Data Dictionary Generation Functions#
The following functions can be used to create and add descriptions to a data dictionary
- adtl.autoparser.create_dict(data: DataFrame | str) DataFrame
Create a basic data dictionary from a dataset.
Creates a data dictionary from a dataset, including the field name, field type, and common values (defined as occuring more than 25 times in the columns). Also creates an empty column for field decriptions, which can either be added by hand later, or auto-generated with an LLM using generate_descriptions().
- Parameters:
data – Path to a CSV or XLSX file, or a DataFrame, containing the raw data.
- Returns:
Data dictionary containing field names, field types, and common values.
- Return type:
pd.DataFrame
- adtl.autoparser.generate_descriptions(data_dict: DataFrame | str) DataFrame
Generate descriptions for the columns in the dataset.
Uses an LLM to auto-generate descriptions for a data dictionary based on the column headers.
- Parameters:
data_dict – Data dictionary containing the column headers, either as a dataframe or a path to the dictionary as a csv/xlsx file.
- Returns:
Data dictionary with descriptions added
- Return type:
pd.DataFrame
Class definitions#
You can also interact with the base class DictWriter
- class adtl.autoparser.DictWriter#
Class for inferring a data dictionary based on a dataset. Will not store the data, only the created data dictionary.
Use create_dict() to create a data dictionary, as the function equivalent of the command line create-dict script.
generate_descriptions() will use an LLM to generate descriptions for the data dictionary, using only the column headers, NOT the data itself.
- Parameters:
api_key – API key corresponsing to the chosen LLM provider/model
- create_dict(data: DataFrame | str) DataFrame#
Create a basic data dictionary from a dataset.
Creates a data dictionary from a dataset, including the field name, field type, and common values (defined as occuring more than 25 times in the columns). Also creates an empty column for field decriptions, which can either be added by hand later, or auto-generated with an LLM using generate_descriptions().
- Parameters:
data – Path to a CSV or XLSX file, or a DataFrame, containing the raw data.
- Returns:
Data dictionary containing field names, field types, and common values.
- Return type:
pd.DataFrame
- generate_descriptions(data_dict: DataFrame | str | None = None) DataFrame#
Generate descriptions for the columns in the dataset.
Uses an LLM to auto-generate descriptions for a data dictionary based on the column headers.
- Parameters:
data_dict – Data dictionary containing the column headers, either as a dataframe or a path to the dictionary as a csv/xlsx file. Can be None if the data dict has already been created using create_dict().
- Returns:
Data dictionary with descriptions added
- Return type:
pd.DataFrame