Mapping Functions#

The following functions can be used to create the intermediate mapping CSV required to generate a parser

adtl.autoparser.create_mapping(data_dictionary: str | DataFrame, table_name: str, save: bool = True, file_name: str = 'mapping_file', table_format: Literal['wide', 'long'] = 'wide') DataFrame#

Creates a csv containing the mapping between a data dictionary and a schema.

Takes a data dictionary and matches both the source fields, and any common values to the schema. Uses an LLM to first match the source fields to appropriate schema targets, and then to match the common values to appropriate enum or boolean options.

Parameters:
  • data_dictionary – Path to a CSV or XLSX file, or a DataFrame, containing the data dictionary.

  • table_name – Name of the table being mapped.

  • save – Whether to save the mapping to a CSV file.

  • file_name – Name of the file to save the mapping to.

  • table_format – Format of the table to create, either ‘wide’ or ‘long’.

Returns:

Dataframe containing the mapping between the data dictionary and the schema.

Return type:

pd.DataFrame

Class definitions#

You can also interact with the base classes WideMapper and LongMapper

class adtl.autoparser.WideMapper(data_dictionary: str | DataFrame, table_name: str)#

Class for creating an intermediate mapping file linking the data dictionary to a wide schema fields and values.

create_mapping(save=True, file_name='mapping_file') DataFrame#

Creates an intermediate mapping dataframe linking the data dictionary to schema fields. The index contains the target (schema) field names, and the columns are: * source_description * source_field * common_values OR choices (depending on the data dictionary) * target_values * value_mapping

Raises a warning if any fields are present in the schema where a corresponding source field in the data dictionary has not been found.

Parameters:
  • save – Whether to save the mapping to a CSV file. If True, lists and dicts are converted to strings before saving.

  • file_name – The name to use for the CSV file

match_fields_to_schema() DataFrame#

Use the LLM to match the target (schema) fields to the descriptions of the source data fields from the data dictionary.

property target_fields: list[str]#

Returns a list of fields in the target schema

property target_types: dict[str, list[str]]#

Returns the field types of the target schema

property target_values: Series#

Returns the enum values or boolean options for the target schema

class adtl.autoparser.LongMapper(data_dictionary: str | DataFrame, table_name: str)#

Class for creating an intermediate mapping file linking the data dictionary to long-format schema’s fields and values.

create_mapping(save=True, file_name='mapping_file') DataFrame#

Creates an intermediate mapping dataframe linking the data dictionary to schema fields. The index contains the source field names, and the columns are: * source_description * common_values OR choices (depending on the data dictionary) * <variable_name> (the name of the column identified in the config file) * value_col * any other fields in the long schema.

Raises a warning if any fields are present in the schema where a corresponding source field in the data dictionary has not been found.

Parameters:
  • save – Whether to save the mapping to a CSV file. If True, lists and dicts are converted to strings before saving.

  • file_name – The name to use for the CSV file

match_fields_to_schema() DataFrame#

Use the LLM to match the target (schema) fields to the descriptions of the source data fields from the data dictionary.

set_common_fields(common_fields: dict[str, str])#

Function to assign fields to the common fields of the long table - i.e. fields which should be filled by the same text or source field in every row of the long table.

property target_values: Series#

Returns the enum values or boolean options for the target schema