Write a Data Parser#
ADTL requires a TOML specification file which describes how raw data should be converted into a new format, on a field-by-field basis. Every unique data file format (i.e. unique sets of fields and data types) should have a corresponding parser file.
AutoParser exists to semi-automate the process of writing new parser files. This requires a data dictionary (which can be created if it does not already exist, see ‘Create Data dictionary’), and the JSON schema(s) for the target table format(s).
Parser generation is a 2-step process.
Generate intermediate mappings (CSV)#
First, an intermediate mapping file is created which can look like this, for a wide-format table:
target_field |
source_description |
source_field |
common_values |
target_values |
value_mapping |
|---|---|---|---|---|---|
identity |
Identity |
Identité |
|||
name |
Full Name |
Nom complet |
|||
loc_admin_1 |
Province |
Province |
Equateur, Orientale, Katanga, Kinshasa |
||
country_iso3 |
|||||
notification_date |
Notification Date |
DateNotification |
|||
classification |
Classification |
Classicfication |
FISH, amphibie, oiseau, Mammifère, poisson, REPT, OISEAU |
mammal, bird, reptile, amphibian, fish, invertebrate, None |
mammifère=mammal, rept=reptile, fish=fish, oiseau=bird, amphibie=amphibian, poisson=fish |
case_status |
Case Status |
StatusCas |
Vivant, Décédé |
alive, dead, unknown, None |
décédé=dead, vivant=alive |
target_x refers to the desired output format, while source_x refers to the raw data.
In this example, the final row shows that the case_status field in the desired output
format should be filled using data from the StatusCas field in the raw data. The value_mapping
column indicated that all instances of décédé in the raw data should be mapped to dead
in the converted file, and vivant should map to alive.
:::{warning} LLM’s are prone to errors and hallucinations. These intermediate mappings should be manually curated, as the LLM may generate incorrect matches for either the field, or the values within that field. :::
If your desired format has multiple tables, one mapping file should be produced for each table. A similar process is followed for long-format targets, but instead of ‘target_field’ as the table index, ‘source_field’ is used and any source fields which cannot be mapped to a provided variable will be left blank for the user to either map manually, or delete if that data is not required.
Currently, all long-format schemas must provide a list of enums for the field denoted as the ‘variable’ column.
Generate TOML#
This step is automated and should produce a TOML file that conforms to the adtl parser schema, ready for use transforming data.
API#
- adtl.autoparser.create_mapping(data_dictionary: str | DataFrame, table_name: str, save: bool = True, file_name: str = 'mapping_file', table_format: Literal['wide', 'long'] = 'wide') DataFrame
Creates a csv containing the mapping between a data dictionary and a schema.
Takes a data dictionary and matches both the source fields, and any common values to the schema. Uses an LLM to first match the source fields to appropriate schema targets, and then to match the common values to appropriate enum or boolean options.
- Parameters:
data_dictionary – Path to a CSV or XLSX file, or a DataFrame, containing the data dictionary.
table_name – Name of the table being mapped.
save – Whether to save the mapping to a CSV file.
file_name – Name of the file to save the mapping to.
table_format – Format of the table to create, either ‘wide’ or ‘long’.
- Returns:
Dataframe containing the mapping between the data dictionary and the schema.
- Return type:
pd.DataFrame
- adtl.autoparser.create_parser(mappings: DataFrame | str, schema_path: Path, parser_name: str, description: str | None = None, constant_fields: dict[str, dict[str, bool]] | None = None)
Takes the csv mapping file created by create_mapping and writes out a TOML parser
Generates a TOML parser for use with ADTL from the intermediate CSV file generated by create_mapping. This will generate a TOML file that can be used to parse raw data into the format expected by the schema.
- Parameters:
mappings – Path to the CSV file containing the mappings
schema_path – Path to the schema file
parser_name – Name of the parser to create
description – Description of the parser. Defaults to the parser name.
constant_fields – Constant fields are those which are single values, rather than taken from a field in the source data.
- Return type:
None