Parser construction example with a user-provided data dictionary

Parser construction example with a user-provided data dictionary#

This file demonstrates the process of constructing a parser file using animal_data_choices.csv as a source dataset.

Before you start: autoparser requires an LLM API key to function, for either OpenAI or Gemini. You should add yours to your environment, as described here. This example uses the OpenAI API; edit the API_KEY line below to match the name you gave yours.

If you would prefer to use Gemini, use the llm_provider argument in functions where the api key is used, e.g.

writer.generate_descriptions("fr", data_dict, key=API_KEY, llm_provider='gemini')

You can also specify which model from either OpenAI or Gemini you wish to use, with the llm_model argument. Your model choice should support Structured Outputs (for OpenAI) or Controlled Generation (for Gemini). The model should be provided as a string recognised by the respective api, e.g. llm_model = "gpt-4o-mini" (the default model when OpenAI is selected as the provider).

import pandas as pd

import adtl.autoparser as autoparser

API_KEY = "OPENAI_API_KEY"
/home/docs/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
autoparser.setup_config(
    {
        "language": "fr",
        "llm_provider": "openai",
        "api_key": API_KEY,
        "max_common_count": 8,
        "schemas": {
            "animals": "../../../tests/test_autoparser/schemas/animals.schema.json"
        },
        "column_mappings": {
            "source_field": "Field Name",
            "source_description": "Description",
            "source_type": "Field Type",
            "choices": "Choices",
        },
    }
)
data = pd.read_csv("../../../tests/test_autoparser/sources/animal_data_choices.csv")
data.head()
Identité Province DateNotification Classicfication Nom complet Date de naissance AgeAns AgeMois Sexe StatusCas DateDec ContSoins ContHumain Autre ContexteContHumain ContactAnimal Micropucé AnimalDeCompagnie ConditionsPreexistantes
0 A001 Equateur 2024-01-01 4 Luna 15/03/2022 2 10 2 1 NaN 1.0 2.0 2.0 1 1 1 [arthrite, vomir]
1 B002 Equateur 2024-15-02 1 Max 21/07/2021 3 4 1 2 2024-06-01 2.0 1.0 1.0 2 2 1 NaN
2 C003 Equateur 2024-03-10 3 Coco 10/02/2023 1 11 2 1 NaN 1.0 2.0 2.0 1 1 2 NaN
3 D004 NaN 2024-04-22 2 Bella 05/11/2020 4 5 1 1 NaN 1.0 NaN 3.0 2 2 2 NaN
4 E005 NaN 2024-05-30 5 Charlie 18/05/2019 5 3 2 2 2024-07-01 NaN NaN 1.0 1 1 1 NaN

You can see from the above data that a lot of the columns are encoded as numeric values rather than as strings (e.g. the ‘Sexe’ column contains 1’s and 2’s, not gender identities). This means the data dictionary must be used to translate those values into meaningful data; so let’s look at that.

data_dict = pd.read_csv("../../../tests/test_autoparser/sources/animals_dd_choices.csv")
data_dict
Field Name Description Field Type Choices
0 Identité Identity string NaN
1 Province Province string NaN
2 DateNotification Notification Date string NaN
3 Classicfication Classification string 1=fish, 2=amphibie, 3=oiseau, 4=mammifère, 5=p...
4 Nom complet Full Name string NaN
5 Date de naissance Date of Birth string NaN
6 AgeAns Age in Years number NaN
7 AgeMois Age in Months number NaN
8 Sexe Gender string 1=mâle, 2=femelle, 3=inconnu
9 StatusCas Case Status string 1=Vivant, 2=Décédé
10 DateDec Date of Death string NaN
11 ContSoins Care Contact string 1=oui, 2=non
12 ContHumain Autre Other Human Contact string 1=oui, 2=non
13 ContexteContHumain Human Contact Context string 2=non, 1=voyage, 3=autres
14 ContactAnimal Animal Contact string 1=oui, 2=non
15 Micropucé Microchipped string 1=oui, 2=non
16 AnimalDeCompagnie Pet Animal string 1=oui, 2=non
17 ConditionsPreexistantes Preexisting Conditions list NaN

Before we use this data dictionary to map our data, we should check that it can be converted and validated for use with AutoParser.

To do this, we can run the format_dict function, providing a config file that describes how the columns should be mapped, like this one located in the tests directory.

formatted_data_dict = autoparser.format_dict(data_dict)
formatted_data_dict
source_field source_description source_type choices
0 Identité Identity string None
1 Province Province string None
2 DateNotification Notification Date string None
3 Classicfication Classification string {'1': 'fish', '2': 'amphibie', '3': 'oiseau', ...
4 Nom complet Full Name string None
5 Date de naissance Date of Birth string None
6 AgeAns Age in Years number None
7 AgeMois Age in Months number None
8 Sexe Gender string {'1': 'mâle', '2': 'femelle', '3': 'inconnu'}
9 StatusCas Case Status string {'1': 'Vivant', '2': 'Décédé'}
10 DateDec Date of Death string None
11 ContSoins Care Contact string {'1': 'oui', '2': 'non'}
12 ContHumain Autre Other Human Contact string {'1': 'oui', '2': 'non'}
13 ContexteContHumain Human Contact Context string {'2': 'non', '1': 'voyage', '3': 'autres'}
14 ContactAnimal Animal Contact string {'1': 'oui', '2': 'non'}
15 Micropucé Microchipped string {'1': 'oui', '2': 'non'}
16 AnimalDeCompagnie Pet Animal string {'1': 'oui', '2': 'non'}
17 ConditionsPreexistantes Preexisting Conditions list None

We can see that now the dictionary’s headers have been converted for a format recognised by autoparser, and the choices column contains dictionaries of values mapped to data, rather than being in the string format of the input dictionary. This data dictionary was sucessfully validated and is ready to be used for data mapping and parser generation.

AutoParser requires that every field (meaning every row in the data dictionary) must have a description, and those descriptions must be unique. The field descriptions are what is used to map the raw data to the new schema, so their presence is vital, and they must be able to be disambiguated. A data dictionary will fail validation if the required columns cannot be identified, descriptions are duplicated or missing, and if the options in the common_values or choices columns cannot be converted to their expected formats (a list of strings or a string dictionary, respectively). You can find help for validation errors in the (troubleshooting)[../getting_started/index.md#troubleshooting] section of the docs.

Now we’ve validated the data dictionary, we can proceed to create an intermediate mapping file:

mapper = autoparser.WideMapper(formatted_data_dict, "animals")
mapping_dict = mapper.create_mapping(file_name="example_mapping_choices.csv")

mapping_dict.head()
---------------------------------------------------------------------------
AuthenticationError                       Traceback (most recent call last)
Cell In[6], line 2
      1 mapper = autoparser.WideMapper(formatted_data_dict, "animals")
----> 2 mapping_dict = mapper.create_mapping(file_name="example_mapping_choices.csv")
      3 
      4 mapping_dict.head()

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/mapping/wide_mapper.py:138, in WideMapper.create_mapping(self, save, file_name)
    116 def create_mapping(self, save=True, file_name="mapping_file") -> pd.DataFrame:
    117     """
    118     Creates an intermediate mapping dataframe linking the data dictionary to schema
    119     fields. The index contains the target (schema) field names, and the columns are:
   (...)    135         The name to use for the CSV file
    136     """
--> 138     mapping_dict = self.match_fields_to_schema()
    139     mapped_vals = self.match_values_to_schema()
    141     mapping_dict.drop(columns=["source_type"], inplace=True)

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/mapping/wide_mapper.py:71, in WideMapper.match_fields_to_schema(self)
     68 # english translated descriptions rather than names.
     69 source_fields = list(self.data_dictionary.source_description)
---> 71 mappings = self.model.map_fields(source_fields, self.target_fields)
     73 mapping_dict = pd.DataFrame(
     74     {
     75         "target_field": [f.target_field for f in mappings.targets_descriptions],
   (...)     79     }
     80 )
     82 df_merged = pd.merge(
     83     mapping_dict,
     84     self.data_dictionary,
     85     how="left",
     86     on="source_description",
     87 ).drop_duplicates(subset="target_field")

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/language_models/openai.py:57, in OpenAILanguageModel.map_fields(self, source_fields, target_fields)
     51 def map_fields(
     52     self, source_fields: list[str], target_fields: list[str]
     53 ) -> MappingRequest:
     54     """
     55     Calls the OpenAI API to generate a draft mapping between two datasets.
     56     """
---> 57     field_mapping = self.client.beta.chat.completions.parse(
     58         model=self.model,
     59         messages=[
     60             {
     61                 "role": "system",
     62                 "content": (
     63                     "You are an expert at structured data extraction. "
     64                     "You will be given two lists of phrases, one is the headers "
     65                     "for a target data file, and the other a set of descriptions "
     66                     "for columns of source data. "
     67                     "Match each target header to the best matching source "
     68                     "description, but match a header to None if a good match does "
     69                     "not exist. "
     70                     "Return the matched target headers and source descriptions using the provided structure."  # noqa
     71                 ),
     72             },
     73             {
     74                 "role": "user",
     75                 "content": (
     76                     f"These are the target headers: {target_fields}\n"
     77                     f"These are the source descriptions: {source_fields}"
     78                 ),
     79             },
     80         ],
     81         response_format=MappingRequest,
     82     )
     83     mappings = field_mapping.choices[0].message.parsed
     85     return mappings

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/resources/chat/completions/completions.py:191, in Completions.parse(self, messages, model, audio, response_format, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, prompt_cache_key, prompt_cache_retention, reasoning_effort, safety_identifier, seed, service_tier, stop, store, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, verbosity, web_search_options, extra_headers, extra_query, extra_body, timeout)
    184 def parser(raw_completion: ChatCompletion) -> ParsedChatCompletion[ResponseFormatT]:
    185     return _parse_chat_completion(
    186         response_format=response_format,
    187         chat_completion=raw_completion,
    188         input_tools=chat_completion_tools,
    189     )
--> 191 return self._post(
    192     "/chat/completions",
    193     body=maybe_transform(
    194         {
    195             "messages": messages,
    196             "model": model,
    197             "audio": audio,
    198             "frequency_penalty": frequency_penalty,
    199             "function_call": function_call,
    200             "functions": functions,
    201             "logit_bias": logit_bias,
    202             "logprobs": logprobs,
    203             "max_completion_tokens": max_completion_tokens,
    204             "max_tokens": max_tokens,
    205             "metadata": metadata,
    206             "modalities": modalities,
    207             "n": n,
    208             "parallel_tool_calls": parallel_tool_calls,
    209             "prediction": prediction,
    210             "presence_penalty": presence_penalty,
    211             "prompt_cache_key": prompt_cache_key,
    212             "prompt_cache_retention": prompt_cache_retention,
    213             "reasoning_effort": reasoning_effort,
    214             "response_format": _type_to_response_format(response_format),
    215             "safety_identifier": safety_identifier,
    216             "seed": seed,
    217             "service_tier": service_tier,
    218             "stop": stop,
    219             "store": store,
    220             "stream": False,
    221             "stream_options": stream_options,
    222             "temperature": temperature,
    223             "tool_choice": tool_choice,
    224             "tools": tools,
    225             "top_logprobs": top_logprobs,
    226             "top_p": top_p,
    227             "user": user,
    228             "verbosity": verbosity,
    229             "web_search_options": web_search_options,
    230         },
    231         completion_create_params.CompletionCreateParams,
    232     ),
    233     options=make_request_options(
    234         extra_headers=extra_headers,
    235         extra_query=extra_query,
    236         extra_body=extra_body,
    237         timeout=timeout,
    238         post_parser=parser,
    239     ),
    240     # we turn the `ChatCompletion` instance into a `ParsedChatCompletion`
    241     # in the `parser` function above
    242     cast_to=cast(Type[ParsedChatCompletion[ResponseFormatT]], ChatCompletion),
    243     stream=False,
    244 )

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/_base_client.py:1314, in SyncAPIClient.post(self, path, cast_to, body, content, options, files, stream, stream_cls)
   1305     warnings.warn(
   1306         "Passing raw bytes as `body` is deprecated and will be removed in a future version. "
   1307         "Please pass raw bytes via the `content` parameter instead.",
   1308         DeprecationWarning,
   1309         stacklevel=2,
   1310     )
   1311 opts = FinalRequestOptions.construct(
   1312     method="post", url=path, json_data=body, content=content, files=to_httpx_files(files), **options
   1313 )
-> 1314 return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/_base_client.py:1087, in SyncAPIClient.request(self, cast_to, options, stream, stream_cls)
   1084             err.response.read()
   1086         log.debug("Re-raising status error")
-> 1087         raise self._make_status_error_from_response(err.response) from None
   1089     break
   1091 assert response is not None, "could not resolve response (should never happen)"

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: OPENAI_A**_KEY. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

At this point, you should inspect the mapping file and look for fields/values that have been incorrectly mapped, and edit them where necessary. The mapping file has been written out to example_mapping.csv. A good example is the ‘loc_admin_1’ field; the LLM often maps the common values provided to ‘None’ as the schema denotes this as a free-text field. Instead, delete these mapped values and the parsed data will contain the original free text. Also note the warning above; the LLM should not have found fields to map to the ‘country_iso3’ or ‘owner’ fields. If the original data did contain an appropriate field for these, you should edit the mapping file accordingly.

Once you have edited the mapping file to your satisfaction, we can go ahead and create the TOML parser file, example_parser.toml:

writer = autoparser.ParserGenerator(
    "example_mapping_choices.csv", "", "example_choices"
)
writer.create_parser("example_parser_with_choices.toml")
Missing required field country_iso3 in animals schema. Adding empty field...

You can veiw/edit the created parser at example_parser_with_choices.toml, and use it with adtl.

import adtl

data = adtl.parse(
    "example_parser_with_choices.toml",
    "../../../tests/test_autoparser/sources/animal_data_choices.csv",
    "example_choices_output",
)
data["animals"].head()
[example_choices] parsing animal_data_choices.csv: 100%|██████████| 30/30 [00:00<00:00, 22541.94it/s]
[example_choices] validating animals table: 30it [00:00, 124460.06it/s]
age_months age_years chipped identity loc_admin_1 name notification_date pet underlying_conditions country_iso3 case_status classification sex adtl_valid adtl_error date_of_death
0 10 2 True A001 Equateur Luna 2024-01-01 True [arthrite, vomir] alive mammal female False data.underlying_conditions must be array or null NaN
1 4 3 False B002 Equateur Max 2024-15-02 True NaN dead fish male True NaN 2024-06-01
2 11 1 True C003 Equateur Coco 2024-03-10 False NaN alive bird female True NaN NaN
3 5 4 False D004 NaN Bella 2024-04-22 False NaN alive amphibian male True NaN NaN
4 3 5 True E005 NaN Charlie 2024-05-30 True NaN dead fish female True NaN 2024-07-01