Parser construction example with a user-provided data dictionary#

This file demonstrates the process of constructing a parser file using animal_data_choices.csv as a source dataset.

Before you start: autoparser requires an LLM API key to function, for either OpenAI or Gemini. You should add yours to your environment, as described here. This example uses the OpenAI API; edit the API_KEY line below to match the name you gave yours.

If you would prefer to use Gemini, use the llm_provider argument in functions where the api key is used, e.g.

writer.generate_descriptions("fr", data_dict, key=API_KEY, llm_provider='gemini')

You can also specify which model from either OpenAI or Gemini you wish to use, with the llm_model argument. Your model choice should support Structured Outputs (for OpenAI) or Controlled Generation (for Gemini). The model should be provided as a string recognised by the respective api, e.g. llm_model = "gpt-4o-mini" (the default model when OpenAI is selected as the provider).

import pandas as pd

import adtl.autoparser as autoparser

API_KEY = "OPENAI_API_KEY"

/home/docs/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

autoparser.setup_config(
    {
        "language": "fr",
        "llm_provider": "openai",
        "api_key": API_KEY,
        "max_common_count": 8,
        "schemas": {
            "animals": "../../../tests/test_autoparser/schemas/animals.schema.json"
        },
        "column_mappings": {
            "source_field": "Field Name",
            "source_description": "Description",
            "source_type": "Field Type",
            "choices": "Choices",
        },
    }
)

data = pd.read_csv("../../../tests/test_autoparser/sources/animal_data_choices.csv")
data.head()

	Identité	Province	DateNotification	Classicfication	Nom complet	Date de naissance	AgeAns	AgeMois	Sexe	StatusCas	DateDec	ContSoins	ContHumain Autre	ContexteContHumain	ContactAnimal	Micropucé	AnimalDeCompagnie	ConditionsPreexistantes
0	A001	Equateur	2024-01-01	4	Luna	15/03/2022	2	10	2	1	NaN	1.0	2.0	2.0	1	1	1	[arthrite, vomir]
1	B002	Equateur	2024-15-02	1	Max	21/07/2021	3	4	1	2	2024-06-01	2.0	1.0	1.0	2	2	1	NaN
2	C003	Equateur	2024-03-10	3	Coco	10/02/2023	1	11	2	1	NaN	1.0	2.0	2.0	1	1	2	NaN
3	D004	NaN	2024-04-22	2	Bella	05/11/2020	4	5	1	1	NaN	1.0	NaN	3.0	2	2	2	NaN
4	E005	NaN	2024-05-30	5	Charlie	18/05/2019	5	3	2	2	2024-07-01	NaN	NaN	1.0	1	1	1	NaN

You can see from the above data that a lot of the columns are encoded as numeric values rather than as strings (e.g. the ‘Sexe’ column contains 1’s and 2’s, not gender identities). This means the data dictionary must be used to translate those values into meaningful data; so let’s look at that.

data_dict = pd.read_csv("../../../tests/test_autoparser/sources/animals_dd_choices.csv")
data_dict

	Field Name	Description	Field Type	Choices
0	Identité	Identity	string	NaN
1	Province	Province	string	NaN
2	DateNotification	Notification Date	string	NaN
3	Classicfication	Classification	string	1=fish, 2=amphibie, 3=oiseau, 4=mammifère, 5=p...
4	Nom complet	Full Name	string	NaN
5	Date de naissance	Date of Birth	string	NaN
6	AgeAns	Age in Years	number	NaN
7	AgeMois	Age in Months	number	NaN
8	Sexe	Gender	string	1=mâle, 2=femelle, 3=inconnu
9	StatusCas	Case Status	string	1=Vivant, 2=Décédé
10	DateDec	Date of Death	string	NaN
11	ContSoins	Care Contact	string	1=oui, 2=non
12	ContHumain Autre	Other Human Contact	string	1=oui, 2=non
13	ContexteContHumain	Human Contact Context	string	2=non, 1=voyage, 3=autres
14	ContactAnimal	Animal Contact	string	1=oui, 2=non
15	Micropucé	Microchipped	string	1=oui, 2=non
16	AnimalDeCompagnie	Pet Animal	string	1=oui, 2=non
17	ConditionsPreexistantes	Preexisting Conditions	list	NaN

Before we use this data dictionary to map our data, we should check that it can be converted and validated for use with AutoParser.

To do this, we can run the format_dict function, providing a config file that describes how the columns should be mapped, like this one located in the tests directory.

formatted_data_dict = autoparser.format_dict(data_dict)
formatted_data_dict

	source_field	source_description	source_type	choices
0	Identité	Identity	string	None
1	Province	Province	string	None
2	DateNotification	Notification Date	string	None
3	Classicfication	Classification	string	{'1': 'fish', '2': 'amphibie', '3': 'oiseau', ...
4	Nom complet	Full Name	string	None
5	Date de naissance	Date of Birth	string	None
6	AgeAns	Age in Years	number	None
7	AgeMois	Age in Months	number	None
8	Sexe	Gender	string	{'1': 'mâle', '2': 'femelle', '3': 'inconnu'}
9	StatusCas	Case Status	string	{'1': 'Vivant', '2': 'Décédé'}
10	DateDec	Date of Death	string	None
11	ContSoins	Care Contact	string	{'1': 'oui', '2': 'non'}
12	ContHumain Autre	Other Human Contact	string	{'1': 'oui', '2': 'non'}
13	ContexteContHumain	Human Contact Context	string	{'2': 'non', '1': 'voyage', '3': 'autres'}
14	ContactAnimal	Animal Contact	string	{'1': 'oui', '2': 'non'}
15	Micropucé	Microchipped	string	{'1': 'oui', '2': 'non'}
16	AnimalDeCompagnie	Pet Animal	string	{'1': 'oui', '2': 'non'}
17	ConditionsPreexistantes	Preexisting Conditions	list	None

We can see that now the dictionary’s headers have been converted for a format recognised by autoparser, and the choices column contains dictionaries of values mapped to data, rather than being in the string format of the input dictionary. This data dictionary was sucessfully validated and is ready to be used for data mapping and parser generation.

AutoParser requires that every field (meaning every row in the data dictionary) must have a description, and those descriptions must be unique. The field descriptions are what is used to map the raw data to the new schema, so their presence is vital, and they must be able to be disambiguated. A data dictionary will fail validation if the required columns cannot be identified, descriptions are duplicated or missing, and if the options in the common_values or choices columns cannot be converted to their expected formats (a list of strings or a string dictionary, respectively). You can find help for validation errors in the (troubleshooting)[../getting_started/index.md#troubleshooting] section of the docs.

Now we’ve validated the data dictionary, we can proceed to create an intermediate mapping file:

mapper = autoparser.WideMapper(formatted_data_dict, "animals")
mapping_dict = mapper.create_mapping(file_name="example_mapping_choices.csv")

mapping_dict.head()

---------------------------------------------------------------------------
AuthenticationError                       Traceback (most recent call last)
Cell In[6], line 2
mapper = autoparser.WideMapper(formatted_data_dict, "animals")
----> 2 mapping_dict = mapper.create_mapping(file_name="example_mapping_choices.csv")

mapping_dict.head()

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/mapping/wide_mapper.py:138, in WideMapper.create_mapping(self, save, file_name)
def create_mapping(self, save=True, file_name="mapping_file") -> pd.DataFrame:
   """
   Creates an intermediate mapping dataframe linking the data dictionary to schema
   fields. The index contains the target (schema) field names, and the columns are:
   (...)    135         The name to use for the CSV file
   """
--> 138     mapping_dict = self.match_fields_to_schema()
   mapped_vals = self.match_values_to_schema()
   mapping_dict.drop(columns=["source_type"], inplace=True)

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/mapping/wide_mapper.py:71, in WideMapper.match_fields_to_schema(self)
# english translated descriptions rather than names.
source_fields = list(self.data_dictionary.source_description)
---> 71 mappings = self.model.map_fields(source_fields, self.target_fields)
mapping_dict = pd.DataFrame(
   {
       "target_field": [f.target_field for f in mappings.targets_descriptions],
   (...)     79     }
)
df_merged = pd.merge(
   mapping_dict,
   self.data_dictionary,
   how="left",
   on="source_description",
).drop_duplicates(subset="target_field")

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/language_models/openai.py:57, in OpenAILanguageModel.map_fields(self, source_fields, target_fields)
def map_fields(
   self, source_fields: list[str], target_fields: list[str]
) -> MappingRequest:
   """
   Calls the OpenAI API to generate a draft mapping between two datasets.
   """
---> 57     field_mapping = self.client.beta.chat.completions.parse(
       model=self.model,
       messages=[
           {
               "role": "system",
               "content": (
                   "You are an expert at structured data extraction. "
                   "You will be given two lists of phrases, one is the headers "
                   "for a target data file, and the other a set of descriptions "
                   "for columns of source data. "
                   "Match each target header to the best matching source "
                   "description, but match a header to None if a good match does "
                   "not exist. "
                   "Return the matched target headers and source descriptions using the provided structure."  # noqa
               ),
           },
           {
               "role": "user",
               "content": (
                   f"These are the target headers: {target_fields}\n"
                   f"These are the source descriptions: {source_fields}"
               ),
           },
       ],
       response_format=MappingRequest,
   )
   mappings = field_mapping.choices[0].message.parsed
   return mappings

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/resources/chat/completions/completions.py:193, in Completions.parse(self, messages, model, audio, response_format, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, moderation, n, parallel_tool_calls, prediction, presence_penalty, prompt_cache_key, prompt_cache_options, prompt_cache_retention, reasoning_effort, safety_identifier, seed, service_tier, stop, store, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, verbosity, web_search_options, extra_headers, extra_query, extra_body, timeout)
def parser(raw_completion: ChatCompletion) -> ParsedChatCompletion[ResponseFormatT]:
   return _parse_chat_completion(
       response_format=response_format,
       chat_completion=raw_completion,
       input_tools=chat_completion_tools,
   )
--> 193 return self._post(
   "/chat/completions",
   body=maybe_transform(
       {
           "messages": messages,
           "model": model,
           "audio": audio,
           "frequency_penalty": frequency_penalty,
           "function_call": function_call,
           "functions": functions,
           "logit_bias": logit_bias,
           "logprobs": logprobs,
           "max_completion_tokens": max_completion_tokens,
           "max_tokens": max_tokens,
           "metadata": metadata,
           "modalities": modalities,
           "moderation": moderation,
           "n": n,
           "parallel_tool_calls": parallel_tool_calls,
           "prediction": prediction,
           "presence_penalty": presence_penalty,
           "prompt_cache_key": prompt_cache_key,
           "prompt_cache_options": prompt_cache_options,
           "prompt_cache_retention": prompt_cache_retention,
           "reasoning_effort": reasoning_effort,
           "response_format": _type_to_response_format(response_format),
           "safety_identifier": safety_identifier,
           "seed": seed,
           "service_tier": service_tier,
           "stop": stop,
           "store": store,
           "stream": False,
           "stream_options": stream_options,
           "temperature": temperature,
           "tool_choice": tool_choice,
           "tools": tools,
           "top_logprobs": top_logprobs,
           "top_p": top_p,
           "user": user,
           "verbosity": verbosity,
           "web_search_options": web_search_options,
       },
       completion_create_params.CompletionCreateParams,
   ),
   options=make_request_options(
       extra_headers=extra_headers,
       extra_query=extra_query,
       extra_body=extra_body,
       timeout=timeout,
       post_parser=parser,
       security={"bearer_auth": True},
   ),
   # we turn the `ChatCompletion` instance into a `ParsedChatCompletion`
   # in the `parser` function above
   cast_to=cast(Type[ParsedChatCompletion[ResponseFormatT]], ChatCompletion),
   stream=False,
)

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/_base_client.py:1332, in SyncAPIClient.post(self, path, cast_to, body, content, options, files, stream, stream_cls)
   warnings.warn(
       "Passing raw bytes as `body` is deprecated and will be removed in a future version. "
       "Please pass raw bytes via the `content` parameter instead.",
       DeprecationWarning,
       stacklevel=2,
   )
opts = FinalRequestOptions.construct(
   method="post", url=path, json_data=body, content=content, files=to_httpx_files(files), **options
)
-> 1332 return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/_base_client.py:1105, in SyncAPIClient.request(self, cast_to, options, stream, stream_cls)
           err.response.read()
       log.debug("Re-raising status error")
-> 1105         raise self._make_status_error_from_response(err.response) from None
   break
assert response is not None, "could not resolve response (should never happen)"

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: OPENAI_A**_KEY. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

At this point, you should inspect the mapping file and look for fields/values that have been incorrectly mapped, and edit them where necessary. The mapping file has been written out to example_mapping.csv. A good example is the ‘loc_admin_1’ field; the LLM often maps the common values provided to ‘None’ as the schema denotes this as a free-text field. Instead, delete these mapped values and the parsed data will contain the original free text. Also note the warning above; the LLM should not have found fields to map to the ‘country_iso3’ or ‘owner’ fields. If the original data did contain an appropriate field for these, you should edit the mapping file accordingly.

Once you have edited the mapping file to your satisfaction, we can go ahead and create the TOML parser file, example_parser.toml:

writer = autoparser.ParserGenerator(
    "example_mapping_choices.csv", "", "example_choices"
)
writer.create_parser("example_parser_with_choices.toml")

Missing required field country_iso3 in animals schema. Adding empty field...

You can veiw/edit the created parser at example_parser_with_choices.toml, and use it with adtl.

import adtl

data = adtl.parse(
    "example_parser_with_choices.toml",
    "../../../tests/test_autoparser/sources/animal_data_choices.csv",
    "example_choices_output",
)
data["animals"].head()

[example_choices] parsing animal_data_choices.csv: 100%|██████████| 30/30 [00:00<00:00, 22541.94it/s]
[example_choices] validating animals table: 30it [00:00, 124460.06it/s]

	age_months	age_years	chipped	identity	loc_admin_1	name	notification_date	pet	underlying_conditions	case_status	classification	sex	adtl_valid	adtl_error	date_of_death
0	10	2	True	A001	Equateur	Luna	2024-01-01	True	[arthrite, vomir]	alive	mammal	female	False	data.underlying_conditions must be array or null	NaN
1	4	3	False	B002	Equateur	Max	2024-15-02	True	NaN	dead	fish	male	True	NaN	2024-06-01
2	11	1	True	C003	Equateur	Coco	2024-03-10	False	NaN	alive	bird	female	True	NaN	NaN
3	5	4	False	D004	NaN	Bella	2024-04-22	False	NaN	alive	amphibian	male	True	NaN	NaN
4	3	5	True	E005	NaN	Charlie	2024-05-30	True	NaN	dead	fish	female	True	NaN	2024-07-01