Parser construction example with a user-provided data dictionary#
This file demonstrates the process of constructing a parser file using animal_data_choices.csv as a source dataset.
Before you start: autoparser requires an LLM API key to function, for either OpenAI or Gemini.
You should add yours to your environment, as described here.
This example uses the OpenAI API; edit the API_KEY line below to match the name you gave yours.
If you would prefer to use Gemini, use the llm_provider argument in functions where the api key is used, e.g.
writer.generate_descriptions("fr", data_dict, key=API_KEY, llm_provider='gemini')
You can also specify which model from either OpenAI or Gemini you wish to use, with the llm_model argument. Your model choice should support Structured Outputs (for OpenAI) or Controlled Generation (for Gemini).
The model should be provided as a string recognised by the respective api, e.g. llm_model = "gpt-4o-mini" (the default model when OpenAI is selected as the provider).
import pandas as pd
import adtl.autoparser as autoparser
API_KEY = "OPENAI_API_KEY"
/home/docs/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
autoparser.setup_config(
{
"language": "fr",
"llm_provider": "openai",
"api_key": API_KEY,
"max_common_count": 8,
"schemas": {
"animals": "../../../tests/test_autoparser/schemas/animals.schema.json"
},
"column_mappings": {
"source_field": "Field Name",
"source_description": "Description",
"source_type": "Field Type",
"choices": "Choices",
},
}
)
data = pd.read_csv("../../../tests/test_autoparser/sources/animal_data_choices.csv")
data.head()
| Identité | Province | DateNotification | Classicfication | Nom complet | Date de naissance | AgeAns | AgeMois | Sexe | StatusCas | DateDec | ContSoins | ContHumain Autre | ContexteContHumain | ContactAnimal | Micropucé | AnimalDeCompagnie | ConditionsPreexistantes | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A001 | Equateur | 2024-01-01 | 4 | Luna | 15/03/2022 | 2 | 10 | 2 | 1 | NaN | 1.0 | 2.0 | 2.0 | 1 | 1 | 1 | [arthrite, vomir] |
| 1 | B002 | Equateur | 2024-15-02 | 1 | Max | 21/07/2021 | 3 | 4 | 1 | 2 | 2024-06-01 | 2.0 | 1.0 | 1.0 | 2 | 2 | 1 | NaN |
| 2 | C003 | Equateur | 2024-03-10 | 3 | Coco | 10/02/2023 | 1 | 11 | 2 | 1 | NaN | 1.0 | 2.0 | 2.0 | 1 | 1 | 2 | NaN |
| 3 | D004 | NaN | 2024-04-22 | 2 | Bella | 05/11/2020 | 4 | 5 | 1 | 1 | NaN | 1.0 | NaN | 3.0 | 2 | 2 | 2 | NaN |
| 4 | E005 | NaN | 2024-05-30 | 5 | Charlie | 18/05/2019 | 5 | 3 | 2 | 2 | 2024-07-01 | NaN | NaN | 1.0 | 1 | 1 | 1 | NaN |
You can see from the above data that a lot of the columns are encoded as numeric values rather than as strings (e.g. the ‘Sexe’ column contains 1’s and 2’s, not gender identities). This means the data dictionary must be used to translate those values into meaningful data; so let’s look at that.
data_dict = pd.read_csv("../../../tests/test_autoparser/sources/animals_dd_choices.csv")
data_dict
| Field Name | Description | Field Type | Choices | |
|---|---|---|---|---|
| 0 | Identité | Identity | string | NaN |
| 1 | Province | Province | string | NaN |
| 2 | DateNotification | Notification Date | string | NaN |
| 3 | Classicfication | Classification | string | 1=fish, 2=amphibie, 3=oiseau, 4=mammifère, 5=p... |
| 4 | Nom complet | Full Name | string | NaN |
| 5 | Date de naissance | Date of Birth | string | NaN |
| 6 | AgeAns | Age in Years | number | NaN |
| 7 | AgeMois | Age in Months | number | NaN |
| 8 | Sexe | Gender | string | 1=mâle, 2=femelle, 3=inconnu |
| 9 | StatusCas | Case Status | string | 1=Vivant, 2=Décédé |
| 10 | DateDec | Date of Death | string | NaN |
| 11 | ContSoins | Care Contact | string | 1=oui, 2=non |
| 12 | ContHumain Autre | Other Human Contact | string | 1=oui, 2=non |
| 13 | ContexteContHumain | Human Contact Context | string | 2=non, 1=voyage, 3=autres |
| 14 | ContactAnimal | Animal Contact | string | 1=oui, 2=non |
| 15 | Micropucé | Microchipped | string | 1=oui, 2=non |
| 16 | AnimalDeCompagnie | Pet Animal | string | 1=oui, 2=non |
| 17 | ConditionsPreexistantes | Preexisting Conditions | list | NaN |
Before we use this data dictionary to map our data, we should check that it can be converted and validated for use with AutoParser.
To do this, we can run the format_dict function, providing a config file that describes how the columns should be mapped, like this one located in the tests directory.
formatted_data_dict = autoparser.format_dict(data_dict)
formatted_data_dict
| source_field | source_description | source_type | choices | |
|---|---|---|---|---|
| 0 | Identité | Identity | string | None |
| 1 | Province | Province | string | None |
| 2 | DateNotification | Notification Date | string | None |
| 3 | Classicfication | Classification | string | {'1': 'fish', '2': 'amphibie', '3': 'oiseau', ... |
| 4 | Nom complet | Full Name | string | None |
| 5 | Date de naissance | Date of Birth | string | None |
| 6 | AgeAns | Age in Years | number | None |
| 7 | AgeMois | Age in Months | number | None |
| 8 | Sexe | Gender | string | {'1': 'mâle', '2': 'femelle', '3': 'inconnu'} |
| 9 | StatusCas | Case Status | string | {'1': 'Vivant', '2': 'Décédé'} |
| 10 | DateDec | Date of Death | string | None |
| 11 | ContSoins | Care Contact | string | {'1': 'oui', '2': 'non'} |
| 12 | ContHumain Autre | Other Human Contact | string | {'1': 'oui', '2': 'non'} |
| 13 | ContexteContHumain | Human Contact Context | string | {'2': 'non', '1': 'voyage', '3': 'autres'} |
| 14 | ContactAnimal | Animal Contact | string | {'1': 'oui', '2': 'non'} |
| 15 | Micropucé | Microchipped | string | {'1': 'oui', '2': 'non'} |
| 16 | AnimalDeCompagnie | Pet Animal | string | {'1': 'oui', '2': 'non'} |
| 17 | ConditionsPreexistantes | Preexisting Conditions | list | None |
We can see that now the dictionary’s headers have been converted for a format recognised by autoparser, and the choices column contains dictionaries of values mapped to data, rather than being in the string format of the input dictionary. This data dictionary was sucessfully validated and is ready to be used for data mapping and parser generation.
AutoParser requires that every field (meaning every row in the data dictionary) must have a description, and those descriptions must be unique. The field descriptions are what is used to map the raw data to the new schema, so their presence is vital, and they must be able to be disambiguated. A data dictionary will fail validation if the required columns cannot be identified, descriptions are duplicated or missing, and if the options in the common_values or choices columns cannot be converted to their expected formats (a list of strings or a string dictionary, respectively). You can find help for validation errors in the (troubleshooting)[../getting_started/index.md#troubleshooting] section of the docs.
Now we’ve validated the data dictionary, we can proceed to create an intermediate mapping file:
mapper = autoparser.WideMapper(formatted_data_dict, "animals")
mapping_dict = mapper.create_mapping(file_name="example_mapping_choices.csv")
mapping_dict.head()
---------------------------------------------------------------------------
AuthenticationError Traceback (most recent call last)
Cell In[6], line 2
1 mapper = autoparser.WideMapper(formatted_data_dict, "animals")
----> 2 mapping_dict = mapper.create_mapping(file_name="example_mapping_choices.csv")
3
4 mapping_dict.head()
File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/mapping/wide_mapper.py:138, in WideMapper.create_mapping(self, save, file_name)
116 def create_mapping(self, save=True, file_name="mapping_file") -> pd.DataFrame:
117 """
118 Creates an intermediate mapping dataframe linking the data dictionary to schema
119 fields. The index contains the target (schema) field names, and the columns are:
(...) 135 The name to use for the CSV file
136 """
--> 138 mapping_dict = self.match_fields_to_schema()
139 mapped_vals = self.match_values_to_schema()
141 mapping_dict.drop(columns=["source_type"], inplace=True)
File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/mapping/wide_mapper.py:71, in WideMapper.match_fields_to_schema(self)
68 # english translated descriptions rather than names.
69 source_fields = list(self.data_dictionary.source_description)
---> 71 mappings = self.model.map_fields(source_fields, self.target_fields)
73 mapping_dict = pd.DataFrame(
74 {
75 "target_field": [f.target_field for f in mappings.targets_descriptions],
(...) 79 }
80 )
82 df_merged = pd.merge(
83 mapping_dict,
84 self.data_dictionary,
85 how="left",
86 on="source_description",
87 ).drop_duplicates(subset="target_field")
File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/language_models/openai.py:57, in OpenAILanguageModel.map_fields(self, source_fields, target_fields)
51 def map_fields(
52 self, source_fields: list[str], target_fields: list[str]
53 ) -> MappingRequest:
54 """
55 Calls the OpenAI API to generate a draft mapping between two datasets.
56 """
---> 57 field_mapping = self.client.beta.chat.completions.parse(
58 model=self.model,
59 messages=[
60 {
61 "role": "system",
62 "content": (
63 "You are an expert at structured data extraction. "
64 "You will be given two lists of phrases, one is the headers "
65 "for a target data file, and the other a set of descriptions "
66 "for columns of source data. "
67 "Match each target header to the best matching source "
68 "description, but match a header to None if a good match does "
69 "not exist. "
70 "Return the matched target headers and source descriptions using the provided structure." # noqa
71 ),
72 },
73 {
74 "role": "user",
75 "content": (
76 f"These are the target headers: {target_fields}\n"
77 f"These are the source descriptions: {source_fields}"
78 ),
79 },
80 ],
81 response_format=MappingRequest,
82 )
83 mappings = field_mapping.choices[0].message.parsed
85 return mappings
File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/resources/chat/completions/completions.py:191, in Completions.parse(self, messages, model, audio, response_format, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, prompt_cache_key, prompt_cache_retention, reasoning_effort, safety_identifier, seed, service_tier, stop, store, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, verbosity, web_search_options, extra_headers, extra_query, extra_body, timeout)
184 def parser(raw_completion: ChatCompletion) -> ParsedChatCompletion[ResponseFormatT]:
185 return _parse_chat_completion(
186 response_format=response_format,
187 chat_completion=raw_completion,
188 input_tools=chat_completion_tools,
189 )
--> 191 return self._post(
192 "/chat/completions",
193 body=maybe_transform(
194 {
195 "messages": messages,
196 "model": model,
197 "audio": audio,
198 "frequency_penalty": frequency_penalty,
199 "function_call": function_call,
200 "functions": functions,
201 "logit_bias": logit_bias,
202 "logprobs": logprobs,
203 "max_completion_tokens": max_completion_tokens,
204 "max_tokens": max_tokens,
205 "metadata": metadata,
206 "modalities": modalities,
207 "n": n,
208 "parallel_tool_calls": parallel_tool_calls,
209 "prediction": prediction,
210 "presence_penalty": presence_penalty,
211 "prompt_cache_key": prompt_cache_key,
212 "prompt_cache_retention": prompt_cache_retention,
213 "reasoning_effort": reasoning_effort,
214 "response_format": _type_to_response_format(response_format),
215 "safety_identifier": safety_identifier,
216 "seed": seed,
217 "service_tier": service_tier,
218 "stop": stop,
219 "store": store,
220 "stream": False,
221 "stream_options": stream_options,
222 "temperature": temperature,
223 "tool_choice": tool_choice,
224 "tools": tools,
225 "top_logprobs": top_logprobs,
226 "top_p": top_p,
227 "user": user,
228 "verbosity": verbosity,
229 "web_search_options": web_search_options,
230 },
231 completion_create_params.CompletionCreateParams,
232 ),
233 options=make_request_options(
234 extra_headers=extra_headers,
235 extra_query=extra_query,
236 extra_body=extra_body,
237 timeout=timeout,
238 post_parser=parser,
239 ),
240 # we turn the `ChatCompletion` instance into a `ParsedChatCompletion`
241 # in the `parser` function above
242 cast_to=cast(Type[ParsedChatCompletion[ResponseFormatT]], ChatCompletion),
243 stream=False,
244 )
File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/_base_client.py:1314, in SyncAPIClient.post(self, path, cast_to, body, content, options, files, stream, stream_cls)
1305 warnings.warn(
1306 "Passing raw bytes as `body` is deprecated and will be removed in a future version. "
1307 "Please pass raw bytes via the `content` parameter instead.",
1308 DeprecationWarning,
1309 stacklevel=2,
1310 )
1311 opts = FinalRequestOptions.construct(
1312 method="post", url=path, json_data=body, content=content, files=to_httpx_files(files), **options
1313 )
-> 1314 return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))
File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/_base_client.py:1087, in SyncAPIClient.request(self, cast_to, options, stream, stream_cls)
1084 err.response.read()
1086 log.debug("Re-raising status error")
-> 1087 raise self._make_status_error_from_response(err.response) from None
1089 break
1091 assert response is not None, "could not resolve response (should never happen)"
AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: OPENAI_A**_KEY. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}
At this point, you should inspect the mapping file and look for fields/values that have been incorrectly mapped, and edit them where necessary. The mapping file has been written out to example_mapping.csv. A good example is the ‘loc_admin_1’ field; the LLM often maps the common values provided to ‘None’ as the schema denotes this as a free-text field. Instead, delete these mapped values and the parsed data will contain the original free text. Also note the warning above; the LLM should not have found fields to map to the ‘country_iso3’ or ‘owner’ fields. If the original data did contain an appropriate field for these, you should edit the mapping file accordingly.
Once you have edited the mapping file to your satisfaction, we can go ahead and create the TOML parser file, example_parser.toml:
writer = autoparser.ParserGenerator(
"example_mapping_choices.csv", "", "example_choices"
)
writer.create_parser("example_parser_with_choices.toml")
Missing required field country_iso3 in animals schema. Adding empty field...
You can veiw/edit the created parser at example_parser_with_choices.toml, and use it with adtl.
import adtl
data = adtl.parse(
"example_parser_with_choices.toml",
"../../../tests/test_autoparser/sources/animal_data_choices.csv",
"example_choices_output",
)
data["animals"].head()
[example_choices] parsing animal_data_choices.csv: 100%|██████████| 30/30 [00:00<00:00, 22541.94it/s]
[example_choices] validating animals table: 30it [00:00, 124460.06it/s]
| age_months | age_years | chipped | identity | loc_admin_1 | name | notification_date | pet | underlying_conditions | country_iso3 | case_status | classification | sex | adtl_valid | adtl_error | date_of_death | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | 2 | True | A001 | Equateur | Luna | 2024-01-01 | True | [arthrite, vomir] | alive | mammal | female | False | data.underlying_conditions must be array or null | NaN | |
| 1 | 4 | 3 | False | B002 | Equateur | Max | 2024-15-02 | True | NaN | dead | fish | male | True | NaN | 2024-06-01 | |
| 2 | 11 | 1 | True | C003 | Equateur | Coco | 2024-03-10 | False | NaN | alive | bird | female | True | NaN | NaN | |
| 3 | 5 | 4 | False | D004 | NaN | Bella | 2024-04-22 | False | NaN | alive | amphibian | male | True | NaN | NaN | |
| 4 | 3 | 5 | True | E005 | NaN | Charlie | 2024-05-30 | True | NaN | dead | fish | female | True | NaN | 2024-07-01 |