Parser construction example#

This file demonstrates the process of constructing a parser file using animals.csv as a source dataset.

Before you start: autoparser requires an LLM API key to function, for either OpenAI or Gemini. You should add yours to your environment, as described here. This example uses the OpenAI API; edit the API_KEY line below to match the name you gave yours.

If you would prefer to use Gemini, use the llm_provider argument in functions where the api key is used, e.g.

writer.generate_descriptions("fr", data_dict, key=API_KEY, llm_provider='gemini')

You can also specify which model from either OpenAI or Gemini you wish to use, with the llm_model argument. Your model choice should support Structured Outputs (for OpenAI) or Controlled Generation (for Gemini). The model should be provided as a string recognised by the respective api, e.g. llm_model = "gpt-4o-mini" (the default model when OpenAI is selected as the provider).

import pandas as pd

import adtl.autoparser as autoparser

API_KEY = "OPENAI_API_KEY"

/home/docs/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

autoparser.setup_config(
    {
        "language": "fr",
        "llm_provider": "openai",
        "api_key": API_KEY,
        "max_common_count": 8,
        "schemas": {
            "animals": "../../../tests/test_autoparser/schemas/animals.schema.json"
        },
    }
)

data = pd.read_csv("../../../tests/test_autoparser/sources/animal_data.csv")
data.head()

	Identité	Province	DateNotification	Classicfication	Nom complet	Date de naissance	AgeAns	AgeMois	Sexe	StatusCas	DateDec	ContSoins	ContHumain Autre	ContexteContHumain	ContactAnimal	Micropucé	AnimalDeCompagnie	ConditionsPreexistantes
0	A001	Equateur	2024-01-01	Mammifère	Luna	15/03/2022	2	10	f	Vivant	NaN	Oui	Non	Non	Oui	Oui	Oui	[arthrite, vomir]
1	B002	Equateur	2024-15-02	FISH	Max	21/07/2021	3	4	m	Décédé	2024-06-01	Non	Oui	Voyage	Non	NON	Oui	NaN
2	C003	Equateur	2024-03-10	oiseau	Coco	10/02/2023	1	11	F	Vivant	NaN	Oui	Non	Non	Oui	Oui	Non	NaN
3	D004	NaN	2024-04-22	amphibie	Bella	05/11/2020	4	5	m	Vivant	NaN	Oui	NaN	Autres	Non	NON	Non	NaN
4	E005	NaN	2024-05-30	poisson	Charlie	18/05/2019	5	3	F	Décédé	2024-07-01	NaN	NaN	Voyage	Oui	Oui	Oui	NaN

Let’s generate a basic data dictionary from this data set. We want to use the configuration file set up for this dataset, located in the tests directory.

writer = autoparser.DictWriter()
data_dict = writer.create_dict(data)
data_dict.head()

	Field Name	Description	Field Type	Common Values
0	Identité	NaN	string	NaN
1	Province	NaN	string	Equateur, Orientale, Katanga
2	DateNotification	NaN	string	NaN
3	Classicfication	NaN	string	FISH, amphibie, oiseau, Mammifère, poisson, REPT
4	Nom complet	NaN	string	NaN

The ‘Common Values’ column indicates fields where there are a limited number of unique values, suggesting mapping to a controlled terminology may have been done, or might be required in the parser. The list of common values is every unique value in the field.

Notice that the Description column is empty. To proceed to the next step of the parser generation process, creating the mapping file linking source -> schema fields, this column must be filled. You can either do this by hand (the descriptions MUST be in english), or use autoparser’s LLM functionality to do it for you, demonstrated below.

dd_described = writer.generate_descriptions(data_dict)
dd_described.head()

---------------------------------------------------------------------------
AuthenticationError                       Traceback (most recent call last)
Cell In[5], line 1
----> 1 dd_described = writer.generate_descriptions(data_dict)
dd_described.head()

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/dict_writer.py:223, in DictWriter.generate_descriptions(self, data_dict)
df = self._load_dict(data_dict)
headers = df.source_field
--> 223 descriptions = self.model.get_definitions(list(headers), self.config.language)
descriptions = {d.field_name: d.translation for d in descriptions}
df_descriptions = pd.DataFrame(
   descriptions.items(), columns=["source_field_gpt", "description"]
)

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/language_models/openai.py:30, in OpenAILanguageModel.get_definitions(self, headers, language)
def get_definitions(self, headers: list[str], language: str) -> dict[str, str]:
   """
   Get the definitions of the columns in the dataset.
   """
---> 30     completion = self.client.beta.chat.completions.parse(
       model=self.model,
       messages=[
           {
               "role": "system",
               "content": (
                   "You are an expert at structured data extraction. "
                   "The following is a list of headers from a data file in "
                   f"{language}, some containing shortened words or abbreviations. "  # noqa
                   "Translate them to english. "
                   "Return a list of (original header, translation) pairs, using the given structure."  # noqa
               ),
           },
           {"role": "user", "content": f"{headers}"},
       ],
       response_format=ColumnDescriptionRequest,
   )
   descriptions = completion.choices[0].message.parsed.field_descriptions
   return descriptions

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/resources/chat/completions/completions.py:193, in Completions.parse(self, messages, model, audio, response_format, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, moderation, n, parallel_tool_calls, prediction, presence_penalty, prompt_cache_key, prompt_cache_options, prompt_cache_retention, reasoning_effort, safety_identifier, seed, service_tier, stop, store, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, verbosity, web_search_options, extra_headers, extra_query, extra_body, timeout)
def parser(raw_completion: ChatCompletion) -> ParsedChatCompletion[ResponseFormatT]:
   return _parse_chat_completion(
       response_format=response_format,
       chat_completion=raw_completion,
       input_tools=chat_completion_tools,
   )
--> 193 return self._post(
   "/chat/completions",
   body=maybe_transform(
       {
           "messages": messages,
           "model": model,
           "audio": audio,
           "frequency_penalty": frequency_penalty,
           "function_call": function_call,
           "functions": functions,
           "logit_bias": logit_bias,
           "logprobs": logprobs,
           "max_completion_tokens": max_completion_tokens,
           "max_tokens": max_tokens,
           "metadata": metadata,
           "modalities": modalities,
           "moderation": moderation,
           "n": n,
           "parallel_tool_calls": parallel_tool_calls,
           "prediction": prediction,
           "presence_penalty": presence_penalty,
           "prompt_cache_key": prompt_cache_key,
           "prompt_cache_options": prompt_cache_options,
           "prompt_cache_retention": prompt_cache_retention,
           "reasoning_effort": reasoning_effort,
           "response_format": _type_to_response_format(response_format),
           "safety_identifier": safety_identifier,
           "seed": seed,
           "service_tier": service_tier,
           "stop": stop,
           "store": store,
           "stream": False,
           "stream_options": stream_options,
           "temperature": temperature,
           "tool_choice": tool_choice,
           "tools": tools,
           "top_logprobs": top_logprobs,
           "top_p": top_p,
           "user": user,
           "verbosity": verbosity,
           "web_search_options": web_search_options,
       },
       completion_create_params.CompletionCreateParams,
   ),
   options=make_request_options(
       extra_headers=extra_headers,
       extra_query=extra_query,
       extra_body=extra_body,
       timeout=timeout,
       post_parser=parser,
       security={"bearer_auth": True},
   ),
   # we turn the `ChatCompletion` instance into a `ParsedChatCompletion`
   # in the `parser` function above
   cast_to=cast(Type[ParsedChatCompletion[ResponseFormatT]], ChatCompletion),
   stream=False,
)

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/_base_client.py:1332, in SyncAPIClient.post(self, path, cast_to, body, content, options, files, stream, stream_cls)
   warnings.warn(
       "Passing raw bytes as `body` is deprecated and will be removed in a future version. "
       "Please pass raw bytes via the `content` parameter instead.",
       DeprecationWarning,
       stacklevel=2,
   )
opts = FinalRequestOptions.construct(
   method="post", url=path, json_data=body, content=content, files=to_httpx_files(files), **options
)
-> 1332 return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/_base_client.py:1105, in SyncAPIClient.request(self, cast_to, options, stream, stream_cls)
           err.response.read()
       log.debug("Re-raising status error")
-> 1105         raise self._make_status_error_from_response(err.response) from None
   break
assert response is not None, "could not resolve response (should never happen)"

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: OPENAI_A**_KEY. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

Now that we have a data dictionary with descriptions added, we can proceed to creating an intermediate mapping file:

mapper = autoparser.WideMapper(dd_described, "animals")
mapping_dict = mapper.create_mapping(file_name="example_mapping.csv")

mapping_dict.head()

/Users/pipliggins/Documents/repos/adtl/src/adtl/autoparser/mapping/wide_mapper.py:152: UserWarning: The following schema fields have not been mapped: ['country_iso3', 'owner']
  warnings.warn(

	source_description	source_field	common_values	target_values	value_mapping
target_field
identity	Identity	Identité	None	NaN	NaN
name	Full Name	Nom complet	None	NaN	NaN
loc_admin_1	Province	Province	equateur \| katanga \| orientale	NaN	equateur=None \| katanga=None \| orientale=None
country_iso3	None	NaN	NaN	NaN	NaN
notification_date	Notification Date	DateNotification	None	NaN	NaN

At this point, you should inspect the mapping file and look for fields/values that have been incorrectly mapped, and edit them where necessary. The mapping file has been written out to example_mapping.csv. A good example is the ‘loc_admin_1’ field; the LLM often maps the common values provided to ‘None’ as the schema denotes this as a free-text field. Instead, delete these mapped values and the parsed data will contain the original free text. Also note the warning above; the LLM should not have found fields to map to the ‘country_iso3’ or ‘owner’ fields. If the original data did contain an appropriate field for these, you should edit the mapping file accordingly.

Once you have edited the mapping file to your satisfaction, we can go ahead and create the TOML parser file, example_parser.toml:

writer = autoparser.ParserGenerator(
    "example_mapping.csv",
    "",
    "example",
)
writer.create_parser("example_parser.toml")

Missing required field country_iso3 in animals schema. Adding empty field...

You can veiw/edit the created parser at example_parser.toml, and use it with adtl.

import adtl

data = adtl.parse(
    "example_parser.toml",
    "../../../tests/test_autoparser/sources/animal_data.csv",
    "example_output",
)
data["animals"].head()

[example] parsing animal_data.csv: 100%|██████████| 30/30 [00:00<00:00, 22623.00it/s]
[example] validating animals table: 30it [00:00, 120873.31it/s]

	age_months	age_years	chipped	identity	name	notification_date	pet	case_status	classification	sex	underlying_conditions	adtl_valid	date_of_death	loc_admin_1	adtl_error
0	10	2	True	A001	Luna	2024-01-01	True	alive	mammal	female	[arthritis, vomiting]	True	NaN	NaN	NaN
1	4	3	False	B002	Max	2024-15-02	True	dead	fish	male	NaN	True	2024-06-01	NaN	NaN
2	11	1	True	C003	Coco	2024-03-10	False	alive	bird	female	NaN	True	NaN	NaN	NaN
3	5	4	False	D004	Bella	2024-04-22	False	alive	amphibian	male	NaN	True	NaN	NaN	NaN
4	3	5	True	E005	Charlie	2024-05-30	True	dead	fish	female	NaN	True	2024-07-01	NaN	NaN