Parser construction example

Parser construction example#

This file demonstrates the process of constructing a parser file using animals.csv as a source dataset.

Before you start: autoparser requires an LLM API key to function, for either OpenAI or Gemini. You should add yours to your environment, as described here. This example uses the OpenAI API; edit the API_KEY line below to match the name you gave yours.

If you would prefer to use Gemini, use the llm_provider argument in functions where the api key is used, e.g.

writer.generate_descriptions("fr", data_dict, key=API_KEY, llm_provider='gemini')

You can also specify which model from either OpenAI or Gemini you wish to use, with the llm_model argument. Your model choice should support Structured Outputs (for OpenAI) or Controlled Generation (for Gemini). The model should be provided as a string recognised by the respective api, e.g. llm_model = "gpt-4o-mini" (the default model when OpenAI is selected as the provider).

import pandas as pd

import adtl.autoparser as autoparser

API_KEY = "OPENAI_API_KEY"
/home/docs/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm
autoparser.setup_config(
    {
        "language": "fr",
        "llm_provider": "openai",
        "api_key": API_KEY,
        "max_common_count": 8,
        "schemas": {
            "animals": "../../../tests/test_autoparser/schemas/animals.schema.json"
        },
    }
)
data = pd.read_csv("../../../tests/test_autoparser/sources/animal_data.csv")
data.head()
Identité Province DateNotification Classicfication Nom complet Date de naissance AgeAns AgeMois Sexe StatusCas DateDec ContSoins ContHumain Autre ContexteContHumain ContactAnimal Micropucé AnimalDeCompagnie ConditionsPreexistantes
0 A001 Equateur 2024-01-01 Mammifère Luna 15/03/2022 2 10 f Vivant NaN Oui Non Non Oui Oui Oui [arthrite, vomir]
1 B002 Equateur 2024-15-02 FISH Max 21/07/2021 3 4 m Décédé 2024-06-01 Non Oui Voyage Non NON Oui NaN
2 C003 Equateur 2024-03-10 oiseau Coco 10/02/2023 1 11 F Vivant NaN Oui Non Non Oui Oui Non NaN
3 D004 NaN 2024-04-22 amphibie Bella 05/11/2020 4 5 m Vivant NaN Oui NaN Autres Non NON Non NaN
4 E005 NaN 2024-05-30 poisson Charlie 18/05/2019 5 3 F Décédé 2024-07-01 NaN NaN Voyage Oui Oui Oui NaN

Let’s generate a basic data dictionary from this data set. We want to use the configuration file set up for this dataset, located in the tests directory.

writer = autoparser.DictWriter()
data_dict = writer.create_dict(data)
data_dict.head()
Field Name Description Field Type Common Values
0 Identité NaN string NaN
1 Province NaN string Equateur, Orientale, Katanga
2 DateNotification NaN string NaN
3 Classicfication NaN string FISH, amphibie, oiseau, Mammifère, poisson, REPT
4 Nom complet NaN string NaN

The ‘Common Values’ column indicates fields where there are a limited number of unique values, suggesting mapping to a controlled terminology may have been done, or might be required in the parser. The list of common values is every unique value in the field.

Notice that the Description column is empty. To proceed to the next step of the parser generation process, creating the mapping file linking source -> schema fields, this column must be filled. You can either do this by hand (the descriptions MUST be in english), or use autoparser’s LLM functionality to do it for you, demonstrated below.

dd_described = writer.generate_descriptions(data_dict)
dd_described.head()
---------------------------------------------------------------------------
AuthenticationError                       Traceback (most recent call last)
Cell In[5], line 1
----> 1 dd_described = writer.generate_descriptions(data_dict)
      2 dd_described.head()

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/dict_writer.py:223, in DictWriter.generate_descriptions(self, data_dict)
    219 df = self._load_dict(data_dict)
    221 headers = df.source_field
--> 223 descriptions = self.model.get_definitions(list(headers), self.config.language)
    225 descriptions = {d.field_name: d.translation for d in descriptions}
    226 df_descriptions = pd.DataFrame(
    227     descriptions.items(), columns=["source_field_gpt", "description"]
    228 )

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/adtl/autoparser/language_models/openai.py:30, in OpenAILanguageModel.get_definitions(self, headers, language)
     26 def get_definitions(self, headers: list[str], language: str) -> dict[str, str]:
     27     """
     28     Get the definitions of the columns in the dataset.
     29     """
---> 30     completion = self.client.beta.chat.completions.parse(
     31         model=self.model,
     32         messages=[
     33             {
     34                 "role": "system",
     35                 "content": (
     36                     "You are an expert at structured data extraction. "
     37                     "The following is a list of headers from a data file in "
     38                     f"{language}, some containing shortened words or abbreviations. "  # noqa
     39                     "Translate them to english. "
     40                     "Return a list of (original header, translation) pairs, using the given structure."  # noqa
     41                 ),
     42             },
     43             {"role": "user", "content": f"{headers}"},
     44         ],
     45         response_format=ColumnDescriptionRequest,
     46     )
     47     descriptions = completion.choices[0].message.parsed.field_descriptions
     49     return descriptions

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/resources/chat/completions/completions.py:191, in Completions.parse(self, messages, model, audio, response_format, frequency_penalty, function_call, functions, logit_bias, logprobs, max_completion_tokens, max_tokens, metadata, modalities, n, parallel_tool_calls, prediction, presence_penalty, prompt_cache_key, prompt_cache_retention, reasoning_effort, safety_identifier, seed, service_tier, stop, store, stream_options, temperature, tool_choice, tools, top_logprobs, top_p, user, verbosity, web_search_options, extra_headers, extra_query, extra_body, timeout)
    184 def parser(raw_completion: ChatCompletion) -> ParsedChatCompletion[ResponseFormatT]:
    185     return _parse_chat_completion(
    186         response_format=response_format,
    187         chat_completion=raw_completion,
    188         input_tools=chat_completion_tools,
    189     )
--> 191 return self._post(
    192     "/chat/completions",
    193     body=maybe_transform(
    194         {
    195             "messages": messages,
    196             "model": model,
    197             "audio": audio,
    198             "frequency_penalty": frequency_penalty,
    199             "function_call": function_call,
    200             "functions": functions,
    201             "logit_bias": logit_bias,
    202             "logprobs": logprobs,
    203             "max_completion_tokens": max_completion_tokens,
    204             "max_tokens": max_tokens,
    205             "metadata": metadata,
    206             "modalities": modalities,
    207             "n": n,
    208             "parallel_tool_calls": parallel_tool_calls,
    209             "prediction": prediction,
    210             "presence_penalty": presence_penalty,
    211             "prompt_cache_key": prompt_cache_key,
    212             "prompt_cache_retention": prompt_cache_retention,
    213             "reasoning_effort": reasoning_effort,
    214             "response_format": _type_to_response_format(response_format),
    215             "safety_identifier": safety_identifier,
    216             "seed": seed,
    217             "service_tier": service_tier,
    218             "stop": stop,
    219             "store": store,
    220             "stream": False,
    221             "stream_options": stream_options,
    222             "temperature": temperature,
    223             "tool_choice": tool_choice,
    224             "tools": tools,
    225             "top_logprobs": top_logprobs,
    226             "top_p": top_p,
    227             "user": user,
    228             "verbosity": verbosity,
    229             "web_search_options": web_search_options,
    230         },
    231         completion_create_params.CompletionCreateParams,
    232     ),
    233     options=make_request_options(
    234         extra_headers=extra_headers,
    235         extra_query=extra_query,
    236         extra_body=extra_body,
    237         timeout=timeout,
    238         post_parser=parser,
    239     ),
    240     # we turn the `ChatCompletion` instance into a `ParsedChatCompletion`
    241     # in the `parser` function above
    242     cast_to=cast(Type[ParsedChatCompletion[ResponseFormatT]], ChatCompletion),
    243     stream=False,
    244 )

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/_base_client.py:1314, in SyncAPIClient.post(self, path, cast_to, body, content, options, files, stream, stream_cls)
   1305     warnings.warn(
   1306         "Passing raw bytes as `body` is deprecated and will be removed in a future version. "
   1307         "Please pass raw bytes via the `content` parameter instead.",
   1308         DeprecationWarning,
   1309         stacklevel=2,
   1310     )
   1311 opts = FinalRequestOptions.construct(
   1312     method="post", url=path, json_data=body, content=content, files=to_httpx_files(files), **options
   1313 )
-> 1314 return cast(ResponseT, self.request(cast_to, opts, stream=stream, stream_cls=stream_cls))

File ~/checkouts/readthedocs.org/user_builds/adtl/envs/latest/lib/python3.11/site-packages/openai/_base_client.py:1087, in SyncAPIClient.request(self, cast_to, options, stream, stream_cls)
   1084             err.response.read()
   1086         log.debug("Re-raising status error")
-> 1087         raise self._make_status_error_from_response(err.response) from None
   1089     break
   1091 assert response is not None, "could not resolve response (should never happen)"

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: OPENAI_A**_KEY. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

Now that we have a data dictionary with descriptions added, we can proceed to creating an intermediate mapping file:

mapper = autoparser.WideMapper(dd_described, "animals")
mapping_dict = mapper.create_mapping(file_name="example_mapping.csv")

mapping_dict.head()
/Users/pipliggins/Documents/repos/adtl/src/adtl/autoparser/mapping/wide_mapper.py:152: UserWarning: The following schema fields have not been mapped: ['country_iso3', 'owner']
  warnings.warn(
source_description source_field common_values target_values value_mapping
target_field
identity Identity Identité None NaN NaN
name Full Name Nom complet None NaN NaN
loc_admin_1 Province Province equateur | katanga | orientale NaN equateur=None | katanga=None | orientale=None
country_iso3 None NaN NaN NaN NaN
notification_date Notification Date DateNotification None NaN NaN

At this point, you should inspect the mapping file and look for fields/values that have been incorrectly mapped, and edit them where necessary. The mapping file has been written out to example_mapping.csv. A good example is the ‘loc_admin_1’ field; the LLM often maps the common values provided to ‘None’ as the schema denotes this as a free-text field. Instead, delete these mapped values and the parsed data will contain the original free text. Also note the warning above; the LLM should not have found fields to map to the ‘country_iso3’ or ‘owner’ fields. If the original data did contain an appropriate field for these, you should edit the mapping file accordingly.

Once you have edited the mapping file to your satisfaction, we can go ahead and create the TOML parser file, example_parser.toml:

writer = autoparser.ParserGenerator(
    "example_mapping.csv",
    "",
    "example",
)
writer.create_parser("example_parser.toml")
Missing required field country_iso3 in animals schema. Adding empty field...

You can veiw/edit the created parser at example_parser.toml, and use it with adtl.

import adtl

data = adtl.parse(
    "example_parser.toml",
    "../../../tests/test_autoparser/sources/animal_data.csv",
    "example_output",
)
data["animals"].head()
[example] parsing animal_data.csv: 100%|██████████| 30/30 [00:00<00:00, 22623.00it/s]
[example] validating animals table: 30it [00:00, 120873.31it/s]
age_months age_years chipped identity name notification_date pet country_iso3 case_status classification sex underlying_conditions adtl_valid date_of_death loc_admin_1 adtl_error
0 10 2 True A001 Luna 2024-01-01 True alive mammal female [arthritis, vomiting] True NaN NaN NaN
1 4 3 False B002 Max 2024-15-02 True dead fish male NaN True 2024-06-01 NaN NaN
2 11 1 True C003 Coco 2024-03-10 False alive bird female NaN True NaN NaN NaN
3 5 4 False D004 Bella 2024-04-22 False alive amphibian male NaN True NaN NaN NaN
4 3 5 True E005 Charlie 2024-05-30 True dead fish female NaN True 2024-07-01 NaN NaN