Getting started#
Installation#
AutoParser is a Python package that can either be built into your code or run as a command-line interface (CLI). You can install AutoParser using pip:
python3 -m pip install adtl[autoparser]
Note that it is usually recommended to install into a virtual environment. We recommend using uv to manage the virtual environment. To create and active a virtual environment for AutoParser using uv run the following commands:
uv sync
. .venv/bin/activate
To view and use the CLI, you can type adtl-autoparser into the command line to view the
options available.
Other requirements#
AutoParser relies on LLMs to automatically map raw data fields to a target schema.
In order to use this tool, you will need an API key for either OpenAI
or Google’s Gemini.
You can select which model to use, or keep to the defaults which are OpenAI’s gpt-4o-mini,
or Google’s gemini-2.5-flash. Your model choice should support Structured Outputs (for OpenAI) or Controlled Generation (for Gemini). Please be aware that more high-powered models like OpenAI’s
‘O’ series and Gemini 2.0 will cost more per API call.
These choices can be specified using the config class.
The LLM should never see your raw data; only the data dictionary which contains column headers, text descriptions of what each field contains, and a list of frequently occuring values if present.
Supported file formats#
Autoparser supports CSV, XLSX and parquet formats for raw data and data dictionary files, and either JSON or TOML for the target schema.
Quickstart#
See the example notebook here for a basic walk through the functionality of AutoParser.
If you already have a data dictionary associated with your data, follow this example instead.