Specification#
The ADTL specification describes the field mappings from the source file to the target schema. The format is under development and expected to change.
Specification files can be in TOML or JSON, with TOML preferred due to readability.
Each specification file can refer to one or more tables, which are created in parallel from one source file.
Pydantic schema: This is a Pydantic model to validate adtl parser files.
Type conversion#
If field data types are specified in the target schema, ADTL will attempt to coerce the parsed source data into that format. In the case of numeric data -> integer types, the value will be rounded to the nearest whole number to conform to the schema. If the value can’t be converted, the original data type will be returned.
Where data types aren’t provided, ADTL will return numeric data as-is, and attempt to cast string values first as integers (no rounding applied), then floats, returning the original format if these both fail.
Metadata#
These metadata fields are defined under a header key adtl.
Required fields#
name: Name of the specification, usually the source data name in lowercase and hyphenated. By convention, this is the same name as the specification file.
description: Description of the specification
tables: Dictionary with keys as names of tables that are mapped from the source file. Each table key contains a dictionary with the following optional keys:
kind: If this is set to groupBy the parser will group rows together according to the groupBy key. The other allowed values are constant, where the table has fixed content (e.g. metadata), oneToOne, where one row in the source data corresponds to one row in the output data, and oneToMany when multiple rows are generated from the same row.
groupBy: Attribute(s) to group by
discriminator: Column name used to choose between subschemas with kind oneToMany
aggregation: Aggregation type. Currently either type lastNotNull or applyCombinedType is supported which sets a particular attribute to the last non-null value in the grouped dataset. applyCombinedType applies combinedType rules over all the rows being grouped, while lastNotNull only applies those rules along a single row, and retains the last row regardless.
schema (optional): Specifies JSON schema to use for validation, can be a relative path, or a URL
common (optional): Specifies common mappings that are applied to every if-block in a kind=oneToMany table.
optional-fields (optional): Specifies list of fields that are ordinarily required under the schema, but are considered optional for this parser.
Optional fields#
defs: Definitions that can be referred to elsewhere in the schema
include-def (list): List of additional TOML or JSON files to import as definitions
skipFieldPattern : Regex string matching field names which may be skipped if not present in a datafile, following the same syntax as
fieldPatternkey.defaultDateFormat: Default source date format, applied to all fields with either “date_” / “_date” in the field name or that have format date set in the JSON schema
returnUnmatched: Returns all values that are not able to be converted according to the provided rules and formats. For fields with value mappings, it is equivalent to using
ignoreMissingKey. Fields using data transformation functions will issue a warning to the terminal describing the error in the transformation. Transformations requiring multiple parameters will only return the current field value that was not transformed.:warning: This is likely to return columns with non-matching datatypes. External json validation may fail. This option is incompatible with the
--parquetoption to save outputs as parquet files (which required a consistent type down each column).emptyFields: If the source data has something padding the empty fields (e.g. an Excel file with ‘NA’ in all the unfilled cells), use this field to specify what that code is so it can be stripped out.
Validation#
adtl supports validation using JSON Schema, upto draft-07 of the specification. Validation is performed using fastjsonschema.
adtl does not raise errors on validation issues. Instead two special columns are added to each table that has an associated schema:
adtl_valid(boolean): True if row is valid according to JSON schema, False otherwiseadtl_error(string): Validation error message returned by fastjsonschema
References#
Often, a part of the schema is repeated, and it is better to
avoid repeated code. adtl
supports references anywhere a dictionary or object is allowed using ref = "someReference".
This would require a someReference key within the top-level definitions section:
[adtl]
name = "parser"
[adtl.tables]
someTable = { groupBy = "subjid", aggregation = "lastNotNull" }
[adtl.defs]
someReference = { values = { 1 = true, 2 = false } }
Often some definitions are repeated across files. adtl supports including
definitions from external files using the include-def keyword under the
[adtl] section. As an example, a mapping of country codes to country names
could be stored in countries.toml:
[countryMap.values]
1 = "ALB"
2 = "ZZZ"
# and so on
This could be included in adtl, and used as a reference just as if it was included in the TOML file directly:
[adtl]
include-def = ["countries.toml"]
# ...
[cases.country_iso3]
field = "country"
ref = "countryMap"
Definition files can also be included from the command line by passing the
--include-def flag to adtl. This is useful when the included file can change
from one run to another, or in cases where the definitions/mappings are located
externally. The following would produce an equivalent result to the
include-def assignment in the above example, assuming data.csv is the source
data file:
adtl parser.toml data.csv --include-def countries.toml
Table mappings#
Each table has its associated field mappings under a key of the same
name, so there should be a top level [table] section.
Within the table dictionary, keys are fields / attributes in the schema. Values are rules
that describe the mapping from the source data format. There are several valid
rule patterns, listed below. Each rule will either have a field attribute
that is the corresponding field in the source format, or a combinedField
attribute which links multiple fields in the source format, and specifies how
the fields should be combined. Fields can be marked as privacy sensitive using
sensitive = true, which can be used by the parser to take additional steps,
such as hashing the field.
Constant#
Every value in the table is the same constant value
country_iso3 = "GBR"
Field#
Maps to a single field from the source format
[table.date_death] # specifies that date_death is under table named 'table'
field = "flw_date_death"
description = "Date of death"
Field with conditional#
Maps to a single field from the source format only if condition(s) are met. The value is set to null if the condition fails.
field = "foobar"
if = { foobar_type = 4 }
Operations other than equals can be specified as { field_name = {op = value} }
where op is one of < | > | <= | >= | != | =~. Logical operations (and, or, not) are
supported with any = [ condition-list ] (or), all = [ condition-list ] (and),
not = { condition } (not).
In the above example, if we wanted to set from field foobar only if
foobar_type is 4 and bazbar < 5. For simplicity, the equals operation is optional,
and adtl allows conditions of the form { field_name = value }:
field = "foobar"
if.all = [ # in TOML this is a nested key, like { "if": { "all": [ ... ] } }
{ foobar_type = 4 },
{ bazbar = { "<" = 5 }}
]
The =~ operator allows matching with regular expressions, similar to the Bash
and Perl operators. The following will match strings SARS-CoV 2, SARS COV 2
and sars-cov-2. Case is ignored when matching.
field = "foobar"
if.foobar."=~" = ".*SARS[- ]CoV[- ]2.*"
The oneToMany table has default conditional behaviour so that rows are only shown if the row is not empty, and contains values which can be mapped correctly if maps are provided. For example, an observation recording the presence/absence of vomiting should only be shown if values map to True/False:
[[table]]
name = "vomiting_nausea"
is_present = { field = "Admission Symptoms.Vomiting", values = {1 = true, 0 = false} } # values = ['0', 'Unknown', '1', 'UNKNOWN', '']
# if.any = [{ "Admission Symptoms.Vomiting" = '1'}, { "Admission Symptoms.Vomiting" = '0'}] <- rule assumed by adtl
If a different/more specific conditional statement is required, e.g. if a row should only be displayed based on the condition of a different field, this behaviour can be overridden by writing an if condition into the parser; note that this will stop any automated generation, you should specify all conditions under which the row should be displayed, for example:
[[observation]]
name = "transfer_from_other_facility"
phase = "study"
date = { field = "rpt_date" }
if = { rpt_date = { "!=" = "" } } # This is dependent on a date rather than an is_present field, so requires specifying.
is_present = true
Field with unit#
Often values need to be normalised to a particular unit.
This can be done by setting source_unit and unit attributes on a field. The
pint library is used, so the units should be in a format
that pint understands. Generally pint works well with
most common units.
The source_unit field can also be a rule, but unit must be a string. For example,
to set the age based on a field called age_unit which can be months or years:
field = "age_estimate"
source_unit = { field = "age_estimateunit", values = { 1 = "months", 2 = "years" }}
unit = "years"
Field with date#
Normalising date formats is a common transformation.
The date format in the source file is indicated in the source_date key (which
can itself refer to a field, like source_unit), and the date format to be
transformed to is indicated in the date field. By default, if date is not
specified, it defaults to ISO 8601 date format %Y-%m-%d.
Date formats are specified in strftime(3) format.
field = "outcome_date"
source_date = "%d/%m/%Y"
date = "%Y-%m-%d"
Field with value mapping#
Same as Single field, but with an extra values key that describes the
mapping from the values to the ones in the schema. This covers boolean fields,
with the mappings being to true | false | null.
[table.sex_at_birth]
field = "sex"
values = { 1 = "male", 2 = "female", 3 = "non_binary" }
description = "Sex at Birth"
Sometimes, you may want to keep the value as-is if no match was found; this
could be useful when mapping to controlled terminologies, where you want to map
the value to a known set if found, but keep the value as free text if not. This
can be done by adding ignoreMissingKey = true to the rule:
[table.sex_at_birth]
field = "sex"
values = { homme = "male", femme = "female" }
ignoreMissingKey = true
When the parser encounters a field where sex is one of homme or femme it
matches them to male and female respectively. When it encounters any other
string, such as non binaire, it will return non binaire. Contrast this to
the case when we do not specify ignoreMissingKey = true, in which case, the
parser would return null when it does not find a match.
Example with boolean values
[table.has_dementia]
field = "dementia_mhyn"
values = { 1 = true, 2 = false }
description = "Dementia"
If the data for this field has a range of different capitalisations and you wish to
capture them all without specifying each variant, you can add caseInsensitive = true
to the rule:
[table.sex_at_birth]
field = "sex"
values = { homme = "male", femme = "female" }
caseInsensitive = true
When the parser encounters e.g. Homme or FEMME in the data it will still match to
male and female respectively. The parser will still ignore different spellings, e.g.
Home will return null, but strips and leading or trailing whitespace so " FEMME "
will also match to female.
Field with lists of values#
If a field requires a list of values, a type of enum_list can be added to the rule:
[table.symptoms]
field = "ReportedSymptoms"
type = "enum_list"
values = { "high temp" = "fever", headache = "cephalalgia", "muscle aches"="myalgia" }
ignoreMissingKey = true
When the parser is given a list either in square brackets, e.g. '[high temp, headache']'
or as a comma-separated string e.g. "muscle aches, high temp" it will attempt to turn
convert the string into a list of values and find matches for the listed values. As with
a standard value mapping field, it can be tagged to be case insensitive and to return
all fields it cannot match.
Combined type#
Use to collate data from to multiple fields in the source format to one. Requires
a combinedType attribute specifying the combination criteria, and
a fields attribute which a list of fields that will be combined.
Accepted values for combinedType are:
any - Whether any of the fields are non-null (truthy)
all - Whether all of the fields are non-null (truthy)
min - Minimum of non-null fields
max - Minimum of non-null fields
firstNonNull - First in the list of fields that has a non-null value
list - List of various fields
set - List of various fields, with duplicates removed
With kind=GroupBy and applyCombinedType aggregation, combinedType will apply both along a
single row, and across multiple rows being aggregated. If aggregation is lastNonNull, rules only
apply across a single row.
A combinedType can have multiple fields within a fields key, or can specify
multiple fields with a fieldPattern key which is a regex that is matched to the
list of fields:
[table.has_liver_disease]
combinedType = "list"
fields = [
{ fieldPattern = ".*liv.*", values = { 1 = true, 0 = false }}
]
Example of a combinedType = "any" mapping:
[table.has_liver_disease]
combinedType = "any"
fields = [
{ field = "modliv", description = "Moderate liver disease", values = { 1 = true, 0 = false }},
{ field = "mildliver", description = "Mild liver disease", values = { 1 = true, 0 = false }},
]
excludeWhen: List and Set fields can have an optional excludeWhen key which can either be a list of values or none or false-like. When it is none we drop the null values (None in Python) or it can be false-like in which case false-like values (bool(x) == False in Python) are excluded (empty lists, boolean False, 0). Alternatively a list of values to be excluded can be provided.
If excludeWhen is not set, no exclusions take place and all values are returned as-is.
Skippable fields#
In some cases, a study will be assocaited with multiple data files, all of which have been filled in to varying degrees. For example, one study site may not provide any follow-up data.
Rather than writing a new parser for every data file with minor differences, parsers can be made
robust to a certain amount of missing data by tagging applicable fields with can_skip = True,
for example:
[[observation]]
name = "cough"
phase = "admission"
date = { field = "admit_date" }
is_present = { field = "cough_ceoccur_v2", description = "Cough", ref = "Y/N/NK", "can_skip" = true }
In this case, if adtl does not find cough_ceoccur_v2 in the data it will skip over the field
and continue, rather than throwing an error.
If there are lots of fields missing all with similar field names, for example if followup data
has been omitted and all the followup fields are labelled with a flw prefix e.g., flw_cough,
flw2_fatigue, this can be specified at the top of the file:
[adtl]
name = "isaric-core"
description = "isaric-core"
skipFieldPattern = "flw.*"
[table.sex_at_birth]
combinedType = "firstNonNull"
excludeWhen = "none"
fields = [
{ field = "sex", values = { 1 = "male", 2 = "female" } },
{ field = "flw_sex_at_birth", values = { 1 = "male", 2 = "female", 3 = "non_binary" } },
{ field = "flw2_sex_at_birth", values = { 1 = "male", 2 = "female", 3 = "non_binary" } },
]
Notice that in this case can_skip does not need to be added to the fields with a flw prefix.
Data transformations (apply)#
Arbitrary functions can be applied to source fields. adtl ships with a library
found in the transformations.py file, but users may add their own by using the --include-transform
flag and providing a single python file where their custom functions are provided. Custom functions will only execute if using Python’s stdlib, or one of adtl’s dependencies. Parameters other than the source field which need to be parsed into the transformation
function must be listed as params, in the same order as they should be
passed to the transformation function.
If the parameter is a field attribute value from the source data, the field name
should be prefixed with a $ to distinguish it from constant strings.
[[table]]
field = "icu_admitted"
apply = { function = "isNotNull" }
[[table]]
field = "brthdtc"
apply = { function = "yearsElapsed", params = ["$dsstdat"] }
Conditional rows#
For the oneToMany case, each row in the source file generates
multiple rows for the target. This is expressed in the specification by making the
value corresponding to the table key a list instead of an object. Additionally
an if key sets the condition under which the row is emitted.
[[table]]
date = { field = "dsstdtc" }
name = "headache"
if = { headache_cmyn = 1 }
[[table]]
date = { field = "dsstdtc" }
name = "cough"
if = { cough_cmyn = 1 }
Repeated rows#
Often, oneToMany tables (such as ISARIC observation table) have repeated blocks,
with only the field name and condition changing. Add a for keyword that will
add looping through variable(s). In the case of multiple variables being
provided, the cartesian product of the variables will be used to repeat the
block.
Field names within the block use the Python f-string syntax to represent the
variable, which is expanded out by using Python’s str.format.
Example from an ISARIC dataset that contains five followup surveys that ask about observed symptoms after discharge:
[[observation]]
name = "history_of_fever"
phase = "followup"
date = { field = "flw2_survey_date_{n}" }
is_present = { field = "flw2_fever_{n}", values = { 0 = false, 1 = true } }
if.not."flw2_fever_{n}" = 2
for.n.range = [1, 5] # n goes from 1--5 inclusive
# for.n = [1, 3, 5] # can also specify a list
Note that unlike Python ranges, adtl ranges include both start and end of the range.
Variable interpolations in braces can be anywhere in the block. So a if.any
condition could look like
[[observation]]
name = "history_of_fever"
phase = "followup"
date = { field = "flw2_survey_date_{n}" }
is_present = { field = "flw2_fever_{n}", values = { 0 = false, 1 = true } }
if.any = [ { "flw2_fever_{n}" = 1 }, { "flw2_fever_{n}" = 0 } ]
for.n.range = [1, 5] # n goes from 1--5 inclusive
Multiple variables are supported in the for loop. If multiple variables are specified, then the block is repeated for as many instances as the Cartesian product of the values the variables correspond to. As an example the for expression
for = { x = [1, 2], y = [3, 4] }
will loop over the values x, y = [(1, 3), (1, 4), (1, 3), (1, 4)], and a block
with such a loop referring to both variables will get repeated four times:
[[observation]]
field = "field_{x}_{y}"
if."field_{x}_{y}" = 1
for = { x = [1, 2], y = [3, 4] }
will get expanded as
[[observation]]
field = "field_1_3"
if."field_1_3" = 1
[[observation]]
field = "field_1_4"
if."field_1_4" = 1
[[observation]]
field = "field_2_3"
if."field_2_3" = 1
[[observation]]
field = "field_2_4"
if."field_2_4" = 1
Generated fields#
ADTL can generate content for fields. Currently, this is limited to adding a datetime stamp, or a UUID (currently just UUIDv5) which can link long-format rows to each other as may be necessary in the oneToMany case. E.g.
To add a datetime stamp in an ingestion_date field:
[table.ingestion_date]
generate = {type = "datetime"}
or to create a UUID based off a unique data combination to link medication events:
[[long]]
attribute = "medi_antivial"
event_id = {generate = {type = "uuid5", values = ["subjid", "redcap_repeat_instance", "medi_medtype", "medi_date"]}}
value = {field = "medi_antiviralagent"}
start_date = "medi_date"
[[long]]
attribute = "medi_route"
event_id = {generate = {type = "uuid5", values = ["subjid", "redcap_repeat_instance", "medi_medtype", "medi_date"]}}
value = {field = "medi_medroute", values = {1 = "Oral", 2="IV"}}
start_date = "medi_date"
[[long]]
attribute = "medi_dose"
event_id = {generate = {type = "uuid5", values = ["subjid", "redcap_repeat_instance", "medi_medtype", "medi_date"]}}
value_num = {field = "medi_dosage"}
attribute_unit = {field = "medi_units"}
start_date = "medi_date"
In this case the combined data in a single row of the input file from fields subjid, redcap_repeat_instance
medi_medtype and medi_date should uniquely identify a single event. Users should ensure that
if one or more of these fields are empty, uniqueness is not compromised; field which are auto-filled or data-rich
are therefore preferred.