-
Notifications
You must be signed in to change notification settings - Fork 3
Home
To set configuration options, create a file called sirad_config.py
and place
either in the directory where you are executing the sirad
command or
somewhere else on your Python path. See _options
in config.py
for a
complete list of possible options and default values.
For an example of a configuration file, see sirad_config.py from the SIRAD worked example repo.
The following options are available:
-
DATA_SALT
: secret salt used for hashing data values. This shouldn't be shared. A warning will be outputted if it is not set. Defaults toNone
. -
PII_SALT
: secret salt used for hashing pii values. This shouldn't be shared. A warning will be issued if it is not set. Defaults toNone
. -
LAYOUTS
: directory that contains layout files. Defaults tolayouts/
. -
RAW_DIR
,DATA_DIR
,PII_DIR
,LINK_DIR
,RESEARCH_DIR
: paths to where the original data, the processed files, and the research files will be saved. -
VERSION
: the current version number of the processed and research files.
sirad
uses YAML files to define the layout, or structure, of raw data files. These YAML files define each column in the incoming data and how it should be processed.
For an example of a YAML layout file, see tax.yaml from the SIRAD worked example repo.
The following properties can be specified in a YAML layout file:
The path to the source file, relative to RAW_DIR
.
The following values for file type are supported:
-
csv
- delimited text file, defaulting to comma-delimited (seedelimiter
below) -
fixed
- fixed width format, which requires the specification of awidth
property for eachfield
(seefields
below) -
xlsx
- Excel .xlsx file (note: .xls is not currently supported)
For the csv
file type, this specifies the delimiter to use. Common alternatives to comma-delimited include tab-delimited ('\t'
) and pipe-delimited ('|'
).
The file encoding to use when opening a source file of type csv
or fixed
. If you do not know the encoding ahead of time, you can detect the encoding by running the Unix file
command on the source file.
Line endings (LF or CRLF) are detected automatically by the file parser.
Non-ASCII characters are automatically transliterated to ASCII according to the character mapping found in readers.py
.
Whether to read the first line of the file as the column headers.
A list specifying the header name and type of each field in the source file.
For a fixed-width source file, or when setting header=False
:
- The
fields
list must be in the same order as the contents of the source file.
For a csv or xlsx file:
- You can specify a different order, which will be used as the order in the output.
- Every field that appears in the
fields
list must also appear with the same name in the source file header. - If a field exists in the source file header, but not in the
fields
list, it will be skipped in the output.
Each field consists of a name, optionally followed with a dictionary of the following field properties:
Specify date
if you wish to interpret the value as a date and convert to a standardized YYYYMMDD
format during processing.
Marks the field as a type of personally identifiable information (PII). The field will be included in the PII_DIR
output and not in the DATA_DIR
output. The named PII fields used in calculating the sirad_id
are:
first_name
last_name
dob
The named PII fields used in censuscoding addresses have one of the following prefixes for address type (additional types can be added by editing research.py
):
home
mailing
employer
employer1
employer2
employer3
and one of the following suffixes for the address element:
-
_address
: a field containing the entire street address including street number, ex.3 Main St
-
_street
: a field containing only the street name -
_street_num
: a field containing only the street number _city
-
_zip5
: the five digit zip code -
_zip9
: a nine digit zip code
Replaces the value with an irreversible SHA-1 hash of the value, using the salt in PII_SALT
for PII_DIR
output or the DATA_SALT
for DATA_DIR
output. Commonly used in conjunction with ssn
or with sensitive identifiers that will be included in DATA_DIR
output.
Marks the field as containing a Social Security Number, which will be validated according to the rules found in dataset.py
. A field with _invalid
appended will be added to the output with the result of the validation.
Specifies the date format in strftime notation for a field of date
type. You can specify multiple formats separated by '|' in the case where the input data does not have a consistent format, and each format will be attempted in order after splitting on the '|' separator.
For a fixed-width file, this specifies the number of characters that will be read for this field.
Skip the field in all output. This is equivalent to omitting the field from the fields
list for a csv or xlsx file, but can be useful if you want to document the existence of the field in the layout file.
Includes the field in the data output. Used to force a field marked pii
to be included in both the PII_DIR
and DATA_DIR
outputs. This is useful in the case where a field is needed for calculating the sirad_id
or for censuscoding, but is not actually considered PII. Examples might include dob
for sirad_id
(date of birth may not be classified as PII in a data sharing agreement) or a city or zip code field for censuscoding.
All output from SIRAD is in pipe-delimited CSV files, and the pipe character is stripped from all field values.
The sirad process
command stages output files in the following output directories (which can be deleted after a successful run of the sirad research
command):
Contains an output CSV file corresponding to each layout file, using the basename of the layout file. The field record_id
is prepended, and is the row number from the source file (1-based indexing). Only fields that were not marked as pii
(except those with data=True
) are included, and in the order the provided in the fields
list.
Contains an output CSV file corresponding to each layout file, using the basename of the layout file. The row order is randomly shuffled relative to the source file, so that the PII files cannot be directly joined to the data files. The field record_id
is prepended, which is the row number after random shuffling (1-based indexing). Only fields marked as pii
are included, and they are renamed according to the PII name. Additionally, each field marked as ssn
has a corresponding _invalid
field with the indicator for SSN validation that is appended at the end of the fields.
Contains an output CSV file corresponding to each layout file, using the basename of the layout file. This file contains a record_id
field which corresponds to the record_id
in the data file, and a pii_id
field which corresponds to the record_id
in the PII file. This mapping provides a link between the randomly-shuffled PII rows and the data rows.
The sirad research
command generates a final, versioned release of de-identified data that can be used in research. It uses the PII_DIR
files to construct the sirad_id
and perform censuscoding, and the LINK_DIR
to map and prepend any fields constructed from the PII to each of the DATA_DIR
files.
An output CSV file corresponding to each layout file is written to RESEARCH_DIR
, using the basename of the layout file. If the source file contained PII sufficient to construct a sirad_id
(first name, last name, DOB) then a sirad_id
field is prepended. For each type of address (home, mailing, employer, employer1, employer2, employer3), if the source file contained PII sufficient for censuscoding (address/zip or street/street_num/zip), then a corresponding triplet of anonymous geolocations (_city
, _zip
, _blkgrp
) is prepended for that address type.
As described above, the following transformations are applied in the final output:
A row identifier, called record_id
, is added to every output file.
Fields marked as type=date
are interpreted according to the format
value (which can be a pipe-delimited list of formats), and then transformed to a normalized YYYYMMDD
format in the output. Values that cannot be interpreted according to the format
string are replaced with nulls, and a warning is printed when the --debug
option is used.
All PII fields are removed from the output, unless they are explicitly marked with data=true
.
The sirad_id
field is added to the output for any file that contains sufficient PII to construct it.
Each field marked as ssn=true
has a corresponding _invalid
field with the indicator for SSN validation added to the output.
For each set of address PII fields that can be censuscoded, a triplet of (_city
, _zip
, _blkgrp
) fields is added to the output. Even though the original _city
and _zip
PII fields are dropped from the output (as per the transformation on PII described above), the censuscoder adds normalized versions of these fields back into the output. To normalize _city
, characters are converted to upper case and only letter and space characters are retained. To normalize _zip
, only digit characters are retained.