A CSV file and supporting scripts for converting data between Australian boundaries.
It solves the problem of trying to merge 2 or more datasets based on different census or administrative boundaries such as statistical areas or postcodes.
It does this by providing a list of concordances between pairs of boundaries. e.g. In the image below: 100% of postcode 3126 fits within the Boroondara LGA. However, only ~46% of postcode 3127 fits within that LGA.
In this context, concordance describes what % of residential addresses in a "from" boundary fit within a "to" boundary.
Download & import the ~50Mb concordance file into your database or reporting tool to get started. A script for importing into Postgres is also provided.
- You have sales data by ABS Census 2016 SA2 boundaries & competitor data by local government area (LGA) and need to determine market share. Using the boundary concordance file, you convert the SA2 data to LGA and merge both datasets by LGA ID.
- You have Covid 19 cases by postcode & testing numbers by LGA and need to determine the rate of infection as a % of testing. You use the file to convert the postcode data to LGAs and merge both datasets by LGA ID.
Using this file comes with the following caveats:
- It only works with data related to residents, citizens & consumers. In other words - industrial, commercial & special use data isn't suited to conversion using the concordance file provided
- The data represents % overlaps between boundaries that are a best estimate of how data should be apportioned between 2 boundary sets based on residential address counts. Your data may have biases in it that mean this approach doesn't return the best result. e.g. looking at the image below - if your postcode 3127 customers were mostly on the Boroondara Council side - the boundary concordance file would incorrectly put 54% of them in Whitehorse Council
- The code has been tested against the official ABS Census 2016 to 2021 boundary correspondences and the difference is: ~1% of addresses are in another boundary
- Note: ABS Census 2016 to 2021 boundary conversions aren't in the initial version of the concordance file. Plan is to include these when the ABS Census 2021 data is released in June 2022
- ABS Census meshblock boundary concordances haven't been included as they blow the file out to over 200Mb (there are over 300,000 meshblock bdys). Plan is to provide these as a separate concordance file in the future.
Postcode 3127 split 46%-54% by the Boroondara & Whitehorse council boundary in blue (Credits: boundary data by Australian Bureau of Statistics, under CC BY 4.0; Map tiles by Carto, under CC BY 3.0. Data by OpenStreetMap, under ODbL)
The list of boundary pairs in the file with their overall concordance & average error is in the boundary concordance score CSV file.
A high overall concordance indicates your data can be reliability converted to the new boundary. The lower the concordance the more inaccurate the data conversion will be. Also, concordances are only reliable when going from a smaller boundary to a similar sized or larger one. Conversions from larger to smaller boundaries aren't supplied because of this.
In the score file, the error measures what proportion of data is placed in the wrong boundary when converting a statistic common to both boundaries, like population, from postcode to LGA (for example). Note: error rates are only available for ABS 2016 from/to boundaries as the error check requires census data.
The concordance file is generated by the following process:
- Tag all GNAF addresses with 2016 & 2021 ABS Census boundaries and geoscape 202408 Administrative boundaries
- Remove all addresses in non-residential ABS Census 2021 meshblocks
- Aggregate all residential addresses by a set of from boundary and to boundary pairs (e.g. postcode to LGA)
- Determine the % overlap of residential addresses between both boundary types for all boundary pairs
STEP 1 - Read the Limitations
..if you skipped them.
There are 2 options to get the data:
- Download and import the file
- Run
create_concordance_file.py
to create the Postgres table & CSV file yourself
- Download the concordance file
- Import it into your database/reporting tool of choice. If using Postgres:
- Edit the file path, schema name & table owner in
00_import_concordance_file.sql
in the postgres-scripts folder - Run the SQL script to import the file
- Edit the file path, schema name & table owner in
This requires a knowledge of Python, Postgres & pg_restore.
BTW - if the boundary combination you want isn't in the default concordance file - you need to edit the settings.py
file before running create_concordance_file.py
. If this is too hard - raise an issue and we may be able to generate it for you; noting you shouldn't convert data to a smaller boundary due to the increase in data errors.
Running the script only needs to be done for 3 reasons:
- The boundary from/to combination you need isn't in the standard concordances file
- It's now the future and we've been too lazy to update the concordances file with the latest boundary data from the ABS and/or Geoscape
- You have a license of Geoscape Buildings or Geoscape Land Parcels and want to use the planning zone data in those products to:
- Use a more accurate list of residential addresses to determine the data apportionment percentages (see note below); or
- Use a different set of addresses to apportion your data; e.g. industrial or commercial addresses
Note: The benefit of using Geoscape planning zone data over the default residential address filter (ABS Census 2021 meshblock categories) is limited due to ~2.3m addresses not having a planning zone, The code as-is fills this missing data with ABS Census 2021 meshblock categories.
Running the script requires the following open data, available as Postgres dump files, as well as the optional licensed Geoscape data mentioned above:
- ABS Census 2016 boundaries (download)
- ABS Census 2021 boundaries (download)
- GNAF from gnaf-loader (download)
- Geoscape Administrative Boundaries from gnaf-loader (download)
- ABS Census 2016 data - used to generate error rates only (download)
- Download the above dump files and import them using
pg_restore
- OPTIONAL: If you have access to Geoscape Buildings or Land Parcels data:
- import it into Postgres
- Edit the
02_create_residential_address_table.sql
in the postgres-scripts folder to suit your dataset and schema name - Run the above SQL script
- Review & edit the boundaries to process in
settings.py
as required - make any required changes in the sections near the bottom marked for editing. If optionally using Geoscape Buildings data for residential address - change theresidential_address_source
to use it. - Add
psycopg
to your Python 3.x environment - Run the script (takes ~30-45 mins on a 2017 MacBook Pro with 16Gb of RAM and 4 cores)
The behaviour of the Python script can be controlled by specifying various command line options to the script. Supported arguments are:
--output-path
local path where the boundary concordance files will be output.
--pghost
the host name for the Postgres server. This defaults to thePGHOST
environment variable if set, otherwise defaults tolocalhost
.--pgport
the port number for the Postgres server. This defaults to thePGPORT
environment variable if set, otherwise5432
.--pgdb
the database name for Postgres server. This defaults to thePGDATABASE
environment variable if set, otherwisegeoscape
.--pguser
the username for accessing the Postgres server. This defaults to thePGUSER
environment variable if set, otherwisepostgres
.--pgpassword
password for accessing the Postgres server. This defaults to thePGPASSWORD
environment variable if set, otherwisepassword
.
--geoscape-version
Geoscape version number in YYYYMM format. Defaults to current year and last release month. e.g.202408
.--gnaf-schema
input schema name to store final GNAF tables in. Also the output schema for the concordance table. Defaults tognaf_<geoscape_version>
.--admin-schema
input schema name to store final admin boundary tables in. Defaults toadmin_bdys_<geoscape_version>
.--output-table
name of both output concordance table and file. Defaults toboundary_concordance
.--output-score_table
name of both output concordance QA table and file. Defaults to<output_table>_score
.
- Use the default GNAF & Admin Bdy data:
python create_concordance_file.py --output-path=~/tmp
- Use a different version of GNAF & Admin Bdy data and a custom output table/file name:
python create_concordance_file.py --output-path=~/tmp --output-table="old_bdy_concordance" --admin-schema="admin_bdys_202111" --gnaf-schema="gnaf_202111"
After loading the file into your database/reporting tool of choice - you use it by creating a 3 (or more) table join between the datasets you want to merge & the concordance file/table.
A Postgres SQL script (below) for the above-mentioned Covid 19 postcode to LGA example is included.
To run the example:
- Edit the file path, schema name & table owner in
01_import_nsw_covid_data.sql
in the postgres-scripts/example-usage folder as required - Run the script to import the NSW Covid 19 data (supplied in the data folder)
- Edit the schema name & table owner in
02_join_pc_and_lga_data.sql
as required - Run the script
WITH pc_data AS (
SELECT con.to_id AS lga_id,
con.to_name AS lga_name,
sum(pc.cases::float * con.address_percent / 100.0)::integer AS cases
FROM testing.nsw_covid_cases_20220503_postcode AS pc
INNER JOIN gnaf_202408.boundary_concordance AS con ON pc.postcode = con.from_id
WHERE con.from_source = 'geoscape 202408'
AND con.from_bdy = 'postcode'
AND con.to_source = 'abs 2016'
AND con.to_bdy = 'lga'
GROUP BY lga_id,
lga_name
)
SELECT pc_data.lga_id,
pc_data.lga_name,
lga.tests,
pc_data.cases,
(pc_data.cases::float / lga.tests::float * 100.0)::numeric(4,1) AS infection_rate_percent
FROM testing.nsw_covid_tests_20220503_lga AS lga
INNER JOIN pc_data on pc_data.lga_id = lga.lga_code19;
Example concordance file join to convert postcode level data to LGA and then join with LGA level data
Incorporates or developed using G-NAF © Geoscape Australia licensed by the Commonwealth of Australia under the Open Geo-coded National Address File (G-NAF) End User Licence Agreement.
Incorporates or developed using Administrative Boundaries © Geoscape Australia licensed by the Commonwealth of Australia under Creative Commons Attribution 4.0 International licence (CC BY 4.0).
Based on Australian Bureau of Statistics data, licensed by the Commonwealth of Australia under Creative Commons Attribution 4.0 International licence (CC BY 4.0).
Covid 19 data © NSW Ministry of Health, licensed by the NSW Government under Creative Commons Attribution 4.0 International licence (CC BY 4.0).