Skip to content

Commit 0c65bb4

Browse files
authored
Merge pull request #1960 from cmu-delphi/ds/geomap
feat: add aggregate_by_weighted_sum to geomapper
2 parents 0751162 + 48e247f commit 0c65bb4

File tree

5 files changed

+361
-138
lines changed

5 files changed

+361
-138
lines changed

.git-blame-ignore-revs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
1-
# Format geomap.py with black
1+
# Format geomap.py
22
d4b056e7a4c11982324e9224c9f9f6fd5d5ec65c
3+
# Format test_geomap.py
4+
79072dcdec3faca9aaeeea65de83f7fa5c00d53f
Lines changed: 11 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Geocoding data processing pipeline
1+
# Geocoding Data Processing
22

33
Authors: Jingjing Tang, James Sharpnack, Dmitry Shemetov
44

@@ -7,42 +7,37 @@ Authors: Jingjing Tang, James Sharpnack, Dmitry Shemetov
77
Requires the following source files below.
88

99
Run the following to build the crosswalk tables in `covidcast-indicators/_delph_utils_python/delph_utils/data`
10-
```
10+
11+
```sh
1112
$ python geo_data_proc.py
1213
```
1314

14-
You can see consistency checks and diffs with old sources in ./consistency_checks.ipynb
15+
Find data consistency checks in `./source-file-sanity-check.ipynb`.
1516

1617
## Geo Codes
1718

1819
We support the following geocodes.
1920

20-
- The ZIP code and the FIPS code are the most granular geocodes we support.
21-
- The [ZIP code](https://en.wikipedia.org/wiki/ZIP_Code) is a US postal code used by the USPS and the [FIPS code](https://en.wikipedia.org/wiki/FIPS_county_code) is an identifier for US counties and other associated territories. The ZIP code is five digit code (with leading zeros).
22-
- The FIPS code is a five digit code (with leading zeros), where the first two digits are a two-digit state code and the last three are a three-digit county code (see this [US Census Bureau page](https://www.census.gov/library/reference/code-lists/ansi.html) for detailed information).
23-
- The Metropolitan Statistical Area (MSA) code refers to regions around cities (these are sometimes referred to as CBSA codes). More information on these can be found at the [US Census Bureau](https://www.census.gov/programs-surveys/metro-micro/about.html).
24-
- We are reserving 10001-10099 for states codes of the form 100XX where XX is the FIPS code for the state (the current smallest CBSA is 10100). In the case that the CBSA codes change then it should be verified that these are not used.
21+
- The [ZIP code](https://en.wikipedia.org/wiki/ZIP_Code) is a US postal code used by the USPS and the [FIPS code](https://en.wikipedia.org/wiki/FIPS_county_code) is an identifier for US counties and other associated territories. The ZIP code is five digit code (with leading zeros).
22+
- The FIPS code is a five digit code (with leading zeros), where the first two digits are a two-digit state code and the last three are a three-digit county code (see this [US Census Bureau page](https://www.census.gov/library/reference/code-lists/ansi.html) for detailed information).
23+
- The Metropolitan Statistical Area (MSA) code refers to regions around cities (these are sometimes referred to as CBSA codes). More information on these can be found at the [US Census Bureau](https://www.census.gov/programs-surveys/metro-micro/about.html). We rserve 10001-10099 for states codes of the form 100XX where XX is the FIPS code for the state (the current smallest CBSA is 10100). In the case that the CBSA codes change then it should be verified that these are not used.
2524
- State codes are a series of equivalent identifiers for US state. They include the state name, the state number (state_id), and the state two-letter abbreviation (state_code). The state number is the state FIPS code. See [here](https://en.wikipedia.org/wiki/List_of_U.S._state_and_territory_abbreviations) for more.
2625
- The Hospital Referral Region (HRR) and the Hospital Service Area (HSA). More information [here](https://www.dartmouthatlas.org/covid-19/hrr-mapping/).
27-
FIPS codes depart in some special cases, so we produce manual changes listed below.
2826

29-
## Source files
27+
## Source Files
3028

3129
The source files are requested from a government URL when `geo_data_proc.py` is run (see the top of said script for the URLs). Below we describe the locations to find updated versions of the source files, if they are ever needed.
3230

3331
- ZIP -> FIPS (county) population tables available from [US Census](https://www.census.gov/geographies/reference-files/time-series/geo/relationship-files.html#par_textimage_674173622). This file contains the population of the intersections between ZIP and FIPS regions, allowing the creation of a population-weighted transform between the two. As of 4 February 2022, this source did not include population information for 24 ZIPs that appear in our indicators. We have added those values manually using information available from the [zipdatamaps website](www.zipdatamaps.com).
3432
- ZIP -> HRR -> HSA crosswalk file comes from the 2018 version at the [Dartmouth Atlas Project](https://atlasdata.dartmouth.edu/static/supp_research_data).
3533
- FIPS -> MSA crosswalk file comes from the September 2018 version of the delineation files at the [US Census Bureau](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html).
36-
- State Code -> State ID -> State Name comes from the ANSI standard at the [US Census](https://www.census.gov/library/reference/code-lists/ansi.html#par_textimage_3). The first two digits of a FIPS codes should match the state code here.
34+
- State Code -> State ID -> State Name comes from the ANSI standard at the [US Census](https://www.census.gov/library/reference/code-lists/ansi.html#par_textimage_3).
3735

38-
39-
## Derived files
36+
## Derived Files
4037

4138
The rest of the crosswalk tables are derived from the mappings above. We provide crosswalk functions from granular to coarser codes, but not the other way around. This is because there is no information gained when crosswalking from coarse to granular.
4239

43-
44-
45-
## Deprecated source files
40+
## Deprecated Source Files
4641

4742
- ZIP to FIPS to HRR to states: `02_20_uszips.csv` comes from a version of the table [here](https://simplemaps.com/data/us-zips) modified by Jingjing to include population weights.
4843
- The `02_20_uszips.csv` file is based on the newest consensus data including 5-digit zipcode, fips code, county name, state, population, HRR, HSA (I downloaded the original file from [here](https://simplemaps.com/data/us-zips). This file matches best to the most recent (2020) situation in terms of the population. But there still exist some matching problems. I manually checked and corrected those lines (~20) with [zip-codes](https://www.zip-codes.com/zip-code/58439/zip-code-58439.asp). The mapping from 5-digit zipcode to HRR is based on the file in 2017 version downloaded from [here](https://atlasdata.dartmouth.edu/static/supp_research_data).
@@ -51,7 +46,3 @@ The rest of the crosswalk tables are derived from the mappings above. We provide
5146
- CBSA -> FIPS crosswalk from [here](https://data.nber.org/data/cbsa-fips-county-crosswalk.html) (the file is `cbsatocountycrosswalk.csv`).
5247
- MSA tables from March 2020 [here](https://www.census.gov/geographies/reference-files/time-series/demo/metro-micro/delineation-files.html). This file seems to differ in a few fips codes from the source for the 02_20_uszip file which Jingjing constructed. There are at least 10 additional fips in 03_20_msa that are not in the uszip file, and one of the msa codes seems to be incorrect: 49020 (a google search confirms that it is incorrect in uszip and correct in the census data).
5348
- MSA tables from 2019 [here](https://apps.bea.gov/regional/docs/msalist.cfm)
54-
55-
## Notes
56-
57-
- The NAs in the coding currently zero-fills.

_delphi_utils_python/data_proc/geomap/geo_data_proc.py

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,7 @@
11
"""
2-
Authors: Dmitry Shemetov @dshemetov, James Sharpnack @jsharpna
3-
4-
Intended execution:
2+
Authors: Dmitry Shemetov, James Sharpnack
53
64
cd _delphi_utils/data_proc/geomap
7-
chmod u+x geo_data_proc.py
85
python geo_data_proc.py
96
"""
107

@@ -19,7 +16,7 @@
1916

2017

2118
# Source files
22-
YEAR = 2019
19+
YEAR = 2020
2320
INPUT_DIR = "./old_source_files"
2421
OUTPUT_DIR = f"../../delphi_utils/data/{YEAR}"
2522
FIPS_BY_ZIP_POP_URL = "https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt?#"
@@ -41,7 +38,6 @@
4138
FIPS_HHS_FILENAME = "fips_hhs_table.csv"
4239
FIPS_CHNGFIPS_OUT_FILENAME = "fips_chng-fips_table.csv"
4340
FIPS_POPULATION_OUT_FILENAME = "fips_pop.csv"
44-
4541
CHNGFIPS_STATE_OUT_FILENAME = "chng-fips_state_table.csv"
4642
ZIP_HSA_OUT_FILENAME = "zip_hsa_table.csv"
4743
ZIP_HRR_OUT_FILENAME = "zip_hrr_table.csv"

_delphi_utils_python/delphi_utils/geomap.py

Lines changed: 132 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -18,54 +18,90 @@ class GeoMapper: # pylint: disable=too-many-public-methods
1818
1919
The GeoMapper class provides utility functions for translating between different
2020
geocodes. Supported geocodes:
21-
- zip: zip5, a length 5 str of 0-9 with leading 0's
22-
- fips: state code and county code, a length 5 str of 0-9 with leading 0's
23-
- msa: metropolitan statistical area, a length 5 str of 0-9 with leading 0's
24-
- state_code: state code, a str of 0-9
25-
- state_id: state id, a str of A-Z
26-
- hrr: hospital referral region, an int 1-500
27-
28-
Mappings:
29-
- [x] zip -> fips : population weighted
30-
- [x] zip -> hrr : unweighted
31-
- [x] zip -> msa : unweighted
32-
- [x] zip -> state
33-
- [x] zip -> hhs
34-
- [x] zip -> population
35-
- [x] state code -> hhs
36-
- [x] fips -> state : unweighted
37-
- [x] fips -> msa : unweighted
38-
- [x] fips -> megacounty
39-
- [x] fips -> hrr
40-
- [x] fips -> hhs
41-
- [x] fips -> chng-fips
42-
- [x] chng-fips -> state : unweighted
43-
- [x] nation
44-
- [ ] zip -> dma (postponed)
45-
46-
The GeoMapper instance loads crosswalk tables from the package data_dir. The
47-
crosswalk tables are assumed to have been built using the geo_data_proc.py script
48-
in data_proc/geomap. If a mapping between codes is NOT one to many, then the table has
49-
just two colums. If the mapping IS one to many, then a third column, the weight column,
50-
exists (e.g. zip, fips, weight; satisfying (sum(weights) where zip==ZIP) == 1).
21+
22+
- zip: five characters [0-9] with leading 0's, e.g. "33626"
23+
also known as zip5 or zip code
24+
- fips: five characters [0-9] with leading 0's, e.g. "12057"
25+
the first two digits are the state FIPS code and the last
26+
three are the county FIPS code
27+
- msa: five characters [0-9] with leading 0's, e.g. "90001"
28+
also known as metropolitan statistical area
29+
- state_code: two characters [0-9], e.g "06"
30+
- state_id: two characters [A-Z], e.g "CA"
31+
- state_name: human-readable name, e.g "California"
32+
- state_*: we use this below to refer to the three above geocodes in aggregate
33+
- hrr: an integer from 1-500, also known as hospital
34+
referral region
35+
- hhs: an integer from 1-10, also known as health and human services region
36+
https://www.hhs.gov/about/agencies/iea/regional-offices/index.html
37+
38+
Valid mappings:
39+
40+
From To Population Weighted
41+
zip fips Yes
42+
zip hrr No
43+
zip msa Yes
44+
zip state_* Yes
45+
zip hhs Yes
46+
zip population --
47+
zip nation No
48+
state_* state_* No
49+
state_* hhs No
50+
state_* population --
51+
state_* nation No
52+
fips state_* No
53+
fips msa No
54+
fips megacounty No
55+
fips hrr Yes
56+
fips hhs No
57+
fips chng-fips No
58+
fips nation No
59+
chng-fips state_* No
60+
61+
Crosswalk Tables
62+
================
63+
64+
The GeoMapper instance loads pre-generated crosswalk tables (built by the
65+
script in `data_proc/geomap/geo_data_proc.py`). If a mapping between codes
66+
is one to one or many to one, then the table has just two columns. If the
67+
mapping is one to many, then a weight column is provided, which gives the
68+
fractional population contribution of a source_geo to the target_geo. The
69+
weights satisfy the condition that df.groupby(from_code).sum(weight) == 1.0
70+
for all values of from_code.
71+
72+
Aggregation
73+
===========
74+
75+
The GeoMapper class provides functions to aggregate data from one geocode
76+
to another. The aggregation can be a simple one-to-one mapping or a
77+
weighted aggregation. The weighted aggregation is useful when the data
78+
being aggregated is a population-weighted quantity, such as visits or
79+
cases. The aggregation is done by multiplying the data columns by the
80+
weights and summing over the data columns. Note that the aggregation does
81+
not adjust the aggregation for missing or NA values in the data columns,
82+
which is equivalent to a zero-fill.
5183
5284
Example Usage
53-
==========
85+
=============
5486
The main GeoMapper object loads and stores crosswalk dataframes on-demand.
5587
56-
When replacing geocodes with a new one an aggregation step is performed on the data columns
57-
to merge entries (i.e. in the case of a many to one mapping or a weighted mapping). This
58-
requires a specification of the data columns, which are assumed to be all the columns that
59-
are not the geocodes or the date column specified in date_col.
88+
When replacing geocodes with a new one an aggregation step is performed on
89+
the data columns to merge entries (i.e. in the case of a many to one
90+
mapping or a weighted mapping). This requires a specification of the data
91+
columns, which are assumed to be all the columns that are not the geocodes
92+
or the date column specified in date_col.
6093
6194
Example 1: to add a new column with a new geocode, possibly with weights:
6295
> gmpr = GeoMapper()
63-
> df = gmpr.add_geocode(df, "fips", "zip", from_col="fips", new_col="geo_id",
96+
> df = gmpr.add_geocode(df, "fips", "zip",
97+
from_col="fips", new_col="geo_id",
6498
date_col="timestamp", dropna=False)
6599
66-
Example 2: to replace a geocode column with a new one, aggregating the data with weights:
100+
Example 2: to replace a geocode column with a new one, aggregating the data
101+
with weights:
67102
> gmpr = GeoMapper()
68-
> df = gmpr.replace_geocode(df, "fips", "zip", from_col="fips", new_col="geo_id",
103+
> df = gmpr.replace_geocode(df, "fips", "zip",
104+
from_col="fips", new_col="geo_id",
69105
date_col="timestamp", dropna=False)
70106
"""
71107

@@ -113,7 +149,7 @@ def __init__(self, census_year: int = 2020):
113149
subkey
114150
for mainkey in self.CROSSWALK_FILENAMES
115151
for subkey in self.CROSSWALK_FILENAMES[mainkey]
116-
}.union(set(self.CROSSWALK_FILENAMES.keys())) - set(["state", "pop"])
152+
}.union(set(self.CROSSWALK_FILENAMES.keys())) - {"state", "pop"}
117153

118154
for from_code, to_codes in self.CROSSWALK_FILENAMES.items():
119155
for to_code, file_path in to_codes.items():
@@ -135,7 +171,6 @@ def _load_crosswalk_from_file(
135171
"weight": float,
136172
**{geo: str for geo in self._geos - set("nation")},
137173
}
138-
139174
usecols = [from_code, "pop"] if to_code == "pop" else None
140175
return pd.read_csv(stream, dtype=dtype, usecols=usecols)
141176

@@ -229,12 +264,7 @@ def add_geocode(
229264
):
230265
"""Add a new geocode column to a dataframe.
231266
232-
Currently supported conversions:
233-
- fips -> state_code, state_id, state_name, zip, msa, hrr, nation, hhs, chng-fips
234-
- chng-fips -> state_code, state_id, state_name
235-
- zip -> state_code, state_id, state_name, fips, msa, hrr, nation, hhs
236-
- state_x -> state_y (where x and y are in {code, id, name}), nation
237-
- state_code -> hhs, nation
267+
See class docstring for supported geocode transformations.
238268
239269
Parameters
240270
---------
@@ -303,7 +333,7 @@ def add_geocode(
303333
df = df.merge(crosswalk, left_on=from_col, right_on=from_col, how="left")
304334

305335
# Drop extra state columns
306-
if new_code in state_codes and not from_code in state_codes:
336+
if new_code in state_codes and from_code not in state_codes:
307337
state_codes.remove(new_code)
308338
df.drop(columns=state_codes, inplace=True)
309339
elif new_code in state_codes and from_code in state_codes:
@@ -345,12 +375,7 @@ def replace_geocode(
345375
) -> pd.DataFrame:
346376
"""Replace a geocode column in a dataframe.
347377
348-
Currently supported conversions:
349-
- fips -> chng-fips, state_code, state_id, state_name, zip, msa, hrr, nation
350-
- chng-fips -> state_code, state_id, state_name
351-
- zip -> state_code, state_id, state_name, fips, msa, hrr, nation
352-
- state_x -> state_y (where x and y are in {code, id, name}), nation
353-
- state_code -> hhs, nation
378+
See class docstring for supported geocode transformations.
354379
355380
Parameters
356381
---------
@@ -397,7 +422,7 @@ def replace_geocode(
397422
df[data_cols] = df[data_cols].multiply(df["weight"], axis=0)
398423
df.drop("weight", axis=1, inplace=True)
399424

400-
if not date_col is None:
425+
if date_col is not None:
401426
df = df.groupby([date_col, new_col]).sum(numeric_only=True).reset_index()
402427
else:
403428
df = df.groupby([new_col]).sum(numeric_only=True).reset_index()
@@ -575,8 +600,7 @@ def get_geos_within(
575600
Return all contained regions of the given type within the given container geocode.
576601
577602
Given container_geocode (e.g "ca" for California) of type container_geocode_type
578-
(e.g "state"), return:
579-
- all (contained_geocode_type)s within container_geocode
603+
(e.g "state"), return all (contained_geocode_type)s within container_geocode.
580604
581605
Supports these 4 combinations:
582606
- all states within a nation
@@ -627,3 +651,55 @@ def get_geos_within(
627651
"must be one of (state, nation), (state, hhs), (county, state)"
628652
", (fips, state), (chng-fips, state)"
629653
)
654+
655+
def aggregate_by_weighted_sum(
656+
self, df: pd.DataFrame, to_geo: str, sensor_col: str, time_col: str, population_col: str
657+
) -> pd.DataFrame:
658+
"""Aggregate sensor, weighted by time-dependent population.
659+
660+
Note: This function generates its own population weights and excludes
661+
locations where the data is NA, which is effectively an extrapolation
662+
assumption to the rest of the geos. This is in contrast to the
663+
`replace_geocode` function, which assumes that the weights are already
664+
present in the data and does not adjust for missing data (see the
665+
docstring for the GeoMapper class).
666+
667+
Parameters
668+
---------
669+
df: pd.DataFrame
670+
Input dataframe, assumed to have a sensor column (e.g. "visits"), a
671+
to_geo column (e.g. "state"), and a population column (corresponding
672+
to a from_geo, e.g. "wastewater collection site").
673+
to_geo: str
674+
The column name of the geocode to aggregate to.
675+
sensor: str
676+
The column name of the sensor to aggregate.
677+
population_column: str
678+
The column name of the population to weight the sensor by.
679+
680+
Returns
681+
---------
682+
agg_df: pd.DataFrame
683+
A dataframe with the aggregated sensor values, weighted by population.
684+
"""
685+
# Don't modify the input dataframe
686+
df = df.copy()
687+
# Zero-out populations where the sensor is NA
688+
df["_zeroed_pop"] = df[population_col] * df[sensor_col].abs().notna()
689+
# Weight the sensor by the population
690+
df["_weighted_sensor"] = df[sensor_col] * df["_zeroed_pop"]
691+
agg_df = (
692+
df.groupby([time_col, to_geo])
693+
.agg(
694+
{
695+
"_zeroed_pop": "sum",
696+
"_weighted_sensor": lambda x: x.sum(min_count=1),
697+
}
698+
).assign(
699+
_new_sensor = lambda x: x["_weighted_sensor"] / x["_zeroed_pop"]
700+
).reset_index()
701+
.rename(columns={"_new_sensor": f"weighted_{sensor_col}"})
702+
.drop(columns=["_zeroed_pop", "_weighted_sensor"])
703+
)
704+
705+
return agg_df

0 commit comments

Comments
 (0)