Skip to content

Commit d52dcd5

Browse files
authored
Add discussion of patching functionality to indicator manual (#2031)
* basic patching info * custom_run * add hhs to geo section * de-emphasize R in contributing doc, and add links * ref/issue date intro * naming and force push standards * secret and params links
1 parent ad089cf commit d52dcd5

File tree

2 files changed

+88
-29
lines changed

2 files changed

+88
-29
lines changed

.github/CONTRIBUTING.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,9 @@ The production branch is configured to automatically deploy to our production en
1212

1313
* everything else
1414

15-
All other branches are development branches. We don't enforce a naming policy.
15+
All other branches are development branches. We don't enforce a naming policy, but it is recommended to prefix all branches you create with your name, username, or initials (e.g. `username/branch-name`).
16+
17+
We don't forbid force-pushing, but please keep to a minimum and be careful of using when modifying a branch at the same time as others.
1618

1719
## Issues
1820

@@ -29,7 +31,7 @@ So, how does one go about developing a pipeline for a new data source?
2931
### tl;dr
3032

3133
1. Create your new indicator branch from `main`.
32-
2. Build it using the appropriate template, following the guidelines in the included README.md and REVIEW.md files.
34+
2. Build it using the [indicator template](https://github.com/cmu-delphi/covidcast-indicators/tree/main/_template_python), following the guidelines in the included README.md, REVIEW.md, and INDICATOR_DEV_GUIDE.md files.
3335
3. Make some stuff!
3436
4. When your stuff works, push your development branch to remote, and open a PR against `main` for review.
3537
5. Once your PR has been merged, consult with a platform engineer for the remaining production setup needs. They will create a deployment workflow for your indicator including any necessary production parameters. Production secrets are encrypted in the Ansible vault. This workflow will be tested in staging by admins, who will consult you about any problems they encounter.
@@ -50,7 +52,7 @@ git checkout -b dev-my-feature-branch
5052

5153
### Creating your indicator
5254

53-
Create a directory for your new indicator by making a copy of `_template_r` or `_template_python` depending on the programming language you intend to use. If using Python, add the name of the directory to the list found in `jobs > build > strategy > matrix > packages` in `.github/workflows/python-ci.yml`, which will enable automated checks for your indicator when you make PRs. The template copies of `README.md` and `REVIEW.md` include the minimum requirements for code structure, documentation, linting, testing, and method of configuration. Beyond that, we don't have any established restrictions on implementation; you can look at other existing indicators see some examples of code layout, organization, and general approach.
55+
Create a directory for your new indicator by making a copy of `_template_python`. (We also make a `_template_r` available, but R should be only used as a last resort, due to complications using it in production.) Add the name of the directory to the list found in `jobs > build > strategy > matrix > packages` in `.github/workflows/python-ci.yml`, which will enable automated checks for your indicator when you make PRs. The template copies of `README.md` and `REVIEW.md` include the minimum requirements for code structure, documentation, linting, testing, and method of configuration. Beyond that, we don't have any established restrictions on implementation; you can look at other existing indicators see some examples of code layout, organization, and general approach.
5456

5557
* Consult your peers with questions! :handshake:
5658

@@ -62,7 +64,7 @@ Once you have something that runs locally and passes tests you set up your remot
6264
git push -u origin dev-my-feature-branch
6365
```
6466

65-
You can then draft public API documentation for people who would fetch this
67+
You can then draft [public API documentation](https://cmu-delphi.github.io/delphi-epidata/) for people who would fetch this
6668
data from the API. Public API documentation is kept in the delphi-epidata
6769
repository, and there is a [template Markdown
6870
file](https://github.com/cmu-delphi/delphi-epidata/blob/main/docs/api/covidcast-signals/_source-template.md)
@@ -104,7 +106,8 @@ We use a branch-based git workflow coupled with [Jenkins](https://www.jenkins.io
104106
* Package - Tar and gzip the built environment.
105107
* Deploy - Trigger an Ansible playbook to place the built package onto the runtime host, place any necessary production configuration, and adjust the runtime envirnemnt (if necessary).
106108

107-
There are several additional Jenkins-specific files that will need to be created for each indicator, as well as some configuration additions to the runtime host. It will be important to pair with a platform engineer to prepare the necessary production environment needs, test the workflow, validate on production, and ultimately sign off on a production release.
109+
There are several additional Jenkins-specific files that will need to be created for each indicator, as well as some configuration additions to the runtime host.
110+
It will be important to pair with a platform engineer to prepare the necessary production environment needs, test the workflow, validate on production, and ultimately sign off on a production release.
108111

109112
### Preparing container images of indicators
110113

_template_python/INDICATOR_DEV_GUIDE.md

Lines changed: 80 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,7 @@ This is the general extract-transform-load procedure used by all COVIDcast indic
5050
7. Deliver the CSV output files to the `receiving/` directory on the API server.
5151

5252
Adding a new indicator typically means implementing steps 1-3. Step 4 is included via the function ` create_export_csv`. Steps 5 (the validator), 6 (the archive differ) and 7 (acquisition) are all handled by runners in production.
53+
5354
## Step 0: Keep revision history (important!)
5455

5556
If the data provider doesn’t provide or it is unclear if they provide historical versions of the data, immediately set up a script (bash, Python, etc) to automatically (e.g. cron) download the data every day and save locally with versioning.
@@ -195,7 +196,14 @@ Generally, indicators have:
195196
Do other geo handling (e.g. finding and reporting DC as a state).
196197
* `constants.py`: Lists of geos to produce, signals to produce, dataset ids, data source URL, etc.
197198

198-
Your code should be _extensively_ commented! Especially note sections where you took an unusual approach (make sure to say why and consider briefly discussing alternate approaches).
199+
Your code should be _extensively_ commented! Especially note sections where you took an unusual approach (make sure to say why and consider briefly discussing alternate approaches that were discarded or could be useful in the future).
200+
201+
#### Development environment
202+
203+
Make sure you have a functional environment with python 3.8.15+.
204+
For local runs, the makefile’s make install target will set up a local virtual environment with necessary packages.
205+
206+
(If working in R (very much NOT recommended), local runs can be run without a virtual environment or using the [`renv` package](https://rstudio.github.io/renv/articles/renv.html), but production runs should be set up to use Docker.)
199207

200208
#### Function stubs
201209

@@ -235,12 +243,22 @@ def api_call(token: str):
235243

236244
After that, generalize your code to be able to be run on all geos of interest, take settings from params.json, use constants for easy maintenance, with extensive documentation, etc.
237245

238-
#### Development environment
246+
#### Testing
239247

240-
Make sure you have a functional environment with python 3.8.15+.
241-
For local runs, the makefile’s make install target will set up a local virtual environment with necessary packages.
248+
As a general rule, it helps to decompose your functions into operations for which you can write unit tests.
249+
To run the tests, use `make test` in the top-level indicator directory.
242250

243-
(If working in R (very much NOT recommended), local runs can be run without a virtual environment or using the [`renv` package](https://rstudio.github.io/renv/articles/renv.html), but production runs should be set up to use Docker.)
251+
Unit tests are required for all functions.
252+
Integration tests are highly desired, but may be difficult to set up depending on where the data is being fetched from.
253+
Mocking functions are useful in this case.
254+
255+
#### Dealing with dates
256+
257+
We keep track of two different date fields for each dataset. The first field is called "reference value" (field name `time_value`) and tracks the date that a value is reported _for_, that is, when the event happened. The second field is called "issue date" or "version" (field name `issue`) and tracks when a value was recorded, not when it happened.
258+
259+
For example, flu test positivity of 80% for a reference date of Jan 1 and an issue date of Jan 5 means that _on_ Jan 1, the test positivity rate was 80%. But we only received and recorded the value on Jan 5, 4 days later (AKA a lag of 4 days).
260+
261+
It's important to track issue date because many data sources are revised over time, and reported values can change substantially between issues.
244262

245263
#### Dealing with data-types
246264

@@ -255,14 +273,16 @@ E.g. which geo values are allowed, should every valid date be present in some wa
255273

256274
In an ideal case, the data exists at one of our [already covered geos](https://cmu-delphi.github.io/delphi-epidata/api/covidcast_geography.html):
257275

258-
* State: state_code (string, leftpadded to 2 digits with 0) or state_id (string)
276+
* Zip code
259277
* FIPS (state+county codes, string leftpadded to 5 digits with 0)
260-
* ZIP
261278
* MSA (metro statistical area, int)
262279
* HRR (hospital referral region, int)
280+
* State: state_code (string, leftpadded to 2 digits with 0) or state_id (string)
281+
* HHS ([Department of Health and Human Services-defined regions](https://www.hhs.gov/about/agencies/iea/regional-offices/index.html))
282+
* Nation
263283

264-
If you want to map from one of these to another, the [`delphi_utils.geomapper`](https://github.com/cmu-delphi/covidcast-indicators/blob/6912077acba97e835aff7d0cd3d64309a1a9241d/_delphi_utils_python/delphi_utils/geomap.py) utility covers most cases.
265-
A brief example of aggregating from states to hhs regions via their population:
284+
If you want to map from one of these to a higher level, the [`delphi_utils.geomapper`](https://github.com/cmu-delphi/covidcast-indicators/blob/6912077acba97e835aff7d0cd3d64309a1a9241d/_delphi_utils_python/delphi_utils/geomap.py) utility covers most cases.
285+
Here's a brief example of aggregating from states to hhs regions via their population:
266286

267287
```{python}
268288
from delphi_utils.geomap import GeoMapper
@@ -274,19 +294,6 @@ hhs_version = geo_mapper.replace_geocode(df, "state_code","hhs", new_col = "geo_
274294

275295
This example is taken from [`hhs_hosp`](https://github.com/cmu-delphi/covidcast-indicators/blob/main/hhs_hosp/delphi_hhs/run.py); more documentation can be found in the `geomapper` class definition.
276296

277-
#### Implement a Missing Value code system
278-
279-
The column is described [here](https://cmu-delphi.github.io/delphi-epidata/api/missing_codes.html).
280-
281-
#### Local testing
282-
283-
As a general rule, it helps to decompose your functions into operations for which you can write unit tests.
284-
To run the tests, use `make test` in the top-level indicator directory.
285-
286-
Unit tests are required for all functions.
287-
Integration tests are highly desired, but may be difficult to set up depending on where the data is being fetched from.
288-
Mocking functions are useful in this case.
289-
290297
#### Naming
291298

292299
Indicator and signal names need to be approved by [@RoniRos](https://www.github.com/RoniRos).
@@ -324,6 +331,52 @@ Using this tag dictionary, we can interpret the following signals as
324331
* `confirmed_admissions_influenza_1d_prop` = raw (unsmoothed) daily ("1d") confirmed influenza hospital admissions ("confirmed_admissions_influenza") per 100,000 population ("prop").
325332
* `confirmed_admissions_influenza_1d_prop_7dav` = the same as above, but smoothed with a 7-day moving average ("7dav").
326333

334+
#### Implement a Missing Value code system
335+
336+
The column is described [here](https://cmu-delphi.github.io/delphi-epidata/api/missing_codes.html).
337+
338+
#### Implement a patching method
339+
340+
After normal data reporting is restored following an outage, we would like to be able to easily reconstruct the version history of the data. To do so, implement a `patch` method that runs an indicator's main `run_module` for every issue date in a range. An [example patch module](https://github.com/cmu-delphi/covidcast-indicators/blob/b784f30/google_symptoms/delphi_google_symptoms/patch.py).
341+
342+
An outage can be external to Delphi, e.g. the data provider was unable to provide new data on the historically-expected schedule, or internal, e.g. Delphi code had a bug that caused our pipeline to fail. The goal of the patch feature is to recreate every missing issue _as if we had ingested it on the correct day_.
343+
344+
The patch feature should be easy to use. The only manual parts should be modifying `params.json`, and running the patch module and acquisition. Any setup that needs to be done (e.g. cache creation, dir creation) should be done automatically as part of the patch function.
345+
346+
All patch modules should expect settings from `params.json` of the form
347+
348+
```
349+
{
350+
"common": {
351+
...
352+
"custom_run": true
353+
},
354+
"validation": {
355+
...
356+
},
357+
"patch": {
358+
"patch_dir": "<path to dir>/<patch dir name>",
359+
"start_issue": "2024-04-20",
360+
"end_issue": "2024-04-21"
361+
}
362+
}
363+
```
364+
365+
The `custom_run` parameter should [default to false](https://github.com/cmu-delphi/covidcast-indicators/blob/d435bf0f0d5880ddf8905ea60f242976e6702342/nssp/delphi_nssp/run.py#L73), and [warn](https://github.com/cmu-delphi/covidcast-indicators/blob/d435bf0f0d5880ddf8905ea60f242976e6702342/nssp/delphi_nssp/run.py#L75-L83) if parameters and arguments disagree.
366+
367+
Patching should generate data for that range of issue dates, and store them in batch issue format:
368+
`<patch_dir as provided in the params>/issue_<issue date>/<indicator name as stored in our DB>/xxx.csv`.
369+
370+
Acquisition in `delphi-epidata` includes [code that allow files in this issue-specific structure](https://github.com/cmu-delphi/delphi-epidata/blob/694d89ad763fa85bd644e1f64552c9bc85f688ef/src/acquisition/covidcast/csv_to_database.py#L43C32-L43C61) to be added to the database. This output format is designed to match the `issue`-type acquisition format. The issue-specific mode is triggered with the flag `specific_issue_date`. [A Cronicle job](https://cronicle-prod-01.delphi.cmu.edu/#Schedule?sub=edit_event&id=elh59ynwobf) has already been set up to call acquisition using the flag; please use it to load patches into the database.
371+
372+
Sometimes source data is already versioned, and to reconstruct an issue we simply need to filter the source data to include only values that would have been available on that issue day. If we receive data drops directly, we can filter by the file creation date instead.
373+
374+
However, it is not always possible to reconstruct issues; many datasets aren't versioned by the provider. If a source has no revisions (for example, `google-symptoms`), then we can guess which dates of data would have been available that issue day based on the normal lag of the source. For example, `google-symptoms` normally has a lag of 4 days, i.e. "today" the most recent data we see in the source data is from 4 days ago. So to reconstruct data for issue 2024-01-10, we just need to report data with a `time_value` (reference date) from 2024-01-06 and earlier. (How much earlier depends on the behavior we normally expect from the indicator code; if we normally report 2 weeks of data, filter to 2024-01-06 - 14 days through 2024-01-06.)
375+
376+
Some datasets, such as those on healthdata.gov, provide metadata indicating when certain rows were updated.
377+
378+
In other cases (such as datasetes that both have revisions _and_ don't track revisions), please discuss with the indicator stakeholder and consider [what you know about how the data works](#step-1-exploratory-analysis).
379+
327380
### Statistical review
328381

329382
The data produced by the new indicator needs to be sanity-checked.
@@ -419,14 +472,17 @@ Refer to [this guide](https://docs.google.com/document/d/1Bbuvtoxowt7x2_8USx_JY-
419472

420473
* Add module name to the `build` job in `.github/workflows/python-ci.yml`.
421474
This allows github actions to run on this indicator code, which includes unit tests and linting.
422-
* Add top-level directory name to `indicator_list` in `Jenkinsfile`.
475+
* Add module name to the ["Copy version to indicator directory" step](https://github.com/cmu-delphi/covidcast-indicators/blob/f01185767a9847d8082baf4f1e17be50a39047c2/.github/workflows/create-release.yml#L64) in `.github/workflows/create-release.yml`.
476+
* Add top-level directory name to [`indicator_list` in `Jenkinsfile`](https://github.com/cmu-delphi/covidcast-indicators/blob/f01185767a9847d8082baf4f1e17be50a39047c2/Jenkinsfile#L13).
423477
This allows your code to be automatically deployed to staging after your branch is merged to main, and deployed to prod after `covidcast-indicators` is released.
424478
* Create `ansible/templates/{top_level_directory_name}-params-prod.json.j2` based on your `params.json.template` with some adjustment:
425479
* "export_dir": "/common/covidcast/receiving/{data-source-name}"
426480
* "log_filename": "/var/log/indicators/{top_level_directory_name}.log"
481+
* Define any sensitive variables as "secrets" in the [Ansible `vars.yaml`](https://github.com/cmu-delphi/covidcast-indicators/blob/main/ansible/vars.yaml) and [vault](https://github.com/cmu-delphi/covidcast-indicators/blob/main/ansible/vault.yaml).
482+
Refer to [this guide](https://docs.google.com/document/d/1Bbuvtoxowt7x2_8USx_JY-yTo-Av3oAFlhyG-vXGG-c/edit#heading=h.8kkoy8sx3t7f) for more vault info.
483+
* Add configs for Sir Complains-a-Lot ("sirCAL") alerting in sirCAL's [local](https://github.com/cmu-delphi/covidcast-indicators/blob/main/sir_complainsalot/params.json.template) and [Ansible](https://github.com/cmu-delphi/covidcast-indicators/blob/main/ansible/templates/sir_complainsalot-params-prod.json.j2) params templates.
427484

428485
Pay attention to the receiving/export directory, as well as how you can store credentials in vault.
429-
Refer to [this guide](https://docs.google.com/document/d/1Bbuvtoxowt7x2_8USx_JY-yTo-Av3oAFlhyG-vXGG-c/edit#heading=h.8kkoy8sx3t7f) for more vault info.
430486

431487
### Staging
432488

0 commit comments

Comments
 (0)