Fix problem with covid_hosp skipping state revisions. #1064

krivard · 2023-01-12T18:20:33Z

Prevents future recurrence of the HHS state hospital admissions outage active 2023-01-12.

Prerequisites:

Unless it is a documentation hotfix it should be merged against the dev branch
Branch is up-to-date with the branch to be merged with, i.e. dev
Build is successful
Code is cleaned up and formatted

Summary

In January 2023, we noticed that the HHS hospital admissions data had unusually high lag. On investigation, it turned out that timeseries datasets had not been fetched in over a week, even though new timeseries revisions were available from healthdata.gov. It turned out that successful import of daily revisions (which in Jan 2023 have 7 days of lag, while timeseries revisions have only 1-2) were masking the new timeseries revisions from the acquisition system.

This PR changes the usage of the dataset_name column in the covid_hosp_meta table from containing the data table name (for which covid_hosp_state_timeseries is shared by both the timeseries and daily revisions pipelines) to containing the healthdata.gov dataset ID (which is unique). This lets the acquisition system check for the last known timeseries file when pulling timeseries revisions, and the last known daily file when pulling daily revisions.

This PR also changes how we track metadata. Previously, each run of the pipeline collected together all revisions posted to healthdata.gov on a particular day, and recorded only one line in metadata for the whole batch -- preventing us humans from having any idea whether any particular file from healthdata.gov was actually ingested by the acquisition system. The proposed change records a line in metadata for each file from healthdata.gov which is included in the batch.

This PR includes a migration to run after deploy and before next acquisition, which will update the covid_hosp_meta table to tag state rows with their healthdata.gov ID based on the name of the revision file stored in the revision_timestamp column.

Things this PR DOES NOT include:

changing the older_than inequality to permit pulling revisions from the current day: we exclude current-day revisions on purpose to avoid a scenario where the initial data reported for an issue turns out to be incomplete and must be updated later, i.e., needing to version our versions in addition to versioning the reference dates. essentially, we wait until we're sure the issue from healthdata.gov is complete before ingesting it.
checking the metadata table and skipping acquisition of revisions that are already listed: this is potentially desirable as a mechanism for minimally backfilling the last two years of potentially-skipped revisions, or as a mechanism for relaxing our insistence on excluding current-day revisions if we think 12-24 hours of reduced lag is worth the consequences. we'd need to include a way to turn it off though, since our current method of pulling in new columns on past data is just to re-acquire the files.

Includes a migration to run after deploy and before next acquisition.

melange396

would it be worth adding a new column so that we dont lose the table_name that was stored before? i know its ultimately recoverable from the other information, but maybe it could mean potentially easier debugging of future issues? covid_hosp_meta isnt a big table so it should be "easy".

you might also wanna update the column description for dataset_name in the ddl file.

melange396 · 2023-01-18T18:29:26Z

src/acquisition/covid_hosp/common/database.py

@@ -19,6 +19,7 @@ class Database:
  def __init__(self,
               connection,
               table_name=None,
+               dataset_name=None,


perhaps add a docstring entry for this?

krivard · 2023-01-18T21:11:27Z

@melange396 done

melange396

looks great!

Fix problem with covid_hosp skipping state revisions.

5d3bce6

Includes a migration to run after deploy and before next acquisition.

krivard requested review from neul3, nolangormley and melange396 January 12, 2023 18:20

neul3 previously approved these changes Jan 13, 2023

View reviewed changes

melange396 reviewed Jan 18, 2023

View reviewed changes

[covid_hosp] add new column instead of repurpose old column

247a1e3

krivard dismissed neul3’s stale review via 247a1e3 January 18, 2023 21:11

krivard requested a review from melange396 January 18, 2023 21:11

melange396 approved these changes Jan 19, 2023

View reviewed changes

krivard merged commit 517df70 into dev Jan 19, 2023

krivard deleted the krivard/covid_hosp_datechecks branch January 19, 2023 15:20

krivard mentioned this pull request Jan 19, 2023

Release Delphi Epidata 0.4.5 #1070

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix problem with covid_hosp skipping state revisions. #1064

Fix problem with covid_hosp skipping state revisions. #1064

krivard commented Jan 12, 2023

melange396 left a comment •

edited

Loading

melange396 Jan 18, 2023

krivard commented Jan 18, 2023

melange396 left a comment

Fix problem with covid_hosp skipping state revisions. #1064

Fix problem with covid_hosp skipping state revisions. #1064

Conversation

krivard commented Jan 12, 2023

Summary

melange396 left a comment • edited Loading

Choose a reason for hiding this comment

melange396 Jan 18, 2023

Choose a reason for hiding this comment

krivard commented Jan 18, 2023

melange396 left a comment

Choose a reason for hiding this comment

melange396 left a comment •

edited

Loading