-
Notifications
You must be signed in to change notification settings - Fork 67
Fix problem with covid_hosp skipping state revisions. #1064
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Includes a migration to run after deploy and before next acquisition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be worth adding a new column so that we dont lose the table_name
that was stored before? i know its ultimately recoverable from the other information, but maybe it could mean potentially easier debugging of future issues? covid_hosp_meta
isnt a big table so it should be "easy".
you might also wanna update the column description for dataset_name
in the ddl file.
@@ -19,6 +19,7 @@ class Database: | |||
def __init__(self, | |||
connection, | |||
table_name=None, | |||
dataset_name=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps add a docstring entry for this?
@melange396 done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great!
Prevents future recurrence of the HHS state hospital admissions outage active 2023-01-12.
Prerequisites:
dev
branchdev
Summary
In January 2023, we noticed that the HHS hospital admissions data had unusually high lag. On investigation, it turned out that timeseries datasets had not been fetched in over a week, even though new timeseries revisions were available from healthdata.gov. It turned out that successful import of daily revisions (which in Jan 2023 have 7 days of lag, while timeseries revisions have only 1-2) were masking the new timeseries revisions from the acquisition system.
This PR changes the usage of the
dataset_name
column in thecovid_hosp_meta
table from containing the data table name (for whichcovid_hosp_state_timeseries
is shared by both the timeseries and daily revisions pipelines) to containing the healthdata.gov dataset ID (which is unique). This lets the acquisition system check for the last known timeseries file when pulling timeseries revisions, and the last known daily file when pulling daily revisions.This PR also changes how we track metadata. Previously, each run of the pipeline collected together all revisions posted to healthdata.gov on a particular day, and recorded only one line in metadata for the whole batch -- preventing us humans from having any idea whether any particular file from healthdata.gov was actually ingested by the acquisition system. The proposed change records a line in metadata for each file from healthdata.gov which is included in the batch.
This PR includes a migration to run after deploy and before next acquisition, which will update the
covid_hosp_meta
table to tag state rows with their healthdata.gov ID based on the name of the revision file stored in therevision_timestamp
column.Things this PR DOES NOT include:
older_than
inequality to permit pulling revisions from the current day: we exclude current-day revisions on purpose to avoid a scenario where the initial data reported for an issue turns out to be incomplete and must be updated later, i.e., needing to version our versions in addition to versioning the reference dates. essentially, we wait until we're sure the issue from healthdata.gov is complete before ingesting it.