Skip to content

Fix problem with covid_hosp skipping state revisions. #1064

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 19, 2023

Conversation

krivard
Copy link
Contributor

@krivard krivard commented Jan 12, 2023

Prevents future recurrence of the HHS state hospital admissions outage active 2023-01-12.

Prerequisites:

  • Unless it is a documentation hotfix it should be merged against the dev branch
  • Branch is up-to-date with the branch to be merged with, i.e. dev
  • Build is successful
  • Code is cleaned up and formatted

Summary

In January 2023, we noticed that the HHS hospital admissions data had unusually high lag. On investigation, it turned out that timeseries datasets had not been fetched in over a week, even though new timeseries revisions were available from healthdata.gov. It turned out that successful import of daily revisions (which in Jan 2023 have 7 days of lag, while timeseries revisions have only 1-2) were masking the new timeseries revisions from the acquisition system.

This PR changes the usage of the dataset_name column in the covid_hosp_meta table from containing the data table name (for which covid_hosp_state_timeseries is shared by both the timeseries and daily revisions pipelines) to containing the healthdata.gov dataset ID (which is unique). This lets the acquisition system check for the last known timeseries file when pulling timeseries revisions, and the last known daily file when pulling daily revisions.

This PR also changes how we track metadata. Previously, each run of the pipeline collected together all revisions posted to healthdata.gov on a particular day, and recorded only one line in metadata for the whole batch -- preventing us humans from having any idea whether any particular file from healthdata.gov was actually ingested by the acquisition system. The proposed change records a line in metadata for each file from healthdata.gov which is included in the batch.

This PR includes a migration to run after deploy and before next acquisition, which will update the covid_hosp_meta table to tag state rows with their healthdata.gov ID based on the name of the revision file stored in the revision_timestamp column.

Things this PR DOES NOT include:

  • changing the older_than inequality to permit pulling revisions from the current day: we exclude current-day revisions on purpose to avoid a scenario where the initial data reported for an issue turns out to be incomplete and must be updated later, i.e., needing to version our versions in addition to versioning the reference dates. essentially, we wait until we're sure the issue from healthdata.gov is complete before ingesting it.
  • checking the metadata table and skipping acquisition of revisions that are already listed: this is potentially desirable as a mechanism for minimally backfilling the last two years of potentially-skipped revisions, or as a mechanism for relaxing our insistence on excluding current-day revisions if we think 12-24 hours of reduced lag is worth the consequences. we'd need to include a way to turn it off though, since our current method of pulling in new columns on past data is just to re-acquire the files.

Includes a migration to run after deploy and before next acquisition.
neul3
neul3 previously approved these changes Jan 13, 2023
Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be worth adding a new column so that we dont lose the table_name that was stored before? i know its ultimately recoverable from the other information, but maybe it could mean potentially easier debugging of future issues? covid_hosp_meta isnt a big table so it should be "easy".

you might also wanna update the column description for dataset_name in the ddl file.

@@ -19,6 +19,7 @@ class Database:
def __init__(self,
connection,
table_name=None,
dataset_name=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps add a docstring entry for this?

@krivard
Copy link
Contributor Author

krivard commented Jan 18, 2023

@melange396 done

@krivard krivard requested a review from melange396 January 18, 2023 21:11
Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great!

@krivard krivard merged commit 517df70 into dev Jan 19, 2023
@krivard krivard deleted the krivard/covid_hosp_datechecks branch January 19, 2023 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants