Skip to content

incremental cache updating prototype #375

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

melange396
Copy link
Collaborator

re: issue #289

this is completely ignorant of re-issued data, so when a new issue of data happens, the cache may be somewhat stale and/or inaccurate. the inaccuracies will probably be relatively small if the reissues are not wildly different than the previous issue. also, those differences will get washed out as data series continue to grow in size. a ~weekly full population of the cache should probably be done to clean up those inaccuracies.

this will break ingestion if new data is loaded before fully populating the cache with the existing data, because it expects a new field in the cache: 'num_points'.

the code is currently saving results to a 'test' cache table for comparing with the regular/main cache table. TODOs have been included for lines that need to be reverted if/when testing is successful.

i have a feeling that the incremental changes could result in some loss of precision for the sd values, but i may just be paranoid. theoretically, the results should be identical if floats have infinite precision.

changed 'Database.get_covidcast_meta()' to 'Database.compute_covidcast_meta()' for clarity

this is completely ignorant of re-issued data, so when a new issue of data happens, the cache may be somewhat stale and/or inaccurate.  the inaccuracies will probably be relatively small if the reissues are not wildly different than the previous issue.  also, those differences will get washed out as data series continue to grow in size.  a ~weekly full population of the cache should probably be done to clean up those inaccuracies.

this will break ingestion if new data is loaded before fully populating the cache with the existing data, because it expects a new field in the cache: 'num_points'.

the code is currently saving results to a 'test' cache table for comparing with the regular/main cache table.  TODOs have been included for lines that need to be reverted if/when testing is successful.

i have a feeling that the incremental changes could result in some loss of precision for the sd values, but i may just be paranoid.  theoretically, the results should be identical if floats have infinite precision.

changed 'Database.get_covidcast_meta()' to 'Database.compute_covidcast_meta()' for clarity
@melange396
Copy link
Collaborator Author

oops, i somehow ran the wrong tests on this (copy pasta failed me)... fixing

@melange396
Copy link
Collaborator Author

should be good now, though i should introduce some new tests for this

@krivard
Copy link
Contributor

krivard commented Jan 14, 2021

Is this more relevant to #368 or #289?

@melange396
Copy link
Collaborator Author

yes

@melange396
Copy link
Collaborator Author

but seriously, im not sure. it sorta fits both. ¯\_(ツ)_/¯

Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks largely reasonable, pending clarification on what the default values of table_name parameters should be. Would also be good to write up the testing and/or deployment procedure, ie

  1. run {SQL}
  2. run {python}
  3. check {data}
  4. run {SQL}
  5. etc

Co-authored-by: Katie Mazaitis <[email protected]>
@melange396
Copy link
Collaborator Author

deployment:

  • halt any new csv ingestion ( src/acquisition/covidcast/csv_to_database.py:main() )
  • execute this sql:
  CREATE TABLE `covidcast_meta_cache_test` (
    `timestamp` int(11) NOT NULL,
    `epidata` LONGTEXT NOT NULL,
    PRIMARY KEY (`timestamp`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
  INSERT INTO covidcast_meta_cache_test VALUES (0, '[]');
  • run full metadata update process ( src/acquisition/covidcast/covidcast_meta_cache_updater.py:main() )
  • check tables covidcast_meta_cache[_test] have matching entries and include field 'num_points'
  • resume new data file ingestion
  • after some new data is loaded and before a full cache refresh is run again, check tables do not have matching entries (but they should still both have sane and not-too-different values)


# use the temp table to compute meta just for new data,
# then merge with the global meta
meta_update = self.compute_covidcast_meta(table_name=tmp_table_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly the (temp) metadata will be updated on every write to the database, is that right? How many times do we expect that to happen each day? (i.e., is it for each signal/source update, or do those happen all at once)

Would a possible alternative be to scan through the database once each day and update then?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its pretty much one update to the cache for each input file (csv). i believe the importer runs every so often as a cron job or whatever (multiple times daily?), and loads all new files in the data dump directory. i also believe the files do not arrive on a strict schedule. one file is unique in terms of (source, signal, time_type (day|week), geo_type (state|county|metro), time_value (yyyymmdd), issue_value (yyyymmdd)).

we could do it once a day but that includes some logic to identify which subset of data should be separated out. doing it per file gives us the aggregation we need for free.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

csv_to_database.py runs at :50 after every hour, with a no-op exit if there are no files in receiving. So, up to 24x a day, but the true frequency depends on how many of those periods include data drops.

The number of files handled during each run varies from 0 to 53k. About half of the 800 runs in the last log file handled 0 files, 3/4 under 1500 files, but there is that long tail where we'd be merging metadata 10k times in a single run.

It might be performant as-is, but we should probably think about whether there's a way to efficiently merge a whole pile of these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@melange396 makes sense that this is simpler.

@krivard thanks, so this means we will need to monitor for increased runtime of csv_to_database once this is enabled.

wdyt about just enabling and monitoring for a day? We can scrape that data as we are doing for the metadata update. Probably best to do it on monday when we are all around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. probably two ways to do it: on staging (easy, but won't see as many data files) or in production (either need to merge & back out a PR, or coordinate with @korlaxxalrok to switch prod to sync to a different branch and then switch back)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants