incremental cache updating prototype #375

melange396 · 2021-01-14T00:43:27Z

re: issue #289

this is completely ignorant of re-issued data, so when a new issue of data happens, the cache may be somewhat stale and/or inaccurate. the inaccuracies will probably be relatively small if the reissues are not wildly different than the previous issue. also, those differences will get washed out as data series continue to grow in size. a ~weekly full population of the cache should probably be done to clean up those inaccuracies.

this will break ingestion if new data is loaded before fully populating the cache with the existing data, because it expects a new field in the cache: 'num_points'.

the code is currently saving results to a 'test' cache table for comparing with the regular/main cache table. TODOs have been included for lines that need to be reverted if/when testing is successful.

i have a feeling that the incremental changes could result in some loss of precision for the sd values, but i may just be paranoid. theoretically, the results should be identical if floats have infinite precision.

changed 'Database.get_covidcast_meta()' to 'Database.compute_covidcast_meta()' for clarity

this is completely ignorant of re-issued data, so when a new issue of data happens, the cache may be somewhat stale and/or inaccurate. the inaccuracies will probably be relatively small if the reissues are not wildly different than the previous issue. also, those differences will get washed out as data series continue to grow in size. a ~weekly full population of the cache should probably be done to clean up those inaccuracies. this will break ingestion if new data is loaded before fully populating the cache with the existing data, because it expects a new field in the cache: 'num_points'. the code is currently saving results to a 'test' cache table for comparing with the regular/main cache table. TODOs have been included for lines that need to be reverted if/when testing is successful. i have a feeling that the incremental changes could result in some loss of precision for the sd values, but i may just be paranoid. theoretically, the results should be identical if floats have infinite precision. changed 'Database.get_covidcast_meta()' to 'Database.compute_covidcast_meta()' for clarity

melange396 · 2021-01-14T02:15:20Z

oops, i somehow ran the wrong tests on this (copy pasta failed me)... fixing

melange396 · 2021-01-14T02:34:09Z

should be good now, though i should introduce some new tests for this

krivard · 2021-01-14T15:14:13Z

Is this more relevant to #368 or #289?

melange396 · 2021-01-14T15:42:43Z

yes

melange396 · 2021-01-14T15:43:37Z

but seriously, im not sure. it sorta fits both. ¯\_(ツ)_/¯

krivard

Looks largely reasonable, pending clarification on what the default values of table_name parameters should be. Would also be good to write up the testing and/or deployment procedure, ie

run {SQL}
run {python}
check {data}
run {SQL}
etc

src/acquisition/covidcast/database.py

integrations/acquisition/covidcast/test_covidcast_meta_caching.py

Co-authored-by: Katie Mazaitis <[email protected]>

melange396 · 2021-01-14T18:16:36Z

deployment:

halt any new csv ingestion ( src/acquisition/covidcast/csv_to_database.py:main() )
execute this sql:

  CREATE TABLE `covidcast_meta_cache_test` (
    `timestamp` int(11) NOT NULL,
    `epidata` LONGTEXT NOT NULL,
    PRIMARY KEY (`timestamp`)
  ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
  INSERT INTO covidcast_meta_cache_test VALUES (0, '[]');

run full metadata update process ( src/acquisition/covidcast/covidcast_meta_cache_updater.py:main() )
check tables covidcast_meta_cache[_test] have matching entries and include field 'num_points'
resume new data file ingestion
after some new data is loaded and before a full cache refresh is run again, check tables do not have matching entries (but they should still both have sane and not-too-different values)

src/ddl/covidcast.sql

src/acquisition/covidcast/database.py

benjaminysmith · 2021-01-15T01:36:58Z

src/acquisition/covidcast/database.py

+
+        # use the temp table to compute meta just for new data, 
+        # then merge with the global meta
+        meta_update = self.compute_covidcast_meta(table_name=tmp_table_name)


If I understand correctly the (temp) metadata will be updated on every write to the database, is that right? How many times do we expect that to happen each day? (i.e., is it for each signal/source update, or do those happen all at once)

Would a possible alternative be to scan through the database once each day and update then?

its pretty much one update to the cache for each input file (csv). i believe the importer runs every so often as a cron job or whatever (multiple times daily?), and loads all new files in the data dump directory. i also believe the files do not arrive on a strict schedule. one file is unique in terms of (source, signal, time_type (day|week), geo_type (state|county|metro), time_value (yyyymmdd), issue_value (yyyymmdd)).

we could do it once a day but that includes some logic to identify which subset of data should be separated out. doing it per file gives us the aggregation we need for free.

csv_to_database.py runs at :50 after every hour, with a no-op exit if there are no files in receiving. So, up to 24x a day, but the true frequency depends on how many of those periods include data drops.

The number of files handled during each run varies from 0 to 53k. About half of the 800 runs in the last log file handled 0 files, 3/4 under 1500 files, but there is that long tail where we'd be merging metadata 10k times in a single run.

It might be performant as-is, but we should probably think about whether there's a way to efficiently merge a whole pile of these.

@melange396 makes sense that this is simpler.

@krivard thanks, so this means we will need to monitor for increased runtime of csv_to_database once this is enabled.

wdyt about just enabling and monitoring for a day? We can scrape that data as we are doing for the metadata update. Probably best to do it on monday when we are all around.

good idea. probably two ways to do it: on staging (easy, but won't see as many data files) or in production (either need to merge & back out a PR, or coordinate with @korlaxxalrok to switch prod to sync to a different branch and then switch back)

src/acquisition/covidcast/database.py

…nto incremental_cache

melange396 requested review from krivard and benjaminysmith January 14, 2021 00:43

one character change

26497c0

krivard approved these changes Jan 14, 2021

View reviewed changes

comments

18867f7

Co-authored-by: Katie Mazaitis <[email protected]>

now with family-friendly tests!

302ed07

benjaminysmith reviewed Jan 15, 2021

View reviewed changes

adding todo for ben

12dbe9d

benjaminysmith approved these changes Jan 15, 2021

View reviewed changes

melange396 added 6 commits January 27, 2021 13:56

cache comparison tool

f6a8c48

fix of 0->1 error and comparison code additions

d8db8d1

stupid mistakes because i didnt run tests before commiting :(

16c113f

Merge remote-tracking branch 'upstream/main' into incremental_cache

27b0d38

more merge catchup

79307d3

Merge branch 'main' of https://github.com/cmu-delphi/delphi-epidata i…

a119b08

…nto incremental_cache

krivard mentioned this pull request Jun 23, 2021

add Min_issue query to metadata and add 3 tests #606

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incremental cache updating prototype #375

incremental cache updating prototype #375

melange396 commented Jan 14, 2021

melange396 commented Jan 14, 2021

melange396 commented Jan 14, 2021

krivard commented Jan 14, 2021

melange396 commented Jan 14, 2021

melange396 commented Jan 14, 2021

krivard left a comment

melange396 commented Jan 14, 2021

benjaminysmith Jan 15, 2021

melange396 Jan 15, 2021

krivard Jan 15, 2021

benjaminysmith Jan 15, 2021

krivard Jan 15, 2021

incremental cache updating prototype #375

Are you sure you want to change the base?

incremental cache updating prototype #375

Conversation

melange396 commented Jan 14, 2021

melange396 commented Jan 14, 2021

melange396 commented Jan 14, 2021

krivard commented Jan 14, 2021

melange396 commented Jan 14, 2021

melange396 commented Jan 14, 2021

krivard left a comment

Choose a reason for hiding this comment

melange396 commented Jan 14, 2021

benjaminysmith Jan 15, 2021

Choose a reason for hiding this comment

melange396 Jan 15, 2021

Choose a reason for hiding this comment

krivard Jan 15, 2021

Choose a reason for hiding this comment

benjaminysmith Jan 15, 2021

Choose a reason for hiding this comment

krivard Jan 15, 2021

Choose a reason for hiding this comment