-
Notifications
You must be signed in to change notification settings - Fork 67
incremental cache updating prototype #375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
this is completely ignorant of re-issued data, so when a new issue of data happens, the cache may be somewhat stale and/or inaccurate. the inaccuracies will probably be relatively small if the reissues are not wildly different than the previous issue. also, those differences will get washed out as data series continue to grow in size. a ~weekly full population of the cache should probably be done to clean up those inaccuracies. this will break ingestion if new data is loaded before fully populating the cache with the existing data, because it expects a new field in the cache: 'num_points'. the code is currently saving results to a 'test' cache table for comparing with the regular/main cache table. TODOs have been included for lines that need to be reverted if/when testing is successful. i have a feeling that the incremental changes could result in some loss of precision for the sd values, but i may just be paranoid. theoretically, the results should be identical if floats have infinite precision. changed 'Database.get_covidcast_meta()' to 'Database.compute_covidcast_meta()' for clarity
oops, i somehow ran the wrong tests on this (copy pasta failed me)... fixing |
should be good now, though i should introduce some new tests for this |
yes |
but seriously, im not sure. it sorta fits both. ¯\_(ツ)_/¯ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks largely reasonable, pending clarification on what the default values of table_name
parameters should be. Would also be good to write up the testing and/or deployment procedure, ie
- run {SQL}
- run {python}
- check {data}
- run {SQL}
- etc
integrations/acquisition/covidcast/test_covidcast_meta_caching.py
Outdated
Show resolved
Hide resolved
integrations/acquisition/covidcast/test_covidcast_meta_caching.py
Outdated
Show resolved
Hide resolved
Co-authored-by: Katie Mazaitis <[email protected]>
deployment:
|
|
||
# use the temp table to compute meta just for new data, | ||
# then merge with the global meta | ||
meta_update = self.compute_covidcast_meta(table_name=tmp_table_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly the (temp) metadata will be updated on every write to the database, is that right? How many times do we expect that to happen each day? (i.e., is it for each signal/source update, or do those happen all at once)
Would a possible alternative be to scan through the database once each day and update then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its pretty much one update to the cache for each input file (csv). i believe the importer runs every so often as a cron job or whatever (multiple times daily?), and loads all new files in the data dump directory. i also believe the files do not arrive on a strict schedule. one file is unique in terms of (source, signal, time_type (day|week), geo_type (state|county|metro), time_value (yyyymmdd), issue_value (yyyymmdd)).
we could do it once a day but that includes some logic to identify which subset of data should be separated out. doing it per file gives us the aggregation we need for free.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
csv_to_database.py
runs at :50 after every hour, with a no-op exit if there are no files in receiving. So, up to 24x a day, but the true frequency depends on how many of those periods include data drops.
The number of files handled during each run varies from 0 to 53k. About half of the 800 runs in the last log file handled 0 files, 3/4 under 1500 files, but there is that long tail where we'd be merging metadata 10k times in a single run.
It might be performant as-is, but we should probably think about whether there's a way to efficiently merge a whole pile of these.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@melange396 makes sense that this is simpler.
@krivard thanks, so this means we will need to monitor for increased runtime of csv_to_database once this is enabled.
wdyt about just enabling and monitoring for a day? We can scrape that data as we are doing for the metadata update. Probably best to do it on monday when we are all around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. probably two ways to do it: on staging (easy, but won't see as many data files) or in production (either need to merge & back out a PR, or coordinate with @korlaxxalrok to switch prod to sync to a different branch and then switch back)
…nto incremental_cache
re: issue #289
this is completely ignorant of re-issued data, so when a new issue of data happens, the cache may be somewhat stale and/or inaccurate. the inaccuracies will probably be relatively small if the reissues are not wildly different than the previous issue. also, those differences will get washed out as data series continue to grow in size. a ~weekly full population of the cache should probably be done to clean up those inaccuracies.
this will break ingestion if new data is loaded before fully populating the cache with the existing data, because it expects a new field in the cache: 'num_points'.
the code is currently saving results to a 'test' cache table for comparing with the regular/main cache table. TODOs have been included for lines that need to be reverted if/when testing is successful.
i have a feeling that the incremental changes could result in some loss of precision for the sd values, but i may just be paranoid. theoretically, the results should be identical if floats have infinite precision.
changed 'Database.get_covidcast_meta()' to 'Database.compute_covidcast_meta()' for clarity