covid_hosp_facility lookup speedup #1016

melange396 · 2022-10-31T22:21:56Z

still todo:

test db migration routine(s) a bit more, and plan for execution on prod (aka plan rollout)
- choice of migration doesnt matter too much because values will be updated within a week (during the next data acquisition)
add test that verifies any updates to facility metadata (like a change in street address) will be persisted and replace previous metadata

…ration script

krivard · 2022-11-01T14:59:21Z

src/acquisition/covid_hosp/common/database.py

+    if hasattr(self, 'AGGREGATE_KEY_COLS'):
+      ak_cols = self.AGGREGATE_KEY_COLS
+
+      # restrict data to just the key columns and remove duplicate rows
+      ak_data = dataframe[ak_cols].drop_duplicates()
+      # cast types (bools and NULLables)
+      for col in ak_cols:
+        if col.startswith('is_'): # TODO: this is hacky af
+          ak_data[col] = (ak_data[col] == 'true')
+      ak_data = ak_data.to_numpy(na_value=None).tolist()
+
+      # create string of tick-quoted and comma-seperated column list
+      ak_cols_str = ','.join([ '`'+col+'`' for col in ak_cols ])
+      # ...and ticked and comma-sep'd "column=column" list for ON UPDATE (to keep only the most recent values for each pk)
+      ak_updates_str = ','.join([ '`'+col+'`=`'+col+'`' for col in ak_cols ])
+      # ...and string of VALUES placeholders
+      values_str = ','.join( ['%s'] * len(ak_cols) )
+      # use aggregate key table alias
+      ak_table = self.table_name + '_key'
+      # assemble full SQL statement
+      ak_insert_sql = f'INSERT INTO `{ak_table}` ({ak_cols_str}) VALUES ({values_str}) ON DUPLICATE KEY UPDATE {ak_updates_str}'
+
+      # commit the data
+      with self.new_cursor() as cur:
+        cur.executemany(ak_insert_sql, ak_data)


Would it make more sense to put this in the facility subclass? that's the only one that needs this step.

it could go either way... its not specific to the facility data, but it is true that the facility schema is the only one we currently have that needs it. if we leave it here, some future data set from covid_hosp that has composite keys can use it without modification.

src/ddl/migrations/covid_hosp_facility_v0.3-v0.3.1.sql

krivard · 2022-11-01T15:04:25Z

src/ddl/migrations/covid_hosp_facility_v0.3-v0.3.1.sql

+-- SELECT
+--    `hospital_pk`,MAX(`address`),MAX(`ccn`),MAX(`city`),MAX(`fips_code`),MAX(`geocoded_hospital_address`),MAX(`hhs_ids`),MAX(`hospital_name`),MAX(`hospital_subtype`),MAX(`is_metro_micro`),MAX(`state`),MAX(`zip`)
+-- FROM `covid_hosp_facility`
+-- GROUP BY `hospital_pk`;


All of the above methods assume we will fix #1007 before migration -- is that strictly necessary, or could we effect a migration by creating the table and then re-importing the most recent facility update?

we can fix before migration, or we can TRUNCATE and re-run this INSERT/SELECT as part of the fix (after the other table has been rectified). we could also remove bad rows from this table as part of the fix, but that requires identifying them which may or may not be easy depending on how the fix is implemented.

krivard · 2022-11-01T15:05:35Z

src/server/endpoints/covid_hosp_facility_lookup.py

    q.fields = ", ".join(
-        [
+        [ # NOTE: fields `geocoded_hospital_address` and `hhs_ids` are available but not being provided by this endpoint.


krivard · 2022-11-01T15:17:01Z

src/acquisition/covid_hosp/common/database.py

+      # cast types (bools and NULLables)
+      for col in ak_cols:
+        if col.startswith('is_'): # TODO: this is hacky af
+          ak_data[col] = (ak_data[col] == 'true')
+      ak_data = ak_data.to_numpy(na_value=None).tolist()


maybe less hacky -- we already know the python dtype of each column, including the aggregate key columns

Suggested change

# cast types (bools and NULLables)

for col in ak_cols:

if col.startswith('is_'): # TODO: this is hacky af

ak_data[col] = (ak_data[col] == 'true')

ak_data = ak_data.to_numpy(na_value=None).tolist()

dataframe_typemap = {

name: dtype

for name, _, dtype in dataframe_columns_and_types

}

# cast types (bools and NULLables)

for col in ak_cols:

ak_data[col] = dataframe_typemap[col](ak_data[col])

lol, maybe i should have left my "hacky" comment out of this... i did it that way because i didnt want to have to resort to using the dataframe_columns_and_types when the boolean column was the only one that needed massaging.

your suggestion is pretty good, but it needs an apply() and will probably have a few other hoops to jump through for NaN handling. ill do that and add another commit.

less hacky alternative has been pushed

src/acquisition/covid_hosp/common/database.py

Co-authored-by: Katie Mazaitis <[email protected]>

krivard

Looks good; optional refactor

krivard · 2022-11-10T15:26:43Z

src/acquisition/covid_hosp/common/database.py

+          def cast_but_sidestep_nans(i):
+            # not the prettiest, but it works to avoid the NaN values that show up in many columns
+            if isinstance(i, float) and math.isnan(i):
+              return None
+            return dataframe_typemap[col](i)


We also do that here:

delphi-epidata/src/acquisition/covid_hosp/common/database.py

Lines 166 to 169 in 50e40ba

if isinstance(row[name], float) and math.isnan(row[name]):

values.append(None)

else:

values.append(dtype(row[name]))

should this code get pulled into a function that can be used in both places?

also/alternatively: this is a touch on the YAGNI side, since I can't imagine a scenario where a float would become part of an aggregate key. maybe the check is unnecessary?

I considered treading down the slippery slope of refactoring/repairing this (and other) code in this method and module, and ultimately decided it was better to keep my changes separate and distinct. I can do something to remove the duplication you pointed out, but it felt like touching any of that code would be like opening a can of worms... My previous "hacky af" handling that only really called out the boolean column(s) was an attempt to produce a smaller and easily understandable diff.

Unfortunately, if we are going to apply the types as defined here, this check is needed in some form or another. The dataframe passed as an argument to this method has a NaN wherever the input CSV has an empty field, no matter which column that field is from. If we apply str() (for instance) from dataframe_typemap to such a NaN, we get the string 'nan' when we should really have None or even just an empty string. I almost did something like:

for col in ak_cols: ak_data[col] = list(map(dataframe_typemap[col], ak_data[col].to_numpy(na_value=None)))

but that still needs a guard to prevent (for instance) the string 'None' from appearing in the database where we now have a NULL. Instead of inventing a new and somewhat overly complicated process, i cribbed from what already existed.

I dont love where this ended up, but there were downsides to each of the options: like the "hacky af" thing that i had before, what we have here in the most recent commit, the incomplete thing i just described in the snippet above, or (as you suggested) altering the existing code to factor out the duplicated functionality. Maybe i should have just done the latter, as it wouldve taken me a lot less time than typing all this up (lolz).

so a string column with one blank row in the csv gets converted into a mixed-type column in the pandas df where some rows are strings and some rows are floats?

sorta, except theyre all mixed-type columns (more precisely, dtype: object) between when the csv is read and here 😬

melange396 · 2022-11-14T21:30:22Z

when do we want to deploy this? running the migration by hand is easy and will not break anything if done before deployment, but if another data acquisition is performed between migration and deployment, the new secondary table may* end up out-of-sync. depending on the content of the data in such an acquisition, it may not be out-of-sync by much, and in fact, i think it is likely that it will not be out-of-sync at all. (related to this comment)

* mixed-facility-metadata inputs * facility-metadata updates

…lookup Add covid_hosp_facility lookup integration tests

krivard

👍

melange396 added 2 commits October 31, 2022 18:18

fix for slow covid_hosp_facility_lookup API queries

9a649bb

fix for slow covid_hosp_facility_lookup API queries

d7de17e

melange396 requested a review from krivard October 31, 2022 22:21

moved covid_hosp_facility lookup aggregation sql from endpoint to mig…

1c0ebed

…ration script

krivard reviewed Nov 1, 2022

View reviewed changes

melange396 and others added 3 commits November 9, 2022 13:06

Apply suggestions from code review re: string assembly

4e4afa7

Co-authored-by: Katie Mazaitis <[email protected]>

added copy of new table definition to migration file

32e1a42

differently hacky but more consistent way to cast variable types

50e40ba

melange396 marked this pull request as ready for review November 10, 2022 00:39

krivard previously approved these changes Nov 10, 2022

View reviewed changes

Add covid_hosp_facility lookup integration tests

14aa8fa

* mixed-facility-metadata inputs * facility-metadata updates

krivard mentioned this pull request Nov 16, 2022

Add covid_hosp_facility lookup integration tests #1030

Merged

4 tasks

krivard added 3 commits November 16, 2022 15:30

covid_hosp_facility lookup tests passing

c54719c

Suggestions from review and their sequelae

edaea63

Merge pull request #1030 from cmu-delphi/krivard/covid_hosp_facility-…

9e11ada

…lookup Add covid_hosp_facility lookup integration tests

krivard dismissed their stale review via 9e11ada November 17, 2022 18:04

krivard approved these changes Nov 17, 2022

View reviewed changes

krivard merged commit 5d4c535 into dev Nov 17, 2022

krivard deleted the covid_hosp_lookup_speedup branch November 17, 2022 18:09

krivard mentioned this pull request Nov 17, 2022

Release Delphi Epidata 0.4.2 #1034

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

covid_hosp_facility lookup speedup #1016

covid_hosp_facility lookup speedup #1016

melange396 commented Oct 31, 2022 •

edited

Loading

krivard Nov 1, 2022

melange396 Nov 9, 2022

krivard Nov 1, 2022

melange396 Nov 9, 2022

krivard Nov 1, 2022

krivard Nov 1, 2022

melange396 Nov 9, 2022

melange396 Nov 10, 2022

krivard left a comment

krivard Nov 10, 2022

krivard Nov 10, 2022

melange396 Nov 14, 2022

krivard Nov 16, 2022

melange396 Nov 17, 2022

melange396 commented Nov 14, 2022

krivard left a comment

	if isinstance(row[name], float) and math.isnan(row[name]):
	values.append(None)
	else:
	values.append(dtype(row[name]))

covid_hosp_facility lookup speedup #1016

covid_hosp_facility lookup speedup #1016

Conversation

melange396 commented Oct 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

krivard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

melange396 commented Nov 14, 2022

krivard left a comment

Choose a reason for hiding this comment

melange396 commented Oct 31, 2022 •

edited

Loading