Skip to content

covid_hosp_facility lookup speedup #1016

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Nov 17, 2022
Merged

covid_hosp_facility lookup speedup #1016

merged 10 commits into from
Nov 17, 2022

Conversation

melange396
Copy link
Collaborator

@melange396 melange396 commented Oct 31, 2022

still todo:

  • test db migration routine(s) a bit more, and plan for execution on prod (aka plan rollout)
    • choice of migration doesnt matter too much because values will be updated within a week (during the next data acquisition)
  • add test that verifies any updates to facility metadata (like a change in street address) will be persisted and replace previous metadata

@melange396 melange396 requested a review from krivard October 31, 2022 22:21
Comment on lines 175 to 199
if hasattr(self, 'AGGREGATE_KEY_COLS'):
ak_cols = self.AGGREGATE_KEY_COLS

# restrict data to just the key columns and remove duplicate rows
ak_data = dataframe[ak_cols].drop_duplicates()
# cast types (bools and NULLables)
for col in ak_cols:
if col.startswith('is_'): # TODO: this is hacky af
ak_data[col] = (ak_data[col] == 'true')
ak_data = ak_data.to_numpy(na_value=None).tolist()

# create string of tick-quoted and comma-seperated column list
ak_cols_str = ','.join([ '`'+col+'`' for col in ak_cols ])
# ...and ticked and comma-sep'd "column=column" list for ON UPDATE (to keep only the most recent values for each pk)
ak_updates_str = ','.join([ '`'+col+'`=`'+col+'`' for col in ak_cols ])
# ...and string of VALUES placeholders
values_str = ','.join( ['%s'] * len(ak_cols) )
# use aggregate key table alias
ak_table = self.table_name + '_key'
# assemble full SQL statement
ak_insert_sql = f'INSERT INTO `{ak_table}` ({ak_cols_str}) VALUES ({values_str}) ON DUPLICATE KEY UPDATE {ak_updates_str}'

# commit the data
with self.new_cursor() as cur:
cur.executemany(ak_insert_sql, ak_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make more sense to put this in the facility subclass? that's the only one that needs this step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could go either way... its not specific to the facility data, but it is true that the facility schema is the only one we currently have that needs it. if we leave it here, some future data set from covid_hosp that has composite keys can use it without modification.

-- SELECT
-- `hospital_pk`,MAX(`address`),MAX(`ccn`),MAX(`city`),MAX(`fips_code`),MAX(`geocoded_hospital_address`),MAX(`hhs_ids`),MAX(`hospital_name`),MAX(`hospital_subtype`),MAX(`is_metro_micro`),MAX(`state`),MAX(`zip`)
-- FROM `covid_hosp_facility`
-- GROUP BY `hospital_pk`;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the above methods assume we will fix #1007 before migration -- is that strictly necessary, or could we effect a migration by creating the table and then re-importing the most recent facility update?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can fix before migration, or we can TRUNCATE and re-run this INSERT/SELECT as part of the fix (after the other table has been rectified). we could also remove bad rows from this table as part of the fix, but that requires identifying them which may or may not be easy depending on how the fix is implemented.

q.fields = ", ".join(
[
[ # NOTE: fields `geocoded_hospital_address` and `hhs_ids` are available but not being provided by this endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Comment on lines 180 to 184
# cast types (bools and NULLables)
for col in ak_cols:
if col.startswith('is_'): # TODO: this is hacky af
ak_data[col] = (ak_data[col] == 'true')
ak_data = ak_data.to_numpy(na_value=None).tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe less hacky -- we already know the python dtype of each column, including the aggregate key columns

Suggested change
# cast types (bools and NULLables)
for col in ak_cols:
if col.startswith('is_'): # TODO: this is hacky af
ak_data[col] = (ak_data[col] == 'true')
ak_data = ak_data.to_numpy(na_value=None).tolist()
dataframe_typemap = {
name: dtype
for name, _, dtype in dataframe_columns_and_types
}
# cast types (bools and NULLables)
for col in ak_cols:
ak_data[col] = dataframe_typemap[col](ak_data[col])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lol, maybe i should have left my "hacky" comment out of this... i did it that way because i didnt want to have to resort to using the dataframe_columns_and_types when the boolean column was the only one that needed massaging.

your suggestion is pretty good, but it needs an apply() and will probably have a few other hoops to jump through for NaN handling. ill do that and add another commit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less hacky alternative has been pushed

@melange396 melange396 marked this pull request as ready for review November 10, 2022 00:39
krivard
krivard previously approved these changes Nov 10, 2022
Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; optional refactor

Comment on lines 187 to 191
def cast_but_sidestep_nans(i):
# not the prettiest, but it works to avoid the NaN values that show up in many columns
if isinstance(i, float) and math.isnan(i):
return None
return dataframe_typemap[col](i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also do that here:

if isinstance(row[name], float) and math.isnan(row[name]):
values.append(None)
else:
values.append(dtype(row[name]))

should this code get pulled into a function that can be used in both places?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also/alternatively: this is a touch on the YAGNI side, since I can't imagine a scenario where a float would become part of an aggregate key. maybe the check is unnecessary?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered treading down the slippery slope of refactoring/repairing this (and other) code in this method and module, and ultimately decided it was better to keep my changes separate and distinct. I can do something to remove the duplication you pointed out, but it felt like touching any of that code would be like opening a can of worms... My previous "hacky af" handling that only really called out the boolean column(s) was an attempt to produce a smaller and easily understandable diff.

Unfortunately, if we are going to apply the types as defined here, this check is needed in some form or another. The dataframe passed as an argument to this method has a NaN wherever the input CSV has an empty field, no matter which column that field is from. If we apply str() (for instance) from dataframe_typemap to such a NaN, we get the string 'nan' when we should really have None or even just an empty string. I almost did something like:

for col in ak_cols:
    ak_data[col] = list(map(dataframe_typemap[col], ak_data[col].to_numpy(na_value=None)))

but that still needs a guard to prevent (for instance) the string 'None' from appearing in the database where we now have a NULL. Instead of inventing a new and somewhat overly complicated process, i cribbed from what already existed.

I dont love where this ended up, but there were downsides to each of the options: like the "hacky af" thing that i had before, what we have here in the most recent commit, the incomplete thing i just described in the snippet above, or (as you suggested) altering the existing code to factor out the duplicated functionality. Maybe i should have just done the latter, as it wouldve taken me a lot less time than typing all this up (lolz).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so a string column with one blank row in the csv gets converted into a mixed-type column in the pandas df where some rows are strings and some rows are floats?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorta, except theyre all mixed-type columns (more precisely, dtype: object) between when the csv is read and here 😬

@melange396
Copy link
Collaborator Author

when do we want to deploy this? running the migration by hand is easy and will not break anything if done before deployment, but if another data acquisition is performed between migration and deployment, the new secondary table may* end up out-of-sync. depending on the content of the data in such an acquisition, it may not be out-of-sync by much, and in fact, i think it is likely that it will not be out-of-sync at all. (related to this comment)

* mixed-facility-metadata inputs
* facility-metadata updates
Copy link
Contributor

@krivard krivard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@krivard krivard merged commit 5d4c535 into dev Nov 17, 2022
@krivard krivard deleted the covid_hosp_lookup_speedup branch November 17, 2022 18:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants