-
Notifications
You must be signed in to change notification settings - Fork 67
covid_hosp_facility lookup speedup #1016
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if hasattr(self, 'AGGREGATE_KEY_COLS'): | ||
ak_cols = self.AGGREGATE_KEY_COLS | ||
|
||
# restrict data to just the key columns and remove duplicate rows | ||
ak_data = dataframe[ak_cols].drop_duplicates() | ||
# cast types (bools and NULLables) | ||
for col in ak_cols: | ||
if col.startswith('is_'): # TODO: this is hacky af | ||
ak_data[col] = (ak_data[col] == 'true') | ||
ak_data = ak_data.to_numpy(na_value=None).tolist() | ||
|
||
# create string of tick-quoted and comma-seperated column list | ||
ak_cols_str = ','.join([ '`'+col+'`' for col in ak_cols ]) | ||
# ...and ticked and comma-sep'd "column=column" list for ON UPDATE (to keep only the most recent values for each pk) | ||
ak_updates_str = ','.join([ '`'+col+'`=`'+col+'`' for col in ak_cols ]) | ||
# ...and string of VALUES placeholders | ||
values_str = ','.join( ['%s'] * len(ak_cols) ) | ||
# use aggregate key table alias | ||
ak_table = self.table_name + '_key' | ||
# assemble full SQL statement | ||
ak_insert_sql = f'INSERT INTO `{ak_table}` ({ak_cols_str}) VALUES ({values_str}) ON DUPLICATE KEY UPDATE {ak_updates_str}' | ||
|
||
# commit the data | ||
with self.new_cursor() as cur: | ||
cur.executemany(ak_insert_sql, ak_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make more sense to put this in the facility subclass? that's the only one that needs this step.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it could go either way... its not specific to the facility data, but it is true that the facility schema is the only one we currently have that needs it. if we leave it here, some future data set from covid_hosp that has composite keys can use it without modification.
-- SELECT | ||
-- `hospital_pk`,MAX(`address`),MAX(`ccn`),MAX(`city`),MAX(`fips_code`),MAX(`geocoded_hospital_address`),MAX(`hhs_ids`),MAX(`hospital_name`),MAX(`hospital_subtype`),MAX(`is_metro_micro`),MAX(`state`),MAX(`zip`) | ||
-- FROM `covid_hosp_facility` | ||
-- GROUP BY `hospital_pk`; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the above methods assume we will fix #1007 before migration -- is that strictly necessary, or could we effect a migration by creating the table and then re-importing the most recent facility update?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can fix before migration, or we can TRUNCATE and re-run this INSERT/SELECT as part of the fix (after the other table has been rectified). we could also remove bad rows from this table as part of the fix, but that requires identifying them which may or may not be easy depending on how the fix is implemented.
q.fields = ", ".join( | ||
[ | ||
[ # NOTE: fields `geocoded_hospital_address` and `hhs_ids` are available but not being provided by this endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
# cast types (bools and NULLables) | ||
for col in ak_cols: | ||
if col.startswith('is_'): # TODO: this is hacky af | ||
ak_data[col] = (ak_data[col] == 'true') | ||
ak_data = ak_data.to_numpy(na_value=None).tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe less hacky -- we already know the python dtype of each column, including the aggregate key columns
# cast types (bools and NULLables) | |
for col in ak_cols: | |
if col.startswith('is_'): # TODO: this is hacky af | |
ak_data[col] = (ak_data[col] == 'true') | |
ak_data = ak_data.to_numpy(na_value=None).tolist() | |
dataframe_typemap = { | |
name: dtype | |
for name, _, dtype in dataframe_columns_and_types | |
} | |
# cast types (bools and NULLables) | |
for col in ak_cols: | |
ak_data[col] = dataframe_typemap[col](ak_data[col]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lol, maybe i should have left my "hacky" comment out of this... i did it that way because i didnt want to have to resort to using the dataframe_columns_and_types
when the boolean column was the only one that needed massaging.
your suggestion is pretty good, but it needs an apply()
and will probably have a few other hoops to jump through for NaN handling. ill do that and add another commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less hacky alternative has been pushed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; optional refactor
def cast_but_sidestep_nans(i): | ||
# not the prettiest, but it works to avoid the NaN values that show up in many columns | ||
if isinstance(i, float) and math.isnan(i): | ||
return None | ||
return dataframe_typemap[col](i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also do that here:
delphi-epidata/src/acquisition/covid_hosp/common/database.py
Lines 166 to 169 in 50e40ba
if isinstance(row[name], float) and math.isnan(row[name]): | |
values.append(None) | |
else: | |
values.append(dtype(row[name])) |
should this code get pulled into a function that can be used in both places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also/alternatively: this is a touch on the YAGNI side, since I can't imagine a scenario where a float would become part of an aggregate key. maybe the check is unnecessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered treading down the slippery slope of refactoring/repairing this (and other) code in this method and module, and ultimately decided it was better to keep my changes separate and distinct. I can do something to remove the duplication you pointed out, but it felt like touching any of that code would be like opening a can of worms... My previous "hacky af" handling that only really called out the boolean column(s) was an attempt to produce a smaller and easily understandable diff.
Unfortunately, if we are going to apply the types as defined here, this check is needed in some form or another. The dataframe passed as an argument to this method has a NaN wherever the input CSV has an empty field, no matter which column that field is from. If we apply str()
(for instance) from dataframe_typemap
to such a NaN, we get the string 'nan'
when we should really have None
or even just an empty string. I almost did something like:
for col in ak_cols:
ak_data[col] = list(map(dataframe_typemap[col], ak_data[col].to_numpy(na_value=None)))
but that still needs a guard to prevent (for instance) the string 'None'
from appearing in the database where we now have a NULL
. Instead of inventing a new and somewhat overly complicated process, i cribbed from what already existed.
I dont love where this ended up, but there were downsides to each of the options: like the "hacky af" thing that i had before, what we have here in the most recent commit, the incomplete thing i just described in the snippet above, or (as you suggested) altering the existing code to factor out the duplicated functionality. Maybe i should have just done the latter, as it wouldve taken me a lot less time than typing all this up (lolz).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so a string column with one blank row in the csv gets converted into a mixed-type column in the pandas df where some rows are strings and some rows are floats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorta, except theyre all mixed-type columns (more precisely, dtype: object
) between when the csv is read and here 😬
when do we want to deploy this? running the migration by hand is easy and will not break anything if done before deployment, but if another data acquisition is performed between migration and deployment, the new secondary table may* end up out-of-sync. depending on the content of the data in such an acquisition, it may not be out-of-sync by much, and in fact, i think it is likely that it will not be out-of-sync at all. (related to this comment) |
* mixed-facility-metadata inputs * facility-metadata updates
…lookup Add covid_hosp_facility lookup integration tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
still todo: