-
Notifications
You must be signed in to change notification settings - Fork 16
nssp pipeline code #1952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nssp pipeline code #1952
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran the pipeline and it seems to be pulling correctly, and the fields that are generated make sense. Generally looks good, I have some minor cosmetic suggestions.
I guess the archive differ is the only part that really remains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ran these as it suggests and things run fine. The linter is a little angry. Either way
limit = 50000 # maximum limit allowed by SODA 2.0 | ||
while True: | ||
page = client.get("rdmq-nq56", limit=limit, offset=offset) | ||
if not page: | ||
break # exit the loop if no more results | ||
results.extend(page) | ||
offset += limit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limit = 50000 # maximum limit allowed by SODA 2.0 | |
while True: | |
page = client.get("rdmq-nq56", limit=limit, offset=offset) | |
if not page: | |
break # exit the loop if no more results | |
results.extend(page) | |
offset += limit | |
limit = 50_000 | |
for ii in range(100): | |
page = client.get("rdmq-nq56", limit=limit, offset=offset) | |
if not page: | |
max_ii = ii | |
break # exit the loop if no more results | |
results.extend(page) | |
offset += limit | |
if max_ii == 100: | |
raise ValueError("client has pulled 100x the socrata limit") | |
This is probably fine, but while true
freaks me out. Feel free to use or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsweber2 why did you choose 100 here for the limit in your rewrite? I believe 5k is the per-page item limit, not the total limit. So theoretically we could get an infinite-page result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100 was maybe too low, though it would correspond to 50,000,000 items. If we're pulling more than that it should be quite a while down the road, or something has gone wrong. (or I may be misunderstanding how item counts work).
Co-authored-by: David Weber <[email protected]>
Co-authored-by: David Weber <[email protected]>
Co-authored-by: David Weber <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file seems to be unused: nssp/tests/test_data/page.txt
Some nits and style suggestions. Question about which geos we're reporting.
Co-authored-by: nmdefries <[email protected]>
Co-authored-by: nmdefries <[email protected]>
Co-authored-by: nmdefries <[email protected]>
… available sources and signals"
Nssp staging
Oh, good! I thought i remembered having a discussion about that but wasnt sure if it was for this or something else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm gonna do a deeper look at the core code, but i did a first pass to see if i could narrow down what files i needed to inspect more closely; i knocked off 29/46 files that way!
Theres a few things that seem to be out of scope for this PR, like all the max-line-length=120
in the pylint configs from other indicators... Are they going to break the lint steps in other indicators?
local({ | ||
|
||
# the requested version of renv | ||
version <- "1.0.7" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file seems to be imported from some external source. will we need to do anything to keep it synced/up-to-date? i presume (while along side the notebooks/renv.lock
) it should continue to "just work" until we need it to do something it cant already do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not 100% sure this is the right place to keep these notebooks to be honest, but I feel like the current set of notes about adding indicators are way too scattered. None of this folder is really intended to be automated.
renv/
and renv.lock
are maintained via a cli in R, and allow for pinning versions of dependencies (in this case, the notebooks).
Statistical review from Roni and followup on slack: I just finished reviewing this. Overall it looks great -- thank you all for your work on this! I do have some comments/answers/questions:
Was this resolved? If so, how?
Was this resolved? If so, how?
This is a known problem but one we cannot solve for now. I am fairly sure that CDC would want us to maintain all the values, even if they are e.g. 100% or 50%. But we should alert to this in our documentation. RE “Covid Lag analysis”, I think you have the wrong data viz here (it’s a copy of the one above it, but should be correlation vs. lag). Not crucial for releasing the signal, but we should fix it wherever you end up storing the analysis. Speaking of which: where do we usually store these analyses for new signals? Are they linked to from the documentation? As you showed, there is a good number of discrepancies between original state-level and state-level reconstructed from lower geo-level. Did you figure out the full reason(s) for that? Are these possibly rounding errors? Something else? This too should be discussed in the documentation. That's all I could find. To the substance of the signal: the correlations look awesome, esp. for flu -- this was far from assured! ED and hosp admissions are very different things, and come from very different reporting systems. But this bodes extremely well for estimating hosp trends without hosp reporting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
@@ -0,0 +1,257 @@ | |||
--- | |||
title: "Correlation Analyses for COVID-19 Indicators" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I am not checking this or other notebooks
items
from .pull import pull_nssp_data | ||
|
||
|
||
def add_needed_columns(df, col_names=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (optional): to make this more robust, add assert to make sure that our set of missing column names doesn't include important ones (like geo_id
and value
).
Next steps:
|
* add make format to nssp Makefile * run make format on nssp
From Roni, let's rename the signals by prefixing "visits" with "ed", e.g. This is ready to go whenever that is finished. |
@nmdefries cool. I'll get that going. |
when merging this, please do not forget to do a squash |
Description
Add 8 signals from source nssp