Skip to content

nssp pipeline code #1952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 71 commits into from
Jun 10, 2024
Merged

nssp pipeline code #1952

merged 71 commits into from
Jun 10, 2024

Conversation

minhkhul
Copy link
Contributor

Description

Add 8 signals from source nssp

@minhkhul minhkhul requested review from dsweber2 and nmdefries April 17, 2024 21:11
Copy link
Contributor

@dsweber2 dsweber2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran the pipeline and it seems to be pulling correctly, and the fields that are generated make sense. Generally looks good, I have some minor cosmetic suggestions.

I guess the archive differ is the only part that really remains?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran these as it suggests and things run fine. The linter is a little angry. Either way

limit = 50000 # maximum limit allowed by SODA 2.0
while True:
page = client.get("rdmq-nq56", limit=limit, offset=offset)
if not page:
break # exit the loop if no more results
results.extend(page)
offset += limit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
limit = 50000 # maximum limit allowed by SODA 2.0
while True:
page = client.get("rdmq-nq56", limit=limit, offset=offset)
if not page:
break # exit the loop if no more results
results.extend(page)
offset += limit
limit = 50_000
for ii in range(100):
page = client.get("rdmq-nq56", limit=limit, offset=offset)
if not page:
max_ii = ii
break # exit the loop if no more results
results.extend(page)
offset += limit
if max_ii == 100:
raise ValueError("client has pulled 100x the socrata limit")

This is probably fine, but while true freaks me out. Feel free to use or not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsweber2 why did you choose 100 here for the limit in your rewrite? I believe 5k is the per-page item limit, not the total limit. So theoretically we could get an infinite-page result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 was maybe too low, though it would correspond to 50,000,000 items. If we're pulling more than that it should be quite a while down the road, or something has gone wrong. (or I may be misunderstanding how item counts work).

minhkhul and others added 3 commits April 19, 2024 18:18
Co-authored-by: David Weber <[email protected]>
Co-authored-by: David Weber <[email protected]>
Copy link
Contributor

@nmdefries nmdefries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file seems to be unused: nssp/tests/test_data/page.txt

Some nits and style suggestions. Question about which geos we're reporting.

@minhkhul minhkhul requested a review from nmdefries April 25, 2024 19:56
@melange396
Copy link
Contributor

Oh, good! I thought i remembered having a discussion about that but wasnt sure if it was for this or something else.

Copy link
Contributor

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm gonna do a deeper look at the core code, but i did a first pass to see if i could narrow down what files i needed to inspect more closely; i knocked off 29/46 files that way!

Theres a few things that seem to be out of scope for this PR, like all the max-line-length=120 in the pylint configs from other indicators... Are they going to break the lint steps in other indicators?

local({

# the requested version of renv
version <- "1.0.7"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file seems to be imported from some external source. will we need to do anything to keep it synced/up-to-date? i presume (while along side the notebooks/renv.lock) it should continue to "just work" until we need it to do something it cant already do?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure this is the right place to keep these notebooks to be honest, but I feel like the current set of notes about adding indicators are way too scattered. None of this folder is really intended to be automated.

renv/ and renv.lock are maintained via a cli in R, and allow for pinning versions of dependencies (in this case, the notebooks).

@nmdefries
Copy link
Contributor

nmdefries commented Jun 4, 2024

Statistical review from Roni and followup on slack:

I just finished reviewing this. Overall it looks great -- thank you all for your work on this! I do have some comments/answers/questions:

Nat: Looks like there is data that is not getting included in the aggregation step -- we'll want to think about what to do with that.

Was this resolved? If so, how?

Nat: What is the meaning of a missing value in the incoming data? Censored for privacy? 0? Too small a sample size to accurately report?

Was this resolved? If so, how?

Nat: Think about if we should do any filtering/cleaning, e.g. low sample size in covid tests causing high variability in test positivity rate.
Minh: There are some outliers in low population counties, creating stats like 50-100% of ER visits being covid/flu related in those places. There are around 10 of these outliers spread among each signals. This makes sense as the signals are percentages. So if county A has only one ER visit this week and that visit is about covid, then the data would show up as 100% of all ER visits in county A is covid-related.

This is a known problem but one we cannot solve for now. I am fairly sure that CDC would want us to maintain all the values, even if they are e.g. 100% or 50%. But we should alert to this in our documentation.

RE “Covid Lag analysis”, I think you have the wrong data viz here (it’s a copy of the one above it, but should be correlation vs. lag). Not crucial for releasing the signal, but we should fix it wherever you end up storing the analysis. Speaking of which: where do we usually store these analyses for new signals? Are they linked to from the documentation?

As you showed, there is a good number of discrepancies between original state-level and state-level reconstructed from lower geo-level. Did you figure out the full reason(s) for that? Are these possibly rounding errors? Something else? This too should be discussed in the documentation.

That's all I could find. To the substance of the signal: the correlations look awesome, esp. for flu -- this was far from assured! ED and hosp admissions are very different things, and come from very different reporting systems. But this bodes extremely well for estimating hosp trends without hosp reporting.

Copy link
Contributor

@nmdefries nmdefries left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@@ -0,0 +1,257 @@
---
title: "Correlation Analyses for COVID-19 Indicators"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I am not checking this or other notebooks items

from .pull import pull_nssp_data


def add_needed_columns(df, col_names=None):
Copy link
Contributor

@nmdefries nmdefries Jun 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (optional): to make this more robust, add assert to make sure that our set of missing column names doesn't include important ones (like geo_id and value).

@nmdefries nmdefries requested a review from melange396 June 5, 2024 16:08
@nmdefries
Copy link
Contributor

Next steps:

  • check that this runs and output looks correct on staging
  • loop in Brian to get this set up in production
  • I will review the docs

@nmdefries
Copy link
Contributor

From Roni, let's rename the signals by prefixing "visits" with "ed", e.g. pct_visits_covid to pct_ed_visits_covid.

This is ready to go whenever that is finished.

@minhkhul
Copy link
Contributor Author

@nmdefries cool. I'll get that going.

@melange396
Copy link
Contributor

when merging this, please do not forget to do a squash

@minhkhul minhkhul merged commit ae6f011 into main Jun 10, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants