-
Notifications
You must be signed in to change notification settings - Fork 16
nssp pipeline code #1952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
nssp pipeline code #1952
Changes from all commits
Commits
Show all changes
71 commits
Select commit
Hold shift + click to select a range
d4ca5ca
to make nssp run in staging
minhkhul 11ff7d0
add nssp to Jenkinsfile
minhkhul d76d6ce
nssp_token name change
minhkhul c85c5dd
et code
minhkhul 3014997
Update nssp/delphi_nssp/run.py
minhkhul 4a90591
Update nssp/README.md
minhkhul 638c51d
Update nssp/DETAILS.md
minhkhul 82367b4
Update nssp/delphi_nssp/__main__.py
minhkhul 68a6154
Update nssp/delphi_nssp/pull.py
minhkhul 8851bd6
Update nssp/delphi_nssp/run.py
minhkhul 39e2cbd
readme update
minhkhul 95bdac1
column names mapping + signals name standardization to fit with other…
minhkhul 583e24e
improve readability
minhkhul 971968e
Add type_dict constant
minhkhul de9ef62
more type_dict
minhkhul 900fcc9
add more unit test pull
minhkhul 0678564
data for unit test of pull
minhkhul 38cd523
hrr + msa geos
minhkhul 5807cdb
use enumerate for clarity
minhkhul b4ec831
Merge pull request #1950 from cmu-delphi/nssp_staging
minhkhul e974afb
set nssp sircal max_age to 13 days
minhkhul 85e7b8b
set nssp sircal max_age to 15 days, to account for nighttime run
minhkhul 2247e1b
set nssp sircal max_age to 15 days, to account for nighttime run
minhkhul 2bfd5fc
add validation to params
minhkhul c65796f
Update nssp/DETAILS.md
minhkhul a7a869d
Update nssp/delphi_nssp/constants.py
minhkhul c074a45
et code
minhkhul 1be5b28
Update nssp/delphi_nssp/run.py
minhkhul bd5c782
Update nssp/README.md
minhkhul 86acc03
Update nssp/DETAILS.md
minhkhul 9a53923
Update nssp/delphi_nssp/__main__.py
minhkhul 54094eb
Update nssp/delphi_nssp/pull.py
minhkhul 1552504
Update nssp/delphi_nssp/run.py
minhkhul 6ccddfc
readme update
minhkhul e560393
column names mapping + signals name standardization to fit with other…
minhkhul 309b6c7
improve readability
minhkhul 813d289
Add type_dict constant
minhkhul db1f8ae
more type_dict
minhkhul 2a357c8
add more unit test pull
minhkhul d968789
data for unit test of pull
minhkhul 24a638d
hrr + msa geos
minhkhul b11c528
use enumerate for clarity
minhkhul 8169655
to make nssp run in staging
minhkhul fde9264
add nssp to Jenkinsfile
minhkhul bd545c8
nssp_token name change
minhkhul 7a3807e
set nssp sircal max_age to 15 days, to account for nighttime run
minhkhul 3623b5f
set nssp sircal max_age to 13 days
minhkhul daee033
add validation to params
minhkhul 566a826
Update nssp/DETAILS.md
minhkhul cfa1b94
Update nssp/delphi_nssp/constants.py
minhkhul 425b1fe
nssp correlation rmd and general notebook folder
dsweber2 5ce26f0
making Black happy
dsweber2 9b87133
update to new geomapper function
dsweber2 d53ff83
following 120 line convention everywhere
dsweber2 8eb6055
happy linter
dsweber2 e32cc55
happy black formatter in nssp
dsweber2 adf5df4
drop unneeded nssp tests
dsweber2 3559664
updates borked old tests, caught by @dshemetov
dsweber2 67601aa
rebase woes and version consistency
dsweber2 a79cff8
Update nssp-params-prod.json.j2 min/max lag to 13
minhkhul 9c6f31b
Update params.json.template min/max lag to 7 and 13
minhkhul 33a188e
missed column renames for geo_mapper, unneeded index
dsweber2 8a4cd18
Merge branch 'main' into nssp
dsweber2 eb2f000
Merge branch 'main' into nssp
dshemetov 24b25dd
lint+fix: update from linter changes
dshemetov 8daefe6
ci: update ci to lint nssp
dshemetov ec39773
lint: linter happy
dshemetov 2e178a8
lint: pydocstyle happy
dshemetov 90081df
lint: pydocstyle happy
dshemetov 91b759c
Resolved merge conflicts by accepting all incoming changes
minhkhul 355d65b
pct_visits to pct_ed_visits
minhkhul File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,29 @@ | ||
{ | ||
"common": { | ||
"export_dir": "/common/covidcast/receiving/nssp", | ||
"log_filename": "/var/log/indicators/nssp.log", | ||
"log_exceptions": false | ||
}, | ||
"indicator": { | ||
"wip_signal": true, | ||
"static_file_dir": "./static", | ||
"socrata_token": "{{ nssp_token }}" | ||
}, | ||
"validation": { | ||
"common": { | ||
"data_source": "nssp", | ||
"api_credentials": "{{ validation_api_key }}", | ||
"span_length": 15, | ||
"min_expected_lag": {"all": "7"}, | ||
"max_expected_lag": {"all": "13"}, | ||
"dry_run": true, | ||
"suppressed_errors": [] | ||
}, | ||
"static": { | ||
"minimum_sample_size": 0, | ||
"missing_se_allowed": true, | ||
"missing_sample_size_allowed": true | ||
}, | ||
"dynamic": {} | ||
} | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -50,6 +50,10 @@ | |
"hhs": { | ||
"max_age":15, | ||
"maintainers": [] | ||
}, | ||
"nssp": { | ||
"max_age":13, | ||
"maintainers": [] | ||
} | ||
} | ||
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
source("renv/activate.R") |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,257 @@ | ||
--- | ||
title: "Correlation Analyses for COVID-19 Indicators" | ||
author: "Delphi Group" | ||
date: "`r format(Sys.time(), '%B %d, %Y')`" | ||
output: | ||
html_document: | ||
code_folding: hide | ||
--- | ||
|
||
```{r, include = FALSE} | ||
knitr::opts_chunk$set(message = FALSE, warning = FALSE, fig.width = 8, | ||
fig.height = 7) | ||
``` | ||
|
||
### Getting data | ||
This requires that you've already run the nssp pipeline. See the `nssp` directory for instructions on doing that. | ||
First loading some libraries and reading the results from the pipeline: | ||
```{r} | ||
library(covidcast) | ||
library(epidatr) | ||
library(dplyr) | ||
library(ggplot2) | ||
|
||
library(purrr) | ||
library(tidyverse) | ||
library(dplyr) | ||
library(readr) | ||
files <- list.files(here::here("nssp/receiving"), pattern="\\.csv$", full.names = TRUE) | ||
read_row <- function(filename) { | ||
split_name <- filename %>% | ||
tools::file_path_sans_ext() %>% | ||
strsplit("/") %>% `[[`(1) %>% tail(n=1) %>% | ||
strsplit("_") %>% `[[`(1) | ||
week_number <- split_name[[2]] | ||
geo_type <- split_name[[3]] | ||
col_name <- split_name[-(1:3)] %>% paste(collapse = "_") | ||
read_csv(filename, show_col_types = FALSE) %>% | ||
as_tibble %>% | ||
mutate(signal = col_name, | ||
geo_type = geo_type, | ||
week_number = week_number) %>% | ||
mutate(across(geo_id, factor)) %>% | ||
rename(geo_value = geo_id, time_value = week_number) %>% | ||
select(-missing_se, -se, -sample_size, -missing_sample_size) %>% | ||
return | ||
} | ||
res <- map(files, read_row) | ||
nssp_data <- bind_rows(res) | ||
nssp_state <- nssp_data %>% | ||
filter(geo_type == "state") %>% | ||
mutate(time_value = epidatr:::parse_api_week(time_value)) %>% | ||
as_epi_df(time_type = "week", geo_type = "state") %>% | ||
select(-missing_val, -geo_type) | ||
unique(nssp_data$time_value) | ||
``` | ||
And epidatr versions of hhs for comparison | ||
```{r} | ||
library(epidatr) | ||
eval_time <- epidatr::epirange(from = "2020-01-01", to = Sys.Date()) | ||
fetch_args <- epidatr::fetch_args_list(return_empty = TRUE, timeout_seconds = 300) | ||
|
||
flu_hhs <- epidatr::pub_covidcast( | ||
source = "hhs", | ||
signals = "confirmed_admissions_influenza_1d_prop_7dav", | ||
geo_type = "state", | ||
time_type = "day", | ||
geo_values = "*", | ||
time_values = eval_time, | ||
fetch_args = fetch_args | ||
) %>% | ||
select(-signal, -source, - time_type) | ||
|
||
covid_hhs <- epidatr::pub_covidcast( | ||
source = "hhs", | ||
signals = "confirmed_admissions_covid_1d_prop_7dav", | ||
geo_type = "state", | ||
time_type = "day", | ||
geo_values = "*", | ||
time_values = eval_time, | ||
fetch_args = fetch_args | ||
) %>% | ||
select(-signal, -source, - time_type) | ||
|
||
|
||
nchs <- epidatr::pub_covidcast( | ||
source = "nchs-mortality", | ||
signals = "deaths_allcause_incidence_num", | ||
geo_type = "state", | ||
time_type = "week", | ||
geo_values = "*", | ||
time_values = epidatr::epirange(from = "202001", to = "202418"), | ||
fetch_args = epidatr::fetch_args_list(return_empty = TRUE, timeout_seconds = 300) | ||
) | ||
``` | ||
# Flu | ||
```{r} | ||
library(epiprocess) | ||
nssp_flu_state <- nssp_state %>% filter(signal == "pct_ed_visits_influenza") %>% select(-signal) %>% drop_na %>% rename(pct_flu_visits = val) %>% as_epi_df(time_type = "week", geo_type = "state") | ||
week_starts <- nssp_flu_state$time_value %>% unique() | ||
flu_hhs_weekly <- flu_hhs %>% select(geo_value, time_value, value) %>% filter(time_value %in% week_starts) %>% rename(conf_admission = value) %>% drop_na %>% as_epi_df(time_type = "week", geo_type = "state") | ||
joined <- nssp_flu_state %>% left_join(flu_hhs_weekly) | ||
``` | ||
|
||
After the necessary joining, lets look at the average correlations | ||
```{r} | ||
cor(joined$pct_flu_visits, joined$conf_admission, method = "spearman") | ||
``` | ||
So the overall correlation is pretty high. | ||
|
||
## Correlations sliced by state | ||
```{r} | ||
correlations_space_flu <- epi_cor(joined, pct_flu_visits, conf_admission, cor_by = "geo_value", use = "complete.obs", method = "spearman") | ||
library(maps) # For map data | ||
states_map <- map_data("state") | ||
mapped <- states_map %>% as_tibble %>% mutate(geo_value = setNames(tolower(state.abb), tolower(state.name))[region]) %>% right_join(correlations_space_flu) %>% arrange(group, order) | ||
library(viridis) | ||
ggplot(mapped, aes(x = long, y = lat, group = group, fill = cor)) + | ||
geom_polygon(colour = "black") + | ||
scale_fill_viridis(discrete=FALSE, option="viridis", limits = c(0,1)) + | ||
coord_map("polyconic") + | ||
labs(title = "Spearman Correlations between Flu ER visits and Flu hospital admissions") | ||
ggsave("flu_ER_admissions_state_correlations.pdf") | ||
``` | ||
Over space, hospital admissions look like they're highly correlated with ER visits (which makes sense, frequently when one is admitted it is via the ER). | ||
The lowest overall correlation is | ||
```{r} | ||
correlations_space_flu %>% summarize(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, std = sd, q25 = ~quantile(.,0.25), q75 = ~quantile(.,0.75), max = max))) | ||
``` | ||
### Lag evaluation | ||
```{r} | ||
library(purrr) | ||
lags <- 0:35 | ||
|
||
lagged_flu_state <- map_dfr(lags, function(lag) { | ||
epi_cor(joined, pct_flu_visits, conf_admission, cor_by = geo_value, dt1 = lag, use = "complete.obs", method = "spearman") %>% | ||
mutate(lag = .env$lag) | ||
}) | ||
|
||
lagged_flu_state %>% | ||
group_by(lag) %>% | ||
summarize(mean = mean(cor, na.rm = TRUE)) %>% | ||
ggplot(aes(x = lag, y = mean)) + | ||
geom_line() + | ||
geom_point() + | ||
labs(x = "Lag", y = "Mean correlation", title = "Lag comparison for state spearman correlations for flu ER and Hosp admissions") | ||
ggsave("flu_ER_admissions_state_lag_cor.pdf") | ||
``` | ||
Somewhat unsurprisingly, the correlation is highest immediately afterward. | ||
## Correlations sliced by time | ||
```{r} | ||
correlations_time_flu <- epi_cor(joined, pct_flu_visits, conf_admission, cor_by = "time_value", use = "complete.obs", method = "spearman") | ||
correlations_time_flu | ||
ggplot(correlations_time_flu, aes(x = time_value, y = cor)) + geom_line() + lims(y=c(0,1)) + labs(title = "Spearman Correlations between Flu ER visits and Flu hospital admissions") | ||
ggsave("flu_ER_admissions_time_correlations.pdf") | ||
``` | ||
Strangely, sliced by time, we get significantly lower correlations | ||
```{r} | ||
correlations_time_flu %>% summarize(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, std = sd, q25 = ~quantile(.,0.25), q75 = ~quantile(.,0.75), max = max))) | ||
``` | ||
Seems like we have a Simpson's paradox adjacent result, since for any given location the signals are fairly well correlated when averaged over time, but at a given time, averaging over different locations suggests they're not very well correlated. | ||
If the typical explanation applies, this means that there are large differences in the number of points. | ||
|
||
so, getting the counts: | ||
```{r} | ||
joined %>% group_by(geo_value) %>% count %>% arrange(n) %>% ungroup %>% summarise(across(where(is.numeric), .fns = list(min = min, max = max))) | ||
``` | ||
Each location has 82 | ||
|
||
```{r} | ||
joined %>% group_by(time_value) %>% count %>% arrange(n) %>% ungroup %>% summarise(across(where(is.numeric), .fns = list(min = min, max = max))) | ||
``` | ||
# Covid | ||
```{r} | ||
library(epiprocess) | ||
nssp_data %>% pull(signal) %>% unique | ||
nssp_state <- nssp_data %>% | ||
filter(geo_type == "state") %>% | ||
mutate(time_value = epidatr:::parse_api_week(time_value)) %>% | ||
as_epi_df(time_type = "week", geo_type = "state") %>% | ||
select(-missing_val, -geo_type) | ||
nssp_covid_state <- nssp_state %>% filter(signal == "pct_ed_visits_covid") %>% select(-signal) %>% drop_na %>% rename(pct_covid_visits = val) %>% as_epi_df(time_type = "week", geo_type = "state") | ||
week_starts <- nssp_covid_state$time_value %>% unique() | ||
covid_hhs_weekly <- covid_hhs %>% select(geo_value, time_value, value) %>% filter(time_value %in% week_starts) %>% rename(conf_admission = value) %>% drop_na %>% as_epi_df(time_type = "week", geo_type = "state") | ||
joined_covid <- nssp_covid_state %>% left_join(covid_hhs_weekly) | ||
``` | ||
|
||
After the necessary joining, lets look at the average correlations | ||
```{r} | ||
cor(joined_covid$pct_covid_visits, joined_covid$conf_admission, method = "spearman") | ||
``` | ||
So the overall correlation is pretty high, but lower than flu. | ||
|
||
## Correlations sliced by state | ||
```{r} | ||
correlations_space_covid <- epi_cor(joined_covid, pct_covid_visits, conf_admission, cor_by = "geo_value", use = "complete.obs", method = "spearman") | ||
library(maps) # For map data | ||
states_map <- map_data("state") | ||
mapped <- states_map %>% as_tibble %>% mutate(geo_value = setNames(tolower(state.abb), tolower(state.name))[region]) %>% right_join(correlations_space_covid) %>% arrange(group, order) | ||
library(viridis) | ||
ggplot(mapped, aes(x = long, y = lat, group = group, fill = cor)) + | ||
geom_polygon(colour = "black") + | ||
scale_fill_viridis(discrete=FALSE, option="viridis", limits = c(0,1)) + | ||
coord_map("polyconic") + | ||
labs(title = "Spearman Correlations between covid ER visits and covid hospital admissions") | ||
ggsave("covid_ER_admissions_state_correlations.pdf") | ||
ggsave("covid_ER_admissions_state_correlations.png") | ||
``` | ||
Over space, hospital admissions look like they're highly correlated with ER visits (which makes sense, frequently when one is admitted it is via the ER). | ||
The lowest overall correlation is | ||
```{r} | ||
correlations_space_covid %>% summarize(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, std = sd, q25 = ~quantile(.,0.25), q75 = ~quantile(.,0.75), max = max))) | ||
``` | ||
### Lag evaluation | ||
```{r} | ||
library(purrr) | ||
lags <- 0:35 | ||
|
||
lagged_covid_state <- map_dfr(lags, function(lag) { | ||
epi_cor(joined_covid, pct_covid_visits, conf_admission, cor_by = geo_value, dt1 = -lag, use = "complete.obs", method = "spearman") %>% | ||
mutate(lag = .env$lag) | ||
}) | ||
|
||
lagged_covid_state %>% | ||
group_by(lag) %>% | ||
summarize(mean = mean(cor, na.rm = TRUE)) %>% | ||
ggplot(aes(x = lag, y = mean)) + | ||
geom_line() + | ||
geom_point() + | ||
labs(x = "Lag", y = "Mean correlation", title = "Lag comparison for state spearman correlations for covid ER and Hosp admissions") | ||
ggsave("covid_ER_admissions_state_lag_cor.pdf") | ||
ggsave("covid_ER_admissions_state_lag_cor.png") | ||
``` | ||
Somewhat unsurprisingly, the correlation is highest immediately afterward, though its significantly lower than in the flu case. | ||
## Correlations sliced by time | ||
```{r} | ||
correlations_time_covid <- epi_cor(joined_covid, pct_covid_visits, conf_admission, cor_by = "time_value", use = "complete.obs", method = "spearman") | ||
correlations_time_covid | ||
ggplot(correlations_time_covid, aes(x = time_value, y = cor)) + geom_line() + lims(y=c(0,1)) + labs(title = "Spearman Correlations between covid ER visits and covid hospital admissions") | ||
ggsave("covid_ER_admissions_time_correlations.pdf") | ||
ggsave("covid_ER_admissions_time_correlations.png") | ||
``` | ||
Strangely, sliced by time, we get significantly lower correlations, some of them are even negative | ||
```{r} | ||
correlations_time_covid %>% summarize(across(where(is.numeric), .fns = list(min = min, median = median, mean = mean, std = sd, q25 = ~quantile(.,0.25), q75 = ~quantile(.,0.75), max = max))) | ||
``` | ||
Seems like we have a Simpson's paradox adjacent result, since for any given location the signals are fairly well correlated when averaged over time, but at a given time, averaging over different locations suggests they're not very well correlated. | ||
If the typical explanation applies, this means that there are large differences in the number of points. | ||
|
||
so, getting the counts: | ||
```{r} | ||
joined_covid %>% group_by(geo_value) %>% count %>% arrange(n) %>% ungroup %>% summarise(across(where(is.numeric), .fns = list(min = min, max = max))) | ||
``` | ||
Each location has 82 | ||
|
||
```{r} | ||
joined_covid %>% group_by(time_value) %>% count %>% arrange(n) %>% ungroup %>% summarise(across(where(is.numeric), .fns = list(min = min, max = max))) | ||
``` |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I am not checking this or other
notebooks
items