Skip to content

Commit 525618a

Browse files
authored
Merge pull request #392 from cmu-delphi/ds/fix-articles
doc: fix sliding article and verify others
2 parents 34cb6ed + 93a405e commit 525618a

File tree

3 files changed

+107
-112
lines changed

3 files changed

+107
-112
lines changed

vignettes/articles/sliding.Rmd

Lines changed: 87 additions & 93 deletions
Original file line numberDiff line numberDiff line change
@@ -25,27 +25,21 @@ library(purrr)
2525

2626
# Demonstrations of sliding AR and ARX forecasters
2727

28-
A key function from the epiprocess package is `epi_slide()`, which allows the
29-
user to apply a function or formula-based computation over variables in an
30-
`epi_df` over a running window of `n` time steps (see the following `epiprocess`
31-
vignette to go over the basics of the function: ["Slide a computation over
32-
signal values"](https://cmu-delphi.github.io/epiprocess/articles/slide.html)).
33-
The equivalent sliding method for an `epi_archive` object can be called by using
34-
the wrapper function `epix_slide()` (refer to the following vignette for the
35-
basics of the function: ["Work with archive objects and data
36-
revisions"](https://cmu-delphi.github.io/epiprocess/articles/archive.html)). The
37-
key difference from `epi_slide()` is that it performs version-aware
38-
computations. That is, the function only uses data that would have been
39-
available as of time t for that reference time.
40-
41-
In this vignette, we use `epi_slide()` and `epix_slide()` for backtesting our
42-
`arx_forecaster` on historical COVID-19 case data from the US and from Canada.
43-
More precisely, we first demonstrate using `epi_slide()` to slide ARX
44-
forecasters over an `epi_df` object and compare the results obtained from using
45-
different forecasting engines. We then compare the results from version-aware
46-
and unaware forecasting, where the former is obtained from applying
47-
`epix_slide()` to the `epi_archive` object, while the latter is obtained from
48-
applying `epi_slide()` to the latest snapshot of the data.
28+
A key function from the epiprocess package is `epix_slide()` (refer to the
29+
following vignette for the basics of the function: ["Work with archive objects
30+
and data
31+
revisions"](https://cmu-delphi.github.io/epiprocess/articles/archive.html))
32+
which allows performing version-aware computations. That is, the function only
33+
uses data that would have been available as of time t for that reference time.
34+
35+
In this vignette, we use `epix_slide()` for backtesting our `arx_forecaster` on
36+
historical COVID-19 case data from the US and from Canada. We first examine the
37+
results from a version-unaware forecaster, comparing two different fitting
38+
engines and then we contrast this with version-aware forecasting. The former
39+
will proceed by constructing an `epi_archive` that erases its version
40+
information and then use `epix_slide()` to forecast the future. The latter will
41+
keep the versioned data and proceed similarly by using `epix_slide()` to
42+
forecast the future.
4943

5044
## Comparing different forecasting engines
5145

@@ -60,16 +54,16 @@ claims and the number of new confirmed COVID-19 cases per 100,000 population
6054

6155
<summary>Load a data archive</summary>
6256

63-
We process as before, with the
64-
modification that we use `sync = locf` in `epix_merge()` so that the last
65-
version of each observation can be carried forward to extrapolate unavailable
66-
versions for the less up-to-date input archive.
57+
We process as before, with the modification that we use `sync = locf` in
58+
`epix_merge()` so that the last version of each observation can be carried
59+
forward to extrapolate unavailable versions for the less up-to-date input
60+
archive.
6761

6862
```{r grab-epi-data}
6963
theme_set(theme_bw())
7064
71-
y <- readRDS("all_states_covidcast_signals.rds")
72-
y <- purrr::map(y, ~ select(.x, geo_value, time_value, version = issue, value))
65+
y <- readRDS("all_states_covidcast_signals.rds") %>%
66+
purrr::map(~ select(.x, geo_value, time_value, version = issue, value))
7367
7468
x <- epix_merge(
7569
y[[1]] %>% rename(percent_cli = value) %>% as_epi_archive(compactify = FALSE),
@@ -82,34 +76,36 @@ rm(y)
8276

8377
</details>
8478

85-
After obtaining the latest snapshot of the data, we produce forecasts on that
86-
data using the default engine of simple linear regression and compare to a
87-
random forest.
88-
89-
Note that all of the warnings about the forecast date being less than the most
90-
recent update date of the data have been suppressed to avoid cluttering the
91-
output.
79+
We then obtaining the latest snapshot of the data and proceed to fake the
80+
version information by setting `version = time_value`. This has the effect of
81+
obtaining data that arrives exactly at the day of the time_value.
9282

9383
```{r arx-kweek-preliminaries, warning = FALSE}
9484
# Latest snapshot of data, and forecast dates
95-
x_latest <- epix_as_of(x, version = max(x$versions_end))
96-
fc_time_values <- seq(from = as.Date("2021-11-01"), to = as.Date("2021-11-01"), by = "1 month")
85+
x_latest <- epix_as_of(x, version = max(x$versions_end)) %>%
86+
mutate(version = time_value) %>%
87+
as_epi_archive()
88+
fc_time_values <- seq(
89+
from = as.Date("2020-08-01"),
90+
to = as.Date("2021-11-01"),
91+
by = "1 month"
92+
)
9793
aheads <- c(7, 14, 21, 28)
9894
99-
forecast_k_week_ahead <- function(epi_df, outcome, predictors, ahead = 7, engine) {
100-
epi_slide(
101-
epi_df,
102-
~ arx_forecaster(
103-
.x, outcome, predictors, engine,
104-
args_list = arx_args_list(ahead = ahead)
105-
)$predictions %>%
106-
select(-geo_value),
107-
.window_size = 120,
108-
.ref_time_values = fc_time_values,
109-
.new_col_name = "fc"
110-
) %>%
111-
select(geo_value, time_value, starts_with("fc")) %>%
112-
mutate(engine_type = engine$engine)
95+
forecast_k_week_ahead <- function(epi_archive, outcome, predictors, ahead = 7, engine) {
96+
epi_archive %>%
97+
epix_slide(
98+
.f = function(x, gk, rtv) {
99+
arx_forecaster(
100+
x, outcome, predictors, engine,
101+
args_list = arx_args_list(ahead = ahead)
102+
)$predictions %>%
103+
mutate(engine_type = engine$engine) %>%
104+
pivot_quantiles_wider(.pred_distn)
105+
},
106+
.before = 120,
107+
.versions = fc_time_values
108+
)
113109
}
114110
```
115111

@@ -131,7 +127,6 @@ fc <- bind_rows(
131127
engine = rand_forest(mode = "regression")
132128
))
133129
)
134-
pivot_quantiles_wider(fc_.pred_distn)
135130
```
136131

137132
Here, `arx_forecaster()` does all the heavy lifting. It creates leads of the
@@ -148,18 +143,22 @@ sense of the model performance while keeping the graphic simple.
148143

149144
<summary>Code for plotting</summary>
150145
```{r plot-arx, message = FALSE, warning = FALSE}
151-
fc_cafl <- fc %>% filter(geo_value %in% c("ca", "fl"))
152-
x_latest_cafl <- x_latest %>% filter(geo_value %in% c("ca", "fl"))
153-
154-
p1 <- ggplot(fc_cafl, aes(fc_target_date, group = time_value, fill = engine_type)) +
146+
fc_cafl <- fc %>%
147+
tibble() %>%
148+
filter(geo_value %in% c("ca", "fl"))
149+
x_latest_cafl <- x_latest$DT %>%
150+
tibble() %>%
151+
filter(geo_value %in% c("ca", "fl"))
152+
153+
p1 <- ggplot(fc_cafl, aes(target_date, group = forecast_date, fill = engine_type)) +
155154
geom_line(
156155
data = x_latest_cafl, aes(x = time_value, y = case_rate),
157156
inherit.aes = FALSE, color = "gray50"
158157
) +
159158
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`), alpha = 0.4) +
160-
geom_line(aes(y = fc_.pred)) +
161-
geom_point(aes(y = fc_.pred), size = 0.5) +
162-
geom_vline(aes(xintercept = time_value), linetype = 2, alpha = 0.5) +
159+
geom_line(aes(y = .pred)) +
160+
geom_point(aes(y = .pred), size = 0.5) +
161+
geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
163162
facet_grid(vars(geo_value), vars(engine_type), scales = "free") +
164163
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
165164
scale_fill_brewer(palette = "Set1") +
@@ -216,51 +215,50 @@ linear regression with those from using boosted regression trees.
216215
can <- readRDS(system.file(
217216
"extdata", "can_prov_cases.rds",
218217
package = "epipredict", mustWork = TRUE
219-
))
220-
221-
can <- can %>%
218+
)) %>%
222219
group_by(version, geo_value) %>%
223220
arrange(time_value) %>%
224221
mutate(cr_7dav = RcppRoll::roll_meanr(case_rate, n = 7L)) %>%
225222
as_epi_archive(compactify = TRUE)
226223
227-
can_latest <- epix_as_of(can, max_version = max(can$DT$version))
224+
can_latest <- epix_as_of(can, version = max(can$DT$version)) %>%
225+
mutate(version = time_value) %>%
226+
as_epi_archive()
228227
229228
# Generate the forecasts, and bind them together
230229
can_fc <- bind_rows(
231230
map(
232231
aheads,
233232
~ forecast_k_week_ahead(can_latest, "cr_7dav", "cr_7dav", .x, linear_reg())
234-
) %>% list_rbind(),
233+
),
235234
map(
236235
aheads,
237236
~ forecast_k_week_ahead(
238237
can_latest, "cr_7dav", "cr_7dav", .x,
239238
boost_tree(mode = "regression", trees = 20)
240239
)
241-
) %>% list_rbind()
242-
) %>%
243-
pivot_quantiles_wider(fc_.pred_distn)
240+
)
241+
)
244242
```
245243

246244
The figures below shows the results for all of the provinces.
247245

248246
```{r plot-can-fc-lr, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 12}
249247
ggplot(
250248
can_fc %>% filter(engine_type == "lm"),
251-
aes(x = fc_target_date, group = time_value)
249+
aes(x = target_date, group = forecast_date)
252250
) +
253251
coord_cartesian(xlim = lubridate::ymd(c("2020-12-01", NA))) +
254252
geom_line(
255-
data = can_latest, aes(x = time_value, y = cr_7dav),
253+
data = can_latest$DT %>% tibble(), aes(x = time_value, y = cr_7dav),
256254
inherit.aes = FALSE, color = "gray50"
257255
) +
258256
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = geo_value),
259257
alpha = 0.4
260258
) +
261-
geom_line(aes(y = fc_.pred)) +
262-
geom_point(aes(y = fc_.pred), size = 0.5) +
263-
geom_vline(aes(xintercept = time_value), linetype = 2, alpha = 0.5) +
259+
geom_line(aes(y = .pred)) +
260+
geom_point(aes(y = .pred), size = 0.5) +
261+
geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
264262
facet_wrap(~geo_value, scales = "free_y", ncol = 3) +
265263
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
266264
labs(
@@ -273,19 +271,19 @@ ggplot(
273271
```{r plot-can-fc-boost, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 12}
274272
ggplot(
275273
can_fc %>% filter(engine_type == "xgboost"),
276-
aes(x = fc_target_date, group = time_value)
274+
aes(x = target_date, group = forecast_date)
277275
) +
278276
coord_cartesian(xlim = lubridate::ymd(c("2020-12-01", NA))) +
279277
geom_line(
280-
data = can_latest, aes(x = time_value, y = cr_7dav),
278+
data = can_latest$DT %>% tibble(), aes(x = time_value, y = cr_7dav),
281279
inherit.aes = FALSE, color = "gray50"
282280
) +
283281
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = geo_value),
284282
alpha = 0.4
285283
) +
286-
geom_line(aes(y = fc_.pred)) +
287-
geom_point(aes(y = fc_.pred), size = 0.5) +
288-
geom_vline(aes(xintercept = time_value), linetype = 2, alpha = 0.5) +
284+
geom_line(aes(y = .pred)) +
285+
geom_point(aes(y = .pred), size = 0.5) +
286+
geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
289287
facet_wrap(~geo_value, scales = "free_y", ncol = 3) +
290288
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
291289
labs(
@@ -313,9 +311,7 @@ have been available in real-time) to forecast the 7 day average of future
313311
COVID-19 case rates from current and past COVID-19 case rates and death rates
314312
for all states. That is, we can make forecasts on the archive, `x`, and compare
315313
those to forecasts on the latest data, `x_latest` using the same general set-up
316-
as above. For version-aware forecasting, note that `x` is fed into
317-
`epix_slide()`, while for version-unaware forecasting, `x_latest` is fed into
318-
`epi_slide()`. Note that in this example, we use a geo-pooled approach (using
314+
as above. Note that in this example, we use a geo-pooled approach (using
319315
combined data from all US states and territories) to train our model.
320316

321317
<details>
@@ -352,21 +348,19 @@ deaths_incidence_prop <- pub_covidcast(
352348
as_epi_archive(compactify = FALSE)
353349
354350
355-
x <- epix_merge(confirmed_incidence_prop, deaths_incidence_prop,
356-
sync = "locf"
357-
)
351+
x <- epix_merge(confirmed_incidence_prop, deaths_incidence_prop, sync = "locf")
358352
359353
x <- x %>%
360354
epix_slide(
361-
before = 365000L, ref_time_values = fc_time_values,
355+
.versions = fc_time_values,
362356
function(x, gk, rtv) {
363357
x %>%
364358
group_by(geo_value) %>%
365-
epi_slide_mean(case_rate, before = 6L) %>%
359+
epi_slide_mean(case_rate, .window_size = 7L) %>%
366360
rename(case_rate_7d_av = slide_value_case_rate) %>%
367-
epi_slide_mean(death_rate, before = 6L) %>%
368-
ungroup() %>%
369-
rename(death_rate_7d_av = slide_value_death_rate)
361+
epi_slide_mean(death_rate, ..window_size = 7L) %>%
362+
rename(death_rate_7d_av = slide_value_death_rate) %>%
363+
ungroup()
370364
}
371365
) %>%
372366
rename(version = time_value) %>%
@@ -419,14 +413,14 @@ epi archive and store it as `x_latest`.
419413

420414
```{r running-arx-forecaster}
421415
arx_preds <- x %>%
422-
epix_slide(~ forecaster(.x),
423-
before = 120, ref_time_values = fc_time_values,
424-
names_sep = NULL
416+
epix_slide(
417+
~ forecaster(.x),
418+
.before = 120, .versions = fc_time_values
425419
) %>%
426420
mutate(engine_type = quantile_reg()$engine) %>%
427421
mutate(ahead_val = target_date - forecast_date)
428422
429-
x_latest <- epix_as_of(x, max_version = max(x$versions_end))
423+
x_latest <- epix_as_of(x, version = max(x$versions_end))
430424
```
431425

432426
Now we plot both the actual and predicted 7 day average of the death rate for
@@ -443,15 +437,15 @@ fc_states <- arx_preds %>%
443437
444438
x_latest_states <- x_latest %>% filter(geo_value %in% states_to_show)
445439
446-
p2 <- ggplot(fc_states, aes(target_date, group = time_value)) +
440+
p2 <- ggplot(fc_states, aes(target_date, group = forecast_date)) +
447441
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = geo_value), alpha = 0.4) +
448442
geom_line(
449443
data = x_latest_states, aes(x = time_value, y = death_rate_7d_av),
450444
inherit.aes = FALSE, color = "gray50"
451445
) +
452446
geom_line(aes(y = .pred, color = geo_value)) +
453447
geom_point(aes(y = .pred, color = geo_value), size = 0.5) +
454-
geom_vline(aes(xintercept = time_value), linetype = 2, alpha = 0.5) +
448+
geom_vline(aes(xintercept = forecast_date), linetype = 2, alpha = 0.5) +
455449
facet_wrap(~geo_value, scales = "free_y", ncol = 1L) +
456450
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
457451
scale_fill_brewer(palette = "Set1") +

0 commit comments

Comments
 (0)