@@ -25,27 +25,21 @@ library(purrr)
25
25
26
26
# Demonstrations of sliding AR and ARX forecasters
27
27
28
- A key function from the epiprocess package is ` epi_slide() ` , which allows the
29
- user to apply a function or formula-based computation over variables in an
30
- ` epi_df ` over a running window of ` n ` time steps (see the following ` epiprocess `
31
- vignette to go over the basics of the function: [ "Slide a computation over
32
- signal values"] ( https://cmu-delphi.github.io/epiprocess/articles/slide.html ) ).
33
- The equivalent sliding method for an ` epi_archive ` object can be called by using
34
- the wrapper function ` epix_slide() ` (refer to the following vignette for the
35
- basics of the function: [ "Work with archive objects and data
36
- revisions"] ( https://cmu-delphi.github.io/epiprocess/articles/archive.html ) ). The
37
- key difference from ` epi_slide() ` is that it performs version-aware
38
- computations. That is, the function only uses data that would have been
39
- available as of time t for that reference time.
40
-
41
- In this vignette, we use ` epi_slide() ` and ` epix_slide() ` for backtesting our
42
- ` arx_forecaster ` on historical COVID-19 case data from the US and from Canada.
43
- More precisely, we first demonstrate using ` epi_slide() ` to slide ARX
44
- forecasters over an ` epi_df ` object and compare the results obtained from using
45
- different forecasting engines. We then compare the results from version-aware
46
- and unaware forecasting, where the former is obtained from applying
47
- ` epix_slide() ` to the ` epi_archive ` object, while the latter is obtained from
48
- applying ` epi_slide() ` to the latest snapshot of the data.
28
+ A key function from the epiprocess package is ` epix_slide() ` (refer to the
29
+ following vignette for the basics of the function: [ "Work with archive objects
30
+ and data
31
+ revisions"] ( https://cmu-delphi.github.io/epiprocess/articles/archive.html ) )
32
+ which allows performing version-aware computations. That is, the function only
33
+ uses data that would have been available as of time t for that reference time.
34
+
35
+ In this vignette, we use ` epix_slide() ` for backtesting our ` arx_forecaster ` on
36
+ historical COVID-19 case data from the US and from Canada. We first examine the
37
+ results from a version-unaware forecaster, comparing two different fitting
38
+ engines and then we contrast this with version-aware forecasting. The former
39
+ will proceed by constructing an ` epi_archive ` that erases its version
40
+ information and then use ` epix_slide() ` to forecast the future. The latter will
41
+ keep the versioned data and proceed similarly by using ` epix_slide() ` to
42
+ forecast the future.
49
43
50
44
## Comparing different forecasting engines
51
45
@@ -60,16 +54,16 @@ claims and the number of new confirmed COVID-19 cases per 100,000 population
60
54
61
55
<summary >Load a data archive</summary >
62
56
63
- We process as before, with the
64
- modification that we use ` sync = locf ` in ` epix_merge() ` so that the last
65
- version of each observation can be carried forward to extrapolate unavailable
66
- versions for the less up-to-date input archive.
57
+ We process as before, with the modification that we use ` sync = locf ` in
58
+ ` epix_merge() ` so that the last version of each observation can be carried
59
+ forward to extrapolate unavailable versions for the less up-to-date input
60
+ archive.
67
61
68
62
``` {r grab-epi-data}
69
63
theme_set(theme_bw())
70
64
71
- y <- readRDS("all_states_covidcast_signals.rds")
72
- y <- purrr::map(y, ~ select(.x, geo_value, time_value, version = issue, value))
65
+ y <- readRDS("all_states_covidcast_signals.rds") %>%
66
+ purrr::map(~ select(.x, geo_value, time_value, version = issue, value))
73
67
74
68
x <- epix_merge(
75
69
y[[1]] %>% rename(percent_cli = value) %>% as_epi_archive(compactify = FALSE),
@@ -82,34 +76,36 @@ rm(y)
82
76
83
77
</details >
84
78
85
- After obtaining the latest snapshot of the data, we produce forecasts on that
86
- data using the default engine of simple linear regression and compare to a
87
- random forest.
88
-
89
- Note that all of the warnings about the forecast date being less than the most
90
- recent update date of the data have been suppressed to avoid cluttering the
91
- output.
79
+ We then obtaining the latest snapshot of the data and proceed to fake the
80
+ version information by setting ` version = time_value ` . This has the effect of
81
+ obtaining data that arrives exactly at the day of the time_value.
92
82
93
83
``` {r arx-kweek-preliminaries, warning = FALSE}
94
84
# Latest snapshot of data, and forecast dates
95
- x_latest <- epix_as_of(x, version = max(x$versions_end))
96
- fc_time_values <- seq(from = as.Date("2021-11-01"), to = as.Date("2021-11-01"), by = "1 month")
85
+ x_latest <- epix_as_of(x, version = max(x$versions_end)) %>%
86
+ mutate(version = time_value) %>%
87
+ as_epi_archive()
88
+ fc_time_values <- seq(
89
+ from = as.Date("2020-08-01"),
90
+ to = as.Date("2021-11-01"),
91
+ by = "1 month"
92
+ )
97
93
aheads <- c(7, 14, 21, 28)
98
94
99
- forecast_k_week_ahead <- function(epi_df , outcome, predictors, ahead = 7, engine) {
100
- epi_slide(
101
- epi_df,
102
- ~ arx_forecaster(
103
- .x, outcome, predictors, engine,
104
- args_list = arx_args_list(ahead = ahead)
105
- )$predictions %>%
106
- select(-geo_value),
107
- .window_size = 120,
108
- .ref_time_values = fc_time_values,
109
- .new_col_name = "fc"
110
- ) %>%
111
- select(geo_value, time_value, starts_with("fc")) %>%
112
- mutate(engine_type = engine$engine )
95
+ forecast_k_week_ahead <- function(epi_archive , outcome, predictors, ahead = 7, engine) {
96
+ epi_archive %>%
97
+ epix_slide(
98
+ .f = function(x, gk, rtv) {
99
+ arx_forecaster(
100
+ x, outcome, predictors, engine,
101
+ args_list = arx_args_list(ahead = ahead)
102
+ )$predictions %>%
103
+ mutate(engine_type = engine$engine) %>%
104
+ pivot_quantiles_wider(.pred_distn)
105
+ },
106
+ .before = 120,
107
+ .versions = fc_time_values
108
+ )
113
109
}
114
110
```
115
111
@@ -131,7 +127,6 @@ fc <- bind_rows(
131
127
engine = rand_forest(mode = "regression")
132
128
))
133
129
)
134
- pivot_quantiles_wider(fc_.pred_distn)
135
130
```
136
131
137
132
Here, ` arx_forecaster() ` does all the heavy lifting. It creates leads of the
@@ -148,18 +143,22 @@ sense of the model performance while keeping the graphic simple.
148
143
149
144
<summary >Code for plotting</summary >
150
145
``` {r plot-arx, message = FALSE, warning = FALSE}
151
- fc_cafl <- fc %>% filter(geo_value %in% c("ca", "fl"))
152
- x_latest_cafl <- x_latest %>% filter(geo_value %in% c("ca", "fl"))
153
-
154
- p1 <- ggplot(fc_cafl, aes(fc_target_date, group = time_value, fill = engine_type)) +
146
+ fc_cafl <- fc %>%
147
+ tibble() %>%
148
+ filter(geo_value %in% c("ca", "fl"))
149
+ x_latest_cafl <- x_latest$DT %>%
150
+ tibble() %>%
151
+ filter(geo_value %in% c("ca", "fl"))
152
+
153
+ p1 <- ggplot(fc_cafl, aes(target_date, group = forecast_date, fill = engine_type)) +
155
154
geom_line(
156
155
data = x_latest_cafl, aes(x = time_value, y = case_rate),
157
156
inherit.aes = FALSE, color = "gray50"
158
157
) +
159
158
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`), alpha = 0.4) +
160
- geom_line(aes(y = fc_ .pred)) +
161
- geom_point(aes(y = fc_ .pred), size = 0.5) +
162
- geom_vline(aes(xintercept = time_value ), linetype = 2, alpha = 0.5) +
159
+ geom_line(aes(y = .pred)) +
160
+ geom_point(aes(y = .pred), size = 0.5) +
161
+ geom_vline(aes(xintercept = forecast_date ), linetype = 2, alpha = 0.5) +
163
162
facet_grid(vars(geo_value), vars(engine_type), scales = "free") +
164
163
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
165
164
scale_fill_brewer(palette = "Set1") +
@@ -216,51 +215,50 @@ linear regression with those from using boosted regression trees.
216
215
can <- readRDS(system.file(
217
216
"extdata", "can_prov_cases.rds",
218
217
package = "epipredict", mustWork = TRUE
219
- ))
220
-
221
- can <- can %>%
218
+ )) %>%
222
219
group_by(version, geo_value) %>%
223
220
arrange(time_value) %>%
224
221
mutate(cr_7dav = RcppRoll::roll_meanr(case_rate, n = 7L)) %>%
225
222
as_epi_archive(compactify = TRUE)
226
223
227
- can_latest <- epix_as_of(can, max_version = max(can$DT$version))
224
+ can_latest <- epix_as_of(can, version = max(can$DT$version)) %>%
225
+ mutate(version = time_value) %>%
226
+ as_epi_archive()
228
227
229
228
# Generate the forecasts, and bind them together
230
229
can_fc <- bind_rows(
231
230
map(
232
231
aheads,
233
232
~ forecast_k_week_ahead(can_latest, "cr_7dav", "cr_7dav", .x, linear_reg())
234
- ) %>% list_rbind() ,
233
+ ),
235
234
map(
236
235
aheads,
237
236
~ forecast_k_week_ahead(
238
237
can_latest, "cr_7dav", "cr_7dav", .x,
239
238
boost_tree(mode = "regression", trees = 20)
240
239
)
241
- ) %>% list_rbind()
242
- ) %>%
243
- pivot_quantiles_wider(fc_.pred_distn)
240
+ )
241
+ )
244
242
```
245
243
246
244
The figures below shows the results for all of the provinces.
247
245
248
246
``` {r plot-can-fc-lr, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 12}
249
247
ggplot(
250
248
can_fc %>% filter(engine_type == "lm"),
251
- aes(x = fc_target_date , group = time_value )
249
+ aes(x = target_date , group = forecast_date )
252
250
) +
253
251
coord_cartesian(xlim = lubridate::ymd(c("2020-12-01", NA))) +
254
252
geom_line(
255
- data = can_latest, aes(x = time_value, y = cr_7dav),
253
+ data = can_latest$DT %>% tibble() , aes(x = time_value, y = cr_7dav),
256
254
inherit.aes = FALSE, color = "gray50"
257
255
) +
258
256
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = geo_value),
259
257
alpha = 0.4
260
258
) +
261
- geom_line(aes(y = fc_ .pred)) +
262
- geom_point(aes(y = fc_ .pred), size = 0.5) +
263
- geom_vline(aes(xintercept = time_value ), linetype = 2, alpha = 0.5) +
259
+ geom_line(aes(y = .pred)) +
260
+ geom_point(aes(y = .pred), size = 0.5) +
261
+ geom_vline(aes(xintercept = forecast_date ), linetype = 2, alpha = 0.5) +
264
262
facet_wrap(~geo_value, scales = "free_y", ncol = 3) +
265
263
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
266
264
labs(
@@ -273,19 +271,19 @@ ggplot(
273
271
``` {r plot-can-fc-boost, message = FALSE, warning = FALSE, fig.width = 9, fig.height = 12}
274
272
ggplot(
275
273
can_fc %>% filter(engine_type == "xgboost"),
276
- aes(x = fc_target_date , group = time_value )
274
+ aes(x = target_date , group = forecast_date )
277
275
) +
278
276
coord_cartesian(xlim = lubridate::ymd(c("2020-12-01", NA))) +
279
277
geom_line(
280
- data = can_latest, aes(x = time_value, y = cr_7dav),
278
+ data = can_latest$DT %>% tibble() , aes(x = time_value, y = cr_7dav),
281
279
inherit.aes = FALSE, color = "gray50"
282
280
) +
283
281
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = geo_value),
284
282
alpha = 0.4
285
283
) +
286
- geom_line(aes(y = fc_ .pred)) +
287
- geom_point(aes(y = fc_ .pred), size = 0.5) +
288
- geom_vline(aes(xintercept = time_value ), linetype = 2, alpha = 0.5) +
284
+ geom_line(aes(y = .pred)) +
285
+ geom_point(aes(y = .pred), size = 0.5) +
286
+ geom_vline(aes(xintercept = forecast_date ), linetype = 2, alpha = 0.5) +
289
287
facet_wrap(~geo_value, scales = "free_y", ncol = 3) +
290
288
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
291
289
labs(
@@ -313,9 +311,7 @@ have been available in real-time) to forecast the 7 day average of future
313
311
COVID-19 case rates from current and past COVID-19 case rates and death rates
314
312
for all states. That is, we can make forecasts on the archive, ` x ` , and compare
315
313
those to forecasts on the latest data, ` x_latest ` using the same general set-up
316
- as above. For version-aware forecasting, note that ` x ` is fed into
317
- ` epix_slide() ` , while for version-unaware forecasting, ` x_latest ` is fed into
318
- ` epi_slide() ` . Note that in this example, we use a geo-pooled approach (using
314
+ as above. Note that in this example, we use a geo-pooled approach (using
319
315
combined data from all US states and territories) to train our model.
320
316
321
317
<details >
@@ -352,21 +348,19 @@ deaths_incidence_prop <- pub_covidcast(
352
348
as_epi_archive(compactify = FALSE)
353
349
354
350
355
- x <- epix_merge(confirmed_incidence_prop, deaths_incidence_prop,
356
- sync = "locf"
357
- )
351
+ x <- epix_merge(confirmed_incidence_prop, deaths_incidence_prop, sync = "locf")
358
352
359
353
x <- x %>%
360
354
epix_slide(
361
- before = 365000L, ref_time_values = fc_time_values,
355
+ .versions = fc_time_values,
362
356
function(x, gk, rtv) {
363
357
x %>%
364
358
group_by(geo_value) %>%
365
- epi_slide_mean(case_rate, before = 6L ) %>%
359
+ epi_slide_mean(case_rate, .window_size = 7L ) %>%
366
360
rename(case_rate_7d_av = slide_value_case_rate) %>%
367
- epi_slide_mean(death_rate, before = 6L ) %>%
368
- ungroup( ) %>%
369
- rename(death_rate_7d_av = slide_value_death_rate )
361
+ epi_slide_mean(death_rate, ..window_size = 7L ) %>%
362
+ rename(death_rate_7d_av = slide_value_death_rate ) %>%
363
+ ungroup( )
370
364
}
371
365
) %>%
372
366
rename(version = time_value) %>%
@@ -419,14 +413,14 @@ epi archive and store it as `x_latest`.
419
413
420
414
``` {r running-arx-forecaster}
421
415
arx_preds <- x %>%
422
- epix_slide(~ forecaster(.x),
423
- before = 120, ref_time_values = fc_time_values ,
424
- names_sep = NULL
416
+ epix_slide(
417
+ ~ forecaster(.x) ,
418
+ .before = 120, .versions = fc_time_values
425
419
) %>%
426
420
mutate(engine_type = quantile_reg()$engine) %>%
427
421
mutate(ahead_val = target_date - forecast_date)
428
422
429
- x_latest <- epix_as_of(x, max_version = max(x$versions_end))
423
+ x_latest <- epix_as_of(x, version = max(x$versions_end))
430
424
```
431
425
432
426
Now we plot both the actual and predicted 7 day average of the death rate for
@@ -443,15 +437,15 @@ fc_states <- arx_preds %>%
443
437
444
438
x_latest_states <- x_latest %>% filter(geo_value %in% states_to_show)
445
439
446
- p2 <- ggplot(fc_states, aes(target_date, group = time_value )) +
440
+ p2 <- ggplot(fc_states, aes(target_date, group = forecast_date )) +
447
441
geom_ribbon(aes(ymin = `0.05`, ymax = `0.95`, fill = geo_value), alpha = 0.4) +
448
442
geom_line(
449
443
data = x_latest_states, aes(x = time_value, y = death_rate_7d_av),
450
444
inherit.aes = FALSE, color = "gray50"
451
445
) +
452
446
geom_line(aes(y = .pred, color = geo_value)) +
453
447
geom_point(aes(y = .pred, color = geo_value), size = 0.5) +
454
- geom_vline(aes(xintercept = time_value ), linetype = 2, alpha = 0.5) +
448
+ geom_vline(aes(xintercept = forecast_date ), linetype = 2, alpha = 0.5) +
455
449
facet_wrap(~geo_value, scales = "free_y", ncol = 1L) +
456
450
scale_x_date(minor_breaks = "month", date_labels = "%b %y") +
457
451
scale_fill_brewer(palette = "Set1") +
0 commit comments