8
8
[ ![ R-CMD-check] ( https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml/badge.svg )] ( https://github.com/cmu-delphi/epipredict/actions/workflows/R-CMD-check.yaml )
9
9
<!-- badges: end -->
10
10
11
- Epipredict is a framework for building transformation and forecasting
11
+ ` {epipredict} ` is a framework for building transformation and forecasting
12
12
pipelines for epidemiological and other panel time-series datasets. In
13
13
addition to tools for building forecasting pipelines, it contains a
14
14
number of “canned” forecasters meant to run with little modification as
15
15
an easy way to get started forecasting.
16
16
17
17
It is designed to work well with
18
- [ ` epiprocess ` ] ( https://cmu-delphi.github.io/epiprocess/ ) , a utility for
19
- handling various time series and geographic processing tools in an
18
+ [ ` { epiprocess} ` ] ( https://cmu-delphi.github.io/epiprocess/ ) , a utility for
19
+ time series handling and geographic processing in an
20
20
epidemiological context. Both of the packages are meant to work well
21
21
with the panel data provided by
22
- [ ` epidatr ` ] ( https://cmu-delphi.github.io/epidatr/ ) .
22
+ [ ` {epidatr} ` ] ( https://cmu-delphi.github.io/epidatr/ ) .
23
+ Pre-compiled example datasets are also availalbe in [ ` {epidatasets} ` ] ( https://cmu-delphi.github.io/epidatasets/ ) .
23
24
24
- If you are looking for more detail beyond the package documentation, see
25
- our [ forecasting
26
- book] ( https://cmu-delphi.github.io/delphi-tooling-book/ ) .
25
+ If you are looking for detail beyond the package documentation, see
26
+ our [ forecasting book] ( https://cmu-delphi.github.io/delphi-tooling-book/ ) .
27
27
28
28
## Installation
29
29
30
- To install (unless you’re planning on contributing to package
31
- development, we suggest using the stable version):
30
+ Unless you’re planning on contributing to package
31
+ development, we suggest using the stable version.
32
+ To install, run:
32
33
33
34
``` r
34
35
# Stable version
@@ -44,25 +45,32 @@ is at <https://cmu-delphi.github.io/epipredict/dev>.
44
45
45
46
## Motivating example
46
47
47
- To demonstrate the kind of forecast epipredict can make, say we’re
48
- predicting COVID deaths per 100k for each state on
48
+ To demonstrate the kind of forecast ` { epipredict} ` can make, say we want to
49
+ predict COVID-19 deaths per 100k people for each state on 2021-08-01.
49
50
50
51
``` r
52
+ library(epipredict )
53
+ library(epidatr )
54
+ library(epiprocess )
55
+ library(dplyr )
56
+ library(ggplot2 )
57
+
51
58
forecast_date <- as.Date(" 2021-08-01" )
52
59
```
53
60
54
61
Below the fold, we construct this dataset as an ` epiprocess::epi_df `
55
- from JHU data.
62
+ from [ Johns Hopkins Center for Systems Science and Engineering deaths data] ( https://cmu-delphi.github.io/delphi-epidata/api/covidcast-signals/jhu-csse.html ) .
56
63
57
64
<details >
58
65
<summary >
59
66
Creating the dataset using ` {epidatr} ` and ` {epiprocess} `
60
67
</summary >
61
68
62
- This dataset can be found in the package as ` covid_case_death_rates ` ; we
63
- demonstrate some of the typically ubiquitous cleaning operations needed
64
- to be able to forecast. First we pull both jhu-csse cases and deaths
65
- from [ ` {epidatr} ` ] ( https://cmu-delphi.github.io/epidatr/ ) package:
69
+ This section is intended to demonstrate some of the ubiquitous cleaning operations needed
70
+ to be able to forecast.
71
+ The dataset prepared here is also included ready-to-go in ` {epipredict} ` as ` covid_case_death_rates ` .
72
+
73
+ First we pull both ` jhu-csse ` cases and deaths data from the [ Delphi API] ( https://cmu-delphi.github.io/delphi-epidata/api/covidcast.html ) using the [ ` {epidatr} ` ] ( https://cmu-delphi.github.io/epidatr/ ) package:
66
74
67
75
``` r
68
76
cases <- pub_covidcast(
@@ -87,7 +95,7 @@ deaths <- pub_covidcast(
87
95
```
88
96
89
97
Since visualizing the results on every geography is somewhat
90
- overwhelming, we’ll only train on a subset of 5 .
98
+ overwhelming, we’ll only train on a subset of locations .
91
99
92
100
``` r
93
101
used_locations <- c(" ca" , " ma" , " ny" , " tx" )
@@ -113,12 +121,11 @@ cases_deaths |>
113
121
114
122
<img src =" man/figures/README-date-1.png " width =" 90% " style =" display : block ; margin : auto ;" />
115
123
116
- As with basically any dataset, there is some cleaning that we will need
117
- to do to make it actually usable; we’ll use some utilities from
124
+ As with the typical dataset, we will need to do some cleaning to make it actually usable; we’ll use some utilities from
118
125
[ ` {epiprocess} ` ] ( https://cmu-delphi.github.io/epiprocess/ ) for this.
119
126
120
- First, to eliminate some of the noise coming from daily reporting, we do
121
- 7 day averaging over a trailing window[ ^ 1 ] :
127
+ First, to reduce the noise from daily reporting, we will compute a
128
+ 7 day average over a trailing window[ ^ 1 ] :
122
129
123
130
``` r
124
131
cases_deaths <-
@@ -134,7 +141,7 @@ cases_deaths <-
134
141
rename(case_rate = cases_7dav , death_rate = death_rate_7dav )
135
142
```
136
143
137
- Then trimming outliers, most especially negative values:
144
+ Then we'll trim outliers, especially negative values:
138
145
139
146
``` r
140
147
cases_deaths <-
@@ -161,24 +168,25 @@ cases_deaths <-
161
168
162
169
</details >
163
170
164
- After having downloaded and cleaned the data in ` cases_deaths ` , we plot
165
- a subset of the states, noting the actual forecast date:
171
+ After downloading and cleaning the cases and deaths data , we can plot
172
+ a subset of the states, marking the desired forecast date:
166
173
167
174
<details >
168
175
<summary >
169
176
Plot
170
177
</summary >
171
178
172
179
``` r
180
+ used_locations <- c(" ca" , " ma" , " ny" , " tx" )
173
181
forecast_date_label <-
174
182
tibble(
175
183
geo_value = rep(used_locations , 2 ),
176
184
.response_name = c(rep(" case_rate" , 4 ), rep(" death_rate" , 4 )),
177
185
dates = rep(forecast_date - 7 * 2 , 2 * length(used_locations )),
178
186
heights = c(rep(150 , 4 ), rep(0.75 , 4 ))
179
187
)
180
- processed_data_plot <-
181
- covid_case_death_rates | >
188
+
189
+ covid_case_death_rates | >
182
190
filter(geo_value %in% used_locations ) | >
183
191
autoplot(
184
192
case_rate ,
@@ -204,13 +212,13 @@ processed_data_plot <-
204
212
205
213
<img src =" man/figures/README-show-processed-data-1.png " width =" 90% " style =" display : block ; margin : auto ;" />
206
214
207
- To make a forecast, we will use a “canned” simple auto-regressive
215
+ To make a forecast, we will use a simple “canned” auto-regressive
208
216
forecaster to predict the death rate four weeks into the future using
209
- lagged[ ^ 2 ] deaths and cases
217
+ lagged[ ^ 2 ] deaths and cases.
210
218
211
219
``` r
212
220
four_week_ahead <- arx_forecaster(
213
- cases_deaths | > filter(time_value < = forecast_date ),
221
+ covid_case_death_rates | > filter(time_value < = forecast_date ),
214
222
outcome = " death_rate" ,
215
223
predictors = c(" case_rate" , " death_rate" ),
216
224
args_list = arx_args_list(
@@ -221,31 +229,31 @@ four_week_ahead <- arx_forecaster(
221
229
)
222
230
four_week_ahead
223
231
# > ══ A basic forecaster of type ARX Forecaster ════════════════════════════════
224
- # >
232
+ # >
225
233
# > This forecaster was fit on 2025-02-10 12:09:58.
226
- # >
234
+ # >
227
235
# > Training data was an <epi_df> with:
228
236
# > • Geography: state,
229
237
# > • Time type: day,
230
238
# > • Using data up-to-date as of: 2022-01-01.
231
239
# > • With the last data available on 2021-08-01
232
- # >
240
+ # >
233
241
# > ── Predictions ──────────────────────────────────────────────────────────────
234
- # >
242
+ # >
235
243
# > A total of 4 predictions are available for
236
244
# > • 4 unique geographic regions,
237
245
# > • At forecast date: 2021-08-01,
238
246
# > • For target date: 2021-08-29,
239
- # >
247
+ # >
240
248
```
241
249
242
- In this case, we have used 0-3 days, a week, and two week lags for the
243
- case rate, while using only zero, one and two weekly lags for the death
244
- rate (as predictors). The result ` four_week_ahead ` is both a fitted
250
+ In our model setup, we are defining as our predictors case rate lagged 0-3 days, one week, and two weeks, and death rate lagged 0-2 weeks.
251
+ The result ` four_week_ahead ` is both a fitted
245
252
model object which could be used any time in the future to create
246
- different forecasts, as well as a set of predicted values (and
253
+ different forecasts, and a set of predicted values (and
247
254
prediction intervals) for each location 28 days after the forecast date.
248
- Plotting the prediction intervals on our subset above[ ^ 3 ] :
255
+
256
+ Plotting the prediction intervals on the true values for our location subset[ ^ 3 ] :
249
257
250
258
<details >
251
259
<summary >
@@ -275,28 +283,29 @@ forecast_plot <-
275
283
276
284
<img src =" man/figures/README-show-single-forecast-1.png " width =" 90% " style =" display : block ; margin : auto ;" />
277
285
278
- And as a tibble of quantile level -value pairs:
286
+ And as a tibble of quantile-value pairs:
279
287
280
288
``` r
281
289
four_week_ahead $ predictions | >
282
290
select(- .pred ) | >
283
291
pivot_quantiles_longer(.pred_distn )
284
292
# > # A tibble: 20 × 5
285
293
# > geo_value values quantile_levels forecast_date target_date
286
- # > <chr> <dbl> <dbl> <date> <date>
287
- # > 1 ca 0.199 0.1 2021-08-01 2021-08-29
288
- # > 2 ca 0.285 0.25 2021-08-01 2021-08-29
289
- # > 3 ca 0.345 0.5 2021-08-01 2021-08-29
290
- # > 4 ca 0.405 0.75 2021-08-01 2021-08-29
291
- # > 5 ca 0.491 0.9 2021-08-01 2021-08-29
292
- # > 6 ma 0.0285 0.1 2021-08-01 2021-08-29
294
+ # > <chr> <dbl> <dbl> <date> <date>
295
+ # > 1 ca 0.199 0.1 2021-08-01 2021-08-29
296
+ # > 2 ca 0.285 0.25 2021-08-01 2021-08-29
297
+ # > 3 ca 0.345 0.5 2021-08-01 2021-08-29
298
+ # > 4 ca 0.405 0.75 2021-08-01 2021-08-29
299
+ # > 5 ca 0.491 0.9 2021-08-01 2021-08-29
300
+ # > 6 ma 0.0285 0.1 2021-08-01 2021-08-29
293
301
# > # ℹ 14 more rows
294
302
```
295
303
296
- The black dot gives the median prediction, while the blue intervals give
304
+ The orange dot gives the predicted median, and the blue intervals give
297
305
the 25-75%, the 10-90%, and 2.5-97.5% inter-quantile ranges[ ^ 4 ] . For
298
306
this particular day and these locations, the forecasts are relatively
299
- accurate, with the true data being at least within the 10-90% interval.
307
+ accurate, with the true data being at worst within the 10-90% interval.
308
+
300
309
A couple of things to note:
301
310
302
311
1 . Our methods are primarily direct forecasters; this means we don’t
@@ -310,12 +319,11 @@ A couple of things to note:
310
319
## Getting Help
311
320
312
321
If you encounter a bug or have a feature request, feel free to file an
313
- [ issue on our github
322
+ [ issue on our GitHub
314
323
page] ( https://github.com/cmu-delphi/epipredict/issues ) . For other
315
324
questions, feel free to reach out to the authors, either via this
316
- [ contact
317
- form] ( https://docs.google.com/forms/d/e/1FAIpQLScqgT1fKZr5VWBfsaSp-DNaN03aV6EoZU4YljIzHJ1Wl_zmtg/viewform ) ,
318
- email, or the Insightnet slack.
325
+ [ contact form] ( https://docs.google.com/forms/d/e/1FAIpQLScqgT1fKZr5VWBfsaSp-DNaN03aV6EoZU4YljIzHJ1Wl_zmtg/viewform ) ,
326
+ email, or the InsightNet Slack.
319
327
320
328
[ ^ 1 ] : This makes it so that any given day of the processed time-series
321
329
only depends on the previous week, which means that we avoid leaking
0 commit comments