library(tidyverse)
library(jsonlite)
library(MMWRweek)
Using Functions to Query an API
CSTE Data Science Team Training Program
Functions
It is much better practice to write functions to perform a task rather than copying and pasting code. This offers several advantages:
- Avoids duplication.
- Easier maintenance. If requirements change, you only need to update the function once, rather than update everywhere you’ve copied and pasted the code.
- Avoids copy-paste errors, or errors where some duplicates are updated but not others.
- Makes code modular. The function you write here could be used elsewhere.
Let’s take a look at an example task where we want to retrieve data from a web API, do some data manipulation, and write out results to a file.
ILI data from FluView
Let’s retrieve Influenza-like illness (ILI) from U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet) using the CMU Delphi FluView API. The base URL is https://delphi.cmu.edu/epidata/fluview/
, which must be combined with required parameters to specify one or more regions and epidemiological weeks.
For example, if we want to retrieve the first five weeks of ILI data for Washington state for 2023, we could visit the following URL:
https://delphi.cmu.edu/epidata/fluview/?regions=wa&epiweeks=202201-202205.
The data that is returned through this API is JSON, which can be read into R using the jsonlite package.
First, let’s construct our base URL:
<- "https://delphi.cmu.edu/epidata/fluview/" baseurl
Next, let’s build up our URL by specifying what location and year we want. In this example, let’s get data for California for 2020. We start by specifying the year we want, then construct the epidemiological week string, then finally putting it all together into a URL string.
Note that we could simply specify multiple years and multiple regions at once. For example, we could get data for all of 2019 and 2020 for Washington, California, and Oregon at https://delphi.cmu.edu/epidata/fluview/?regions=wa,or,ca&epiweeks=201901-202053. However, for the sake of demonstration, let’s imagine that the API only allows retrieval of one location, one year at a time.
# Specify the region and year
<- "ca"
location <- 2020
year
# Construct the epiweek string
<- paste0(year, "01", "-", year, "53")
epiweeks epiweeks
[1] "202001-202053"
# Construct the URL suffix
<- paste0("?regions=", location, "&epiweeks=", epiweeks)
urlsuffix urlsuffix
[1] "?regions=ca&epiweeks=202001-202053"
# Construct the entire URL
<- paste0(baseurl, urlsuffix)
url url
[1] "https://delphi.cmu.edu/epidata/fluview/?regions=ca&epiweeks=202001-202053"
Now we can retrieve data for all of 2020 for California. Let’s select out the region, epiweek, weighted ILI, then figure out the date from the epiweek. Hover over the numbers to the right in the code block below to view explanations.
<-
ca2020 read_json(url, simplifyVector = TRUE)$epidata %>%
tibble() %>%
select(region, epiweek, wili) %>%
mutate(year=as.numeric(substr(epiweek, 0, 4)),
week=as.numeric(substr(epiweek, 5, 6)),
date=MMWRweek2Date(year, week))
ca2020
- 1
- This returns a data.frame object.
- 2
- This turns the data frame into a tibble, which are easier to work with and print nicely.
- 3
- We only want the region, epiweek string, and weighted ILI.
- 4
- Get the numeric year (first four characters in the epiweek string).
- 5
- Get the numeric week (fifth and sixth characters in the epiweek string).
- 6
- Use the MMWRweek package to get the date from the epidemiological year and week.
# A tibble: 53 × 6
region epiweek wili year week date
<chr> <int> <dbl> <dbl> <dbl> <date>
1 ca 202001 4.87 2020 1 2019-12-29
2 ca 202002 4.15 2020 2 2020-01-05
3 ca 202003 4.30 2020 3 2020-01-12
4 ca 202004 4.80 2020 4 2020-01-19
5 ca 202005 4.98 2020 5 2020-01-26
6 ca 202006 5.06 2020 6 2020-02-02
7 ca 202007 4.80 2020 7 2020-02-09
8 ca 202008 4.51 2020 8 2020-02-16
9 ca 202009 4.23 2020 9 2020-02-23
10 ca 202010 3.99 2020 10 2020-03-01
# ℹ 43 more rows
Copy and paste
Let’s do this for Washington state for 2018-2021. What are the problems with this code?
- First, there’s tons of duplication. Imagine we wanted to do this for 10 years of data for Washington state? What if we wanted to do this for multiple states?
- Related, what if we needed to change something about the code here? For instance, what if we wanted to get the unweighted ILI instead of weighted ILI (
wili
)? - There’s lots of room for error here. Each time we copy and paste the code block, we’re creating new objects (
wa2019
,wa2020
, etc.). What if we forget to modify the name of the object we’re assigning, which could result in writing 2020 data to the 2019 object (see annotation #1). Or, what if we forget to change the filename we’re writing to, and accidentally write 2018 data towa2020.csv
(annotation #2)? There’s lots we have to keep track of mentally, for what’s actually a relatively simple pipeline.
# First let's set the base URL if we haven't already
<- "https://delphi.cmu.edu/epidata/fluview/"
baseurl
# Retrieve and write out 2018 data
<- "wa"
location <- 2018
year <- paste0(year, "01", "-", year, "53")
epiweeks <- paste0("?regions=", location, "&epiweeks=", epiweeks)
urlsuffix <- paste0(baseurl, urlsuffix)
url <-
wa2018 read_json(url, simplifyVector = TRUE)$epidata %>%
tibble() %>%
select(region, epiweek, wili) %>%
mutate(year=as.numeric(substr(epiweek, 0, 4)),
week=as.numeric(substr(epiweek, 5, 6)),
date=MMWRweek2Date(year, week))
write_csv(wa2018, file="wa2018.csv")
# Retrieve and write out 2019 data
<- "wa"
location <- 2019
year <- paste0(year, "01", "-", year, "53")
epiweeks <- paste0("?regions=", location, "&epiweeks=", epiweeks)
urlsuffix <- paste0(baseurl, urlsuffix)
url <-
wa2019 read_json(url, simplifyVector = TRUE)$epidata %>%
tibble() %>%
select(region, epiweek, wili) %>%
mutate(year=as.numeric(substr(epiweek, 0, 4)),
week=as.numeric(substr(epiweek, 5, 6)),
date=MMWRweek2Date(year, week))
write_csv(wa2019, file="wa2019.csv")
# Retrieve and write out 2020 data
<- "wa"
location <- 2020
year <- paste0(year, "01", "-", year, "53")
epiweeks <- paste0("?regions=", location, "&epiweeks=", epiweeks)
urlsuffix <- paste0(baseurl, urlsuffix)
url <-
wa2020 read_json(url, simplifyVector = TRUE)$epidata %>%
tibble() %>%
select(region, epiweek, wili) %>%
mutate(year=as.numeric(substr(epiweek, 0, 4)),
week=as.numeric(substr(epiweek, 5, 6)),
date=MMWRweek2Date(year, week))
write_csv(wa2020, file="wa2020.csv")
# Retrieve and write out 2021 data
<- "wa"
location <- 2021
year <- paste0(year, "01", "-", year, "53")
epiweeks <- paste0("?regions=", location, "&epiweeks=", epiweeks)
urlsuffix <- paste0(baseurl, urlsuffix)
url <-
wa2021 read_json(url, simplifyVector = TRUE)$epidata %>%
tibble() %>%
select(region, epiweek, wili) %>%
mutate(year=as.numeric(substr(epiweek, 0, 4)),
week=as.numeric(substr(epiweek, 5, 6)),
date=MMWRweek2Date(year, week))
write_csv(wa2021, file="wa2021.csv")
- 1
-
What if we forget to change the name of the object,
wa2019
here? - 2
-
What if we forget to change both the name of the object we’re saving,
wa2019
, and the filename,wa2019.csv
?
Using functions
Let’s define a function, which will allow us to define inputs arbitrarily.
<- function(location, year, write=FALSE) {
get_wili
# Set the base URL
<- "https://delphi.cmu.edu/epidata/fluview/"
baseurl
# Construct the full API URL
<- paste0(year, "01", "-", year, "53")
epiweeks <- paste0("?regions=", location, "&epiweeks=", epiweeks)
urlsuffix <- paste0(baseurl, urlsuffix)
url
# Get the results
<-
result read_json(url, simplifyVector = TRUE)$epidata %>%
tibble() %>%
select(region, epiweek, wili) %>%
mutate(year=as.numeric(substr(epiweek, 0, 4)),
week=as.numeric(substr(epiweek, 5, 6)),
date=MMWRweek2Date(year, week))
# Optional: if write=TRUE, write out to file
if (write) {
<- paste0(location, year, ".csv")
outfile if (file.exists(outfile)) {
warning(paste0("Skipping writing output file that already exists: ", outfile))
else {
} message(paste0("Writing results to file: ", outfile))
write_csv(result, outfile)
}
}
# Return the results of the function
return(result)
}
- 1
-
location
andyear
must be specified. By default, no files are written (defaultwrite=FALSE
). - 2
-
The
year
andlocation
variables here are defined as function arguments. - 3
-
If the function is called with
write=TRUE
, this block will be run. - 4
- Construct an output filename.
- 5
- If the output file already exists, issue a warning, and do not overwrite the file. If the output file doesn’t exist, write it out to disk and message the output filename.
Now we can get data for any state or year with a single function call! Let’s get data for California in 2018:
get_wili(location="ca", year=2018)
# A tibble: 52 × 6
region epiweek wili year week date
<chr> <int> <dbl> <dbl> <dbl> <date>
1 ca 201801 6.67 2018 1 2017-12-31
2 ca 201802 4.69 2018 2 2018-01-07
3 ca 201803 3.94 2018 3 2018-01-14
4 ca 201804 3.90 2018 4 2018-01-21
5 ca 201805 4.26 2018 5 2018-01-28
6 ca 201806 3.82 2018 6 2018-02-04
7 ca 201807 3.71 2018 7 2018-02-11
8 ca 201808 3.51 2018 8 2018-02-18
9 ca 201809 3.65 2018 9 2018-02-25
10 ca 201810 3.39 2018 10 2018-03-04
# ℹ 42 more rows
Or Virginia in 2011:
get_wili(location="va", year=2011)
# A tibble: 52 × 6
region epiweek wili year week date
<chr> <int> <dbl> <dbl> <dbl> <date>
1 va 201101 2.49 2011 1 2011-01-02
2 va 201102 2.81 2011 2 2011-01-09
3 va 201103 3.97 2011 3 2011-01-16
4 va 201104 4.74 2011 4 2011-01-23
5 va 201105 5.35 2011 5 2011-01-30
6 va 201106 5.13 2011 6 2011-02-06
7 va 201107 5.14 2011 7 2011-02-13
8 va 201108 4.36 2011 8 2011-02-20
9 va 201109 3.47 2011 9 2011-02-27
10 va 201110 2.84 2011 10 2011-03-06
# ℹ 42 more rows
By default, write=FALSE
. Let’s set write=TRUE
, which will result in a csv being written to disk which takes a default filename as the name of the location and the year. Let’s do this for Washington state in 2019. Note that our function returns a message that gets printed to the screen, notifying you where the data is being written.
get_wili(location="wa", year=2019, write=TRUE)
Writing results to file: wa2019.csv
# A tibble: 52 × 6
region epiweek wili year week date
<chr> <int> <dbl> <dbl> <dbl> <date>
1 wa 201901 1.39 2019 1 2018-12-30
2 wa 201902 1.45 2019 2 2019-01-06
3 wa 201903 1.61 2019 3 2019-01-13
4 wa 201904 1.88 2019 4 2019-01-20
5 wa 201905 2.15 2019 5 2019-01-27
6 wa 201906 2.17 2019 6 2019-02-03
7 wa 201907 3.33 2019 7 2019-02-10
8 wa 201908 2.22 2019 8 2019-02-17
9 wa 201909 2.24 2019 9 2019-02-24
10 wa 201910 3.29 2019 10 2019-03-03
# ℹ 42 more rows
Note that if the file already exists, our function will skip over it and issue a warning.
get_wili(location="wa", year=2019, write=TRUE)
Warning in get_wili(location = "wa", year = 2019, write = TRUE): Skipping
writing output file that already exists: wa2019.csv
# A tibble: 52 × 6
region epiweek wili year week date
<chr> <int> <dbl> <dbl> <dbl> <date>
1 wa 201901 1.39 2019 1 2018-12-30
2 wa 201902 1.45 2019 2 2019-01-06
3 wa 201903 1.61 2019 3 2019-01-13
4 wa 201904 1.88 2019 4 2019-01-20
5 wa 201905 2.15 2019 5 2019-01-27
6 wa 201906 2.17 2019 6 2019-02-03
7 wa 201907 3.33 2019 7 2019-02-10
8 wa 201908 2.22 2019 8 2019-02-17
9 wa 201909 2.24 2019 9 2019-02-24
10 wa 201910 3.29 2019 10 2019-03-03
# ℹ 42 more rows
Advanced: Mapping a function over multiple inputs
The get_wili()
function takes three arguments: a location, a year, and a logical TRUE/FALSE flag indicating whether output should be written to a CSV. We could run this code successively over every year we were interested in:
get_wili(location="wa", year=2012, write=TRUE)
get_wili(location="wa", year=2013, write=TRUE)
get_wili(location="wa", year=2014, write=TRUE)
get_wili(location="wa", year=2015, write=TRUE)
# ...etc
However, we can use the purrr package to map the get_wili()
function over multiple year
inputs. We create a years
vector that holds 10 years spanning 2012 through 2021, then we apply purrr’s map()
function to this years
vector, to run a function on each element of years
. Here, the function we use is get_wili()
, and the .
in the function call is replaced by the year in years
. In other words, this one line of code runs get_wili()
over every year in 2012-2021 for Washington state. Additionally, we set write=TRUE
so that CSV tables are written out for every year.
<- 2012:2021
years <- map(years, ~get_wili(location="wa", year=., write=TRUE)) wa10yrs
Writing results to file: wa2012.csv
Writing results to file: wa2013.csv
Writing results to file: wa2014.csv
Writing results to file: wa2015.csv
Writing results to file: wa2016.csv
Writing results to file: wa2017.csv
Writing results to file: wa2018.csv
Writing results to file: wa2019.csv
Writing results to file: wa2020.csv
Writing results to file: wa2021.csv
Take a look at the data that is returned by this operation. The map()
function always returns a list, where each element of the list is a tibble containing all that year’s data. The list has 10 years’ worth of data. Let’s only look at the first two elements (2012-2013).
1:2] wa10yrs[
[[1]]
# A tibble: 52 × 6
region epiweek wili year week date
<chr> <int> <dbl> <dbl> <dbl> <date>
1 wa 201201 0.922 2012 1 2012-01-01
2 wa 201202 0.587 2012 2 2012-01-08
3 wa 201203 0.966 2012 3 2012-01-15
4 wa 201204 0.605 2012 4 2012-01-22
5 wa 201205 1.16 2012 5 2012-01-29
6 wa 201206 1.12 2012 6 2012-02-05
7 wa 201207 1.07 2012 7 2012-02-12
8 wa 201208 1.30 2012 8 2012-02-19
9 wa 201209 1.11 2012 9 2012-02-26
10 wa 201210 1.45 2012 10 2012-03-04
# ℹ 42 more rows
[[2]]
# A tibble: 52 × 6
region epiweek wili year week date
<chr> <int> <dbl> <dbl> <dbl> <date>
1 wa 201301 2.78 2013 1 2012-12-30
2 wa 201302 2.30 2013 2 2013-01-06
3 wa 201303 2.94 2013 3 2013-01-13
4 wa 201304 3.59 2013 4 2013-01-20
5 wa 201305 2.47 2013 5 2013-01-27
6 wa 201306 2.11 2013 6 2013-02-03
7 wa 201307 1.67 2013 7 2013-02-10
8 wa 201308 1.39 2013 8 2013-02-17
9 wa 201309 1.34 2013 9 2013-02-24
10 wa 201310 1.15 2013 10 2013-03-03
# ℹ 42 more rows
We can collapse that list into one large tibble containing the entire range of data using dplyr::bind_rows()
.
<- bind_rows(wa10yrs) wa10yrs
Let’s take a look:
wa10yrs
# A tibble: 522 × 6
region epiweek wili year week date
<chr> <int> <dbl> <dbl> <dbl> <date>
1 wa 201201 0.922 2012 1 2012-01-01
2 wa 201202 0.587 2012 2 2012-01-08
3 wa 201203 0.966 2012 3 2012-01-15
4 wa 201204 0.605 2012 4 2012-01-22
5 wa 201205 1.16 2012 5 2012-01-29
6 wa 201206 1.12 2012 6 2012-02-05
7 wa 201207 1.07 2012 7 2012-02-12
8 wa 201208 1.30 2012 8 2012-02-19
9 wa 201209 1.11 2012 9 2012-02-26
10 wa 201210 1.45 2012 10 2012-03-04
# ℹ 512 more rows
Note the range of the dates extends from January 2012 through December 2021:
range(wa10yrs$date)
[1] "2012-01-01" "2021-12-26"
How might you map over multiple locations and multiple years? See the documentation for map2()
or purrr’s other map variants.
Now that we have all the data in one tibble, let’s plot weighted ILI over the last 10 years for Washington state:
ggplot(wa10yrs, aes(date, wili)) +
geom_line() +
scale_x_date(date_breaks="1 year", date_labels="%Y")