Using Functions to Query an API

CSTE Data Science Team Training Program

Authors

Stephen Turner

Pete Nagraj

Published

March 28, 2023

Functions

It is much better practice to write functions to perform a task rather than copying and pasting code. This offers several advantages:

  1. Avoids duplication.
  2. Easier maintenance. If requirements change, you only need to update the function once, rather than update everywhere you’ve copied and pasted the code.
  3. Avoids copy-paste errors, or errors where some duplicates are updated but not others.
  4. Makes code modular. The function you write here could be used elsewhere.

Let’s take a look at an example task where we want to retrieve data from a web API, do some data manipulation, and write out results to a file.

library(tidyverse)
library(jsonlite)
library(MMWRweek)

ILI data from FluView

Let’s retrieve Influenza-like illness (ILI) from U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet) using the CMU Delphi FluView API. The base URL is https://delphi.cmu.edu/epidata/fluview/, which must be combined with required parameters to specify one or more regions and epidemiological weeks.

For example, if we want to retrieve the first five weeks of ILI data for Washington state for 2023, we could visit the following URL:
https://delphi.cmu.edu/epidata/fluview/?regions=wa&epiweeks=202201-202205.

The data that is returned through this API is JSON, which can be read into R using the jsonlite package.

First, let’s construct our base URL:

baseurl <- "https://delphi.cmu.edu/epidata/fluview/"

Next, let’s build up our URL by specifying what location and year we want. In this example, let’s get data for California for 2020. We start by specifying the year we want, then construct the epidemiological week string, then finally putting it all together into a URL string.

Note that we could simply specify multiple years and multiple regions at once. For example, we could get data for all of 2019 and 2020 for Washington, California, and Oregon at https://delphi.cmu.edu/epidata/fluview/?regions=wa,or,ca&epiweeks=201901-202053. However, for the sake of demonstration, let’s imagine that the API only allows retrieval of one location, one year at a time.

# Specify the region and year
location <- "ca"
year <- 2020

# Construct the epiweek string
epiweeks <- paste0(year, "01", "-", year, "53")
epiweeks
[1] "202001-202053"
# Construct the URL suffix
urlsuffix <- paste0("?regions=", location, "&epiweeks=", epiweeks)
urlsuffix
[1] "?regions=ca&epiweeks=202001-202053"
# Construct the entire URL
url <- paste0(baseurl, urlsuffix)
url
[1] "https://delphi.cmu.edu/epidata/fluview/?regions=ca&epiweeks=202001-202053"

Now we can retrieve data for all of 2020 for California. Let’s select out the region, epiweek, weighted ILI, then figure out the date from the epiweek. Hover over the numbers to the right in the code block below to view explanations.

ca2020 <- 
  read_json(url, simplifyVector = TRUE)$epidata %>%
  tibble() %>%
  select(region, epiweek, wili) %>%
  mutate(year=as.numeric(substr(epiweek, 0, 4)),
         week=as.numeric(substr(epiweek, 5, 6)),
         date=MMWRweek2Date(year, week))
ca2020
1
This returns a data.frame object.
2
This turns the data frame into a tibble, which are easier to work with and print nicely.
3
We only want the region, epiweek string, and weighted ILI.
4
Get the numeric year (first four characters in the epiweek string).
5
Get the numeric week (fifth and sixth characters in the epiweek string).
6
Use the MMWRweek package to get the date from the epidemiological year and week.
# A tibble: 53 × 6
   region epiweek  wili  year  week date      
   <chr>    <int> <dbl> <dbl> <dbl> <date>    
 1 ca      202001  4.87  2020     1 2019-12-29
 2 ca      202002  4.15  2020     2 2020-01-05
 3 ca      202003  4.30  2020     3 2020-01-12
 4 ca      202004  4.80  2020     4 2020-01-19
 5 ca      202005  4.98  2020     5 2020-01-26
 6 ca      202006  5.06  2020     6 2020-02-02
 7 ca      202007  4.80  2020     7 2020-02-09
 8 ca      202008  4.51  2020     8 2020-02-16
 9 ca      202009  4.23  2020     9 2020-02-23
10 ca      202010  3.99  2020    10 2020-03-01
# ℹ 43 more rows

Copy and paste

Let’s do this for Washington state for 2018-2021. What are the problems with this code?

  1. First, there’s tons of duplication. Imagine we wanted to do this for 10 years of data for Washington state? What if we wanted to do this for multiple states?
  2. Related, what if we needed to change something about the code here? For instance, what if we wanted to get the unweighted ILI instead of weighted ILI (wili)?
  3. There’s lots of room for error here. Each time we copy and paste the code block, we’re creating new objects (wa2019, wa2020, etc.). What if we forget to modify the name of the object we’re assigning, which could result in writing 2020 data to the 2019 object (see annotation #1). Or, what if we forget to change the filename we’re writing to, and accidentally write 2018 data to wa2020.csv (annotation #2)? There’s lots we have to keep track of mentally, for what’s actually a relatively simple pipeline.
# First let's set the base URL if we haven't already
baseurl <- "https://delphi.cmu.edu/epidata/fluview/"

# Retrieve and write out 2018 data
location <- "wa"
year <- 2018
epiweeks <- paste0(year, "01", "-", year, "53")
urlsuffix <- paste0("?regions=", location, "&epiweeks=", epiweeks)
url <- paste0(baseurl, urlsuffix)
wa2018 <- 
  read_json(url, simplifyVector = TRUE)$epidata %>% 
  tibble() %>% 
  select(region, epiweek, wili) %>% 
  mutate(year=as.numeric(substr(epiweek, 0, 4)), 
         week=as.numeric(substr(epiweek, 5, 6)), 
         date=MMWRweek2Date(year, week)) 
write_csv(wa2018, file="wa2018.csv")

# Retrieve and write out 2019 data
location <- "wa"
year <- 2019
epiweeks <- paste0(year, "01", "-", year, "53")
urlsuffix <- paste0("?regions=", location, "&epiweeks=", epiweeks)
url <- paste0(baseurl, urlsuffix)
wa2019 <-
  read_json(url, simplifyVector = TRUE)$epidata %>% 
  tibble() %>% 
  select(region, epiweek, wili) %>% 
  mutate(year=as.numeric(substr(epiweek, 0, 4)), 
         week=as.numeric(substr(epiweek, 5, 6)), 
         date=MMWRweek2Date(year, week)) 
write_csv(wa2019, file="wa2019.csv")

# Retrieve and write out 2020 data
location <- "wa"
year <- 2020
epiweeks <- paste0(year, "01", "-", year, "53")
urlsuffix <- paste0("?regions=", location, "&epiweeks=", epiweeks)
url <- paste0(baseurl, urlsuffix)
wa2020 <- 
  read_json(url, simplifyVector = TRUE)$epidata %>% 
  tibble() %>% 
  select(region, epiweek, wili) %>% 
  mutate(year=as.numeric(substr(epiweek, 0, 4)), 
         week=as.numeric(substr(epiweek, 5, 6)), 
         date=MMWRweek2Date(year, week)) 
write_csv(wa2020, file="wa2020.csv")

# Retrieve and write out 2021 data
location <- "wa"
year <- 2021
epiweeks <- paste0(year, "01", "-", year, "53")
urlsuffix <- paste0("?regions=", location, "&epiweeks=", epiweeks)
url <- paste0(baseurl, urlsuffix)
wa2021 <- 
  read_json(url, simplifyVector = TRUE)$epidata %>% 
  tibble() %>% 
  select(region, epiweek, wili) %>% 
  mutate(year=as.numeric(substr(epiweek, 0, 4)), 
         week=as.numeric(substr(epiweek, 5, 6)), 
         date=MMWRweek2Date(year, week)) 
write_csv(wa2021, file="wa2021.csv")
1
What if we forget to change the name of the object, wa2019 here?
2
What if we forget to change both the name of the object we’re saving, wa2019, and the filename, wa2019.csv?

Using functions

Let’s define a function, which will allow us to define inputs arbitrarily.

get_wili <- function(location, year, write=FALSE) {
  
  # Set the base URL
  baseurl <- "https://delphi.cmu.edu/epidata/fluview/"
  
  # Construct the full API URL
  epiweeks <- paste0(year, "01", "-", year, "53")
  urlsuffix <- paste0("?regions=", location, "&epiweeks=", epiweeks)
  url <- paste0(baseurl, urlsuffix)
  
  # Get the results
  result <- 
    read_json(url, simplifyVector = TRUE)$epidata %>% 
    tibble() %>% 
    select(region, epiweek, wili) %>% 
    mutate(year=as.numeric(substr(epiweek, 0, 4)), 
           week=as.numeric(substr(epiweek, 5, 6)), 
           date=MMWRweek2Date(year, week)) 

  # Optional: if write=TRUE, write out to file
  if (write) {
    outfile <- paste0(location, year, ".csv")
    if (file.exists(outfile)) {
      warning(paste0("Skipping writing output file that already exists: ", outfile))
    } else {
      message(paste0("Writing results to file: ", outfile))
      write_csv(result, outfile)
    }
  } 
  
  # Return the results of the function
  return(result)
}
1
location and year must be specified. By default, no files are written (default write=FALSE).
2
The year and location variables here are defined as function arguments.
3
If the function is called with write=TRUE, this block will be run.
4
Construct an output filename.
5
If the output file already exists, issue a warning, and do not overwrite the file. If the output file doesn’t exist, write it out to disk and message the output filename.

Now we can get data for any state or year with a single function call! Let’s get data for California in 2018:

get_wili(location="ca", year=2018)
# A tibble: 52 × 6
   region epiweek  wili  year  week date      
   <chr>    <int> <dbl> <dbl> <dbl> <date>    
 1 ca      201801  6.67  2018     1 2017-12-31
 2 ca      201802  4.69  2018     2 2018-01-07
 3 ca      201803  3.94  2018     3 2018-01-14
 4 ca      201804  3.90  2018     4 2018-01-21
 5 ca      201805  4.26  2018     5 2018-01-28
 6 ca      201806  3.82  2018     6 2018-02-04
 7 ca      201807  3.71  2018     7 2018-02-11
 8 ca      201808  3.51  2018     8 2018-02-18
 9 ca      201809  3.65  2018     9 2018-02-25
10 ca      201810  3.39  2018    10 2018-03-04
# ℹ 42 more rows

Or Virginia in 2011:

get_wili(location="va", year=2011)
# A tibble: 52 × 6
   region epiweek  wili  year  week date      
   <chr>    <int> <dbl> <dbl> <dbl> <date>    
 1 va      201101  2.49  2011     1 2011-01-02
 2 va      201102  2.81  2011     2 2011-01-09
 3 va      201103  3.97  2011     3 2011-01-16
 4 va      201104  4.74  2011     4 2011-01-23
 5 va      201105  5.35  2011     5 2011-01-30
 6 va      201106  5.13  2011     6 2011-02-06
 7 va      201107  5.14  2011     7 2011-02-13
 8 va      201108  4.36  2011     8 2011-02-20
 9 va      201109  3.47  2011     9 2011-02-27
10 va      201110  2.84  2011    10 2011-03-06
# ℹ 42 more rows

By default, write=FALSE. Let’s set write=TRUE, which will result in a csv being written to disk which takes a default filename as the name of the location and the year. Let’s do this for Washington state in 2019. Note that our function returns a message that gets printed to the screen, notifying you where the data is being written.

get_wili(location="wa", year=2019, write=TRUE)
Writing results to file: wa2019.csv
# A tibble: 52 × 6
   region epiweek  wili  year  week date      
   <chr>    <int> <dbl> <dbl> <dbl> <date>    
 1 wa      201901  1.39  2019     1 2018-12-30
 2 wa      201902  1.45  2019     2 2019-01-06
 3 wa      201903  1.61  2019     3 2019-01-13
 4 wa      201904  1.88  2019     4 2019-01-20
 5 wa      201905  2.15  2019     5 2019-01-27
 6 wa      201906  2.17  2019     6 2019-02-03
 7 wa      201907  3.33  2019     7 2019-02-10
 8 wa      201908  2.22  2019     8 2019-02-17
 9 wa      201909  2.24  2019     9 2019-02-24
10 wa      201910  3.29  2019    10 2019-03-03
# ℹ 42 more rows

Note that if the file already exists, our function will skip over it and issue a warning.

get_wili(location="wa", year=2019, write=TRUE)
Warning in get_wili(location = "wa", year = 2019, write = TRUE): Skipping
writing output file that already exists: wa2019.csv
# A tibble: 52 × 6
   region epiweek  wili  year  week date      
   <chr>    <int> <dbl> <dbl> <dbl> <date>    
 1 wa      201901  1.39  2019     1 2018-12-30
 2 wa      201902  1.45  2019     2 2019-01-06
 3 wa      201903  1.61  2019     3 2019-01-13
 4 wa      201904  1.88  2019     4 2019-01-20
 5 wa      201905  2.15  2019     5 2019-01-27
 6 wa      201906  2.17  2019     6 2019-02-03
 7 wa      201907  3.33  2019     7 2019-02-10
 8 wa      201908  2.22  2019     8 2019-02-17
 9 wa      201909  2.24  2019     9 2019-02-24
10 wa      201910  3.29  2019    10 2019-03-03
# ℹ 42 more rows

Advanced: Mapping a function over multiple inputs

The get_wili() function takes three arguments: a location, a year, and a logical TRUE/FALSE flag indicating whether output should be written to a CSV. We could run this code successively over every year we were interested in:

get_wili(location="wa", year=2012, write=TRUE)
get_wili(location="wa", year=2013, write=TRUE)
get_wili(location="wa", year=2014, write=TRUE)
get_wili(location="wa", year=2015, write=TRUE)
# ...etc

However, we can use the purrr package to map the get_wili() function over multiple year inputs. We create a years vector that holds 10 years spanning 2012 through 2021, then we apply purrr’s map() function to this years vector, to run a function on each element of years. Here, the function we use is get_wili(), and the . in the function call is replaced by the year in years. In other words, this one line of code runs get_wili() over every year in 2012-2021 for Washington state. Additionally, we set write=TRUE so that CSV tables are written out for every year.

years <- 2012:2021
wa10yrs <- map(years, ~get_wili(location="wa", year=., write=TRUE))
Writing results to file: wa2012.csv
Writing results to file: wa2013.csv
Writing results to file: wa2014.csv
Writing results to file: wa2015.csv
Writing results to file: wa2016.csv
Writing results to file: wa2017.csv
Writing results to file: wa2018.csv
Writing results to file: wa2019.csv
Writing results to file: wa2020.csv
Writing results to file: wa2021.csv

Take a look at the data that is returned by this operation. The map() function always returns a list, where each element of the list is a tibble containing all that year’s data. The list has 10 years’ worth of data. Let’s only look at the first two elements (2012-2013).

wa10yrs[1:2]
[[1]]
# A tibble: 52 × 6
   region epiweek  wili  year  week date      
   <chr>    <int> <dbl> <dbl> <dbl> <date>    
 1 wa      201201 0.922  2012     1 2012-01-01
 2 wa      201202 0.587  2012     2 2012-01-08
 3 wa      201203 0.966  2012     3 2012-01-15
 4 wa      201204 0.605  2012     4 2012-01-22
 5 wa      201205 1.16   2012     5 2012-01-29
 6 wa      201206 1.12   2012     6 2012-02-05
 7 wa      201207 1.07   2012     7 2012-02-12
 8 wa      201208 1.30   2012     8 2012-02-19
 9 wa      201209 1.11   2012     9 2012-02-26
10 wa      201210 1.45   2012    10 2012-03-04
# ℹ 42 more rows

[[2]]
# A tibble: 52 × 6
   region epiweek  wili  year  week date      
   <chr>    <int> <dbl> <dbl> <dbl> <date>    
 1 wa      201301  2.78  2013     1 2012-12-30
 2 wa      201302  2.30  2013     2 2013-01-06
 3 wa      201303  2.94  2013     3 2013-01-13
 4 wa      201304  3.59  2013     4 2013-01-20
 5 wa      201305  2.47  2013     5 2013-01-27
 6 wa      201306  2.11  2013     6 2013-02-03
 7 wa      201307  1.67  2013     7 2013-02-10
 8 wa      201308  1.39  2013     8 2013-02-17
 9 wa      201309  1.34  2013     9 2013-02-24
10 wa      201310  1.15  2013    10 2013-03-03
# ℹ 42 more rows

We can collapse that list into one large tibble containing the entire range of data using dplyr::bind_rows().

wa10yrs <- bind_rows(wa10yrs)

Let’s take a look:

wa10yrs
# A tibble: 522 × 6
   region epiweek  wili  year  week date      
   <chr>    <int> <dbl> <dbl> <dbl> <date>    
 1 wa      201201 0.922  2012     1 2012-01-01
 2 wa      201202 0.587  2012     2 2012-01-08
 3 wa      201203 0.966  2012     3 2012-01-15
 4 wa      201204 0.605  2012     4 2012-01-22
 5 wa      201205 1.16   2012     5 2012-01-29
 6 wa      201206 1.12   2012     6 2012-02-05
 7 wa      201207 1.07   2012     7 2012-02-12
 8 wa      201208 1.30   2012     8 2012-02-19
 9 wa      201209 1.11   2012     9 2012-02-26
10 wa      201210 1.45   2012    10 2012-03-04
# ℹ 512 more rows

Note the range of the dates extends from January 2012 through December 2021:

range(wa10yrs$date)
[1] "2012-01-01" "2021-12-26"
Going further

How might you map over multiple locations and multiple years? See the documentation for map2() or purrr’s other map variants.

Now that we have all the data in one tibble, let’s plot weighted ILI over the last 10 years for Washington state:

ggplot(wa10yrs, aes(date, wili)) + 
  geom_line() + 
  scale_x_date(date_breaks="1 year", date_labels="%Y")