Helpful tip for issues in data queries from APIs

4 minute read

Published: September 09, 2023

Short post about querying large amount of data

Introduction

APIs give data scientists access to an incredibly diverse array of databases inside a programming script. You can simply load a library, key up a couple of code, and be ready to work with it. However, not all APIs are equal some often time out, work slowly, or are just a pain to work with. To be fair it may not be the API’s fault but the amount of data you’re requesting or the traffic on their server.

The Issue

One particular API that I use frequently and am a huge fan of is the Water Quality Portal (WQP). Specifically, I use the R package dataRetrieval to interface with the WQP via an r script.

Usually, when I am querying data from the WQP, it’s a lot. For one project in particular it was all temperature, pH, and total alkalinity measurements in streams throughout the Contiguous United States from 1990 through 2020. A query of that size (10 million plus observations) was no small feat and the only way I could get it to work was to partition the query into sections - by year for each state. The stated query was done using a for-loop within an r-script which iterated through each query section (year, state), and then saved the data to my local machine. This query took hours and required me to set up the program, let it run, and work on something else.

Now, the idea of partitioning a data query within a loop is certainly not novel, and it’s even recommended by the WQP team. However, the issue I needed to solve was the server timing out. Specifically, when I was running my query scripts, most iterations would go smoothly, but the server would time out and kill the program every so often. So, when I returned, I would come back to an incomplete data query and errors in my council.

The Solution

To handle this issue, I implemented a nested while-loop, where the condition to end the nested loop was that the data was extracted successfully. By implementing the nested loop within the main loop, the program wouldn’t stop when the server timed out it would just retry that iteration’s query. In addition to the extraction condition, I added a maximum number of attempts. If the maximum was met the while loop would end and the main loop would move on to the next iteration’s query. In order to actually change the value of variables used to check conditions within the nested loop I added a tryCatch(), which allowed me to run lines of code in succession, and run a function if an error was returned.

Of course, a description of the solution is not helpful, so I have added an example block of code below:

Code

# The following code is a simplified example of my presented solution.
# The code will query organic carbon data from 4 states from the year 2020 within all streams

library(dataRetrieval)
query_iterations <- c("MN","LA","IA","WI") # The main loop will iterate by state

# Next we set the arguments for tha data query
args <- list(statecode = "", # state code blank (change in loop)
             siteType = "Stream", # sites in stream
             sampleMedia = "Water", # samples in water
             characteristicName = "Organic%20carbon", # secified parameter
             startDateLo = "2020-01-01", # begin date
             startDateHi = "2020-12-31" # end date
)

data.list <- list() # empty list to store data

for(i in 1:length(query_iterations)) { # iteration for each state

  args$statecode <- query_iterations[i] # set state for iteration
  extracted <- FALSE # set extracted to FALSE
  attempts <- 0 # set attempts to 0

  while(extracted==FALSE & attempts <= 50) { # runs while not ectracted and attemts under 50
    tryCatch(
      data.list[[i]] <- readWQPdata(args), # function to query data and stores in list
      extracted <<- TRUE, # if the first line is successful then extracted set to TRUE
      error = function(e) { # if an error occurs then extracted is still FALSE
        extracted <<- FALSE
      }
    )
    if(!extracted) { # if not extracted than attempts goes up by one
      attempts <- attempts + 1
    }
  }
}

Conclusion

This solution might seem like a specific fix for a specific issue, but it can be generalized into any data query. Also, I think it shows great use cases for some powerful programming techniques (while loops, tryCatch, and APIs) to help solve data science problems.

Hope you found this useful and thank you for reading!

Share on

Twitter Facebook LinkedIn

Helpful tip for issues in data queries from APIs

Introduction

The Issue

The Solution

Code

Conclusion

Share on

You May Also Enjoy

Tip to embed Tableau-public Dashboards in .md files

Speeding up spatial mapping

Hello world