Helpful tip for issues in data queries from APIs
Published:
Short post about querying large amount of data
Introduction
APIs give data scientists access to an incredibly diverse array of databases inside a programming script. You can simply load a library, key up a couple of code, and be ready to work with it. However, not all APIs are equal some often time out, work slowly, or are just a pain to work with. To be fair it may not be the API’s fault but the amount of data you’re requesting or the traffic on their server.
The Issue
One particular API that I use frequently and am a huge fan of is the Water Quality Portal (WQP). Specifically, I use the R package dataRetrieval to interface with the WQP via an r script.
Usually, when I am querying data from the WQP, it’s a lot. For one project in particular it was all temperature, pH, and total alkalinity measurements in streams throughout the Contiguous United States from 1990 through 2020. A query of that size (10 million plus observations) was no small feat and the only way I could get it to work was to partition the query into sections - by year for each state. The stated query was done using a for-loop within an r-script which iterated through each query section (year, state), and then saved the data to my local machine. This query took hours and required me to set up the program, let it run, and work on something else.
Now, the idea of partitioning a data query within a loop is certainly not novel, and it’s even recommended by the WQP team. However, the issue I needed to solve was the server timing out. Specifically, when I was running my query scripts, most iterations would go smoothly, but the server would time out and kill the program every so often. So, when I returned, I would come back to an incomplete data query and errors in my council.
The Solution
To handle this issue, I implemented a nested while-loop, where the condition to end the nested loop was that the data was extracted successfully. By implementing the nested loop within the main loop, the program wouldn’t stop when the server timed out it would just retry that iteration’s query. In addition to the extraction condition, I added a maximum number of attempts. If the maximum was met the while loop would end and the main loop would move on to the next iteration’s query. In order to actually change the value of variables used to check conditions within the nested loop I added a tryCatch()
, which allowed me to run lines of code in succession, and run a function if an error was returned.
Of course, a description of the solution is not helpful, so I have added an example block of code below:
Code
# The following code is a simplified example of my presented solution.
# The code will query organic carbon data from 4 states from the year 2020 within all streams
library(dataRetrieval)
query_iterations <- c("MN","LA","IA","WI") # The main loop will iterate by state
# Next we set the arguments for tha data query
args <- list(statecode = "", # state code blank (change in loop)
siteType = "Stream", # sites in stream
sampleMedia = "Water", # samples in water
characteristicName = "Organic%20carbon", # secified parameter
startDateLo = "2020-01-01", # begin date
startDateHi = "2020-12-31" # end date
)
data.list <- list() # empty list to store data
for(i in 1:length(query_iterations)) { # iteration for each state
args$statecode <- query_iterations[i] # set state for iteration
extracted <- FALSE # set extracted to FALSE
attempts <- 0 # set attempts to 0
while(extracted==FALSE & attempts <= 50) { # runs while not ectracted and attemts under 50
tryCatch(
data.list[[i]] <- readWQPdata(args), # function to query data and stores in list
extracted <<- TRUE, # if the first line is successful then extracted set to TRUE
error = function(e) { # if an error occurs then extracted is still FALSE
extracted <<- FALSE
}
)
if(!extracted) { # if not extracted than attempts goes up by one
attempts <- attempts + 1
}
}
}
Conclusion
This solution might seem like a specific fix for a specific issue, but it can be generalized into any data query. Also, I think it shows great use cases for some powerful programming techniques (while loops, tryCatch, and APIs) to help solve data science problems.
Hope you found this useful and thank you for reading!