How to make a prediction with time-series data using machine learning.
I made an LSTM neural network model that uses 30+ years of weather and streamflow data to quite accurately predict what the streamflow will be tomorrow.
The problem with river forecasts
The main reason I practice data science is to apply it to real-world problems. As a kayaker, I have spent many, many hours poring over weather forecasts, hydrologic forecasts, and SNOTEL station data to make a prediction about a river’s flow. There are good places out there that make this prediction- NOAA runs prediction centers throughout each major river basin in the country, including the South Fork.
But these forecasts often fall short. In particular, I’ve noticed that the forecasts are susceptible to major rain events (flashy rivers in the Pacific Northwest are notoriously hard to predict), and the forecasts are typically only put out once or twice per day, which is often not frequent enough to react to rapidly changing mountain weather forecasts. NOAA also only gives forecasts on a select group of rivers. If you want a forecast for a smaller or more remote drainage, even if it’s gauged, you’re out of luck.
So I’m setting out to create a model that will meet or exceed NOAA’s forecasts, and build models for some drainages that are not covered by NOAA.
To start out, I’m benchmarking my model against an industry-standard model created by Upstream Tech.
The South Fork Payette is a great place to start, for several reasons:
- The South Fork above Lowman is undammed, so the confounding variables of reservoirs are avoided.
- The USGS operates a gauge on the South Fork, NOAA has weather stations and a river forecast, and there are SNOTEL sites in the basin. There is a lot of easily accessible data to start with.
- I used to teach kayaking on the Payette and I’ve paddled almost every section of the river system, so I know the region and its hydrology well!
The Upstream Tech model I’m benchmarking against uses meteorological as well as remote sensing data to build the model. I haven’t incorporated any satellite imagery yet, although this is the next development in my model.
To start, I downloaded daily meteorological data from NOAA from a weather station on Banner Summit, which is at the headwaters of the South Fork. Eventually, I will incorporate more stations into my forecast, but I wanted to keep it simple for this first iteration. The metrics measured are:
- Temperature (min and max)
- Snow Depth
- Snow Water Equivalent
- Day of Year.
These are my predictive features. The data go back to 1987.
Next, I went to the USGS gauge at Lowman, Idaho, and grabbed the daily discharge for every day since 1987. In a more refined model, I might get hourly data, but I decided daily was good enough for this iteration.
I merged the two datasets using pandas, creating a dataframe with features and a target variable (discharge).
There were a few missing values in the meteorological data, so I imputed some values to replace the NaNs. I created a correlation matrix to see if any values were correlated and could be dropped. I decided to get rid of the average temperature reading, as there were already min and max temperature features.
With the data cleaned up, it was time to start modeling.
I started with just a baseline- what would happen if you just guessed the average discharge — about 800 CFS — of the South Fork every time? It turns out that the average error is about 600 CFS. This is unacceptably large, as it’s almost the flow of the river itself!
I knew I could do better- a lot better.
Linear regressions are very simple, but not a bad place to start getting my hands dirty. I used one, then two, then all the features to see how well they would predict the flow of the South Fork. The answer is- pretty badly.
OK, so linear regressions aren’t known to be the most powerful machine learning models out there. Time to bring out some more complicated stuff. I put all the features in a random forest model. I could have spent longer tweaking the hyperparameters, but I decided to just use the stock scikit-learn settings, with the exception of using 100 estimators.
The results were a striking improvement- the random forest didn’t quite capture the nuances of the runoff, but it did track the general seasonal trend much better than a linear regression.
LSTM neural network
Now time for the newest, biggest and baddest model- the neural network. LSTM neural networks can be useful for time series prediction, although they have some limitations. I used the Keras LSTM model.
I trained the model on the period 1987–2015 and evaluated it on the years 2016–2020. In later iterations, I will look more into better validation techniques for time series data, such as nested cross-validation.
Eventually, I managed to get a model that had a 98% R² value and a mean absolute error of only ~50 cfs! This is head and shoulders better than the other (quite simple) models I tried.
The craziest part is that I haven’t even incorporated any other weather stations or remote sensing data into the neural network.
I suspect that the previous day’s flow is contributing most to the prediction because the predicted peaks seem to lag the actual peaks by about a day.
I’d like to do more investigation into how exactly the LSTM is coming up with the prediction, and visualize the feature importances.
Although my model performed decently well a day in advance, I’d like to model the flow in a longer forecast range (2–10 days out). I’ve started doing this with the LSTM, but I need to spend some more time on it.
I also want to incorporate more weather stations. NOAA operates several more stations in the area, and it will be quite interesting to see how the position of the station in the watershed changes the prediction.
I also want to incorporate satellite imagery as a feature. This is quite a bit more complicated, due to the large file sizes and acquiring the images in the first place. I’ve started building a pipeline to ingest Google Earth Engine data into my machine learning models.
Finally, looking at the model, it’s able to predict very well the down-legs of the hydrograph- but so can I, just intuitively. The model is less able to predict abrupt upswings due to rapid snowmelt or a rain event. These are the kinds of events where prediction is critically important for hydropower, flood control, and public safety.
As always, there is more work to do!
Thanks for reading, and stay tuned for Part 2, where I will go through some of these next steps, especially incorporating satellite imagery.
You can view the notebooks I used on Github here.