Predicting the flow of the South Fork Payette River using an LSTM neural network

Predicting the flow of the South Fork Payette River using an LSTM neural network

How to make a prediction with time-series data using machine learning.


I made an LSTM neural network model that uses 30+ years of weather and streamflow data to quite accurately predict what the streamflow will be tomorrow.

The problem with river forecasts

Water meets Idaho granite. 📷 Will Stauffer-Norris

The main reason I practice data science is to apply it to real-world problems. As a kayaker, I have spent many, many hours poring over weather forecasts, hydrologic forecasts, and SNOTEL station data to make a prediction about a river’s flow. There are good places out there that make this prediction- NOAA runs prediction centers throughout each major river basin in the country, including the South Fork.

But these forecasts often fall short. In particular, I’ve noticed that the forecasts are susceptible to major rain events (flashy rivers in the Pacific Northwest are notoriously hard to predict), and the forecasts are typically only put out once or twice per day, which is often not frequent enough to react to rapidly changing mountain weather forecasts. NOAA also only gives forecasts on a select group of rivers. If you want a forecast for a smaller or more remote drainage, even if it’s gauged, you’re out of luck.

So I’m setting out to create a model that will meet or exceed NOAA’s forecasts, and build models for some drainages that are not covered by NOAA.

To start out, I’m benchmarking my model against an industry-standard model created by Upstream Tech.

The South Fork Payette is a great place to start, for several reasons:

  1. The South Fork above Lowman is undammed, so the confounding variables of reservoirs are avoided.
  2. The USGS operates a gauge on the South Fork, NOAA has weather stations and a river forecast, and there are SNOTEL sites in the basin. There is a lot of easily accessible data to start with.
  3. I used to teach kayaking on the Payette and I’ve paddled almost every section of the river system, so I know the region and its hydrology well!
The North Fork of the Payette is legendary among kayakers. 📷 Will Stauffer-Norris
Idaho’s rivers are always in flux. 📷 Will Stauffer-Norris

The data

The Upstream Tech model I’m benchmarking against uses meteorological as well as remote sensing data to build the model. I haven’t incorporated any satellite imagery yet, although this is the next development in my model.

To start, I downloaded daily meteorological data from NOAA from a weather station on Banner Summit, which is at the headwaters of the South Fork. Eventually, I will incorporate more stations into my forecast, but I wanted to keep it simple for this first iteration. The metrics measured are:

  • Precipitation
  • Temperature (min and max)
  • Snow Depth
  • Snow Water Equivalent
  • Day of Year.

These are my predictive features. The data go back to 1987.

Next, I went to the USGS gauge at Lowman, Idaho, and grabbed the daily discharge for every day since 1987. In a more refined model, I might get hourly data, but I decided daily was good enough for this iteration.

Discharge in CFS at the South Fork Payette at Lowman, 1987–2020
Rocky Mountain rivers are used for recreation as well as hydropower and irrigation. 📷 Will Stauffer-Norris


I merged the two datasets using pandas, creating a dataframe with features and a target variable (discharge).

There were a few missing values in the meteorological data, so I imputed some values to replace the NaNs. I created a correlation matrix to see if any values were correlated and could be dropped. I decided to get rid of the average temperature reading, as there were already min and max temperature features.

With the data cleaned up, it was time to start modeling.

Water in the American West is measured down to the last drop. 📷 Will Stauffer-Norris

The model

I started with just a baseline- what would happen if you just guessed the average discharge — about 800 CFS — of the South Fork every time? It turns out that the average error is about 600 CFS. This is unacceptably large, as it’s almost the flow of the river itself!

I knew I could do better- a lot better.

The red line is the baseline prediction of about 800 CFS. Period is the year 2019.

Linear regression

Linear regressions are very simple, but not a bad place to start getting my hands dirty. I used one, then two, then all the features to see how well they would predict the flow of the South Fork. The answer is- pretty badly.

A single feature linear regression based on the “Day of year” feature is just a sloped line that resets each year. Not too useful.
A two feature linear regression (based on “Day of year” and “Temperature” is slightly more nuanced.
Using all eight features in a linear regression isn’t that much better.

Random forest

OK, so linear regressions aren’t known to be the most powerful machine learning models out there. Time to bring out some more complicated stuff. I put all the features in a random forest model. I could have spent longer tweaking the hyperparameters, but I decided to just use the stock scikit-learn settings, with the exception of using 100 estimators.

The results were a striking improvement- the random forest didn’t quite capture the nuances of the runoff, but it did track the general seasonal trend much better than a linear regression.

A random forest model- getting closer to a decent prediction!
The Sawtooth Mountains, headwaters of the South Fork Payette. 📷 Will Stauffer-Norris

LSTM neural network

Now time for the newest, biggest and baddest model- the neural network. LSTM neural networks can be useful for time series prediction, although they have some limitations. I used the Keras LSTM model.

The model has some quirks- you must wrangle data in a very specific way to make it fit- and I found a few tutorials that were invaluable (the Keras documentation and Machine Learning Mastery).

I trained the model on the period 1987–2015 and evaluated it on the years 2016–2020. In later iterations, I will look more into better validation techniques for time series data, such as nested cross-validation.

Eventually, I managed to get a model that had a 98% R² value and a mean absolute error of only ~50 cfs! This is head and shoulders better than the other (quite simple) models I tried.

My model performance over time. The LSTM is a clear winner!

The craziest part is that I haven’t even incorporated any other weather stations or remote sensing data into the neural network.

I suspect that the previous day’s flow is contributing most to the prediction because the predicted peaks seem to lag the actual peaks by about a day.

I’d like to do more investigation into how exactly the LSTM is coming up with the prediction, and visualize the feature importances.

My LSTM model for the 2019 spring runoff (lead time one day).
Like the Idaho backcountry, there is always something more to explore with machine learning. 📷 Will Stauffer-Norris

Next steps

Although my model performed decently well a day in advance, I’d like to model the flow in a longer forecast range (2–10 days out). I’ve started doing this with the LSTM, but I need to spend some more time on it.

I also want to incorporate more weather stations. NOAA operates several more stations in the area, and it will be quite interesting to see how the position of the station in the watershed changes the prediction.

I also want to incorporate satellite imagery as a feature. This is quite a bit more complicated, due to the large file sizes and acquiring the images in the first place. I’ve started building a pipeline to ingest Google Earth Engine data into my machine learning models.

Finally, looking at the model, it’s able to predict very well the down-legs of the hydrograph- but so can I, just intuitively. The model is less able to predict abrupt upswings due to rapid snowmelt or a rain event. These are the kinds of events where prediction is critically important for hydropower, flood control, and public safety.

As always, there is more work to do!

Thanks for reading, and stay tuned for Part 2, where I will go through some of these next steps, especially incorporating satellite imagery.

You can view the notebooks I used on Github here.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top