Dienstag, 21. August 2018

Handling Forecast Data with SQLite

A timeseries is basically an array of values where every value is labeled with either a timestamp or an interval. Forecast data is a bit different because there are several timestamps that have to be considered. First there is the timestamp for which a value is valid, called the valid time. Then there is also the timestamp at which the forecast was created called the basetime. The time delta between the basetime and the valid time is the step and it is important because usually forecasts get worse the further you go into the future. That is why in order to evaluate forecasts you often want to group data by the step. If your forecast is a combination of several forecasts you may also want to store the basetimes of those as well, which can lead to a lot of timestamp labels in your data.

When we work with forecasts, we often store our data in netCDF files since they are very well suited for multi dimensional data and have great software support with packages like xarray and dask. However if your applications write, delete and change your data often, then netCDF may not be the storage container you're looking for. Instead we turned to a classic and one of the most used database engines in the world, SQLite.

Like all SQL databases, SQLite stores data in tables. For our example let's assume we have forecast data that we store using a schema with basetime, step, validtime and values. Here is how our data could look like:


basetime step validtime value
0 2018-08-20 10:00:00 00:15:00 2018-08-20 10:15:00 0.000000
1 2018-08-20 10:00:00 00:30:00 2018-08-20 10:30:00 3.826834
2 2018-08-20 10:00:00 00:45:00 2018-08-20 10:45:00 7.071068
3 2018-08-20 10:00:00 01:00:00 2018-08-20 11:00:00 9.238795
4 2018-08-20 10:00:00 01:15:00 2018-08-20 11:15:00 10.000000



We use pythons sqlite3 package to interact with our database and in doing so, we write the SQL querys in strings and execute them with sqlite3s db api. The queries can then be wrapped in python functions for easy use. Here is how we would connect to our database and wrap basic insert and update queries in python functions:

from sqlite3 import dbapi2 as sqlite
import numpy as np

con = sqlite.connect('forecast_data.db')

con.execute(("create table if not exists forecast"
             "(basetime, step, validtime, value)"))
con.commit()

def insert_into_db(basetime, step, validtime, value):
    """
    Writes a value into the SQLite db.
    NaNs are written as NULLs.
    """
    value = value if not np.isnan(value) else 'NULL'
    arg = ("insert into forecast "
           "(basetime, step, validtime, value) "
           "values ('%s', '%s', '%s', %s)"
           %(basetime, step, valid_time, value))
    con.execute(arg)
    con.commit()

def update_value(basetime, step, validtime, value):
    """
    Executes an update query.
    """
    value = value if not np.isnan(value) else 'NULL'
    arg = ("update forecast set "
           "(basetime, step, validtime, value) = "
           "('%s', '%s', '%s', %s) "
           "where basetime='%s' and step='%s"
           %(basetime, step, validtime, value, basetime, step))
    con.execute(arg)
    con.commit()


When retrieving forecasts from the database we use "select from where" queries that allow us to select different forecasts or group forecast data by the step as described before. Also we usually transform timeseries data into xarray.DataArrays or pandas.Series for further analysis. Here is how those queries can be wrapped:

import pandas as pd

def select_from_where(arg):
    """
    Execute a select query and returns
    a pd.Series with validtime as index.
    """
    retval = con.execute(arg).fetchall()

    val = [retval[i][3] for i in range(len(retval))]
    time = [pd.Timestamp(retval[i][2]) for i in range(len(retval))]
    return pd.Series(val, index=time)


def retrieve_a_timeseries_by_basetime(basetime):
    """
    Retrieves forecasts.

    Parameters
    ----------
    basetime : string or pd.Timestamp
    """
    arg = "select * from forecast where basetime='%s'" % (basetime)
    return select_from_where(arg)


def retrieve_a_timeseries_by_step(step):
    """
    Retrieves forecasts.

    Parameters
    ----------
    step : string or pd.Timedelta
    """
    arg = "select * from forecast where step='%s'" % (step)
    return select_from_where(arg)

In conclusion, SQLite can be great for storing forecast data, when you need an easy and dynamic storage container as a back-end for an application like e.g. a monitoring site. Since everything get's stored in a single file it's portable and easy to set up. The developers have a great when-to-use site, where they lay out some general use cases for SQLite.

1 Kommentar:

  1. Also I found this post about storing financial forecasts using RSQLite:
    https://www.r-bloggers.com/storing-forecasts-in-a-database/

    It's a bit dated but it's somewhat similar to what we do with our weather forecasts.

    AntwortenLöschen