I have some quarterly data that I need to convert to monthly in order to work with another data set. The data looks like this:
Date Value
1/1/2010 100
4/1/2010 130
7/1/2010 160
What I need to do is impute the values for the missing months so that it looks like this:
Date Value
1/1/2010 100
2/1/2010 110
3/1/2010 120
4/1/2010 130
5/1/2010 140
6/1/2010 150
7/1/2010 160
Couldn't find many previous questions on how to do this. Only the reverse [monthly to quarterly]. I tried one of those methodologies in reverse, but it didn't work:
pd.PeriodIndex[df.Date, freq='M']
What would be the easiest way to go about doing this in Pandas?
Note: Pandas version 0.20.1 [May 2017] changed the grouping API. This post reflects the functionality of the updated version.
Anyone working with data knows that real-world data is often patchy and cleaning it takes up a considerable amount of your time [80/20 rule anyone?]. Having recently moved from Pandas to Pyspark, I was used to the conveniences that Pandas offers and that Pyspark sometimes lacks due to its distributed nature. One of the features I have learned to particularly appreciate is the straight-forward way of interpolating [or in-filling] time series data, which Pandas provides. This post is meant to demonstrate this capability in a straight forward and easily understandable way using the example of sensor read data collected in a set of houses. The full notebook for this post can be found in my GitHub.
Preparing the Data and Initial Visualization
First, we generate a pandas data frame df0 with some test data. We create a mock data set containing two houses and use a sin and a cos function to generate some sensor read data for a set of dates. To generate the missing values, we randomly drop half of the entries.
data = {'datetime' : pd.date_range[start='1/15/2018',
end='02/14/2018',
freq='D']\
.append[pd.date_range[start='1/15/2018',
end='02/14/2018',
freq='D']],
'house' : ['house1' for i in range[31]]
+ ['house2' for i in range[31]],
'readvalue' : [0.5 + 0.5*np.sin[2*np.pi/30*i]
for i in range[31]]\
+ [0.5 + 0.5*np.cos[2*np.pi/30*i]
for i in range[31]]}df0 = pd.DataFrame[data, columns = ['readdatetime',
'house',
'readvalue']]# Randomly drop half the reads
random.seed[42]
df0 = df0.drop[random.sample[range[df0.shape[0]],
k=int[df0.shape[0]/2]]]
This is how the resulting table looks like:
Raw read data with missing valuesThe plot below shows the generated data: A sin and a cos function, both with plenty of missing data points.
We will now look at three different methods of interpolating the missing read values: forward-filling, backward-filling and interpolating. Remember that it is crucial to choose the adequate interpolation method for each task. Special considerations are required particularly for forecasting tasks, where we need to consider if we will have the data for the interpolation when we do the forecasting. For example, if you need to interpolate data to forecast the weather then you cannot interpolate the weather of today using the weather of tomorrow since it is still unknown [logical, isn’t it?].
Interpolation
To interpolate the data, we can make use of the groupby[]-function followed by resample[]. However, first we need to convert the read dates to datetime format and set them as the index of our dataframe:
df = df0.copy[]
df['datetime'] = pd.to_datetime[df['datetime']]
df.index = df['datetime']
del df['datetime']
Since we want to interpolate for each house separately, we need to group our data by ‘house’ before we can use the resample[] function with the option ‘D’ to resample the data to a daily frequency.
The next step is then to use mean-filling, forward-filling or backward-filling to determine how the newly generated grid is supposed to be filled.
mean[]
Since we are strictly upsampling, using the mean[] method, all missing read values are filled with NaNs:
df.groupby['house'].resample['D'].mean[].head[4]
Filling using mean[]pad[] — forward filling
Using pad[] instead of mean[] forward-fills the NaNs.
df_pad = df.groupby['house']\
.resample['D']\
.pad[]\
.drop['house', axis=1]
df_pad.head[4]
Filling using pad[]bfill[] — backward filling
Using bfill[] instead of mean[] backward-fills the NaNs:
df_bfill = df.groupby['house']\Filling using bfill[]
.resample['D']\
.bfill[]\
.drop['house', axis=1]df_bfill.head[4]
interpolate[] — interpolating
If we want to mean interpolate the missing values, we need to do this in two steps. First, we generate the underlying data grid by using mean[]. This generates the grid with NaNs as values. Afterwards, we fill the NaNs with interpolated values by calling the interpolate[] method on the read value column:
df_interpol = df.groupby['house']\
.resample['D']\
.mean[]
df_interpol['readvalue'] = df_interpol['readvalue'].interpolate[]
df_interpol.head[4]
Filling using interpolate[]Visualizing the Results
Finally, we can visualize the three different filling methods to get a better idea of their results. The opaque dots show the raw data, the transparent dots show the interpolated values.
We can see how in the top figure, the gaps have been filled with the previously known value, in the middle figure, the gaps have been filled with the existing value to come and in the bottom figure, the difference has been linearly interpolated. Note the edges in the interpolated lines due to the linearity of the interpolation process. Depending on the task, we could use higher-order methods to avoid these kinks, but this would be going too far for this post.
Summary
In this post we have seen how we can use Python’s Pandas module to interpolate time series data using either backfill, forward fill or interpolation methods.