Mastering Time Series Analysis and Forecasting in Alteryx
For those of you that missed my free webinar last week, “How to Create a Time Series Forecast in Alteryx,” (which you can watch the recording of here) I’ve written up a detailed step-by-step to walk you through time series analysis and forecasting.
A time series is a collection of observations taken over some duration of time. Usually, the observations are taken at equally spaced intervals, such as daily, monthly, or yearly. By sorting these observations chronologically, we can gain an understanding of how a variable has changed over time—its trends, seasonality, and other patterns.
A time series forecast is the act of attempting to predict future behavior based on past observations. Some time series data, like seasonal temperature variations, are relatively easy to predict. Other data, such as stock prices, are virtually unpredictable.
Through the rest of this post, we will briefly discuss important concepts in time series analysis before diving into using the time series tools in Alteryx Designer.
Time Series Components
Before we can proceed to building a time series model, let’s review some important terms and concepts.
A trend is a long-term increase or decrease in values. Trends do not need to be linear. Trends can reverse over time.
The chart above, showing Japan’s actual and projected population, shows an upward trend until about 2020, and then a downward trend between 2020 and 2100.
Seasonality is a pattern in the data that occurs at a fixed interval. Often, seasonal patterns are dictated by the time an observation occurs, such as time of the year or day of the week.
The chart above shows the average daily temperature for Yakutsk, Russia. The peaks and troughs in the chart occur every 12 months, corresponding to summer and winter.
A cyclic pattern occurs when observations rise and fall at non-fixed frequencies. Often, cycles last for multiple years, but we cannot know their duration in advance.
The chart above shows the number of Canadian lynx trapped between 1821 and 1934. At first glance, there appears to be a seasonal pattern. However, the peaks are not evenly spaced, with some being 3 to 5 years apart and others up to 9 or 10 years apart.
The last component to a time series is random noise. This is what is left over after trends, seasonality, and cycles have been accounted for. A time series with too much random noise will be very difficult to forecast.
The chart above depicts 10 random walks. Starting at step 0, each step has a 50% of moving up and a 50% of moving down. Notice that, for each of the lines, there are no discernible patterns. Unfortunately for stock traders, stock prices are similar to random walks, and basic time series forecasting is not helpful.
Building a Time Series Forecast in Alteryx
Alteryx comes with 7 time series tools out of the box. Using these tools, we can investigate and analyze a time series and then build a forecast model. Throughout the example problem below, we will use each of these tools, and explain their purpose and configuration.
The TS Plot tool is used to generate several time series plots, to aid in the understanding of the data and determine how to build a forecast model.
The TS Filler tool is used to fill in gaps in a date or datetime field. It is important that time series data not contain any gaps or missing data.
The ARIMA tool creates a univariate time series forecasting model using an autoregressive integrated moving average method.
The ETS tool creates a univariate time series forecasting model using an exponential smoothing method.
The TS Compare tool evaluates one or more time series forecast models created using either the ARIMA or ETS tools.
The TS Forecast tool generates forecasts for a set number of future periods from an ARIMA or ETS model.
TS Covariate Forecast
The TS Covariate Forecast tool generates forecasts from an ARIMA model that uses covariates. The number of periods to forecast is determined by the number of periods of covariate data provided.
You’re planning on taking a trip to South Florida’s lovely Everglades National Park. You want to go when the crowds are at their smallest, so you download data from the National Park Service. After some basic data cleaning, you wind up with this data set:
We have monthly visitation data between January 2020 and December 2016. Even from the snippet seen above, it looks like visitation spikes in the milder winter months and drops sharply during the heat of summer.
Step 1: Check for Missing Dates
Once the data is prepped, the first thing we should do is check that our data set has no missing dates. To do that, connect the data to a TS Filler tool. You will need to make sure that your date field is a Date or Datetime data type for the TS Filler tool to work.
The configuration menu is very simple. All you must do is select your date field—Visitation Date, here—and its interval and frequency. Since we have monthly data, we set the interval to “Month” and its increment is “1.” Run the workflow and the data set will now look like this:
The TS Filler tool creates two new columns: OriginalDateTime and FlagGeneratedRow. What you need to do is look through the data within the FlagGeneratedRow column. If you have any “True” values, those were missing from the original data set and will need to be addressed. In this example, however, we don’t have any missing records so we can proceed to the next step.
Tip: Any missing observations will appear as Null in the original data set. To replace these Null values with numbers, you have many options. You could use the Imputation tool to replace all Nulls with, say, the average value. A more accurate approach, however, may be to use the Multi-Row Formula tool to take the average of the previous and next values.
Step 2: Generate Time Series Plots
Once you have addressed any missing data, the next step is to use the TS Plot tool to generate plots to better understand the data. Connect your data to the TS Plot tool and then connect a Browse to its I output anchor. The first plot that you will see is the Time Series Plot. It is a line chart showing the raw values over time:
From this plot we can see what appears to be seasonality. There does not seem to be any upward or downward trend, but there are occasional spikes in visitation.
To get a better idea about seasonality, the next chart is perfect: the season plot.
Since we have monthly data, the season plot breaks the data down by year, with each year being a distinct line on the plot. From this view, we can easily see if there is any regular seasonality. Here, it looks like the winter months from November through March tend to have the most visitors, while June through September sees few visitors.
The next plot—really, a collection of plots—is the decomposition plot. It shows four cuts of the data:
The first plot shows the raw data. This is the same plot we saw earlier. The second plot, however, shows the seasonal component of the time series after any trend is mathematically removed. This view makes clarifies that the data is seasonal.
The third plot shows the trend in the data after any seasonality is removed mathematically. By removing any seasonality, we can more easily discern if the time series is trending up or down over a long term. In this example, there is no clear trend.
Finally, we have the remainder plot. This is what remains after any seasonality and trends are removed from the data. This is the random noise that we discussed earlier. Ideally, a time series will have a random noise component that is small and stays near 0 without large spikes.
The last plots are for autocorrelation and partial autocorrelation. The autocorrelation function plot (ACF) shows the relationship between an observation and all observations prior to it until some given point. For example, the autocorrelation between the last observation and 2 months ago would measure the relationship between the current month, one month prior, and two months prior.
The partial autocorrelation function plot (PACF) shows the relationship between two observations, but the calculation does not include intervening observations.
These plots help you understand the relationship between an observation and past observations, also known as lags. From the ACF plot, we can see a strong autocorrelation with the lag-1, lag-12, and lag-24. Conversely, there is strong negative autocorrelation with lag-6 and lag-18. These findings reinforce our conclusions about the seasonality of this data from earlier.
Step 3: Build Forecast Models
Now that we have filled in missing data and analyzed our time series, we can finally proceed to building some models. Alteryx gives us two time series models to play with: ARIMA and ETS.
ARIMA - Conceptual
ARIMA stands for AutoRegressive Integrated Moving Average.
- AutoRegressive: The variable of interest is compared against its past observations. These relationships drive predictions of future behavior.
- Integrated: Since ARIMA models are stationary, the data must be transformed by a process known as differencing if there is any trend or seasonality present.
- Moving Average: Shows that the variable of interest depends linearly on current and past values. That is, all past observations are given the same weight in the forecast model.
Non-seasonal ARIMA models are specified using three parameters:
- p: order of the autoregressive part
- d: degree of first differencing involved
- q: order of the moving average part
To further confuse matters, seasonal ARIMA models add P, D, Q, and m parameters to denote seasonal effects. The good thing is that Alteryx leverages a package in R that can assign all these parameters automatically.
ETS - Conceptual
ETS is an exponential smoothing time series model and stands for Error, Trend, Seasonality. All ETS models are non-stationary, meaning that they assume the time series includes some trend or seasonality (hence the name).
A major difference between ARIMA and ETS is that ETS models provide exponentially lower weights to observations as they get older (as opposed to linear weights, as in the ARIMA model).
Thankfully, as with the ARIMA model, Alteryx leverages packages that automatically find the best parameters so you do not need to assign them manually unless you want to.
ARIMA Model in Alteryx
To build an ARIMA model, simply connect the data set to an ARIMA tool. Configuring the tool is very easy: give it a name (no spaces or special characters allowed), identify the target field, and the target field frequency.
The model customization tab is entirely optional. Use the options on this tab if you want to either control how Alteryx automatically chooses parameter values or to manually configure the entire model.
If you are interested in learning how to determine the values for these parameters, we highly recommend reading the ARIMA chapter of Forecasting: Principles and Practice.
The other options tab allows you to set the specific starting period for weekly, monthly, quarterly, or yearly data.
ETS Model in Alteryx
Configuring an ETS model in Alteryx is nearly identical to the ARIMA tool. Connect an ETS tool to your data flow, give the model a name (remember, no space or special characters), identify the target field and the target field frequency.
The configuration differences between the ARIMA and ETS tools lie in the tab labeled Model Type. Unlike the ARIMA tool, which provides options for many parameters and other model attributes, the ETS tool is much simpler. By default, the error, trend, and seasonality are set to auto. Alteryx iterates through the parameters to find the best model fit, so auto is generally a good choice.
ARIMA and ETS Outputs
Like many predictive tools in Alteryx, the ARIMA and ETS tools have three output anchors: O, R, and I.
- O: Contains model information. This output is only used to connect to other predictive suite tools (and the Union tool, as we will see).
- R: Displays a static output of model parameters, error statistics, and a basic forecast.
- I: Shows an interactive forecast plot along with the same error statistics.
There are some slight differences in the output reports from the two models, but both show a series of error statistics, a 6-month forecast, and model parameters. Keep in mind that, for the most part, the error statistics presented don’t tell you much on their own.
Step 4: Compare Time Series Models
In general, the best practice when developing a time series forecast is to compare multiple models against one another. In this example, we will build both an ARIMA and an ETS model using the default settings, and compare them using the TS Compare tool.
The most important thing to remember with the TS Compare tool is that it needs to compare a model’s forecast against real data. That is, you must create a training set and a testing set to analyze which time series model is more suitable.
When dealing with time series data, the easiest way to create these training and testing sets is via the Filter tool. Since our data runs between the years 2000 and 2016, we will build the models on data through 2014 and compare the models with data from 2015 to 2016.
Once you have used the Filter tool to create these sets, connect the True output anchor to the time series models you want to compare. Then, you must connect their O output anchors to a Union tool. Finally, connect the Union tool to one input and the False output anchor from the Filter tool to the other input of the TS Compare tool. The order in which you connect them does not matter.
The best thing about the TS Compare tool is that it needs no configuration. The only options available are graphics options, which merely change the way the R output is displayed.
The three outputs for the TS Compare tool show the same information. The easiest output to read is the O output, seen below.
Each model that was connected to the Union tool occupies one row in the output. The remaining columns display a series of error statistics:
- ME: Mean Error
- RMSE: Root Mean Squared Error
- MAE: Mean Absolute Error
- MPE: Mean Percentage Error
- MAPE: Mean Absolute Percentage Error
- MASE: Mean Absolute Scaled Error
What you are looking for here is the model whose error statistics are closest to zero. In this example, the ETS model outperforms the ARIMA in four of the six statistics. To finish this example, we will use the ETS model.
Step 5: Create a Forecast
At this point, we have examined the data, made sure we had no missing observations, and compared models. Finally, we are ready to create a forecast with our chosen model.
Make sure you connect your full data set to your chosen model, and not the training set that we used in the previous step. Once you have connected the data to your model, connect your model to the TS Forecast tool.
The configuration menu for the TS Forecast tool is very simple. You provide a name for the forecast—again, no spaces or special characters—elect which confidence bounds you want to show, and the number of periods to create a forecast for.
The three outputs for the TS Forecast tool show the forecast and upper and lower confidence interval values for the specified number of time periods. The O output shows a basic data table, whereas the R and I outputs show a graphical representation of the forecast values.
In the image below, we have zoomed in on a subset of the data in the I output. The I output allows you to hover over the chart to reveal that date’s values and zoom in and out interactively. For time series with a large range of dates, being able to zoom in is very helpful.
Bonus: TS Covariate Forecasts
Up to this point, everything we have looked at concerned only one variable: monthly park visitation. The forecast was based entirely on patterns of past data. What if, however, you have other variables that you think may contribute to this pattern?
Enter: TS Covariate Forecast. This tool, when used with the ARIMA tool, allows us to build a model that uses covariates, or other explanatory variables. Note that this does not work with the ETS tool.
The tricky thing with the TS Covariate Forecast tool is that it only produces a forecast for periods in which you have covariate observations. Say, for example, you want to use monthly temperature averages between the years 2000 and 2020. Recalling that our parks data goes through 2016, we can produce a covariate forecast up through 2020. If we wanted to push that forecast outward, we would first need to produce forecast values for the temperatures.
In the workflow above, notice that we joined our park visitation data with weather data. We then created a filter at the year 2016. All data until 2016 was passed into the ARIMA tool (via the True output anchor) and all remaining data—from 2017 to 2020—went directly into the TS Covariate Forecast tool.
Tip: Similar to the TS Compare tool, it does not matter which data streams connect to which input anchors for the TS Covariate Forecast tool.
Taking a step back momentarily, the setup for the ARIMA tool only requires one change from how we used it earlier. You must select the “Use covariates in model estimation?” box and then select which columns contain your covariates.
The configuration for the TS Covariate Forecast tool could not be easier. All you must do is give it a name (no spaces or special characters!) and determine the bounds for your confidence intervals.
Like most other predictive tools, the TS Covariate Forecast tool offers three outputs: O, R, and I. Like we have already seen, the O output generates a data table that can be used with other Alteryx tools. The R and I output display a graphical forecast.
We provided the ARIMA model with data from 2000 to 2016. We then passed covariate data into the TS Covariate Forecast tool from 2017 to 2020. Thus, the tool will give us a forecast with 4 years—48 periods—of data.
Time series analysis and forecasting is a fascinating field. In it, we take a series of observations and, through relatively straightforward mathematics, produce predictions about future behavior. Of course, it is not a perfect science. As in any field of analysis, you must ensure that you understand the data and the problem prior to throwing a model together.
Throughout this article we have examined each of the built-in time series tools that Alteryx Designer currently offers. We hope that it will serve as a helpful guide for anybody looking to use these powerful and fun tools.
Lastly, if you’re interested in seeing time series analysis and forecasting in action, be sure to watch the recording of my most recent webinar: “How to Create a Time Series Forecast in Alteryx.”