How to detect the unexpected? Is the behaviour of some measured value normal or did something unexpected happen?
To answer these questions we need to detect anomalous behaviour in a time series. In this article I want to show you how we can do this with prophet.
Prophet is a library written by Facebook in python and R for prediction of time series.
So for anomaly detection we train our model according to the known values except the last n. Then we predict the last n values and compare the predictions with the truth. If they differ we call them an anomaly.
Here’s an example with some data of website usage.
First let’s have a look at the data.
The data consists of weekly data of the sessions of a website. The date is the Monday of the following week.
You can see clearly the drop in website usage at the end of the year. During the summer there’s also less usage. This spring there’s an all time high. (Guess Covid-19 started to spread then.)
Predictions with Prophet
Prophet predicts time series data. We want to predict
number_of_weeks and compare the prediction with the true values.
So we have to compute the last date (
end_date) we can use for training.
Prophet expects a data.frame with a column named
ds containing the date and a column
y containing the depending data value. So let's transform our data:
So let’s train our model.
Now that our model is fitted let’s use it to do some predictions.
First we need to generate a data.frame containing the dates we want to predict. Prophet provides a handy function
We also only want to predict Mondays because our training data only consists of Mondays.
predict function returns the predictions for each row.
forecast contains the forecast (
yhat) and an uncertainty interval (
Prophet provides a simple visualization of the prediction and the uncertainty interval:
But we can use ggplot2 as well:
We can even build a function to highlight good and bad predictions:
We can also get a visualization of the components:
Prophet can also account holidays as special dates which influences the depending variable. It contains some predefined holidays.
But in our example we need to normalize (or mon dify as I call it) because our time series only consists of Mondays.
So here’s a lengthy function defining holidays in Germany:
As we can see the drops at New Year are slightly better predicted.
Consent Layer or What happened in 2020?
In late 2020 the predictions are too high. The truth is much lower. So what happened in reallity?
The answer is simple: Because of GDPR a consent layer was implemented asking the user if she accepts the tracking via Google Analytics or if she declines it.
When she declined it she still could access the website but she weren’t tracked anymore. So it seemed there were less sessions.
So how can we adjust the model?
We can add an additional regressor which indicates whether the consent layer was active or not.
There are two ways to add the additional regressor:
- additive and
The difference is if the effect of this regressor is additive or multiplicative. In our use case I think multiplactive is a good choice because in reallity a certain fraction of all users will decline the tracking pixel.
Additive Additional Regressor
Multiplicative Additional Regressor
The Anomaly Detection
So is there any anomaly during the last
Let’s pimp our plotting function:
So we’ve trained our model without the last
number_of_weeks weeks. Now we predict these weeks. The predictions are shown as triangles.
As we can see two weeks were slightly better than predicted, the other two fall within the prediction corridor.
So there was no big anomaly within the last four weeks.
Originally published at https://rstats-tips.net/2021/01/01/anomaly-detection-with-prophet/.