I recently used this technique as a quick and dirty way to find anomalies in time series data and thought I would share.
Let’s say you have simple time series data, event counts grouped by day:
SELECT
TO_CHAR(DATE_TRUNC('week', updates.date),'YYYY-MM-DD') AS "updates.date_week",
COUNT(*) AS "updates.count"
FROM public.updates AS updates
To find big ‘jumps’ in this data, you could use the 3 sigma rule. Basically any observations that fall outside of three standard deviations from the mean is considered an outlier.
mu = mean of the data
std = standard deviation of the data
IF abs(x-mu) > 3*std THEN x is outlier
To model this in a Look, I used table calculations. The first thing we need is the Standard Deviation of the count field. This is calculated by taking the square root of the average of the squared deviations of the values from their mean value.
This is pretty easy to define as a table calculation:
The data points you consider for the Mean , STD calculations matters… I feel baseline average should be calculated by averaging the hourly rollups (Example - 9:00 AM to 10:00 AM hourly rollups ) for the three Mondays within the three-week period.
Similarly, The standard deviation is calculated from the baseline average for rollups
The standard deviation provides a statistical indicator of how much variability exists in the population data that factored into the baseline average calculations.
•For hourly reporting, the standard deviation is calculated for the polled values.
•For daily reporting, the standard deviation is calculated for hourly averages.
•For weekly reporting and beyond, the standard deviation is calculated for the daily averages.
Hi, thank you for your answer @Sai_Satish .
I’d like to compare each day to the previous week’s similar days of the week(Monday to the last x Mondays, Friday to the last X Fridays…etc). Let’s say I have multiple users each one has a number of transactions per day during the last 10 weeks period. Then, I want to compare this Mondaytransactions_count to the prior 10 Mondays data by averaging the number of transaction-during the whole past 10 Mondays(I used std_dev and mean too to detect outliers).
Now, if I understood you correctly, I have to compare my actual Monday data to the mean(hourly)data of the prior Mondays.!