Are High Frequency Trading Data Returns Normally Distributed?

In this post we are going to analyze high frequency trading data. High Frequency trading data is basically the intraday trading data that includes 1 minute, 5 minute, 15 minute, 30 minute, 60 minute and 240 minutes data. Daily data and weekly data are statistically different as compared to intraday high frequency trading data. In this post  we will be using python to do the analysis and draw a few Q-Q plots to see whether the high frequency returns are normally distributed or not. You should have Anaconda installed alongwith Spyder as we will be using Spyder as our IDE.

In a previous post on a random walk with GBPUSD one fine morning, we tried to fit a time series model to GBPUSD intraday data. We found that the random walk model is the best model for a currency pair. We found that returns are log normally distributed. In time series analysis and for that matter in most of statistical analysis normal distribution plays a very important part. So many quantitative trading models use this assumption of normality that we wanted to show you graphically that returns are in fact not normally distributed as all. Using log returns is very popular in quantitative trading community even when the returns are not normally distributed.

A very important question that comes to mind is should we model returns or should we model price directly. By the end of this post we might be able to answer this question. Basically we will be calculating the skewness and kurtosis of the high frequency trading data returns and compare that with the normal distribution. Before you begin if you are not sure about skewness and kurtosis, watch this video below!

In time series analysis all textbooks are modelling log returns. The assumption is when we model log returns we get a stationary series. Log returns are supposed to model price much better. But you will be surprised to see the graph that log returns are not normally distributed at all. This fallacious assumption of log returns being close to normal distributions have caused huge losses. When the model based on these fallacious assumptions go wrong you can imagine how the things get unraveled. Did you read the post on Central Banks? So let’s start. We will start with 60 minute EURUSD data. Read that data and then make a histogram and a Q-Q plot. Below is the code!

#import the libraries
import numpy as np
import pandas as pd
from scipy.stats import describe
import scipy.stats as stats
import matplotlib.pyplot as plt
#read the data from the csv file
data1 = pd.read_csv('E:/MarketData/EURUSD60.csv', header=None)
data1.columns=['Date', 'Time', 'Open', 'High', 'Low', 'Close', 'Volume']
data1.shape
#show data
data1.head()
#calculate log returns
log_returns = np.log(data1['Close'] / data1['Close'].shift(1))
log_returns.head(10)
#plot histograms
log_returns.hist(bins=50, figsize=(9, 6))
#print statistics like standard deviation, skewness and kurtosis
log_data = np.array(log_returns.dropna())
describe(log_data, axis=0)
#draw a Q-Q Plot
f = plt.figure(figsize=(12,8))
ax = f.add_subplot(111)
stats.probplot(log_data, dist='norm', plot=ax)
plt.show();

When we run the above code on Spyder we get the following result:

>>> import numpy as np
>>> import pandas as pd
>>> from scipy.stats import describe
>>> import scipy.stats as stats
>>> import matplotlib.pyplot as plt
>>>

>>>
>>> data1 = pd.read_csv(‘E:/MarketData/EURUSD60.csv’, header=None)
(10881, 7)
>>> data1.columns=[‘Date’, ‘Time’, ‘Open’, ‘High’, ‘Low’, ‘Close’, ‘Volume’]
Date   Time     Open     High      Low    Close  Volume
0  2015.03.18  07:00  1.06006  1.06006  1.05795  1.05795    1781
1  2015.03.18  08:00  1.05796  1.05962  1.05795  1.05930    2761
2  2015.03.18  09:00  1.05929  1.06190  1.05829  1.06119    7079
3  2015.03.18  10:00  1.06120  1.06133  1.05918  1.06085    9048
4  2015.03.18  11:00  1.06086  1.06142  1.05969  1.06095    8800
>>> data1.shape
>>> data1.head()
>>>
>>>
>>>
0         NaN
1    0.001275
2    0.001783
3   -0.000320
4    0.000094
5    0.001225
6   -0.002385
7   -0.000547
8    0.001839
9   -0.000226
Name: Close, dtype: float64
>>> log_returns = np.log(data1[‘Close’] / data1[‘Close’].shift(1))
>>> log_returns.head(10)
>>>

log_returns.hist(bins=50, figsize=(9, 6))
<matplotlib.axes._subplots.AxesSubplot object at 0x000002308421DB70>
>>>
>>>
>>> log_data = np.array(log_returns.dropna())
>>> describe(log_data, axis=0)
DescribeResult(nobs=10880, minmax=(-0.020616511106809295, 0.014862239682094908), mean=5.9750651022161664e-07, variance=1.8141768606500153e-06, skewness=-0.44691463783206614, kurtosis=20.778529069037514)
>>>
>>>
>>> f = plt.figure(figsize=(12,8))
>>> ax = f.add_subplot(111)
>>> stats.probplot(log_data, dist=’norm’, plot=ax)
plt.show();
((array([-3.83140668, -3.60740668, -3.4844823 , …,  3.4844823 ,
3.60740668,  3.83140668]), array([-0.02061651, -0.01574893, -0.01513318, …,  0.01144143,
0.01275016,  0.01486224])), (0.0012401011632052561, 5.9750651022115226e-07, 0.9204352835421461))

You can see in the above data, skewness for the 60 minute data is -0.44 and kurtosis is 20.78. Skewness of a normal distribution is almost 0 and the kurtosis of a normal distribution is close to 3. The 60 minute time date series has a skewness of -0.44 and a kurtosis of 20.78. Negative skewness means the distribution is not symmetric like the normal distribution. Kurtosis of this 60 minute intraday data is 20.78 which is much greater than that of a normal distribution of 3. Did you read the post on how to make custom timeframe candlestick charts? Below is the histogram of the log returns.

Histogram of Log Returns

Now we cannot visually judge the skewness and kurtosis from this histogram. So take a look at the Q-Q plot below and you can clearly see that the log normal distribution is not normal at all. If it had been normal, you should see the data falling close to the red line which is not the case here. The data falls away at the two extremes of the Q-Q plot.

Q-Q Plot of Log Returns

You can clearly see that log returns are not normally distributed at all.  Q-Q plot is mostly used to check whether the data is normally distributed or not. If the data falls on the red line than it is approximately normally distributed. This is not the case here. Now let’s run the code of 1 minute data and measure the 1 minute high frequency trading data skewness and kurtosis.

>> import numpy as np
>>> import pandas as pd
>>> from scipy.stats import describe
>>> import scipy.stats as stats
>>> import matplotlib.pyplot as plt
>>>
>>>
>>>
>>> data1 = pd.read_csv(‘E:/MarketData/EURUSD1.csv’, header=None)
>>> data1.columns=[‘Date’, ‘Time’, ‘Open’, ‘High’, ‘Low’, ‘Close’, ‘Volume’]
data1.shape
>>> data1.head()
>>>
(65166, 7)
>>>
Date   Time     Open     High      Low    Close  Volume
0  2016.10.03  16:41  1.12267  1.12267  1.12255  1.12261      23
1  2016.10.03  16:42  1.12264  1.12267  1.12251  1.12258      16
2  2016.10.03  16:43  1.12259  1.12261  1.12259  1.12260      16
3  2016.10.03  16:44  1.12257  1.12262  1.12244  1.12262      23
4  2016.10.03  16:45  1.12261  1.12264  1.12238  1.12242      20
>>>
>>> log_returns = np.log(data1[‘Close’] / data1[‘Close’].shift(1))
>>> log_returns.head(10)
>>>
>>>
0         NaN
1   -0.000027
2    0.000018
3    0.000018
4   -0.000178
5   -0.000045
6   -0.000027
7    0.000009
8    0.000071
9   -0.000027
Name: Close, dtype: float64
>>> log_returns.hist(bins=50, figsize=(9, 6))
>>>
>>>
log_data = np.array(log_returns.dropna())
<matplotlib.axes._subplots.AxesSubplot object at 0x0000023084ECB160>
>>> describe(log_data, axis=0)
>>>
>>>
>>> f = plt.figure(figsize=(12,8))
DescribeResult(nobs=65165, minmax=(-0.0051586949893545966, 0.0028986949421451998), mean=-6.8887234075618529e-07, variance=2.4741456374521119e-08, skewness=-0.8421974705112031, kurtosis=52.469487809683066)
>>> ax = f.add_subplot(111)
>>> stats.probplot(log_data, dist=’norm’, plot=ax)
>>> plt.show();

In the case of 1 minute high frequency data, we find skewness has increased to -0.82 and the kurtosis has also increased to 52.46 which means that 1 minute data is much more skewed and more leptokurtic as compared to 60 minute data. You can use the above code and check skewness and kurtosis for the 5 minute high frequency data as well as 15 minute high frequency data. So let’s get back to the basic question. Should we model log returns or should we model price directly? We think modelling price directly would be a better approach. More on that in later posts when we use deep learning to model price directly. Did you take a look at our Deep Learning For Traders Course? In this course on deep learning, we show you how to build deep learning models for predicting price.