2.1.4 VPIN formula

VPIN formula is computed on n successive buckets, where n is VPIN support. A buffer is defined as n successive buckets. Here is VPIN formula, approximating (1) upon bucket number j (j≥ n):

$$\text{VPIN}\_{j} = \frac{\sum\_{i=j-n+1}^{j} |V\_{bucket,i}^{b} - V\_{bucket,i}^{s}|}{nV\_{bucket}} \tag{4}$$

For a given bucket i:

$$\bullet \ V\_{bucket,i}^{\epsilon} = \sum\_{j \in bucket\_i} V\_j^{\epsilon}$$

• V<sup>b</sup> bucket,i ¼ ∑<sup>j</sup> <sup>∈</sup>bucketi Vb j

In order to distribute all VPIN values between 0 and 1, in practice, VPIN is normalized through a normal law. We thus consider VPINnormalized in the following:

#### 2.1.5 VPIN event

A VPIN event is declared when the following occurs:

$$\text{VPIN}\_{\text{normalized}} \ge \theta\_{\text{VPIN}} \tag{5}$$

where θVPIN is a given decision threshold. In practice [5] θVPIN ¼ 0:99.

#### 2.2 Defining flash crashes with MIR

#### 2.2.1 Formal definition

Let pt <sup>t</sup> be a time series (e.g., of prices). Here is the definition of MIR:

$$\text{MIR}\_{t,\eta} = \max\_{i \neq j, i, j \in [t, t+\eta]} \frac{|p\_i - p\_j|}{p\_i} \tag{6}$$

A flash crash will depend on two things here:


#### 2.2.2 Empiric definition

We reported in this data set only one flash crash, i.e., on May 6, 2010, which lasted approximately 10 minutes according to media and financial institutions. Our definition of flash crash will obviously take into account this event.

• To be sure not to miss a flash crash because of being too long in time bar or bucket, we have chosen a reasonable granularity level as in [5] (buckets per

• For each financial instrument, we have recorded the number of bars necessary to capture the local 10 minutes of maximum fall of May 6, 2010, known as the

• As the window lengths defined above do not have a stable distribution in time (because of the volume-clock paradigm), we have arbitrarily filtered out all events in which the time difference between minimum and maximum within a window length is longer than 20 minutes, in order to capture only quick events. Indeed, one given window length may be too big and thus allow at some date to measure a time difference between local minimum and maximum which is longer than 10 minutes whereas it would be a true flash crash with a

• For each instrument we recorded the amplitude of the "flash crash" and their

The results made it possible to classify the five financial instruments into two

• Data sets where the "flash crash" and other flash crashes are significantly

• Data sets where the "flash crash" and other flash crashes are not really present. More precisely, the "flash crash" is not a rare event in the data set, and generally magnitude levels of flash crashes are low compared to other

In this section, first we present our methodology to find VPIN optimal prediction quality (for which recall and precision rates are maximal and more useful for practice). Second, we present all the results: best parameters, associated remarks,

<sup>1</sup> This is not perfect because we can still miss some crashes (whereas in this data set, it will not be that much, and it will be with a smaller probability), but first we do not want to change too much the definition in time of a flash crash (we will not increase the tolerance level to 1 day), and second this problem is inherent to the fact that fixing volume of bars and of buckets prevents us from controlling precisely filling bar and bucket times. Finding a solution for this precise data set does not guarantee at all

"flash crash"; we refer to these numbers as "window lengths" below.

day, 200, and bars per bucket, 30).

An Assessment of the Prediction Quality of VPIN DOI: http://dx.doi.org/10.5772/intechopen.86532

smaller window length.<sup>1</sup>

respective MIR values.

present: ES, NQ, and YM.

3. Assessing VPIN prediction quality

Here are the parameters we will test:

a general solution neither for one data set nor for a financial instrument.

instruments.

and prediction lengths.

3.1 Methodology

59

3.1.1 Parameters to test

groups:

#### 2.3 The data

#### 2.3.1 Futures used

In this work, we use a comprehensive set of liquid futures trading data to illustrate the techniques to be introduced. More specifically, we will use 67 months' worth of tick data of the five most liquid futures traded on all asset classes. The data comes to us in the form of 5 CSV files, one for each futures contract traded. The source of our data is TickWrite, a data vendor that normalizes the data into a common structure after acquiring it directly from the relevant exchanges. The total size of the comma-separated value (CSV) files is about 45.1 GB. They contain about millions of trades spanning from the beginning of January 2007 to the end of July 2012. The data set contains five of the most heavily traded futures contracts. Each has more than 100 million trades during this 67-month period. The most heavily traded futures, the file containing E-mini SP500 futures, symbol ES, has about 500 million trades involving a total number of about 3 billion contracts. The second most heavily traded futures is Euro exchange rates, symbol EC, which is 188 million trades. The next three are Nasdaq 100 (NQ), 173 million trades; light crude oil (CL), 165 million trades; and E-mini Dow Jones (YM), 110 million trades. In Figure 2, one can see an evolution of prices with time (here each tick corresponds to a bucket).

#### 2.3.2 Definition of flash crash

We want to define empirically a flash crash using the tools of VPIN framework, namely, bars and buckets. As volume-clock paradigm does not allow to control filling times of fixed volume of trades, here below is a summary of the steps we have followed to manage to detect flash crashes using MIR. As it is quite long and the main purpose of study is the prediction of results of the following section, we present principles and do not go into technical details:

Figure 2. Bucket S&P 500 values with time.

2.2.2 Empiric definition

2.3 The data

2.3.1 Futures used

2.3.2 Definition of flash crash

Figure 2.

58

Bucket S&P 500 values with time.

We reported in this data set only one flash crash, i.e., on May 6, 2010, which lasted approximately 10 minutes according to media and financial institutions. Our

In this work, we use a comprehensive set of liquid futures trading data to illustrate the techniques to be introduced. More specifically, we will use 67 months' worth of tick data of the five most liquid futures traded on all asset classes. The data comes to us in the form of 5 CSV files, one for each futures contract traded. The source of our data is TickWrite, a data vendor that normalizes the data into a common structure after acquiring it directly from the relevant exchanges. The total size of the comma-separated value (CSV) files is about 45.1 GB. They contain about millions of trades spanning from the beginning of January 2007 to the end of July 2012. The data set contains five of the most heavily traded futures contracts. Each has more than 100 million trades during this 67-month period. The most heavily traded futures, the file containing E-mini SP500 futures, symbol ES, has about 500 million trades involving a total number of about 3 billion contracts. The second most heavily traded futures is Euro exchange rates, symbol EC, which is 188 million trades. The next three are Nasdaq 100 (NQ), 173 million trades; light crude oil (CL), 165 million trades; and E-mini Dow Jones (YM), 110 million trades. In Figure 2, one can see an evolution of prices with time (here each tick corresponds to a bucket).

We want to define empirically a flash crash using the tools of VPIN framework,

namely, bars and buckets. As volume-clock paradigm does not allow to control filling times of fixed volume of trades, here below is a summary of the steps we have followed to manage to detect flash crashes using MIR. As it is quite long and the main purpose of study is the prediction of results of the following section, we

present principles and do not go into technical details:

definition of flash crash will obviously take into account this event.

Advanced Analytics and Artificial Intelligence Applications


The results made it possible to classify the five financial instruments into two groups:

