MSc:DataMining

From IU
Revision as of 14:26, 30 July 2021 by 10.90.136.11 (talk) (Created page with "= Data Mining = * <span>'''Course name:'''</span> Data Mining * <span>'''Course number:'''</span> 346 == Course Characteristics == === Key concepts of the class === * The...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Data Mining

  • Course name: Data Mining
  • Course number: 346

Course Characteristics

Key concepts of the class

  • The role, subject, problems and methods of Data Mining in the context of other Data Science and Data Engineering disciplines and activities.
  • Data Mining in Finance, in particular in Algorithmic Financial Trading: in-depth study of problems and methods.

What is the purpose of this course?

The main purpose of this course is two-fold:

  • to provide the students with a general understanding of the role of Data Mining which is closely connected to, but not identical to, Data Acquisition on one side, and Big Data Analysis / Machine Learning on the other side;
  • to provide the students with a hands-on experience in application of Data Mining methods and techniques in a real-life modern area such as Financial Industry (more specifically, Algorithmic Financial Trading).

.

Course objectives based on Bloom’s taxonomy

  1. Understand the role, subject, problems and methods of Data Mining in relation to other Data Science and Data Engineering disciplines, such as Data Acquisition and Big Data Analysis / Machine Learning.
  2. Define and understand the key concepts of Data Science for Finance: Financial Assets, Instruments, Markets and Trading, Exchanges, Prices, Volumes.
  3. Define and understand the key concepts of Financial Markets micro-structure: Orders, Order Books, Bids, Asks, Matching Engines, Trades.
  4. Understand the principal problems and solution methods of Data Mining with application to Financial Trading:
    • construction of Order Books and Trades from Order Logs or incremental updates;
    • data cleaning (e.g. resolving Bid-Ask collisions, wrong trade sizes, stale orders);
    • construction of descriptive statistics for Order Books and Trades;
    • feature generation for subsequent Data Analysis (e.g. WVAPs, Volumes, Market Pressure etc).
  5. Develop hands-on experience with complex, industrial-strength Data Mining processes related to Financial Trading data.
  6. Understand and apply the classical and modern methods of statistical data analysis, such as Principal Component Analysis (PCA) and Independent Component Analysis (ICA).
  7. Understand the methods of feature generation for Machine Learning methods.
  8. Understand and apply patterns of Machine Learning methods to analyze financial trading data.

- What should a student remember at the end of the course?

By the end of the course, the students should be able to:

  • Know the differences between Data Mining and Data Acquisition (on one hand) and Big Data Analysis (on the other hand).
  • Know the principal definitions and terminology related to Financial Trading data.
  • Know the importance of data clearing / pre-processing in Data Mining in general and for Financial Trading data in particular.
  • Know that Machine Learning methods of data analysis cannot be used effectively without proper data pre-processing and feature engineering.
  • Know the relationship between Statistical and Machine Learning-based methods of data analysis.

- What should a student be able to understand at the end of the course?

By the end of the course, the students should be able to:

  • Understand the mechanisms of exchange-based financial trading.
  • Understand the micro-structure of financial markets (orders, order books, matching, trades etc).
  • Understand the methods of constructing Order Book and Trades data from the raw data in Finance.
  • Understand the typical errors which occur in Financial Trading data, and the methods for their detection and correction.
  • Understand the modern methods of statistical analysis of financial time series data, in particular, the method of ICA and its advantages over PCA.
  • Understand the methods of feature generation for analysis of Financial Trading data using Machine Learning methods.
  • Understand how the statistical and machine learning-based methods of data analysis can efficiently be integrated to achieve the best results.

- What should a student be able to apply at the end of the course?

By the end of the course, the students should be able to:

  • Develop sufficiently complex, “industrial-strength” software solutions (e.g. in Python) for processing raw Financial Trading data into Order Books and Trades, providing:
    • adequate temporal and spatial efficiency;
    • necessary logic for data clean-up and resolution of data errors / inconsistencies.
  • Judiciously select and apply the appropriate Machine Learning methods for analysis of Financial Trading data and prediction of financial time series.

Course evaluation

The course is largely project-based, thus higher weights are assigned to Labs; Final Project Presentation is provided in place of a Final Exam:

Course grade breakdown
Proposed points
Labs 20 40
Interim performance assessment 30 20
Final Project Presentation 50 40

Grades range

Course grading range
Proposed range
A. Excellent 90-100 85–100
B. Good 75-89 70–84
C. Satisfactory 60-74 50–69
D. Poor 0-59

Resources and reference material

  • M. Kolanovich, Big Data and AI Strategies, JP Morgan, 2017.
  • K. Kim, Electronic and Algorithmic Trading Technology: The Complete Guide, Elsevier, 2007.
  • M. Durbin, All About High-Frequency Trading, McGrow-Hill, 2010.
  • I. Aldridge, High-Frequency Trading, Wiley, 2010.

Course Sections

The main sections of the course and approximate hour distribution between them is as follows:

Course Sections
Section Section Title Teaching Hours
1 Introduction into Data Mining 2
2 Data Mining and Data Analysis in Financial Trading 4
3 Micro-Structure of Financial Markets 16
4 Feature Engineering for Machine Learning in Financial Trading 4
5 Descriptive Statistics of Financial Trading Data 4
6 Principal and Independent Component Analysis (PCA and ICA) 8
7 Machine Learning Methods for Prediction of Financial Data 16

Section 1

Section title:

Introduction into Data Mining

Topics covered in this section:

  • The subject, problems and methods of Data Mining. Data Mining as a Data Engineering discipline.
  • Relationships between Data Mining, Data Acquisition, Big Data Analysis and Machine Learning.
  • A typical data processing workflow.

What forms of evaluation were used to test students’ performance in this section?

|a|c| & Yes/No
Development of individual parts of software product code & 0
Homework and group projects & 0
Midterm evaluation & 1
Testing (written or computer based) & 0
Reports & 0
Essays & 0
Oral polls & 1
Discussions & 1


Typical questions for ongoing performance evaluation within this section

  1. Is Data Mining a science or an engineering discipline?
  2. Explain a typical data processing workflow from Acquisition to Analysis and the role of Data Mining in this process.
  3. What are the main problems and methods of Data Mining?
  4. How is Data Mining related to Data Analysis and Machine Learning?

Typical questions for seminar classes (labs) within this section

None

Test questions for final assessment in this section

  1. Explain the main methods of detecting data outliers and gaps.
  2. Do you concur with the statement that Machine Learning algorithms are in general capable of discerning arbitrarily-complex relationships in data, provided that the training dataset is large enough?
  3. Explain how Data Mining can be used to facilitate efficient Machine Learning.

Section 2

Section title:

Data Mining and Data Analysis in Financial Trading

Topics covered in this section:

  • Financial Asset Classes: Equities, Currencies, Commodities, Interest Rate Products and others.
  • Financial Instruments, Trading and Exchanges.
  • Stochastic dynamics of prices and trading volumes of financial instruments. The notion of stochastic differential equations. Trends and Volatilities.
  • The nomenclature of professional specializations in Quantitative Finance related to data science and data engineering: Quantitative Analysts, Quantitative Researchers, Quantitative Developers, Research Analysis. Their relationship to Data Mining.

What forms of evaluation were used to test students’ performance in this section?

|a|c| & Yes/No
Development of individual parts of software product code & 0
Homework and group projects & 0
Midterm evaluation & 1
Testing (written or computer based) & 0
Reports & 0
Essays & 0
Oral polls & 1
Discussions & 1


Typical questions for ongoing performance evaluation within this section

  1. What are Equities as a financial asset class?
  2. Explain the difference between the QA and QR professional specializations.
  3. Which of professional specializations in quantitative finance is mostly concerned with Data Mining?

Typical questions for seminar classes (labs) within this section

None

Test questions for final assessment in this section

  1. Explain the components of stochastic dynamics of financial assets (HINT: Trends and Volatilities).
  2. Provide two definitions of the Trend, and explain how they are related to each other.
  3. What are the main differences between stochastic dynamics of Equity and IRP prices? In the price prediction problem, where would the role of data science be more important?
  4. Explain the differences between QA and QR specializations regarding the subject of their research.

Section 3

Section title:

Micro-Structure of Financial Markets

Topics covered in this section:

  • The “mechanics” of Exchange-based trading in financial instruments.
  • Orders, Order Books, Bids and Asks, Price Levels and Order Volumes.
  • Bid-Ask Spread, Passive and Aggressive Orders, Order Matching, Trades.
  • Limit and Market orders, semantics of order execution.
  • Formats of historical Market Data: L1, L2 and L3 (Full Orders Log) data.
  • Example: Market Data for Moscow Exchange (FX and Equities sections).
  • Compiling Order Book data from a stream of Full Order Log data.

What forms of evaluation were used to test students’ performance in this section?

|a|c| & Yes/No
Development of individual parts of software product code & 1
Homework and group projects & 1
Midterm evaluation & 1
Testing (written or computer based) & 0
Reports & 0
Essays & 0
Oral polls & 0
Discussions & 1


Typical questions for ongoing performance evaluation within this section

  1. Explain how the Matching Engine of a financial exchange works.
  2. Is it correct to say that a Passive order is always a limit one? Is the converse true?
  3. What happens with a limit order if it is not completely filled at a limit price?

Typical questions for seminar classes (labs) within this section

PROJECT WORK: Develop a Data Acquisition and Data Mining software solution in Python which reads historical market data of Moscow Exchange from files (in a Full Orders Log format) and:

  1. selects the applicable Instruments;
  2. for each applicable Instrument composes a sequence of Order Book snapshots;
  3. correctly applies New, Cancel and Modify (Trade) records from the Full Orders Log;
  4. efficiently detects and corrects data anomalies / errors (e.g. invalid trade prices or sizes, stale orders etc);
  5. provides hooks for outputting the Order Book features which can subsequently be used for machine learning purposes.

Test questions for final assessment in this section

  1. What are Market Orders and how are they recognized in MOEX Full Orders Log?
  2. What are Bid-Ask collisions, why could they occur and how are they resolved in the project software solution?
  3. What is the temporal and spatial complexity of the project algorithm implemented?

Section 4

Section title:

Feature Engineering for Machine Learning in Financial Trading

Topics covered in this section:

  • From order book snaphots to ML features.
  • The uniformity requirements in feature engineering.
  • Volume-Weighted Average Prices (VWAPs) over uniform Size bands. VWAP-based mid-prices and Bid-Ask spreads.
  • Order Volumes over uniform Price Step bands.
  • Trades and “market pressure” over uniformly-defined time intervals. Using Exponential Moving Averages (EMA filters).
  • Derived features (e.g. logarithms).

What forms of evaluation were used to test students’ performance in this section?

|a|c| & Yes/No
Development of individual parts of software product code & 1
Homework and group projects & 1
Midterm evaluation & 1
Testing (written or computer based) & 0
Reports & 0
Essays & 0
Oral polls & 0
Discussions & 1


Typical questions for ongoing performance evaluation within this section

  1. Explain why we need WVAPs and how they are computed.
  2. Explain why we need to use a uniform grid of price levels in Volumes computation.
  3. Why may we need logarithms of features in addition to features themselves?

Typical questions for seminar classes (labs) within this section

PROJECT WORK: Based on the solution implemented in Section 3, provide generation of standard features from Order Book snapshots of financial instruments traded at Moscow Exchange.

Test questions for final assessment in this section

  1. What are the potential (adverse) effects of order book data errors on features generation?
  2. How do we manage the spatial complexity of storing the features generated from Order Books?
  3. Explain the “reciprocity” between VWAPs and Order Volumes, and why we main need both kinds of features?
  4. What is “market pressure” and how is it computed?

Section 5

Section title:

Descriptive Statistics of Financial Trading Data

Topics covered in this section:

  • Distribution of Order Volumes by price depth. Single-humped and two-humped distribution densities, their typical occurences in different financial instruments.
  • Distribution of Aggressive orders by interval between arrival: exponential-type distribution.
  • Hypotheses testing on distributions: the and Kolmogorov–Smirnov criteria.
  • Correlation analysis of financial time series. Avoiding pitfalls:
    • the centricity and stationarity requirements;
    • using finite differences and fractional-order differentiation.
  • Lead-lag analysis and the Hayashi–Yoshida method.

What forms of evaluation were used to test students’ performance in this section?

|a|c| & Yes/No
Development of individual parts of software product code & 1
Homework and group projects & 1
Midterm evaluation & 1
Testing (written or computer based) & 0
Reports & 0
Essays & 0
Oral polls & 0
Discussions & 1


Typical questions for ongoing performance evaluation within this section

  1. Explain the typical differences in Order Volumes distribution between Futures and Spot FX instruments.
  2. What are the typical pitfalls in correlation analysis?
  3. What is the purpose and methods of Lead-Lag analysis?

Typical questions for seminar classes (labs) within this section

PROJECT WORK: Based on the solutions implemented in Sections 3 and 4, perform:

  1. Orders Volume distribution analysis for Spot FX instruments
  2. distribution analysis of intervals between order book dates
  3. distribution analysis of intervals between trades
  4. lead-lag analysis between USD/RUB and EUR/RUB instruments

Test questions for final assessment in this section

  1. Explain the hypothesis testing methods for distribution densities of random variables.
  2. Explain the purpose and the technique of fractional-order differentiation of time series.
  3. Explain the Hayashi–Yoshida method.

Section 6

Section title:

Principal and Independent Component Analysis (PCA and ICA)

Topics covered in this section:

  • The objectives of component analysis.
  • Risk Factors for prices of financial instruments.
  • PCA: an “empirical” method.
  • PCA: a rigorous method based on stochastic differential equations.
  • Hypotheses testing for residual Brownian motions.
  • From PCA to ICA.
  • ICA methods and interpretation of results.

What forms of evaluation were used to test students’ performance in this section?

|a|c| & Yes/No
Development of individual parts of software product code & 1
Homework and group projects & 1
Midterm evaluation & 1
Testing (written or computer based) & 0
Reports & 0
Essays & 0
Oral polls & 0
Discussions & 1


Typical questions for ongoing performance evaluation within this section

  1. Explain the objectives of the component analysis methods.
  2. Explain the differences between the “empirical” and the SDE-based PCA methods.
  3. How is independence of components achieved in ICA?

Typical questions for seminar classes (labs) within this section

PROJECT WORK: Based on the solutions implemented in Sections 3–5, perform:

  1. “empirical” PCA of the time series for major FX and Equity instruments at MOEX;
  2. SDE-based PCA of the same instruments (assuming constant covariance matrix but non-constant reversion terms), and test the hypothesis for the residual Brownian motions;
  3. ICA under the same conditions as above, and try to interpret the risk factors obtained.

Test questions for final assessment in this section

  1. Explain the SDE-based approach to PCA and ICA.
  2. Explain how the residual stochastic innovations are constructed and tested for being Brownian motions.
  3. What are the advantages of ICA over PCA?

Section 7

Section title:

Machine Learning Methods for Prediction of Financial Data

Topics covered in this section:

  • Recap of Machine Learning concepts relevant to financial time series: Supervised Learning (SL) and Reinforcement Learning (RL).
  • Recap of SL methods: explicit regression models, SVMs, boosted gradient methods, Artificial Neural Nets (ANNs).
  • The problem of predicting financial time series in Algorithmic Trading.
  • The importance of feature engineering over the “best non-linear method” selection.
  • The danger of over-fitting and the methods of controlling it.

What forms of evaluation were used to test students’ performance in this section?

|a|c| & Yes/No
Development of individual parts of software product code & 1
Homework and group projects & 1
Midterm evaluation & 1
Testing (written or computer based) & 0
Reports & 0
Essays & 0
Oral polls & 0
Discussions & 1


Typical questions for ongoing performance evaluation within this section

  1. Explain the differences between Reinforcement Learning and Supervised Learning.
  2. Explain why careful feature engineering is so important in Machine Learning in general, and for predicting financial time series in particular.
  3. Explain the objectives and potential time horizons of price prediction in Algorithmic Trading.

Typical questions for seminar classes (labs) within this section

PROJECT WORK: Based on the solutions implemented in Sections 3–6, implement an price prediction methods for USD/RUB instruments at MOEX:

  1. apply the features constructed in Section 4;
  2. apply an explicit linear regression with Lasso regularization;
  3. then construct an ANN and compare the quality of predictions.

Test questions for final assessment in this section

  1. What is over-fitting in ML and how can it be controlled?
  2. Explain the similarities and differences between SVMs and ANNs.
  3. Describe in detail the API of Machine Learning library you have been using in your Lab Project.