MSc: High-Dimensional Data Analysis

From IU
Jump to navigation Jump to search

High-Dimensional Data Analysis

  • Course name: High-Dimensional Data Analysis
  • Course number: DS-06
  • Area of instruction: Computer Science and Engineering

Administrative details

  • Faculty: Computer Science and Engineering
  • Year of instruction: 1st year of MSc
  • Semester of instruction: 1st semester
  • No. of Credits: 5 ECTS
  • Total workload on average: 180 hours overall
  • Frontal lecture hours: 2 hours per week.
  • Frontal tutorial hours: 0 hours per week.
  • Lab hours: 2 hours per week.
  • Individual lab hours: 2 hours per week.
  • Frequency: weekly throughout the semester.
  • Grading mode: letters: A, B, C, D.

Course outline

This course gives the knowledge in data analysis and interpretation. It starts by learning the mathematical definition of distance and use this to motivate the use of the singular value decomposition (SVD) for dimension reduction and multi-dimensional scaling and its connection to principle component analysis. It also describes the principal component analysis and factor analysis and demonstrates how these concepts are applied to data visualization and data analysis of high-throughput experimental data. Moreover, the course gives a brief introduction to machine learning and apply it to high-throughput data. It presents the general idea behind clustering analysis and descript K-means and hierarchical clustering and demonstrate how these are used in describe prediction algorithms such as k-nearest neighbors along with the concepts of training sets, test sets, error rates and cross-validation. The students will be required to participate in laboratory practicum and solve practical tasks using hardware and Python environment.

Expected learning outcomes

  • Apply different data analysis for dimension reduction and multi-dimensional scaling
  • Be able to select best data analysis approach for a particular problem
  • Be familiar with principal component analysis and factor analysis and understand how these concepts are applied to data visualization and data analysis of high-throughput experimental data

Required background knowledge

Strong mathematical background in Calculus, Linear Algebra, Differential Equations, Statistics and Numerical Methods as well as programming in Python and C/C++.

Prerequisites

The course will benefit if students already know some topics of mathematics and programming. Mathematics:

  • CSE202 — Analytical Geometry and Linear Algebra I and CSE204 — Analytical Geometry and Linear Algebra II: matrix multiplication, matrix decomposition (SVD, ALS) and approximation (matrix norm), sparse matrix, stability of solution (decomposition), vector spaces, metric spaces, manifold, eigenvector and eigenvalue.
  • CSE206 — Probability And Statistics: probability, likelihood, probability density function, conditional probability, Bayesian rule, covariance matrix and properties.
  • CSE132 — Software Design with Python
  • Numerical Analysis: DFT, [stochastic] gradient.

Recommendations for students on how to succeed in the course

References:

Materials for self-preparation may include these videos:

Detailed topics covered in the course

  • Mathematical Distance
  • Dimension Reduction
  • Singular Value Decomposition and Principal Component Analysis
  • Multiple Dimensional Scaling Plots
  • Factor Analysis
  • Dealing with Batch Effects
  • Clustering
  • Heatmaps
  • Basic Machine Learning Concepts

Textbook

  • T. Tony Cai, Xiaotong Shen, ed. (2011). High-dimensional data analysis. Frontiers of Statistics. Singapore: World Scientific
  • Christophe Giraud (2015). Introduction to High-Dimensional Statistics. Philadelphia: Chapman and Hall/CRC

Reference material

  • Peter Bühlmann and Sara van de Geer (2011). Statistics for high-dimensional data: methods, theory and applications. Heidelberg; New York: Springer
  • Slides will be provided during the course

Required computer resources

NA

Evaluation

  • Final Project (40%)
  • Assignments (40%)
  • Midterm Exam (20%)