MSc: High-Dimensional Data Analysis
High-Dimensional Data Analysis
- Course name: High-Dimensional Data Analysis
- Course number: DS-06
- Area of instruction: Computer Science and Engineering
Administrative details
- Faculty: Computer Science and Engineering
- Year of instruction: 1st year of MSc
- Semester of instruction: 1st semester
- No. of Credits: 5 ECTS
- Total workload on average: 180 hours overall
- Frontal lecture hours: 2 hours per week.
- Frontal tutorial hours: 0 hours per week.
- Lab hours: 2 hours per week.
- Individual lab hours: 2 hours per week.
- Frequency: weekly throughout the semester.
- Grading mode: letters: A, B, C, D.
Course outline
This course gives the knowledge in data analysis and interpretation. It starts by learning the mathematical definition of distance and use this to motivate the use of the singular value decomposition (SVD) for dimension reduction and multi-dimensional scaling and its connection to principle component analysis. It also describes the principal component analysis and factor analysis and demonstrates how these concepts are applied to data visualization and data analysis of high-throughput experimental data. Moreover, the course gives a brief introduction to machine learning and apply it to high-throughput data. It presents the general idea behind clustering analysis and descript K-means and hierarchical clustering and demonstrate how these are used in describe prediction algorithms such as k-nearest neighbors along with the concepts of training sets, test sets, error rates and cross-validation. The students will be required to participate in laboratory practicum and solve practical tasks using hardware and Python environment.
Expected learning outcomes
- Apply different data analysis for dimension reduction and multi-dimensional scaling
- Be able to select best data analysis approach for a particular problem
- Be familiar with principal component analysis and factor analysis and understand how these concepts are applied to data visualization and data analysis of high-throughput experimental data
Required background knowledge
Strong mathematical background in Calculus, Linear Algebra, Differential Equations, Statistics and Numerical Methods as well as programming in Python and C/C++.
Prerequisites
- CSE201 - Mathematical Analysis I
- CSE203 - Mathematical Analysis II
- CSE205 - Differential Equations
- Numerical Methods
- CSE 331 - Advanced Statistics
The course will benefit if students already know some topics of mathematics and programming. Mathematics:
- CSE202 — Analytical Geometry and Linear Algebra I and Analytical Geometry and Linear Algebra II: matrix multiplication, matrix decomposition (SVD, ALS) and approximation (matrix norm), sparse matrix, stability of solution (decomposition), vector spaces, metric spaces, manifold, eigenvector and eigenvalue.
- CSE206 — Probability And Statistics: probability, likelihood, probability density function, conditional probability, Bayesian rule, covariance matrix and properties.
- CSE132 — Software Design with Python
- Numerical Analysis: DFT, [stochastic] gradient.
Recommendations for students on how to succeed in the course
References:
- Linear Algebra
- Statistics for Applications
- Matrix Methods in Data Analysis, Signal Processing, and Machine Learning
Materials for self-preparation may include these videos:
- 3blue1brown playlist on Linear Algebra.
- Fourier Transform, Gilbert Strang classic lectures;
- This MIT course;
- basic python-based course on maths, numpy with the official quickstart guide.
Detailed topics covered in the course
- Mathematical Distance
- Dimension Reduction
- Singular Value Decomposition and Principal Component Analysis
- Multiple Dimensional Scaling Plots
- Factor Analysis
- Dealing with Batch Effects
- Clustering
- Heatmaps
- Basic Machine Learning Concepts
Textbook
- T. Tony Cai, Xiaotong Shen, ed. (2011). High-dimensional data analysis. Frontiers of Statistics. Singapore: World Scientific
- Christophe Giraud (2015). Introduction to High-Dimensional Statistics. Philadelphia: Chapman and Hall/CRC
Reference material
- Peter Bühlmann and Sara van de Geer (2011). Statistics for high-dimensional data: methods, theory and applications. Heidelberg; New York: Springer
- Slides will be provided during the course
Required computer resources
NA
Evaluation
- Final Project (40%)
- Assignments (40%)
- Midterm Exam (20%)