Difference between revisions of "MSc: Advanced Statistics"

From IU
Jump to navigation Jump to search
 
(40 intermediate revisions by the same user not shown)
Line 6: Line 6:
   
 
== Short Description ==
 
== Short Description ==
This course in advanced statistics with a view toward applications in data sciences. It is intended for masters students who are looking to expand their knowledge of theoretical methods used in modern research in data sciences. The course presents some of the key probabilistic methods and results that may form an essential mathematical toolbox for a data scientist. This course places particular emphasis on random vectors, random matrices, and random projections. It teaches basic theoretical skills for the analysis of these objects, which include concentration inequalities, covering and packing arguments, decoupling and symmetrization tricks, chaining and comparison techniques for stochastic processes, combinatorial reasoning based on the VC dimension, and a lot more.
+
This course in advanced statistics with a view toward applications in data sciences. It is intended for masters students who are looking to expand their knowledge of theoretical methods used in modern research in data sciences. The course presents some of the key probabilistic methods and results that may form an essential mathematical toolbox for a data scientist. This course places particular emphasis on random vectors, random matrices, and random projections. It teaches basic theoretical skills for the analysis of these objects, which include concentration inequalities, covering and packing arguments, decoupling and symmetrization tricks, chaining and comparison techniques for stochastic processes, combinatorial reasoning based on the VC dimension, and a lot more. This course integrates theory with applications for covariance estimation, semidefinite programming, networks, elements of statistical learning, error correcting codes, clustering, matrix completion, dimension reduction, sparse signal recovery, sparse regression, and more.
   
 
== Prerequisites ==
 
== Prerequisites ==
   
 
=== Prerequisite subjects ===
 
=== Prerequisite subjects ===
  +
* Excellent knowledge of probability and statistics.
* CSE329 - Empirical Methods
 
   
 
=== Prerequisite topics ===
 
=== Prerequisite topics ===
Line 22: Line 22:
 
! Section !! Topics within the section
 
! Section !! Topics within the section
 
|-
 
|-
  +
| Concentration of sums of independent random variables ||
| Sampling Distributions Associated with the Normal Population ||
 
  +
# Hoeffding’s inequality
# Kolmogorov-Smirnov test
 
  +
# Chernoff ’s inequality
# Size of samples, Kolmogorov-Smirnov, Fisher exact
 
  +
# Sub-gaussian distributions
# Logistic regression
 
  +
# Sub-exponential distributions
|}
 
  +
|-
  +
| Random vectors in high dimensions ||
  +
# Concentration of the norm
  +
# Covariance matrices and principal component analysis
  +
|-
  +
| Random matrices ||
  +
# Nets, covering numbers and packing numbers
  +
# Covariance estimation and clustering
  +
|}
  +
 
== Intended Learning Outcomes (ILOs) ==
 
== Intended Learning Outcomes (ILOs) ==
   
 
=== What is the main purpose of this course? ===
 
=== What is the main purpose of this course? ===
The main purpose of this course is to present the fundamentals of inferential statistics to the future software engineers and data scientists, on one side providing the scientific fundamentals of the disciplines, and on the other anchoring the theoretical concepts on practices coming from the world of software development and engineering. The course covers the statistical analysis of data with limited assumptions on the distribution, with reference to testing hypotheses, measuring correlations, building samples, and performing regressions.
+
The main purpose of this course is to present the fundamentals of high-dimensional statistics with applications to data science. The course presents some of the key probabilistic methods and results that may form an essential mathematical toolbox for a data scientist. This course places particular emphasis on random vectors, random matrices, and random projections. This course integrates theory with applications for covariance estimation, semidefinite programming, networks, elements of statistical learning, error correcting codes, clustering, matrix completion, dimension reduction, sparse signal recovery, sparse regression, and more.
   
 
=== ILOs defined at three levels ===
 
=== ILOs defined at three levels ===
Line 36: Line 46:
 
==== Level 1: What concepts should a student know/remember/explain? ====
 
==== Level 1: What concepts should a student know/remember/explain? ====
 
By the end of the course, the students should be able to ...
 
By the end of the course, the students should be able to ...
  +
* Explain the difference between low-dimensional and high-dimensional data
* Remember the fundamentals of inferential statistics
 
  +
* Explain concentration inequalities and their application
* Remember the specifics and purpose of different hypothesis tests
 
  +
* Remember the main statistical properties of high-dimensional vectors and matrices
* Distinguish between parametric and non parametric tests
 
   
 
==== Level 2: What basic practical skills should a student be able to perform? ====
 
==== Level 2: What basic practical skills should a student be able to perform? ====
 
By the end of the course, the students should be able to ...
 
By the end of the course, the students should be able to ...
  +
* Perform basic Monte Carlo computations, such as Monte Carlo integration
* the basic concepts of inferential statistics
 
  +
* Obtain simple but accurate bounds of complex statistical metrics
* the fundamental laws in statistics
 
* the concept of null and alternative hypotheses
+
* Apply the median of means estimator
  +
* Investigate simple statistics of social networks
* the hypotheses test procedure
 
  +
* Exploit the thin-shell phenomenon when analysing data
  +
* Apply data clustering and dimension reduction
   
 
==== Level 3: What complex comprehensive skills should a student be able to apply in real-life scenarios? ====
 
==== Level 3: What complex comprehensive skills should a student be able to apply in real-life scenarios? ====
 
By the end of the course, the students should be able to ...
 
By the end of the course, the students should be able to ...
* To understand the problems related to analyse statistically data not distributed normally
+
* To understand the problems related to statistical analysis of data.
  +
* To apply theoretical statistics in real-life via computer simulations and thereby confirm or reject the correctness of the theoretical concepts.
* To know the more recent computationally-intensive techniques that can help to describe samples and to infer properties of populations in absence of normality
 
  +
* To identify the correct statistical methods that needs to be applied to data in order to solve the given tasks in real-life.
* To identify situations when the data is on nominal scales so alternative techniques should be use, and act accordingly.
 
  +
* To be able to generate and run experiments on random data samples.
* To be able to run experiment to evaluate hypotheses for situation of scarce data, distributed non normally, on different kinds of scales.
 
  +
 
== Grading ==
 
== Grading ==
   
Line 63: Line 76:
 
| A. Excellent || 85-100 || -
 
| A. Excellent || 85-100 || -
 
|-
 
|-
| B. Good || 65-84 || -
+
| B. Good || 70-84 || -
 
|-
 
|-
| C. Satisfactory || 51-64 || -
+
| C. Satisfactory || 50-69 || -
 
|-
 
|-
| D. Poor || 0-50 || -
+
| D. Poor || 0-49 || -
 
|}
 
|}
  +
  +
=== Minimum Requirements For Passing The Course ===
  +
  +
There are two requirements for passing this course:.
  +
# You must have at least 50% on the Final Exam.
  +
# You must have at least 50% of the overall grade.
   
 
=== Course activities and grading breakdown ===
 
=== Course activities and grading breakdown ===
Line 76: Line 95:
 
! Activity Type !! Percentage of the overall course grade
 
! Activity Type !! Percentage of the overall course grade
 
|-
 
|-
| Quiz during each lecture (weekly evaluations) || 15
+
| Quiz/Assignment during each lecture (weekly evaluations) || 20
 
|-
 
|-
| Labs classes (weekly evaluations) || 15
+
| Labs classes (weekly evaluations) || 20
 
|-
 
|-
 
| Midterm || 20
 
| Midterm || 20
 
|-
 
|-
| Final exam || 50
+
| Final exam || 40
 
|}
 
|}
   
   
  +
=== Plagiarism Rules ===
There are two constraints for passing the course:
 
  +
# You must attend all labs.
 
  +
* If a student submits a solution to a weekly assignment/quiz and/or lab that is identical to the one submitted from another student, then both students will obtain the maximum points for this task but with the negative sign.
# You must submit all lab reports.
 
# You must have at least 50% on the Final Exam.
 
   
 
=== Recommendations for students on how to succeed in the course ===
 
=== Recommendations for students on how to succeed in the course ===
 
* Watch the video lecture and read the lecture notes before coming to the onsite lectures and to the labs.
 
* Watch the video lecture and read the lecture notes before coming to the onsite lectures and to the labs.
* Attend the onsite lectures
+
* Attend the onsite lectures and questions related to parts of the material that you find unclear.
  +
* Submit solutions to the weekly quizzes.
* Ask questions and provide answers to the questions during the onsite lectures.
 
* Attend all of the labs and submit all of the lab reports.
+
* Submit the weekly lab reports.
 
* Prepare seriously for the midterm exam.
 
* Prepare seriously for the midterm exam.
 
* Prepare seriously for the final exam.
 
* Prepare seriously for the final exam.
Line 104: Line 122:
 
* The lecture notes and the video lectures provided via Moodle are sufficient for passing this course with grade A.
 
* The lecture notes and the video lectures provided via Moodle are sufficient for passing this course with grade A.
   
  +
=== Software and tools used within the course ===
=== Closed access resources ===
 
  +
* You can use any software by your choice to perform the lab tasks.
   
 
=== Software and tools used within the course ===
 
 
 
= Teaching Methodology: Methods, techniques, & activities =
 
= Teaching Methodology: Methods, techniques, & activities =
   
== Activities and Teaching Methods ==
 
{| class="wikitable"
 
|+ Activities within each section
 
|-
 
! Learning Activities !! Section 1
 
|-
 
| Testing (written or computer based) || 1
 
|-
 
| Discussions || 1
 
|}
 
 
== Formative Assessment and Course Activities ==
 
== Formative Assessment and Course Activities ==
   
 
=== Ongoing performance assessment ===
 
=== Ongoing performance assessment ===
  +
The performance will be assessed via weekly quizzes and weekly labs.
   
==== Section 1 ====
 
{| class="wikitable"
 
|+
 
|-
 
! Activity Type !! Content !! Is Graded?
 
|-
 
| Question || Let X1,X2, ...,X10 be a random sample from a distribution whose probability density function is <math>{\textstyle f(x)=(1\quad if\;0<x<1}</math> , otherwise 0). Based on the observed values 0.62, 0.36, 0.23, 0.76, 0.65, 0.09, 0.55, 0.26, 0.38, 0.24, test the hypothesis H0 : X UNIF(0, 1) against H1 : X UNIF(0, 1) at a significance level = 0.1. || 1
 
|-
 
| Question || If X1,X2, ...,Xn is a random sample from a distribution with density function <math>{\textstyle f(x)=((1-\theta )x^{\theta }\;if\;0<x<1}</math> , otherwise 0), what is the maximum likelihood estimator of <math>{\textstyle \theta }</math> ? || 1
 
|-
 
| Question || Let X1,X2, ...,Xn be a random sample of size n from a distribution with a probability density function <math>{\textstyle f(x)=((1-\theta )x^{\theta }\;if\;0<x<1,}</math> otherwise 0), where <math>{\textstyle 0<\theta }</math> is a parameter. Using the maximum likelihood method find an estimator for the parameter <math>{\textstyle \theta }</math> . || 1
 
|-
 
| Question || Suppose you are told that the likelihood of <math>{\textstyle \theta }</math> at <math>{\textstyle \theta =2}</math> is given by 1/4. Is this the probability that <math>{\textstyle \theta =2}</math> ? Explain why or why not. || 1
 
|-
 
| Question || If X1,X2, ...,Xn is a random sample from a distribution with density function<math>{\textstyle f(x)=({\frac {1}{\theta }}\;if\;0<x<1,}</math> otherwise 0), then what is the maximum likelihood estimator of <math>{\textstyle \theta }</math> ? || 0
 
|-
 
| Question || Let X1,X2, ...,Xn be a random sample from a normal population with mean <math>{\textstyle \mu }</math> and variance <math>{\textstyle \sigma ^{2}}</math> . What are the maximum likelihood estimators of <math>{\textstyle \mu }</math> and <math>{\textstyle \sigma ^{2}}</math> ? || 0
 
|-
 
| Question || Suppose that you have the following data points: 0.36, 0.32, 0.10, 0.13, 0.45, 0.11, 0.12, 0.09; compute Dn to determine if they come from the uniform distribution [0,0.5]. || 0
 
|-
 
| Question || The data on the heights of 12 infants are given below: 18.2, 21.4, 22.6, 17.4, 17.6, 16.7, 17.1, 21.4, 20.1, 17.9, 16.8, 23.1. Test the hypothesis that the data came from some normal population at a significance level = 0.1. || 0
 
|}
 
 
=== Final assessment ===
 
=== Final assessment ===
  +
# To be added
 
  +
The final assessment is in a written form. You mast have at least 50% on the final exam to pass the course.
   
 
=== The retake exam ===
 
=== The retake exam ===
  +
The retake of the exam will be in a written form.
'''Section 1'''
 

Latest revision as of 12:12, 22 January 2024

Advanced Statistics

  • Course name: Advanced Statistics
  • Code discipline: DS-03
  • Subject area:

Short Description

This course in advanced statistics with a view toward applications in data sciences. It is intended for masters students who are looking to expand their knowledge of theoretical methods used in modern research in data sciences. The course presents some of the key probabilistic methods and results that may form an essential mathematical toolbox for a data scientist. This course places particular emphasis on random vectors, random matrices, and random projections. It teaches basic theoretical skills for the analysis of these objects, which include concentration inequalities, covering and packing arguments, decoupling and symmetrization tricks, chaining and comparison techniques for stochastic processes, combinatorial reasoning based on the VC dimension, and a lot more. This course integrates theory with applications for covariance estimation, semidefinite programming, networks, elements of statistical learning, error correcting codes, clustering, matrix completion, dimension reduction, sparse signal recovery, sparse regression, and more.

Prerequisites

Prerequisite subjects

  • Excellent knowledge of probability and statistics.

Prerequisite topics

Course Topics

Course Sections and Topics
Section Topics within the section
Concentration of sums of independent random variables
  1. Hoeffding’s inequality
  2. Chernoff ’s inequality
  3. Sub-gaussian distributions
  4. Sub-exponential distributions
Random vectors in high dimensions
  1. Concentration of the norm
  2. Covariance matrices and principal component analysis
Random matrices
  1. Nets, covering numbers and packing numbers
  2. Covariance estimation and clustering

Intended Learning Outcomes (ILOs)

What is the main purpose of this course?

The main purpose of this course is to present the fundamentals of high-dimensional statistics with applications to data science. The course presents some of the key probabilistic methods and results that may form an essential mathematical toolbox for a data scientist. This course places particular emphasis on random vectors, random matrices, and random projections. This course integrates theory with applications for covariance estimation, semidefinite programming, networks, elements of statistical learning, error correcting codes, clustering, matrix completion, dimension reduction, sparse signal recovery, sparse regression, and more.

ILOs defined at three levels

Level 1: What concepts should a student know/remember/explain?

By the end of the course, the students should be able to ...

  • Explain the difference between low-dimensional and high-dimensional data
  • Explain concentration inequalities and their application
  • Remember the main statistical properties of high-dimensional vectors and matrices

Level 2: What basic practical skills should a student be able to perform?

By the end of the course, the students should be able to ...

  • Perform basic Monte Carlo computations, such as Monte Carlo integration
  • Obtain simple but accurate bounds of complex statistical metrics
  • Apply the median of means estimator
  • Investigate simple statistics of social networks
  • Exploit the thin-shell phenomenon when analysing data
  • Apply data clustering and dimension reduction

Level 3: What complex comprehensive skills should a student be able to apply in real-life scenarios?

By the end of the course, the students should be able to ...

  • To understand the problems related to statistical analysis of data.
  • To apply theoretical statistics in real-life via computer simulations and thereby confirm or reject the correctness of the theoretical concepts.
  • To identify the correct statistical methods that needs to be applied to data in order to solve the given tasks in real-life.
  • To be able to generate and run experiments on random data samples.

Grading

Course grading range

Grade Range Description of performance
A. Excellent 85-100 -
B. Good 70-84 -
C. Satisfactory 50-69 -
D. Poor 0-49 -

Minimum Requirements For Passing The Course

There are two requirements for passing this course:.

  1. You must have at least 50% on the Final Exam.
  2. You must have at least 50% of the overall grade.

Course activities and grading breakdown

Activity Type Percentage of the overall course grade
Quiz/Assignment during each lecture (weekly evaluations) 20
Labs classes (weekly evaluations) 20
Midterm 20
Final exam 40


Plagiarism Rules

  • If a student submits a solution to a weekly assignment/quiz and/or lab that is identical to the one submitted from another student, then both students will obtain the maximum points for this task but with the negative sign.

Recommendations for students on how to succeed in the course

  • Watch the video lecture and read the lecture notes before coming to the onsite lectures and to the labs.
  • Attend the onsite lectures and questions related to parts of the material that you find unclear.
  • Submit solutions to the weekly quizzes.
  • Submit the weekly lab reports.
  • Prepare seriously for the midterm exam.
  • Prepare seriously for the final exam.

Resources, literature and reference materials

Open access resources

  • The lecture notes and the video lectures provided via Moodle are sufficient for passing this course with grade A.

Software and tools used within the course

  • You can use any software by your choice to perform the lab tasks.

Teaching Methodology: Methods, techniques, & activities

Formative Assessment and Course Activities

Ongoing performance assessment

The performance will be assessed via weekly quizzes and weekly labs.

Final assessment

The final assessment is in a written form. You mast have at least 50% on the final exam to pass the course.

The retake exam

The retake of the exam will be in a written form.