Difference between revisions of "MSc: Advanced Statistics"

From IU
Jump to navigation Jump to search
 
(48 intermediate revisions by 2 users not shown)
Line 1: Line 1:
  +
 
= Advanced Statistics =
 
= Advanced Statistics =
  +
* '''Course name''': Advanced Statistics
  +
* '''Code discipline''': DS-03
  +
* '''Subject area''':
   
  +
== Short Description ==
* <span>'''Course name:'''</span> Advanced Statistics
 
  +
This course in advanced statistics with a view toward applications in data sciences. It is intended for masters students who are looking to expand their knowledge of theoretical methods used in modern research in data sciences. The course presents some of the key probabilistic methods and results that may form an essential mathematical toolbox for a data scientist. This course places particular emphasis on random vectors, random matrices, and random projections. It teaches basic theoretical skills for the analysis of these objects, which include concentration inequalities, covering and packing arguments, decoupling and symmetrization tricks, chaining and comparison techniques for stochastic processes, combinatorial reasoning based on the VC dimension, and a lot more. This course integrates theory with applications for covariance estimation, semidefinite programming, networks, elements of statistical learning, error correcting codes, clustering, matrix completion, dimension reduction, sparse signal recovery, sparse regression, and more.
* <span>'''Course number:'''</span> DS-03
 
   
  +
== Prerequisites ==
   
== Prerequisites ==
+
=== Prerequisite subjects ===
  +
* Excellent knowledge of probability and statistics.
This course will benefit from good English language skills. Also it could be great if you acquire basic statistical skills in the perspective of [https://en.wikipedia.org/wiki/Empirical_research Empirical research]: statistical hypothesis testing, dependent and independent variables, distributions, etc..
 
* [https://eduwiki.innopolis.university/index.php/MSc:_Empirical_Methods CSE329 - Empirical Methods]
 
   
  +
=== Prerequisite topics ===
== Course Characteristics ==
 
=== Key concepts of the class ===
 
   
* Statistical inference
 
* Non parametric statistics
 
* Test of statistical hypotheses
 
* Simple linear regression and correlation analysis
 
* Meta-Analysis
 
   
  +
== Course Topics ==
=== What is the purpose of this course? ===
 
  +
{| class="wikitable"
 
  +
|+ Course Sections and Topics
The main purpose of this course is to present the fundamentals of inferential statistics to the future software engineers and data scientists, on one side providing the scientific fundamentals of the disciplines, and on the other anchoring the theoretical concepts on practices coming from the world of software development and engineering. The course covers the statistical analysis of data with limited assumptions on the distribution, with reference to testing hypotheses, measuring correlations, building samples, and performing regressions.
 
 
== Course Objectives Based on Bloom’s Taxonomy ==
 
 
=== What should a student remember at the end of the course? ===
 
 
By the end of the course, the students should be able to:
 
 
* Remember the fundamentals of inferential statistics
 
* Remember the specifics and purpose of different hypothesis tests
 
* Distinguish between parametric and non parametric tests
 
 
=== What should a student be able to understand at the end of the course? ===
 
 
By the end of the course, the students should be able to understand:
 
 
* the basic concepts of inferential statistics
 
* the fundamental laws in statistics
 
* the concept of null and alternative hypotheses
 
* the hypotheses test procedure
 
 
=== What should a student be able to apply at the end of the course? ===
 
 
By the end of the course, the students should be able to ...
 
 
* To understand the problems related to analyse statistically data not distributed normally
 
* To know the more recent computationally-intensive techniques that can help to describe samples and to infer properties of populations in absence of normality
 
* To identify situations when the data is on nominal scales so alternative techniques should be use, and act accordingly.
 
* To be able to run experiment to evaluate hypotheses for situation of scarce data, distributed non normally, on different kinds of scales.
 
 
=== Course evaluation (Standard) ===
 
 
{| style="border-spacing: 2px; border: 1px solid darkgray;"
 
|+ Course grade breakdown
 
!
 
!
 
!align="center"| '''Points'''
 
 
|-
 
|-
  +
! Section !! Topics within the section
| Weekly quizzes
 
|
 
|align="center"| 10
 
 
|-
 
|-
  +
| Concentration of sums of independent random variables ||
| Midterm
 
  +
# Hoeffding’s inequality
|
 
  +
# Chernoff ’s inequality
|align="center"| 20
 
  +
# Sub-gaussian distributions
  +
# Sub-exponential distributions
 
|-
 
|-
  +
| Random vectors in high dimensions ||
| Final oral exam
 
  +
# Concentration of the norm
|
 
  +
# Covariance matrices and principal component analysis
|align="center"| 35
 
 
|-
 
|-
  +
| Random matrices ||
| Final written exam
 
  +
# Nets, covering numbers and packing numbers
|
 
  +
# Covariance estimation and clustering
|align="center"| 30
 
|-
 
| Participation
 
|
 
|align="center"| 5
 
 
|}
 
|}
   
=== Course evaluation (Project Based) ===
+
== Intended Learning Outcomes (ILOs) ==
   
  +
=== What is the main purpose of this course? ===
{| style="border-spacing: 2px; border: 1px solid darkgray;"
 
  +
The main purpose of this course is to present the fundamentals of high-dimensional statistics with applications to data science. The course presents some of the key probabilistic methods and results that may form an essential mathematical toolbox for a data scientist. This course places particular emphasis on random vectors, random matrices, and random projections. This course integrates theory with applications for covariance estimation, semidefinite programming, networks, elements of statistical learning, error correcting codes, clustering, matrix completion, dimension reduction, sparse signal recovery, sparse regression, and more.
|+ Course grade breakdown
 
!
 
!
 
!align="center"| '''Points'''
 
|-
 
| Weekly quizzes
 
|
 
|align="center"| 15
 
|-
 
| Weekly Projects Review
 
|
 
|align="center"| 15
 
|-
 
| Mid of Semester Project Review
 
|
 
|align="center"| 20
 
|-
 
| Final Report
 
|
 
|align="center"| 30
 
|-
 
| Final Presentation with Q&A
 
|
 
|align="center"| 20
 
|}
 
   
=== Grades range ===
+
=== ILOs defined at three levels ===
   
  +
==== Level 1: What concepts should a student know/remember/explain? ====
{| style="border-spacing: 2px; border: 1px solid darkgray;"
 
  +
By the end of the course, the students should be able to ...
|+ Course grading range
 
  +
* Explain the difference between low-dimensional and high-dimensional data
!
 
  +
* Explain concentration inequalities and their application
!
 
  +
* Remember the main statistical properties of high-dimensional vectors and matrices
!align="center"| '''Range'''
 
|-
 
| A. Excellent
 
|
 
|align="center"| 95-100
 
|-
 
| B. Good
 
|
 
|align="center"| 75-94
 
|-
 
| C. Satisfactory
 
|
 
|align="center"| 55-74
 
|-
 
| D. Poor
 
|
 
|align="center"| 0-54
 
|}
 
   
  +
==== Level 2: What basic practical skills should a student be able to perform? ====
  +
By the end of the course, the students should be able to ...
  +
* Perform basic Monte Carlo computations, such as Monte Carlo integration
  +
* Obtain simple but accurate bounds of complex statistical metrics
  +
* Apply the median of means estimator
  +
* Investigate simple statistics of social networks
  +
* Exploit the thin-shell phenomenon when analysing data
  +
* Apply data clustering and dimension reduction
   
  +
==== Level 3: What complex comprehensive skills should a student be able to apply in real-life scenarios? ====
=== Cooperation policy and quotations ===
 
  +
By the end of the course, the students should be able to ...
We encourage vigorous discussion and cooperation in this class. You should feel free to discuss any aspects of the class with any classmates. However, we insist that any written material that is not specifically designated as a Team Deliverable be done by you alone. This includes answers to reading questions, individual reports associated with assignments, and labs. We also insist that if you include verbatim text from any source, you clearly indicate it using standard conventions of quotation or indentation and a note to indicate the source.
 
  +
* To understand the problems related to statistical analysis of data.
  +
* To apply theoretical statistics in real-life via computer simulations and thereby confirm or reject the correctness of the theoretical concepts.
  +
* To identify the correct statistical methods that needs to be applied to data in order to solve the given tasks in real-life.
  +
* To be able to generate and run experiments on random data samples.
   
  +
== Grading ==
   
=== Resources and reference material ===
+
=== Course grading range ===
  +
{| class="wikitable"
 
  +
|+
* Wasserman L. (2006) All of Nonparametric Statistics. Springer
 
  +
|-
* Randles, R.H. and Wolfe, D.A. (1991). Introduction to the Theory of Nonparametric Statistics. Melbourne: Robert Krieger. (Ch.1‐Ch.4)
 
  +
! Grade !! Range !! Description of performance
* Hastie, T. Tibshirani, R. and Friedman, J. (2008) The Elements of Statistical Learning 2ed. Springer
 
  +
|-
* Hollander, M. and Wolfe, D.A. (1999). Nonparametric Statistical Methods, 2nd ed. New York: John Wiley.
 
  +
| A. Excellent || 85-100 || -
 
== Course Sections ==
 
 
The main sections of the course and approximate hour distribution between them is as follows:
 
 
{| style="border-spacing: 2px; border: 1px solid darkgray;"
 
|+ Course Sections
 
!align="center"| '''Section'''
 
! '''Section Title'''
 
!align="center"| '''Teaching Hours'''
 
 
|-
 
|-
  +
| B. Good || 70-84 || -
|align="center"| 1
 
| Sampling Distributions Associated with the Normal Population
 
|align="center"| 15
 
 
|-
 
|-
  +
| C. Satisfactory || 50-69 || -
|align="center"| 2
 
| Test of Statistical Hypotheses
 
|align="center"| 30
 
 
|-
 
|-
  +
| D. Poor || 0-49 || -
|align="center"| 3
 
| Simple Linear Regression and Correlation Analysis
 
|align="center"| 15
 
 
|}
 
|}
   
  +
=== Minimum Requirements For Passing The Course ===
=== Section 1 ===
 
   
  +
There are two requirements for passing this course:.
==== Section title: ====
 
  +
# You must have at least 50% on the Final Exam.
  +
# You must have at least 50% of the overall grade.
   
  +
=== Course activities and grading breakdown ===
Sampling Distributions Associated with the Normal Population
 
  +
{| class="wikitable"
 
  +
|+
=== Topics covered in this section: ===
 
 
* Introduction to the course, toward inference
 
* Student’s t-distribution
 
* Bernoulli and binomial distribution
 
* Chi-square distribution
 
* Snedecor’s F-distribution
 
 
=== What forms of evaluation were used to test students’ performance in this section? ===
 
 
 
{| style="border-spacing: 2px; border: 1px solid darkgray;"
 
!align="center"| '''Evaluation'''
 
!align="center"| '''Yes/No'''
 
 
|-
 
|-
  +
! Activity Type !! Percentage of the overall course grade
| Development of individual parts of software product code
 
|align="center"| 0
 
 
|-
 
|-
  +
| Quiz/Assignment during each lecture (weekly evaluations) || 20
| Homework and group projects
 
|align="center"| 0
 
 
|-
 
|-
  +
| Labs classes (weekly evaluations) || 20
| Midterm evaluation
 
|align="center"| 1
 
 
|-
 
|-
  +
| Midterm || 20
| Testing (written or computer based)
 
|align="center"| 1
 
 
|-
 
|-
  +
| Final exam || 40
| Reports
 
|align="center"| 0
 
|-
 
| Essays
 
|align="center"| 0
 
|-
 
| Oral polls
 
|align="center"| 0
 
|-
 
| Discussions
 
|align="center"| 1
 
 
|}
 
|}
   
=== Typical questions for ongoing performance evaluation within this section ===
 
   
  +
=== Plagiarism Rules ===
# Deduce the probability mass function <math display="inline">P(X \leq k</math> for a binomial distribution?
 
# Let X1,...,Xk be ''k'' iid random variables distributed with a <math display="inline">\chi^2</math> distribution with n1,...nk degrees of freedom respectively.<br />
 
What is the distribution of Y=X1+...+Xk? Define it precisely and prove the answer formally?
 
# List at least 3 random variables that “tend to follow” a t distribution?
 
# If X has Chi square function with the 5 degrees of freedom, then what is the probability that X is between 1.145 and 12.83?
 
# If X has a gamma distribution of (1,1), then what is the probability density function of the random variable 2X?
 
   
  +
* If a student submits a solution to a weekly assignment/quiz and/or lab that is identical to the one submitted from another student, then both students will obtain the maximum points for this task but with the negative sign.
=== Typical questions for seminar classes (labs) within this section ===
 
   
  +
=== Recommendations for students on how to succeed in the course ===
# Define and provide examples of sample space, events and probability measure.
 
  +
* Watch the video lecture and read the lecture notes before coming to the onsite lectures and to the labs.
# Write the formula for the coefficients of the simple linear regression. Explain the mathematical procedure you do to derive them and derive them.
 
  +
* Attend the onsite lectures and questions related to parts of the material that you find unclear.
# Calculate the correlation between two functions and explain its meaning.
 
  +
* Submit solutions to the weekly quizzes.
# Calculate the Pearson coefficient for the given functions.
 
# Deduce the MGF for normal distribution.
+
* Submit the weekly lab reports.
  +
* Prepare seriously for the midterm exam.
# State and prove the Bonferroni inequality.
 
  +
* Prepare seriously for the final exam.
   
  +
== Resources, literature and reference materials ==
=== Test questions for final assessment in this section ===
 
   
  +
=== Open access resources ===
  +
* The lecture notes and the video lectures provided via Moodle are sufficient for passing this course with grade A.
   
  +
=== Software and tools used within the course ===
== Test of Statistical Hypotheses ==
 
  +
* You can use any software by your choice to perform the lab tasks.
 
=== Topics covered in this section: ===
 
 
* Z-test
 
* Student’s t-test
 
* Chi-square test
 
* Snedecor’s F-test
 
 
=== What forms of evaluation were used to test students’ performance in this section? ===
 
 
=== Typical questions for ongoing performance evaluation within this section ===
 
 
# Define the concept of power of a statistical test.
 
# Define the purpose of the F Test, its hypotheses, and its structure.
 
# Define the purpose of the t-Test, its hypotheses, and its structure.
 
# Define the purpose of the Chi square Test, its hypotheses, and its structure.
 
# Define the purpose of the Z Test, its hypotheses, and its structure.
 
# Provide concrete numeric examples with explanation on why the power of a test depends on:
 
## the size of the data sets.
 
## the magnitude of the effect.
 
## the level of statistical significance.
 
# Given a statistical test for which we have set a value <math display="inline">\alpha</math> we obtain a p:
 
## if we can reject H0 <math display="inline">(p < \alpha)</math>, what we typically say about H0 and H1.
 
## if we cannot reject H0 <math display="inline">(P \geq \alpha)</math>, what we can typically say about H0 and H1.
 
## when can we say that H0 holds?
 
## when can we say that H1 holds?
 
 
=== Typical questions for seminar classes (labs) within this section ===
 
 
# Provide a concrete example of a t test, detailing both H0 and H1.
 
# Present the structure of the F test for the analysis of the variance.
 
# Explain what are H0 and H1 in hypothesis testing.
 
# Explain the role of the Bonferroni inequality in hypothesis testing.
 
 
=== Test questions for final assessment in this section ===
 
 
 
== Simple Linear Regression and Correlation Analysis ==
 
 
==== Topics covered in this section: ====
 
 
* Kolmogorov-Smirnov test
 
* Size of samples, Kolmogorov-Smirnov, Fisher exact
 
* Logistic regression
 
 
=== What forms of evaluation were used to test students’ performance in this section? ===
 
 
 
{| style="border-spacing: 2px; border: 1px solid darkgray;"
 
!align="center"| '''Evaluation'''
 
!align="center"| '''Yes/No'''
 
|-
 
| Development of individual parts of software product code
 
|align="center"| 0
 
|-
 
| Homework and group projects
 
|align="center"| 0
 
|-
 
| Midterm evaluation
 
|align="center"| 0
 
|-
 
| Testing (written or computer based)
 
|align="center"| 1
 
|-
 
| Reports
 
|align="center"| 0
 
|-
 
| Essays
 
|align="center"| 0
 
|-
 
| Oral polls
 
|align="center"| 0
 
|-
 
| Discussions
 
|align="center"| 1
 
|}
 
   
  +
= Teaching Methodology: Methods, techniques, & activities =
=== Typical questions for ongoing performance evaluation within this section ===
 
   
  +
== Formative Assessment and Course Activities ==
# Let X1,X2, ...,X10 be a random sample from a distribution whose probability density function is <math display="inline">f(x) = (1 \quad if \;0 < x < 1</math>, otherwise 0). Based on the observed values 0.62, 0.36, 0.23, 0.76, 0.65, 0.09, 0.55, 0.26, 0.38, 0.24, test the hypothesis H0 : X UNIF(0, 1) against H1 : X UNIF(0, 1) at a significance level = 0.1.
 
# If X1,X2, ...,Xn is a random sample from a distribution with density function <math display="inline">f(x) = ((1-\theta)x^\theta \; if \; 0 < x < 1</math>, otherwise 0), what is the maximum likelihood estimator of <math display="inline">\theta</math>?
 
# Let X1,X2, ...,Xn be a random sample of size n from a distribution with a probability density function <math display="inline">f(x) = ((1-\theta)x^\theta \; if \; 0 < x < 1,</math> otherwise 0), where <math display="inline">0 < \theta</math> is a parameter. Using the maximum likelihood method find an estimator for the parameter <math display="inline">\theta</math>.
 
# Suppose you are told that the likelihood of <math display="inline">\theta</math> at <math display="inline">\theta=2</math> is given by 1/4. Is this the probability that <math display="inline">\theta=2</math>? Explain why or why not.
 
   
  +
=== Ongoing performance assessment ===
=== Typical questions for seminar classes (labs) within this section ===
 
  +
The performance will be assessed via weekly quizzes and weekly labs.
   
  +
=== Final assessment ===
# If X1,X2, ...,Xn is a random sample from a distribution with density function<math display="inline">f(x) = (\frac{1}{\theta} \;if \; 0 < x < 1,</math> otherwise 0), then what is the maximum likelihood estimator of <math display="inline">\theta</math>?
 
# Let X1,X2, ...,Xn be a random sample from a normal population with mean <math display="inline">\mu</math> and variance <math display="inline">\sigma^2</math>. What are the maximum likelihood estimators of <math display="inline">\mu</math> and <math display="inline">\sigma^2</math>?
 
# Suppose that you have the following data points: 0.36, 0.32, 0.10, 0.13, 0.45, 0.11, 0.12, 0.09; compute Dn to determine if they come from the uniform distribution [0,0.5].
 
# The data on the heights of 12 infants are given below: 18.2, 21.4, 22.6, 17.4, 17.6, 16.7, 17.1, 21.4, 20.1, 17.9, 16.8, 23.1. Test the hypothesis that the data came from some normal population at a significance level = 0.1.
 
   
  +
The final assessment is in a written form. You mast have at least 50% on the final exam to pass the course.
=== Test questions for final assessment in the course ===
 
   
  +
=== The retake exam ===
# Providing full example of two sequences (in case of computational overhead, you can approximate at the first decimal digit). Compute their:
 
  +
The retake of the exam will be in a written form.
## Covariance.
 
## Pearson’s correlation coefficient.
 
## Spearman’s Rank Correlation Coefficient.
 
## Kendall’s tau Correlation coefficient.
 
# What is an empirical distribution?
 
# Present, prove, and discuss the evaluation of the asymptotic confidence interval for the empirical distribution, detailing the role of the binomial.
 
# Prove, under the simplified hypotheses, the distribution free property of Dn.
 
# Write the Shannon Theorem and discuss its implications.
 
# Discuss how we could proceed to compute the confidence interval of the Kendall Tau correlation coefficient of the population.
 
# Suppose that you have the following datapoints: 0.4, 2, 0.6, 2.4, 2.2, 3.6, 3.8, 4; compute Dn to determine if they come from the uniform distribution [0,4].
 
# Prove that <math display="inline">\tilde{F}_n</math> is a consistent and unbiased estimator of F.
 

Latest revision as of 12:12, 22 January 2024

Advanced Statistics

  • Course name: Advanced Statistics
  • Code discipline: DS-03
  • Subject area:

Short Description

This course in advanced statistics with a view toward applications in data sciences. It is intended for masters students who are looking to expand their knowledge of theoretical methods used in modern research in data sciences. The course presents some of the key probabilistic methods and results that may form an essential mathematical toolbox for a data scientist. This course places particular emphasis on random vectors, random matrices, and random projections. It teaches basic theoretical skills for the analysis of these objects, which include concentration inequalities, covering and packing arguments, decoupling and symmetrization tricks, chaining and comparison techniques for stochastic processes, combinatorial reasoning based on the VC dimension, and a lot more. This course integrates theory with applications for covariance estimation, semidefinite programming, networks, elements of statistical learning, error correcting codes, clustering, matrix completion, dimension reduction, sparse signal recovery, sparse regression, and more.

Prerequisites

Prerequisite subjects

  • Excellent knowledge of probability and statistics.

Prerequisite topics

Course Topics

Course Sections and Topics
Section Topics within the section
Concentration of sums of independent random variables
  1. Hoeffding’s inequality
  2. Chernoff ’s inequality
  3. Sub-gaussian distributions
  4. Sub-exponential distributions
Random vectors in high dimensions
  1. Concentration of the norm
  2. Covariance matrices and principal component analysis
Random matrices
  1. Nets, covering numbers and packing numbers
  2. Covariance estimation and clustering

Intended Learning Outcomes (ILOs)

What is the main purpose of this course?

The main purpose of this course is to present the fundamentals of high-dimensional statistics with applications to data science. The course presents some of the key probabilistic methods and results that may form an essential mathematical toolbox for a data scientist. This course places particular emphasis on random vectors, random matrices, and random projections. This course integrates theory with applications for covariance estimation, semidefinite programming, networks, elements of statistical learning, error correcting codes, clustering, matrix completion, dimension reduction, sparse signal recovery, sparse regression, and more.

ILOs defined at three levels

Level 1: What concepts should a student know/remember/explain?

By the end of the course, the students should be able to ...

  • Explain the difference between low-dimensional and high-dimensional data
  • Explain concentration inequalities and their application
  • Remember the main statistical properties of high-dimensional vectors and matrices

Level 2: What basic practical skills should a student be able to perform?

By the end of the course, the students should be able to ...

  • Perform basic Monte Carlo computations, such as Monte Carlo integration
  • Obtain simple but accurate bounds of complex statistical metrics
  • Apply the median of means estimator
  • Investigate simple statistics of social networks
  • Exploit the thin-shell phenomenon when analysing data
  • Apply data clustering and dimension reduction

Level 3: What complex comprehensive skills should a student be able to apply in real-life scenarios?

By the end of the course, the students should be able to ...

  • To understand the problems related to statistical analysis of data.
  • To apply theoretical statistics in real-life via computer simulations and thereby confirm or reject the correctness of the theoretical concepts.
  • To identify the correct statistical methods that needs to be applied to data in order to solve the given tasks in real-life.
  • To be able to generate and run experiments on random data samples.

Grading

Course grading range

Grade Range Description of performance
A. Excellent 85-100 -
B. Good 70-84 -
C. Satisfactory 50-69 -
D. Poor 0-49 -

Minimum Requirements For Passing The Course

There are two requirements for passing this course:.

  1. You must have at least 50% on the Final Exam.
  2. You must have at least 50% of the overall grade.

Course activities and grading breakdown

Activity Type Percentage of the overall course grade
Quiz/Assignment during each lecture (weekly evaluations) 20
Labs classes (weekly evaluations) 20
Midterm 20
Final exam 40


Plagiarism Rules

  • If a student submits a solution to a weekly assignment/quiz and/or lab that is identical to the one submitted from another student, then both students will obtain the maximum points for this task but with the negative sign.

Recommendations for students on how to succeed in the course

  • Watch the video lecture and read the lecture notes before coming to the onsite lectures and to the labs.
  • Attend the onsite lectures and questions related to parts of the material that you find unclear.
  • Submit solutions to the weekly quizzes.
  • Submit the weekly lab reports.
  • Prepare seriously for the midterm exam.
  • Prepare seriously for the final exam.

Resources, literature and reference materials

Open access resources

  • The lecture notes and the video lectures provided via Moodle are sufficient for passing this course with grade A.

Software and tools used within the course

  • You can use any software by your choice to perform the lab tasks.

Teaching Methodology: Methods, techniques, & activities

Formative Assessment and Course Activities

Ongoing performance assessment

The performance will be assessed via weekly quizzes and weekly labs.

Final assessment

The final assessment is in a written form. You mast have at least 50% on the final exam to pass the course.

The retake exam

The retake of the exam will be in a written form.