IU:TestPage

From IU
Revision as of 13:21, 7 February 2022 by R.sirgalina (talk | contribs)
Jump to navigation Jump to search

Data Mining

  • Course name: Data Mining
  • Course number: N/A

Course Characteristics

Key concepts of the class

  • Data Preparation
  • Association Pattern Mining
  • Cluster Analysis
  • Outlier Analysis
  • Data Classification
  • Mining Data Streams, Text Data and Discrete Sequences

What is the purpose of this course?

Data mining is the study of collecting, cleaning, processing, analyzing, and gaining useful insights from data. A wide variation exists in terms of the problem domains, applications, formulations, and data representations that are encountered in real applications. Therefore, “data mining” is a broad umbrella term that is used to describe these different aspects of data processing. This course aims to help students to correctly address large data volumes using advanced tools and techniques. This leads to unique challenges from the perspective of processing and analysis.

Course objectives based on Bloom’s taxonomy

- What should a student remember at the end of the course?

By the end of the course, the students should be able to

  • The most common structures of distributed storage.
  • Batch processing techniques
  • Stream processing techniques
  • Basic distributed data processing algorithms
  • Basic tools to address specific processing needs

- What should a student be able to understand at the end of the course?

By the end of the course, the students should be able to

  • Understand the entire chain of data processing
  • Understand principle theories, models, tools and techniques
  • Analyze and apply adequate models for new problems
  • Understand new data mining tasks and provide solutions in different domains

- What should a student be able to apply at the end of the course?

By the end of the course, the students should be able to

  • Design an appropriate model to cope with new requirements
  • Latest trends, algorithms, technologies in big data
  • Ability to determine appropriate approaches towards new challenges
  • Proficiency in data analysis and performance evaluations
  • Application of models, combination of multiple approaches, adaptation to interdisciplinary fields

Course evaluation

Course grade breakdown
type points
Labs/seminar classes 30
Interim performance assessment 30
Exams 40

Grades range

Course grading range
grade low high
A 90 100
B 75 89
C 60 74
D 0 59

Resources and reference material

  • Jiawei Han, Micheline Kamber and Jian Pei. {\it Data Mining: Concepts and Techniques (3nd Edition)}
  • Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman. {\it Mining of Massive Datasets}

Course Sections

The main sections of the course and approximate hour distribution between them is as follows:

Section 1

Section title

Introduction to Data Mining

Topics covered in this section

  • What is Data Mining
  • The Data Mining Process
  • Data Preparation
  • Similarity and Distances

What forms of evaluation were used to test students’ performance in this section?

Form Yes/No
Development of individual parts of software product code 0
Homework and group projects 0
Midterm evaluation 0
Testing (written or computer based) 1
Reports 0
Essays 0
Oral polls 0
Discussions 1

Typical questions for ongoing performance evaluation within this section

  1. An analyst obtains medical notes from a physician for data mining purposes, and then transforms them into a table containing the medicines prescribed for each patient. What is the data type of (a) the original data, and (b) the transformed data? (c) What is the process of transforming the data to the new format called?
  2. An analyst sets up a sensor network in order to measure the temperature of different locations over a period. What is the data type of the data collected?

Typical questions for seminar classes (labs) within this section

  1. Design the structure of a DB to address a specific analytics type

Tasks for midterm assessment within this section

Test questions for final assessment in this section

  1. An analyst processes Web logs in order to create records with the ordering information for Web page accesses from different users. What is the type of this data?
  2. Consider a data object corresponding to a set of nucleotides arranged in a certain order. What is this type of data?

Section 2

Section title

Association Pattern Mining

Topics covered in this section

  • Association Rule Generation Framework
  • Frequent Itemset Mining Algorithms
  • Pattern Summarization

What forms of evaluation were used to test students’ performance in this section?

Form Yes/No
Development of individual parts of software product code 0
Homework and group projects 0
Midterm evaluation 0
Testing (written or computer based) 1
Reports 0
Essays 0
Oral polls 0
Discussions 1

Typical questions for ongoing performance evaluation within this section

  1. Consider the transaction database in the table below: \\ tid | Items \\ 1 | a, b, c, d \\ 2 | b, c, e, f \\ 3 | a, d, e, f \\ 4 | a, e, f \\ 5 | b, d, f \\ Determine the absolute support of itemsets {a, e, f}, and {d, f}. Convert the absolute support to the relative support.

Typical questions for seminar classes (labs) within this section

  1. Write a computer program to implement the greedy algorithm for finding a representative itemset from a group of itemsets.
  2. Write a computer program to implement an inverted index on a set of market baskets. Implement a query to retrieve all itemsets containing a particular set of items.

Tasks for midterm assessment within this section

Test questions for final assessment in this section

  1. Write a computer program to implement a signature table on a set of market baskets. Implement a query to retrieve the closest market basket to a target basket on the basis of the cosine similarity.

Section 3

Section title

Cluster Analysis

Topics covered in this section

  • Feature Selection for Clustering
  • Representative-Based Algorithms
  • Probabilistic Model-Based Algorithms
  • Graph-Based Algorithms
  • Cluster Validation

What forms of evaluation were used to test students’ performance in this section?

Form Yes/No
Development of individual parts of software product code 1
Homework and group projects 1
Midterm evaluation 1
Testing (written or computer based) 1
Reports 0
Essays 0
Oral polls 0
Discussions 1

Typical questions for ongoing performance evaluation within this section

  1. Consider the 1-dimensional data set with 10 data points \{1, 2, 3, . . . 10\}. Show three iterations of the k-means algorithms when k = 2, and the random seeds are initialized to \{1, 2\}.

Typical questions for seminar classes (labs) within this section

  1. Write a computer program to implement the k-representative algorithm. Use a modular program structure, in which the distance function and centroid determination are separate subroutines. Instantiate these subroutines to the cases of (i) the k-means algorithm, and (ii) the k-medians algorithm.

Tasks for midterm assessment within this section

Test questions for final assessment in this section

  1. Implement the k-modes algorithm. Download the KDD CUP 1999 Network Intrusion Data Set from the UCI Machine Learning Repository, and apply the algorithm to the categorical attributes of the data set. Compute the cluster purity with respect to class labels.
  2. What changes would be required to the BIRCH algorithm to implement it with the use of the Mahalanobis distance, to compute distances between data points and centroids? The diameter of a cluster is computed as its RMS Mahalanobis radius.

Section 4

Section title

Outlier Analysis

Topics covered in this section

  • Extreme Value Analysis
  • Probabilistic Models
  • Clustering for Outlier Detection
  • Distance and Density-Based Outlier Detection
  • Information-Theoretic Models
  • Outlier Validity
  • Outlier Detection with Categorical Data

What forms of evaluation were used to test students’ performance in this section?

Form Yes/No
Development of individual parts of software product code 1
Homework and group projects 1
Midterm evaluation 1
Testing (written or computer based) 1
Reports 0
Essays 0
Oral polls 0
Discussions 1

Typical questions for ongoing performance evaluation within this section

  1. Suppose a particular random variable has mean 3 and standard deviation 2. Compute the Z-number for the values -1, 3. and 9. Which of these values can be considered the most extreme value?
  2. Consider the four 2-dimensional data points (0, 0), (0, 1), (1, 0), and (100, 100). Plot them using mathematical software such as MATLAB. Which data point visually seems like an extreme value? Which data point is reported by the Mahalanobis measure as the strongest extreme value?Which data points are reported by depth-based measure?

Typical questions for seminar classes (labs) within this section

  1. Implement the EM algorithm for clustering, and use it to implement a computation of the probabilistic outlier scores.

Tasks for midterm assessment within this section

Test questions for final assessment in this section

  1. Suppose that algorithm A is designed for outlier detection in numeric data, whereas algorithm B is designed for outlier detection in categorical data. Show how you can use these algorithms to perform outlier detection in a mixed-attribute data set.
  2. Design an algorithm for categorical outlier detection using the Mahalanobis distance. What are the advantages of such an approach?

Section 5

Section title

Data Classification

Topics covered in this section

  • Feature Selection for Classification
  • Rule-Based Classifiers
  • Classification paradigms
  • Instance-Based Learning
  • Multiclass and Rare Class Learning

What forms of evaluation were used to test students’ performance in this section?

Form Yes/No
Development of individual parts of software product code 1
Homework and group projects 1
Midterm evaluation 1
Testing (written or computer based) 1
Reports 0
Essays 0
Oral polls 0
Discussions 1

Typical questions for ongoing performance evaluation within this section

  1. Show how to construct a (possibly overfitting) rule-based classifier that always exhibits 100\ training instances are identical.
  2. Design a univariate decision tree with a soft maximum-margin split criterion borrowed from SVMs. Suppose that this decision tree is generalized to the multivariate case. How does the resulting decision boundary compare with SVMs? Which classifier can handle a larger variety of data sets more accurately?

Typical questions for seminar classes (labs) within this section

  1. Implement an associative classifier in which only maximal patterns are used for classification, and the majority consequent label of rules fired, is reported as the label of the test instance.

Tasks for midterm assessment within this section

Test questions for final assessment in this section

  1. Discuss some general meta-strategies for speeding up classifiers. Discuss some strategies that you might use to scale up (a) nearest-neighbor classifiers, and (b) associative classifiers.
  2. Describe the changes required to the dual formulation of the soft SVM classifier with hinge loss to make it a weighted classifier for rare-class learning.
  3. Describe the changes required to the primal formulation of the soft SVM classifier with hinge loss to make it a weighted classifier for rare-class learning.

Section 6

Section title

Mining Data Streams

Topics covered in this section

  • Frequent Pattern Mining in Data Streams
  • Clustering Data Streams
  • Streaming Outlier Detection
  • Streaming Classification

What forms of evaluation were used to test students’ performance in this section?

Form Yes/No
Development of individual parts of software product code 1
Homework and group projects 1
Midterm evaluation 1
Testing (written or computer based) 1
Reports 0
Essays 0
Oral polls 0
Discussions 1

Typical questions for ongoing performance evaluation within this section

  1. Analyze the CAP theorem
  2. Define the kinds of data storage available
  3. Characteristics of batch processing
  4. Characteristics of stream processing
  5. Describe the usage patterns
  6. Compare NoSQL databases

Typical questions for seminar classes (labs) within this section

  1. Discuss scenarios in which both the Hoeffding inequality and the Chernoff bound can be used. Which one applies more generally?
  2. Suppose that you have a reservoir of size k = 1000, and you have a sample of a stream containing an exactly equal distribution of two classes. Use the upper-tail Chernoff bound to determine the probability that the reservoir contains more than 600 samples of one of the two classes. Can the lower tail be used?

Tasks for midterm assessment within this section

Test questions for final assessment in this section

  1. Implement the CluStream algorithm.
  2. Extend the implementation of the previous exercise to the problem of classification with the microclustering method.
  3. Implement the Flajolet–Martin algorithm for distinct element counting.