Data Mining

Course name: Data Mining
Course number: N/A

Course Characteristics

Key concepts of the class

Data Preparation
Association Pattern Mining
Cluster Analysis
Outlier Analysis
Data Classification
Mining Data Streams, Text Data and Discrete Sequences

What is the purpose of this course?

Data mining is the study of collecting, cleaning, processing, analyzing, and gaining useful insights from data. A wide variation exists in terms of the problem domains, applications, formulations, and data representations that are encountered in real applications. Therefore, “data mining” is a broad umbrella term that is used to describe these different aspects of data processing. This course aims to help students to correctly address large data volumes using advanced tools and techniques. This leads to unique challenges from the perspective of processing and analysis.

Course objectives based on Bloom’s taxonomy

- What should a student remember at the end of the course?

By the end of the course, the students should be able to

The most common structures of distributed storage.
Batch processing techniques
Stream processing techniques
Basic distributed data processing algorithms
Basic tools to address specific processing needs

- What should a student be able to understand at the end of the course?

By the end of the course, the students should be able to

Understand the entire chain of data processing
Understand principle theories, models, tools and techniques
Analyze and apply adequate models for new problems
Understand new data mining tasks and provide solutions in different domains

- What should a student be able to apply at the end of the course?

By the end of the course, the students should be able to

Design an appropriate model to cope with new requirements
Latest trends, algorithms, technologies in big data
Ability to determine appropriate approaches towards new challenges
Proficiency in data analysis and performance evaluations
Application of models, combination of multiple approaches, adaptation to interdisciplinary fields

Course evaluation

Course grade breakdown
type	points
Labs/seminar classes	30
Interim performance assessment	30
Exams	40

Grades range

Course grading range
grade	low	high
A	90	100
B	75	89
C	60	74
D	0	59

Resources and reference material

Jiawei Han, Micheline Kamber and Jian Pei. {\it Data Mining: Concepts and Techniques (3nd Edition)}
Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman. {\it Mining of Massive Datasets}

Course Sections

The main sections of the course and approximate hour distribution between them is as follows:

Section 1

Section title

Introduction to Data Mining

Topics covered in this section

What is Data Mining
The Data Mining Process
Data Preparation
Similarity and Distances

What forms of evaluation were used to test students’ performance in this section?


Form	Yes/No
Development of individual parts of software product code	0
Homework and group projects	0
Midterm evaluation	0
Testing (written or computer based)	1
Reports	0
Essays	0
Oral polls	0
Discussions	1

Typical questions for ongoing performance evaluation within this section

An analyst obtains medical notes from a physician for data mining purposes, and then transforms them into a table containing the medicines prescribed for each patient. What is the data type of (a) the original data, and (b) the transformed data? (c) What is the process of transforming the data to the new format called?
An analyst sets up a sensor network in order to measure the temperature of different locations over a period. What is the data type of the data collected?

Typical questions for seminar classes (labs) within this section

Design the structure of a DB to address a specific analytics type

Tasks for midterm assessment within this section

Test questions for final assessment in this section

An analyst processes Web logs in order to create records with the ordering information for Web page accesses from different users. What is the type of this data?
Consider a data object corresponding to a set of nucleotides arranged in a certain order. What is this type of data?

Section 2

Section title

Association Pattern Mining

Topics covered in this section

Association Rule Generation Framework
Frequent Itemset Mining Algorithms
Pattern Summarization

What forms of evaluation were used to test students’ performance in this section?


Form	Yes/No
Development of individual parts of software product code	0
Homework and group projects	0
Midterm evaluation	0
Testing (written or computer based)	1
Reports	0
Essays	0
Oral polls	0
Discussions	1

Typical questions for ongoing performance evaluation within this section

Consider the transaction database in the table below: \\ tid | Items \\ 1 | a, b, c, d \\ 2 | b, c, e, f \\ 3 | a, d, e, f \\ 4 | a, e, f \\ 5 | b, d, f \\ Determine the absolute support of itemsets {a, e, f}, and {d, f}. Convert the absolute support to the relative support.

Typical questions for seminar classes (labs) within this section

Write a computer program to implement the greedy algorithm for finding a representative itemset from a group of itemsets.
Write a computer program to implement an inverted index on a set of market baskets. Implement a query to retrieve all itemsets containing a particular set of items.

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Write a computer program to implement a signature table on a set of market baskets. Implement a query to retrieve the closest market basket to a target basket on the basis of the cosine similarity.

Section 3

Section title

Cluster Analysis

Topics covered in this section

Feature Selection for Clustering
Representative-Based Algorithms
Probabilistic Model-Based Algorithms
Graph-Based Algorithms
Cluster Validation

What forms of evaluation were used to test students’ performance in this section?


Form	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	1
Testing (written or computer based)	1
Reports	0
Essays	0
Oral polls	0
Discussions	1

Typical questions for ongoing performance evaluation within this section

Consider the 1-dimensional data set with 10 data points \{1, 2, 3, . . . 10\}. Show three iterations of the k-means algorithms when k = 2, and the random seeds are initialized to \{1, 2\}.

Typical questions for seminar classes (labs) within this section

Write a computer program to implement the k-representative algorithm. Use a modular program structure, in which the distance function and centroid determination are separate subroutines. Instantiate these subroutines to the cases of (i) the k-means algorithm, and (ii) the k-medians algorithm.

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Implement the k-modes algorithm. Download the KDD CUP 1999 Network Intrusion Data Set from the UCI Machine Learning Repository, and apply the algorithm to the categorical attributes of the data set. Compute the cluster purity with respect to class labels.
What changes would be required to the BIRCH algorithm to implement it with the use of the Mahalanobis distance, to compute distances between data points and centroids? The diameter of a cluster is computed as its RMS Mahalanobis radius.

Section 4

Section title

Outlier Analysis

Topics covered in this section

Extreme Value Analysis
Probabilistic Models
Clustering for Outlier Detection
Distance and Density-Based Outlier Detection
Information-Theoretic Models
Outlier Validity
Outlier Detection with Categorical Data

What forms of evaluation were used to test students’ performance in this section?


Form	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	1
Testing (written or computer based)	1
Reports	0
Essays	0
Oral polls	0
Discussions	1

Typical questions for ongoing performance evaluation within this section

Suppose a particular random variable has mean 3 and standard deviation 2. Compute the Z-number for the values -1, 3. and 9. Which of these values can be considered the most extreme value?
Consider the four 2-dimensional data points (0, 0), (0, 1), (1, 0), and (100, 100). Plot them using mathematical software such as MATLAB. Which data point visually seems like an extreme value? Which data point is reported by the Mahalanobis measure as the strongest extreme value?Which data points are reported by depth-based measure?

Typical questions for seminar classes (labs) within this section

Implement the EM algorithm for clustering, and use it to implement a computation of the probabilistic outlier scores.

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Suppose that algorithm A is designed for outlier detection in numeric data, whereas algorithm B is designed for outlier detection in categorical data. Show how you can use these algorithms to perform outlier detection in a mixed-attribute data set.
Design an algorithm for categorical outlier detection using the Mahalanobis distance. What are the advantages of such an approach?

Section 5

Section title

Data Classification

Topics covered in this section

Feature Selection for Classification
Rule-Based Classifiers
Classification paradigms
Instance-Based Learning
Multiclass and Rare Class Learning

What forms of evaluation were used to test students’ performance in this section?


Form	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	1
Testing (written or computer based)	1
Reports	0
Essays	0
Oral polls	0
Discussions	1

Typical questions for ongoing performance evaluation within this section

Show how to construct a (possibly overfitting) rule-based classifier that always exhibits 100\ training instances are identical.
Design a univariate decision tree with a soft maximum-margin split criterion borrowed from SVMs. Suppose that this decision tree is generalized to the multivariate case. How does the resulting decision boundary compare with SVMs? Which classifier can handle a larger variety of data sets more accurately?

Typical questions for seminar classes (labs) within this section

Implement an associative classifier in which only maximal patterns are used for classification, and the majority consequent label of rules fired, is reported as the label of the test instance.

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Discuss some general meta-strategies for speeding up classifiers. Discuss some strategies that you might use to scale up (a) nearest-neighbor classifiers, and (b) associative classifiers.
Describe the changes required to the dual formulation of the soft SVM classifier with hinge loss to make it a weighted classifier for rare-class learning.
Describe the changes required to the primal formulation of the soft SVM classifier with hinge loss to make it a weighted classifier for rare-class learning.

Section 6

Section title

Mining Data Streams

Topics covered in this section

Frequent Pattern Mining in Data Streams
Clustering Data Streams
Streaming Outlier Detection
Streaming Classification

What forms of evaluation were used to test students’ performance in this section?


Form	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	1
Testing (written or computer based)	1
Reports	0
Essays	0
Oral polls	0
Discussions	1

Typical questions for ongoing performance evaluation within this section

Analyze the CAP theorem
Define the kinds of data storage available
Characteristics of batch processing
Characteristics of stream processing
Describe the usage patterns
Compare NoSQL databases

Typical questions for seminar classes (labs) within this section

Discuss scenarios in which both the Hoeffding inequality and the Chernoff bound can be used. Which one applies more generally?
Suppose that you have a reservoir of size k = 1000, and you have a sample of a stream containing an exactly equal distribution of two classes. Use the upper-tail Chernoff bound to determine the probability that the reservoir contains more than 600 samples of one of the two classes. Can the lower tail be used?

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Implement the CluStream algorithm.
Extend the implementation of the previous exercise to the problem of classification with the microclustering method.
Implement the Flajolet–Martin algorithm for distinct element counting.

IU:TestPage

Data Mining

Course Characteristics

Key concepts of the class

What is the purpose of this course?

Course objectives based on Bloom’s taxonomy

- What should a student remember at the end of the course?

- What should a student be able to understand at the end of the course?

- What should a student be able to apply at the end of the course?

Course evaluation

Grades range

Resources and reference material

Course Sections

Section 1

Section title

Topics covered in this section

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Section 2

Section title

Topics covered in this section

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Section 3

Section title

Topics covered in this section

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Section 4

Section title

Topics covered in this section

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Section 5

Section title

Topics covered in this section

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Section 6

Section title

Topics covered in this section

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Navigation menu

Search