Data Mining

Course name: Data Mining
Course number: N/A

Course Characteristics

Key concepts of the class

Data Preparation
Association Pattern Mining
Cluster Analysis
Outlier Analysis
Data Classification
Mining Data Streams, Text Data and Discrete Sequences

What is the purpose of this course?

Data mining is the study of collecting, cleaning, processing, analyzing, and gaining useful insights from data. A wide variation exists in terms of the problem domains, applications, formulations, and data representations that are encountered in real applications. Therefore, “data mining” is a broad umbrella term that is used to describe these different aspects of data processing. This course aims to help students to correctly address large data volumes using advanced tools and techniques. This leads to unique challenges from the perspective of processing and analysis.

Course objectives based on Bloom’s taxonomy

- What should a student remember at the end of the course?

By the end of the course, the students should be able to

The most common structures of distributed storage.
Batch processing techniques
Stream processing techniques
Basic distributed data processing algorithms
Basic tools to address specific processing needs

- What should a student be able to understand at the end of the course?

By the end of the course, the students should be able to

Understand the entire chain of data processing
Understand principle theories, models, tools and techniques
Analyze and apply adequate models for new problems
Understand new data mining tasks and provide solutions in different domains

- What should a student be able to apply at the end of the course?

By the end of the course, the students should be able to

Design an appropriate model to cope with new requirements
Latest trends, algorithms, technologies in big data
Ability to determine appropriate approaches towards new challenges
Proficiency in data analysis and performance evaluations
Application of models, combination of multiple approaches, adaptation to interdisciplinary fields

Course evaluation

Course grade breakdown
type	points
Labs/seminar classes	30
Interim performance assessment	30
Exams	40

Grades range

Course grading range
grade	low	high
A	90	100
B	75	89
C	60	74
D	0	59

Resources and reference material

Jiawei Han, Micheline Kamber and Jian Pei. {\it Data Mining: Concepts and Techniques (3nd Edition)}
Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman. {\it Mining of Massive Datasets}

Course Sections

The main sections of the course and approximate hour distribution between them is as follows:

Section 1

Section title

Introduction to Data Mining

Topics covered in this section

What is Data Mining
The Data Mining Process
Data Preparation
Similarity and Distances

What forms of evaluation were used to test students’ performance in this section?


Form	Yes/No
Development of individual parts of software product code	0
Homework and group projects	0
Midterm evaluation	0
Testing (written or computer based)	1
Reports	0
Essays	0
Oral polls	0
Discussions	1

Typical questions for ongoing performance evaluation within this section

An analyst obtains medical notes from a physician for data mining purposes, and then transforms them into a table containing the medicines prescribed for each patient. What is the data type of (a) the original data, and (b) the transformed data? (c) What is the process of transforming the data to the new format called?
An analyst sets up a sensor network in order to measure the temperature of different locations over a period. What is the data type of the data collected?

Typical questions for seminar classes (labs) within this section

Design the structure of a DB to address a specific analytics type

Tasks for midterm assessment within this section

Test questions for final assessment in this section

An analyst processes Web logs in order to create records with the ordering information for Web page accesses from different users. What is the type of this data?
Consider a data object corresponding to a set of nucleotides arranged in a certain order. What is this type of data?

Section 2

Section title

Association Pattern Mining

Topics covered in this section

Association Rule Generation Framework
Frequent Itemset Mining Algorithms
Pattern Summarization

What forms of evaluation were used to test students’ performance in this section?


Form	Yes/No
Development of individual parts of software product code	0
Homework and group projects	0
Midterm evaluation	0
Testing (written or computer based)	1
Reports	0
Essays	0
Oral polls	0
Discussions	1

Typical questions for ongoing performance evaluation within this section

Consider the transaction database in the table below: \\ tid | Items \\ 1 | a, b, c, d \\ 2 | b, c, e, f \\ 3 | a, d, e, f \\ 4 | a, e, f \\ 5 | b, d, f \\ Determine the absolute support of itemsets {a, e, f}, and {d, f}. Convert the absolute support to the relative support.

Typical questions for seminar classes (labs) within this section

Write a computer program to implement the greedy algorithm for finding a representative itemset from a group of itemsets.
Write a computer program to implement an inverted index on a set of market baskets. Implement a query to retrieve all itemsets containing a particular set of items.

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Write a computer program to implement a signature table on a set of market baskets. Implement a query to retrieve the closest market basket to a target basket on the basis of the cosine similarity.

Section 3

Section title

Cluster Analysis

Topics covered in this section

Feature Selection for Clustering
Representative-Based Algorithms
Probabilistic Model-Based Algorithms
Graph-Based Algorithms
Cluster Validation

What forms of evaluation were used to test students’ performance in this section?


Form	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	1
Testing (written or computer based)	1
Reports	0
Essays	0
Oral polls	0
Discussions	1

Typical questions for ongoing performance evaluation within this section

Consider the 1-dimensional data set with 10 data points \{1, 2, 3, . . . 10\}. Show three iterations of the k-means algorithms when k = 2, and the random seeds are initialized to \{1, 2\}.

Typical questions for seminar classes (labs) within this section

Write a computer program to implement the k-representative algorithm. Use a modular program structure, in which the distance function and centroid determination are separate subroutines. Instantiate these subroutines to the cases of (i) the k-means algorithm, and (ii) the k-medians algorithm.

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Implement the k-modes algorithm. Download the KDD CUP 1999 Network Intrusion Data Set from the UCI Machine Learning Repository, and apply the algorithm to the categorical attributes of the data set. Compute the cluster purity with respect to class labels.
What changes would be required to the BIRCH algorithm to implement it with the use of the Mahalanobis distance, to compute distances between data points and centroids? The diameter of a cluster is computed as its RMS Mahalanobis radius.

IU:TestPage

Contents

Data Mining

Course Characteristics

Key concepts of the class

What is the purpose of this course?

Course objectives based on Bloom’s taxonomy

- What should a student remember at the end of the course?

- What should a student be able to understand at the end of the course?

- What should a student be able to apply at the end of the course?

Course evaluation

Grades range

Resources and reference material

Course Sections

Section 1

Section title

Topics covered in this section

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Section 2

Section title

Topics covered in this section

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Section 3

Section title

Topics covered in this section

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Navigation menu

Search