IU:TestPage
Data Mining
- Course name: Data Mining
- Course number: N/A
Course Characteristics
Key concepts of the class
- Data Preparation
- Association Pattern Mining
- Cluster Analysis
- Outlier Analysis
- Data Classification
- Mining Data Streams, Text Data and Discrete Sequences
What is the purpose of this course?
Data mining is the study of collecting, cleaning, processing, analyzing, and gaining useful insights from data. A wide variation exists in terms of the problem domains, applications, formulations, and data representations that are encountered in real applications. Therefore, “data mining” is a broad umbrella term that is used to describe these different aspects of data processing. This course aims to help students to correctly address large data volumes using advanced tools and techniques. This leads to unique challenges from the perspective of processing and analysis.
Course objectives based on Bloom’s taxonomy
- What should a student remember at the end of the course?
By the end of the course, the students should be able to
- The most common structures of distributed storage.
- Batch processing techniques
- Stream processing techniques
- Basic distributed data processing algorithms
- Basic tools to address specific processing needs
- What should a student be able to understand at the end of the course?
By the end of the course, the students should be able to
- Understand the entire chain of data processing
- Understand principle theories, models, tools and techniques
- Analyze and apply adequate models for new problems
- Understand new data mining tasks and provide solutions in different domains
- What should a student be able to apply at the end of the course?
By the end of the course, the students should be able to
- Design an appropriate model to cope with new requirements
- Latest trends, algorithms, technologies in big data
- Ability to determine appropriate approaches towards new challenges
- Proficiency in data analysis and performance evaluations
- Application of models, combination of multiple approaches, adaptation to interdisciplinary fields
Course evaluation
| type | points |
|---|---|
| Labs/seminar classes | 30 |
| Interim performance assessment | 30 |
| Exams | 40 |
Grades range
| grade | low | high |
|---|---|---|
| A | 90 | 100 |
| B | 75 | 89 |
| C | 60 | 74 |
| D | 0 | 59 |
Resources and reference material
- Jiawei Han, Micheline Kamber and Jian Pei. {\it Data Mining: Concepts and Techniques (3nd Edition)}
- Jure Leskovec, Anand Rajaraman and Jeffrey D. Ullman. {\it Mining of Massive Datasets}
Course Sections
The main sections of the course and approximate hour distribution between them is as follows:
Section 1
Section title
Introduction to Data Mining
Topics covered in this section
- What is Data Mining
- The Data Mining Process
- Data Preparation
- Similarity and Distances
What forms of evaluation were used to test students’ performance in this section?
| Form | Yes/No |
|---|---|
| Development of individual parts of software product code | 0 |
| Homework and group projects | 0 |
| Midterm evaluation | 0 |
| Testing (written or computer based) | 1 |
| Reports | 0 |
| Essays | 0 |
| Oral polls | 0 |
| Discussions | 1 |
Typical questions for ongoing performance evaluation within this section
- An analyst obtains medical notes from a physician for data mining purposes, and then transforms them into a table containing the medicines prescribed for each patient. What is the data type of (a) the original data, and (b) the transformed data? (c) What is the process of transforming the data to the new format called?
- An analyst sets up a sensor network in order to measure the temperature of different locations over a period. What is the data type of the data collected?
Typical questions for seminar classes (labs) within this section
- Design the structure of a DB to address a specific analytics type
Tasks for midterm assessment within this section
Test questions for final assessment in this section
- An analyst processes Web logs in order to create records with the ordering information for Web page accesses from different users. What is the type of this data?
- Consider a data object corresponding to a set of nucleotides arranged in a certain order. What is this type of data?
Section 2
Section title
Association Pattern Mining
Topics covered in this section
- Association Rule Generation Framework
- Frequent Itemset Mining Algorithms
- Pattern Summarization
What forms of evaluation were used to test students’ performance in this section?
| Form | Yes/No |
|---|---|
| Development of individual parts of software product code | 0 |
| Homework and group projects | 0 |
| Midterm evaluation | 0 |
| Testing (written or computer based) | 1 |
| Reports | 0 |
| Essays | 0 |
| Oral polls | 0 |
| Discussions | 1 |
Typical questions for ongoing performance evaluation within this section
- Consider the transaction database in the table below: \\ tid | Items \\ 1 | a, b, c, d \\ 2 | b, c, e, f \\ 3 | a, d, e, f \\ 4 | a, e, f \\ 5 | b, d, f \\ Determine the absolute support of itemsets {a, e, f}, and {d, f}. Convert the absolute support to the relative support.
Typical questions for seminar classes (labs) within this section
- Write a computer program to implement the greedy algorithm for finding a representative itemset from a group of itemsets.
- Write a computer program to implement an inverted index on a set of market baskets. Implement a query to retrieve all itemsets containing a particular set of items.
Tasks for midterm assessment within this section
Test questions for final assessment in this section
- Write a computer program to implement a signature table on a set of market baskets. Implement a query to retrieve the closest market basket to a target basket on the basis of the cosine similarity.
Section 3
Section title
Cluster Analysis
Topics covered in this section
- Feature Selection for Clustering
- Representative-Based Algorithms
- Probabilistic Model-Based Algorithms
- Graph-Based Algorithms
- Cluster Validation
What forms of evaluation were used to test students’ performance in this section?
| Form | Yes/No |
|---|---|
| Development of individual parts of software product code | 1 |
| Homework and group projects | 1 |
| Midterm evaluation | 1 |
| Testing (written or computer based) | 1 |
| Reports | 0 |
| Essays | 0 |
| Oral polls | 0 |
| Discussions | 1 |
Typical questions for ongoing performance evaluation within this section
- Consider the 1-dimensional data set with 10 data points \{1, 2, 3, . . . 10\}. Show three iterations of the k-means algorithms when k = 2, and the random seeds are initialized to \{1, 2\}.
Typical questions for seminar classes (labs) within this section
- Write a computer program to implement the k-representative algorithm. Use a modular program structure, in which the distance function and centroid determination are separate subroutines. Instantiate these subroutines to the cases of (i) the k-means algorithm, and (ii) the k-medians algorithm.
Tasks for midterm assessment within this section
Test questions for final assessment in this section
- Implement the k-modes algorithm. Download the KDD CUP 1999 Network Intrusion Data Set from the UCI Machine Learning Repository, and apply the algorithm to the categorical attributes of the data set. Compute the cluster purity with respect to class labels.
- What changes would be required to the BIRCH algorithm to implement it with the use of the Mahalanobis distance, to compute distances between data points and centroids? The diameter of a cluster is computed as its RMS Mahalanobis radius.