Elective:InformationRetrievalForDataScience.tex

From IU
Jump to navigation Jump to search

Information Retrieval for Data Science

  • Course name: Information Retrieval for Data Science
  • Course number: N/A

Course Characteristics

What subject area does your course (discipline) belong to?

Computer systems organization; Information systems; Real-time systems; Information retrieval; World Wide Web

Key concepts of the class

  • Data indexing
  • Relevance and ranking

What is the purpose of this course?

The course is designed to prepare students to understand and learn contemporary tools of information retrieval systems. Students, who will later dedicate their engineering or scientific careers to implementation of search engines, social networks, recommender systems and other content services will obtain necessary knowledge and skills in designing and implementing essential parts of such systems.

Course objectives based on Bloom’s taxonomy

- What should a student remember at the end of the course?

By the end of the course, the students should be able to remember and recognize

  • Terms and definitions used in area of information retrieval,
  • Search engine and recommender system essential parts,
  • Quality metrics of information retrieval systems,
  • Contemporary approaches to semantic data analysis,
  • Indexing strategies.

- What should a student be able to understand at the end of the course?

By the end of the course, the students should be able to describe and explain

  • How to design a recommender system from scratch,
  • How to evaluate quality of a particular information retrieval system,
  • Core ideas and system implementation and maintenance,
  • How to identify and fix information retrieval system problems.

- What should a student be able to apply at the end of the course?

By the end of the course, the students should be able to

  • Build a recommender service from scratch,
  • Implement proper index for an unstructured dataset,
  • Plan quality measures for a new recommender service,
  • Run initial data analysis and problem evaluation for a business task, related to information retrieval.

Course evaluation

Course grade breakdown
Proposed points
Labs/seminar classes 20 0
Interim performance assessment 30 0
Assessments 0 50
Exams 50 50

If necessary, please indicate freely your course’s features in terms of students’ performance assessment:

Labs are followed by the home works. Home works are covering 50% of the grade. 50% of the grade fall to exam session, which will be in the form of project defence.

Grades range

Course grading range
Proposed range
A. Excellent 90-100 80-100
B. Good 75-89 60-79
C. Satisfactory 60-74 40-59
D. Poor 0-59 0-39

Resources and reference material

Main textbook:

  • "An Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press (any edition)

Other reference material:

  • "Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit", Steven Bird, Ewan Klein, and Edward Loper. [link]

Course Sections

The main sections of the course and approximate hour distribution between them is as follows:

Course Sections
Section Section Title Teaching Hours
1 Text processing (1, 3, 4, 7, 8, 9) 24
2 Text indexing and search (2, 5) 8
3 Web basics and algorithms for web (6, 10, 14) 12
4 Media types processing: images, video, sound (11, 12) 8
5 Quality assessment (13, 15) 8

Section 1

Section title:

Text processing

Topics covered in this section:

  • Introduction to information retrieval
  • Distributional semantics. Vector model. Dimension reduction
  • ML approaches to vector modelling
  • Spellchecking and query correction
  • Query expansion and suggest
  • Language model. Topic model. Clustering and classification

What forms of evaluation were used to test students’ performance in this section?

Yes/No
Development of individual parts of software product code 1
Homework and group projects 1
Midterm evaluation 0
Testing (written or computer based) 0
Reports 0
Essays 0
Oral polls 0
Discussions 0

Typical questions for ongoing performance evaluation within this section

  1. Build term-document matrix from the text.
  2. Visualize a dataset in 2D using PCA.
  3. Find K most similar text using ANN-based language model (embeddings).
  4. Implement Levenstein distance function.
  5. Discover 5 major topics in a text dataset.

Typical questions for seminar classes (labs) within this section

  1. Build term-document matrix of a set of web pages using TF-IDF.
  2. Implement PCA-based latent semantic analysis for a set of text with labels. Visualize classes (labels) in 2D.
  3. Train Doc2vec model and implement search with ranking using cosine similarity.
  4. Implement spellchecking algorithm.
  5. Discover optimal number of topics in a dataset using topic modelling.

Test questions for final assessment in this section

  1. What are the approaches to vector space modelling for text datasets?
  2. What are the most common techniques to process text before indexing?
  3. Explain topic modelling problem statement.

Section 2

Section title:

Text indexing and search

Topics covered in this section:

  • Building inverted index.
  • Language, tokenization, stemming, searching, scoring.
  • Indexing for vector model.
  • Kd-trees, Quad-trees, Annoy, FAISS, HNSW index.

What forms of evaluation were used to test students’ performance in this section?

Yes/No
Development of individual parts of software product code 1
Homework and group projects 1
Midterm evaluation 0
Testing (written or computer based) 0
Reports 0
Essays 0
Oral polls 0
Discussions 0

Typical questions for ongoing performance evaluation within this section

  1. Build inverted index for a text.
  2. Tokenize a text.
  3. Write kd-tree.
  4. Implement small-world graph.
  5. Build HNSW index.

Typical questions for seminar classes (labs) within this section

  1. Build inverted index for a set of web pages.
  2. build a distribuition of stems/lexemes for a text.
  3. Choose and implement persistent index for a given text collection.
  4. Compare 2 logarithmic index structures for a given task.
  5. Choose and implement updateable index for vector space of a given dimensions.

Test questions for final assessment in this section

  1. Explain how (and why) KD-trees work.
  2. What are weak places of inverted index?
  3. State the problem of approximate nearest neighbours search. Overview solutions.
  4. Discuss HNSW index architecture.

Section 3

Section title:

Web basics and algorithms for web

Topics covered in this section:

  • Web basics. Internet crawling. XML and HTML processing. Dynamic documents processing.
  • On newsfeeds.
  • Web search specific topics. PageRank. Duplicates. CTR.

What forms of evaluation were used to test students’ performance in this section?

Yes/No
Development of individual parts of software product code 1
Homework and group projects 1
Midterm evaluation 0
Testing (written or computer based) 0
Reports 0
Essays 0
Oral polls 0
Discussions 0

Typical questions for ongoing performance evaluation within this section

  1. Extract text from web page.
  2. Extract text from dynamic web page.
  3. Crawl a website with respect of robots.txt.
  4. Explain PageRank.

Typical questions for seminar classes (labs) within this section

  1. Write a crawler for a dynamically generated web site.
  2. Build a graph of a website.
  3. Run a PageRank on a graph of a single website.
  4. Compute CTR for a given log and train a relevance model using this data.

Test questions for final assessment in this section

  1. Explain what is DOM (document object model) and how is it used in document parsing?
  2. What is the difference in crawling static and dynamic pages?
  3. What are the weaknesses of PageRank?

Section 4

Section title:

Media types processing: images, video, sound

Topics covered in this section:

  • Image and video processing.
  • Image understanding
  • Image enhancing
  • Video understanding
  • Audio processing
  • Speech-to-text

What forms of evaluation were used to test students’ performance in this section?

Yes/No
Development of individual parts of software product code 1
Homework and group projects 1
Midterm evaluation 0
Testing (written or computer based) 0
Reports 0
Essays 0
Oral polls 0
Discussions 0

Typical questions for ongoing performance evaluation within this section

  1. Extract 4 major colors from image.
  2. Split video into shots.
  3. Identify a chord in audio.
  4. Text-2-speech.
  5. Label objects in images.

Typical questions for seminar classes (labs) within this section

  1. Build a "search by color" feature.
  2. Extract scenes from video.
  3. Build a voice fingerprint and identify a person by voice.
  4. Write a voice-controlled search.
  5. Semantic search withing unlabelled image dataset.

Test questions for final assessment in this section

  1. What the the approached to image understanding?
  2. How to cluster a video into scenes and shots?
  3. How speech-to-text technology works?
  4. How to build audio fingerprints?

Section 5

Section title:

Quality assessment

Topics covered in this section:

  • Search quality assessment.
  • Recommender quality assessment.
  • A/B testing.
  • SBS.
  • pFound, DCG, nDCG.

What forms of evaluation were used to test students’ performance in this section?

Yes/No
Development of individual parts of software product code 1
Homework and group projects 1
Midterm evaluation 0
Testing (written or computer based) 0
Reports 0
Essays 0
Oral polls 0
Discussions 0

Typical questions for ongoing performance evaluation within this section

  1. Propose a strategy for A/B testing.
  2. Propose recommender quality metric.
  3. Implement DCG metric.
  4. Discuss relevance metric.

Typical questions for seminar classes (labs) within this section

  1. Provide a framework to accept/reject A/B testing results.
  2. Compute DCG for an example query for random search engine.
  3. Implement a metric for a recommender system.
  4. Implement pFound.

Test questions for final assessment in this section

  1. What is SBS (side-by-side) and how is it used in search engines?
  2. Compare pFound with CTR and with DCG.
  3. Explain how A/B testing works.