Information Retrieval for Data Science

Course name: Information Retrieval for Data Science
Course number: N/A

Course Characteristics

What subject area does your course (discipline) belong to?

Computer systems organization; Information systems; Real-time systems; Information retrieval; World Wide Web

Key concepts of the class

Data indexing
Relevance and ranking

What is the purpose of this course?

The course is designed to prepare students to understand and learn contemporary tools of information retrieval systems. Students, who will later dedicate their engineering or scientific careers to implementation of search engines, social networks, recommender systems and other content services will obtain necessary knowledge and skills in designing and implementing essential parts of such systems.

Course objectives based on Bloom’s taxonomy

- What should a student remember at the end of the course?

By the end of the course, the students should be able to remember and recognize

Terms and definitions used in area of information retrieval,
Search engine and recommender system essential parts,
Quality metrics of information retrieval systems,
Contemporary approaches to semantic data analysis,
Indexing strategies.

- What should a student be able to understand at the end of the course?

By the end of the course, the students should be able to describe and explain

How to design a recommender system from scratch,
How to evaluate quality of a particular information retrieval system,
Core ideas and system implementation and maintenance,
How to identify and fix information retrieval system problems.

- What should a student be able to apply at the end of the course?

By the end of the course, the students should be able to

Build a recommender service from scratch,
Implement proper index for an unstructured dataset,
Plan quality measures for a new recommender service,
Run initial data analysis and problem evaluation for a business task, related to information retrieval.

Course evaluation

Course grade breakdown
		Proposed points
Labs/seminar classes	20	0
Interim performance assessment	30	0
Assessments	0	50
Exams	50	50

If necessary, please indicate freely your course’s features in terms of students’ performance assessment:

Labs are followed by the home works. Home works are covering 50% of the grade. 50% of the grade fall to exam session, which will be in the form of project defence.

Grades range

Course grading range
		Proposed range
A. Excellent	90-100	80-100
B. Good	75-89	60-79
C. Satisfactory	60-74	40-59
D. Poor	0-59	0-39

Resources and reference material

Main textbook:

"An Introduction to Information Retrieval" by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Cambridge University Press (any edition)

Other reference material:

"Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit", Steven Bird, Ewan Klein, and Edward Loper. [link]

Course Sections

The main sections of the course and approximate hour distribution between them is as follows:

Course Sections
Section	Section Title	Teaching Hours
1	Text processing (1, 3, 4, 7, 8, 9)	24
2	Text indexing and search (2, 5)	8
3	Web basics and algorithms for web (6, 10, 14)	12
4	Media types processing: images, video, sound (11, 12)	8
5	Quality assessment (13, 15)	8

Section 1

Section title:

Text processing

Topics covered in this section:

Introduction to information retrieval
Distributional semantics. Vector model. Dimension reduction
ML approaches to vector modelling
Spellchecking and query correction
Query expansion and suggest
Language model. Topic model. Clustering and classification

What forms of evaluation were used to test students’ performance in this section?

	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	0
Testing (written or computer based)	0
Reports	0
Essays	0
Oral polls	0
Discussions	0

Typical questions for ongoing performance evaluation within this section

Build term-document matrix from the text.
Visualize a dataset in 2D using PCA.
Find K most similar text using ANN-based language model (embeddings).
Implement Levenstein distance function.
Discover 5 major topics in a text dataset.

Typical questions for seminar classes (labs) within this section

Build term-document matrix of a set of web pages using TF-IDF.
Implement PCA-based latent semantic analysis for a set of text with labels. Visualize classes (labels) in 2D.
Train Doc2vec model and implement search with ranking using cosine similarity.
Implement spellchecking algorithm.
Discover optimal number of topics in a dataset using topic modelling.

Test questions for final assessment in this section

What are the approaches to vector space modelling for text datasets?
What are the most common techniques to process text before indexing?
Explain topic modelling problem statement.

Section 2

Section title:

Text indexing and search

Topics covered in this section:

Building inverted index.
Language, tokenization, stemming, searching, scoring.
Indexing for vector model.
Kd-trees, Quad-trees, Annoy, FAISS, HNSW index.

What forms of evaluation were used to test students’ performance in this section?

	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	0
Testing (written or computer based)	0
Reports	0
Essays	0
Oral polls	0
Discussions	0

Typical questions for ongoing performance evaluation within this section

Build inverted index for a text.
Tokenize a text.
Write kd-tree.
Implement small-world graph.
Build HNSW index.

Typical questions for seminar classes (labs) within this section

Build inverted index for a set of web pages.
build a distribuition of stems/lexemes for a text.
Choose and implement persistent index for a given text collection.
Compare 2 logarithmic index structures for a given task.
Choose and implement updateable index for vector space of a given dimensions.

Test questions for final assessment in this section

Explain how (and why) KD-trees work.
What are weak places of inverted index?
State the problem of approximate nearest neighbours search. Overview solutions.
Discuss HNSW index architecture.

Section 3

Section title:

Web basics and algorithms for web

Topics covered in this section:

Web basics. Internet crawling. XML and HTML processing. Dynamic documents processing.
On newsfeeds.
Web search specific topics. PageRank. Duplicates. CTR.

What forms of evaluation were used to test students’ performance in this section?

	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	0
Testing (written or computer based)	0
Reports	0
Essays	0
Oral polls	0
Discussions	0

Typical questions for ongoing performance evaluation within this section

Extract text from web page.
Extract text from dynamic web page.
Crawl a website with respect of robots.txt.
Explain PageRank.

Typical questions for seminar classes (labs) within this section

Write a crawler for a dynamically generated web site.
Build a graph of a website.
Run a PageRank on a graph of a single website.
Compute CTR for a given log and train a relevance model using this data.

Test questions for final assessment in this section

Explain what is DOM (document object model) and how is it used in document parsing?
What is the difference in crawling static and dynamic pages?
What are the weaknesses of PageRank?

Section 4

Section title:

Media types processing: images, video, sound

Topics covered in this section:

Image and video processing.
Image understanding
Image enhancing
Video understanding
Audio processing
Speech-to-text

What forms of evaluation were used to test students’ performance in this section?

	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	0
Testing (written or computer based)	0
Reports	0
Essays	0
Oral polls	0
Discussions	0

Typical questions for ongoing performance evaluation within this section

Extract 4 major colors from image.
Split video into shots.
Identify a chord in audio.
Text-2-speech.
Label objects in images.

Typical questions for seminar classes (labs) within this section

Build a "search by color" feature.
Extract scenes from video.
Build a voice fingerprint and identify a person by voice.
Write a voice-controlled search.
Semantic search withing unlabelled image dataset.

Test questions for final assessment in this section

What the the approached to image understanding?
How to cluster a video into scenes and shots?
How speech-to-text technology works?
How to build audio fingerprints?

Section 5

Section title:

Quality assessment

Topics covered in this section:

Search quality assessment.
Recommender quality assessment.
A/B testing.
SBS.
pFound, DCG, nDCG.

What forms of evaluation were used to test students’ performance in this section?

	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	0
Testing (written or computer based)	0
Reports	0
Essays	0
Oral polls	0
Discussions	0

Typical questions for ongoing performance evaluation within this section

Propose a strategy for A/B testing.
Propose recommender quality metric.
Implement DCG metric.
Discuss relevance metric.

Typical questions for seminar classes (labs) within this section

Provide a framework to accept/reject A/B testing results.
Compute DCG for an example query for random search engine.
Implement a metric for a recommender system.
Implement pFound.

Test questions for final assessment in this section

What is SBS (side-by-side) and how is it used in search engines?
Compare pFound with CTR and with DCG.
Explain how A/B testing works.

Elective:InformationRetrievalForDataScience.tex

Information Retrieval for Data Science

Course Characteristics

What subject area does your course (discipline) belong to?

Key concepts of the class

What is the purpose of this course?

Course objectives based on Bloom’s taxonomy

- What should a student remember at the end of the course?

- What should a student be able to understand at the end of the course?

- What should a student be able to apply at the end of the course?

Course evaluation

Grades range

Resources and reference material

Course Sections

Section 1

Section title:

Topics covered in this section:

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Test questions for final assessment in this section

Section 2

Section title:

Topics covered in this section:

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Test questions for final assessment in this section

Section 3

Section title:

Topics covered in this section:

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Test questions for final assessment in this section

Section 4

Section title:

Topics covered in this section:

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Test questions for final assessment in this section

Section 5

Section title:

Topics covered in this section:

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Test questions for final assessment in this section

Navigation menu

Search