Latest revision as of 12:33, 17 November 2021

Reinforcement Learning

Course name: Reinforcement Learning
Course number: R-01

Course Characteristics

Key concepts of the class

Fundamentals of Reinforcement Learning
Sample-based Learning Methods
Prediction and Control with Function Approximation

What is the purpose of this course?

Harnessing the full potential of artificial intelligence requires adaptive learning systems. Reinforcement learning (RL) is one powerful paradigm for doing so, and it is relevant to an enormous range of tasks, including robotics, game playing, consumer modeling and healthcare.

Course objectives based on Bloom’s taxonomy

- What should a student remember at the end of the course?

By the end of the course, the students should be able to

Markov Decision Processes
Exploration vs. Exploitation
Value Functions
Temporal-difference Learning
Q-learning
Expected Sarsa
Actor-Critic

- What should a student be able to understand at the end of the course?

By the end of the course, the students should be able to

How to build an RL system for sequential decision making
How to formalize a task as an RL problem
the space of RL algorithms

- What should a student be able to apply at the end of the course?

By the end of the course, the students should be able to

RL for solving real-world problems
TD-algorithms for estimating value functions
Expected Sarsa and Q-Learning
Actor-Critic Method

Course evaluation

Course grade breakdown
Type	Points
Labs/seminar classes	20
Interim performance assessment	50
Exams	30

Grades range

Course grading range
Grade	Points
A. Excellent	[85, 100]
B. Good	[70, 84]
C. Satisfactory	[55, 69]
D. Poor	[0, 54]

Resources and reference material

Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition.
Reinforcement Learning: State-of-the-Art, Marco Wiering and Martijn van Otterlo, Eds

Course Sections

The main sections of the course and approximate hour distribution between them is as follows:

Course Sections
Section	Section Title	Teaching Hours
1	Fundamentals of RL	…
2	Sample based Learning
3	Prediction and Control with Function Approximation

Section 1

Section title

Fundamentals of Reinforcement Learning

Topics covered in this section

Sequential Decision Making
Markov Decision Processes
Value Functions & Bellman Equations
Dynamic Programming for Value Function

What forms of evaluation were used to test students’ performance in this section?


Form	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	0
Testing (written or computer based)	1
Reports	0
Essays	0
Oral polls	0
Discussions	1

Typical questions for ongoing performance evaluation within this section

What is sequential decision making?
What is exploration vs. exploitation trade-off in sequential decision making?
What are Markov Decision Processes?
What is the difference between episodic and continuous tasks?
What are policies, value functions and Bellman equations?
How to use dynamic programming to compute value functions and optimal policies?

Typical questions for seminar classes (labs) within this section

What are the strengths and weaknesses of different exploration algorithms?
What is an epsilon greedy agent?
How to translate a real-world problem into a Markov Decision Process?
Why Bellman equations?
What is generalized policy iterations?

Tasks for midterm assessment within this section

Suppose you are given two action-value functions corresponding to the action-value function of an arbitrary, fixed policy under the two reward functions. Using the Bellman equation, explain if it is possible or not to combine these value functions in a simple manner to obtain a new action-value function corresponding to a single reward function r.

Test questions for final assessment in this section

How to implement incremental algorithms for estimating action-values?
How to implement and test an epsilon-greedy agent?
Create an example of your own that will fit into Markov Decision Processes framework
How to use optimal values functions to get optimal policies?
How to implement an efficient dynamic programming agent?

Section 2

Section title

Sample Based Learning

Topics covered in this section

Monte Carlo Methods for Prediction and Control
Temporal Difference Learning
Planning Learning and Acting
Expected Sarsa
Q-Learning
On-policy Off-policy Control

What forms of evaluation were used to test students’ performance in this section?


Form	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	1
Testing (written or computer based)	1
Reports	0
Essays	0
Oral polls	0
Discussions	1

Typical questions for ongoing performance evaluation within this section

How to estimate value functions and optimal policies, using only sampled experience from the environment?
What is Monte Carlo?
What is off-policy?
What is Temporal Difference Learning?
What is Q-Learning?
What is Expected Sarsa?
What is model-based RL?
What is Random-Tabular Q-planning?

Typical questions for seminar classes (labs) within this section

How to use Monte Carlo for Prediction and Action values?
How to use Monte Carlo for generalized policy iteration?
What is Batch RL and how does it work?
How to implement Expected Sarsa and Q-Learning?
What is Dyna Architecture and Dyna Algorithm?

Tasks for midterm assessment within this section

Given a Q-Learning algorithm,
Draw the one-step backup diagram of the algorithm and write out its update rule
Is this algorithm on-policy or off-policy? Justify your answer.
Write the two-step version of the algorithm.

Test questions for final assessment in this section

Why does off-policy matter?
How to learn from agent’s interaction with the world?
What is the difference between methods of on-policy and off-policy control?
How is Q-Learning off-policy?

Section 3

Section title

Prediction and Control with Function Approximation

Topics covered in this section

On-policy Prediction with Approximation
Neural Networks and TD Learning
Policy-Gradient Methods
Actor-Critic

What forms of evaluation were used to test students’ performance in this section?


Form	Yes/No
Development of individual parts of software product code	1
Homework and group projects	1
Midterm evaluation	1
Testing (written or computer based)	1
Reports	0
Essays	0
Oral polls	0
Discussions	1

Typical questions for ongoing performance evaluation within this section

How to estimate a value function for a given policy for a large number of states?
How to specify a parametric form of the value function?
How to frame value estimation as supervised learning problem?
How can we learn features for RL?
How can we learn policies directly?
What are the advantages of policy parameterization?
What is Actor-Critic Method?

Typical questions for seminar classes (labs) within this section

How to define the Value Error objective?
How to implement gradient Monte for policy evaluation?
How to solve an infinite state prediction task with NN and TD?
How to estimate the policy gradient?
How to implement Actor-Critic?

Tasks for midterm assessment within this section

What are policy learning and actor critic networks?
What is TD learning?
Given two function approximators, which discretize the state space in two different ways, we want to use the approximators to estimate the value function of a fixed policy, using on-policy data. Suppose you have a small number of samples. Explain the impact that you expect to see on the two algorithm by using TD with different values of lambda.

Test questions for final assessment in this section

How to frame value estimation as supervised learning problem?
What is semi-gradient TD?
What is the Policy-Gradient theorem?

@@ Line 13: / Line 13: @@
 === What is the purpose of this course? ===
 Harnessing the full potential of artificial intelligence requires adaptive learning systems. Reinforcement learning (RL) is one powerful paradigm for doing so, and it is relevant to an enormous range of tasks, including robotics, game playing, consumer modeling and healthcare.
+=== Course objectives based on Bloom’s taxonomy ===
+==== - What should a student remember at the end of the course? ====
+By the end of the course, the students should be able to
+* Markov Decision Processes
+* Exploration vs. Exploitation
+* Value Functions
+* Temporal-difference Learning
+* Q-learning
+* Expected Sarsa
+* Actor-Critic
+==== - What should a student be able to understand at the end of the course? ====
+By the end of the course, the students should be able to
+* How to build an RL system for sequential decision making
+* How to formalize a task as an RL problem
+* the space of RL algorithms
+==== - What should a student be able to apply at the end of the course? ====
+By the end of the course, the students should be able to
+* RL for solving real-world problems
+* TD-algorithms for estimating value functions
+* Expected Sarsa and Q-Learning
+* Actor-Critic Method
+=== Course evaluation ===
+{| class="wikitable"
+|+ Course grade breakdown
+|-
+! Type !! Points
+|-
+| Labs/seminar classes || 20
+|-
+| Interim performance assessment || 50
+|-
+| Exams || 30
+|}
+=== Grades range ===
+{| class="wikitable"
+|+ Course grading range
+|-
+! Grade !! Points
+|-
+| A. Excellent || [85, 100]
+|-
+| B. Good || [70, 84]
+|-
+| C. Satisfactory || [55, 69]
+|-
+| D. Poor || [0, 54]
+|}
+=== Resources and reference material ===
+* Reinforcement Learning: An Introduction, Sutton and Barto, 2nd Edition.
+* Reinforcement Learning: State-of-the-Art, Marco Wiering and Martijn van Otterlo, Eds
+== Course Sections ==
+The main sections of the course and approximate hour distribution between them is as follows:
+{| class="wikitable"
+|+ Course Sections
+|-
+! Section !! Section Title !! Teaching Hours
+|-
+| 1 || Fundamentals of RL || …
+|-
+| 2 || Sample based Learning ||
+|-
+| 3 || Prediction and Control with Function Approximation ||
+|}
+=== Section 1 ===
+==== Section title ====
+Fundamentals of Reinforcement Learning
+==== Topics covered in this section ====
+* Sequential Decision Making
+* Markov Decision Processes
+* Value Functions & Bellman Equations
+* Dynamic Programming for Value Function
+==== What forms of evaluation were used to test students’ performance in this section? ====
+{| class="wikitable"
+|+
+|-
+! Form !! Yes/No
+|-
+| Development of individual parts of software product code || 1
+|-
+| Homework and group projects || 1
+|-
+| Midterm evaluation || 0
+|-
+| Testing (written or computer based) || 1
+|-
+| Reports || 0
+|-
+| Essays || 0
+|-
+| Oral polls || 0
+|-
+| Discussions || 1
+|}
+==== Typical questions for ongoing performance evaluation within this section ====
+# What is sequential decision making?
+# What is exploration vs. exploitation trade-off in sequential decision making?
+# What are Markov Decision Processes?
+# What is the difference between episodic and continuous tasks?
+# What are policies, value functions and Bellman equations?
+# How to use dynamic programming to compute value functions and optimal policies?
+==== Typical questions for seminar classes (labs) within this section ====
+# What are the strengths and weaknesses of different exploration algorithms?
+# What is an epsilon greedy agent?
+# How to translate a real-world problem into a Markov Decision Process?
+# Why Bellman equations?
+# What is generalized policy iterations?
+==== Tasks for midterm assessment within this section ====
+# Suppose you are given two action-value functions corresponding to the action-value function of an arbitrary, fixed policy under the two reward functions. Using the Bellman equation, explain if it is possible or not to combine these value functions in a simple manner to obtain a new action-value function corresponding to a single reward function r.
+==== Test questions for final assessment in this section ====
+# How to implement incremental algorithms for estimating action-values?
+# How to implement and test an epsilon-greedy agent?
+# Create an example of your own that will fit into Markov Decision Processes framework
+# How to use optimal values functions to get optimal policies?
+# How to implement an efficient dynamic programming agent?
+=== Section 2 ===
+==== Section title ====
+Sample Based Learning
+==== Topics covered in this section ====
+* Monte Carlo Methods for Prediction and Control
+* Temporal Difference Learning
+* Planning Learning and Acting
+* Expected Sarsa
+* Q-Learning
+* On-policy Off-policy Control
+==== What forms of evaluation were used to test students’ performance in this section? ====
+{| class="wikitable"
+|+
+|-
+! Form !! Yes/No
+|-
+| Development of individual parts of software product code || 1
+|-
+| Homework and group projects || 1
+|-
+| Midterm evaluation || 1
+|-
+| Testing (written or computer based) || 1
+|-
+| Reports || 0
+|-
+| Essays || 0
+|-
+| Oral polls || 0
+|-
+| Discussions || 1
+|}
+==== Typical questions for ongoing performance evaluation within this section ====
+# How to estimate value functions and optimal policies, using only sampled experience from the environment?
+# What is Monte Carlo?
+# What is off-policy?
+# What is Temporal Difference Learning?
+# What is Q-Learning?
+# What is Expected Sarsa?
+# What is model-based RL?
+# What is Random-Tabular Q-planning?
+==== Typical questions for seminar classes (labs) within this section ====
+# How to use Monte Carlo for Prediction and Action values?
+# How to use Monte Carlo for generalized policy iteration?
+# What is Batch RL and how does it work?
+# How to implement Expected Sarsa and Q-Learning?
+# What is Dyna Architecture and Dyna Algorithm?
+==== Tasks for midterm assessment within this section ====
+# Given a Q-Learning algorithm,
+# Draw the one-step backup diagram of the algorithm and write out its update rule
+# Is this algorithm on-policy or off-policy? Justify your answer.
+# Write the two-step version of the algorithm.
+==== Test questions for final assessment in this section ====
+# Why does off-policy matter?
+# How to learn from agent’s interaction with the world?
+# What is the difference between methods of on-policy and off-policy control?
+# How is Q-Learning off-policy?
+=== Section 3 ===
+==== Section title ====
+Prediction and Control with Function Approximation
+==== Topics covered in this section ====
+* On-policy Prediction with Approximation
+* Neural Networks and TD Learning
+* Policy-Gradient Methods
+* Actor-Critic
+==== What forms of evaluation were used to test students’ performance in this section? ====
+{| class="wikitable"
+|+
+|-
+! Form !! Yes/No
+|-
+| Development of individual parts of software product code || 1
+|-
+| Homework and group projects || 1
+|-
+| Midterm evaluation || 1
+|-
+| Testing (written or computer based) || 1
+|-
+| Reports || 0
+|-
+| Essays || 0
+|-
+| Oral polls || 0
+|-
+| Discussions || 1
+|}
+==== Typical questions for ongoing performance evaluation within this section ====
+# How to estimate a value function for a given policy for a large number of states?
+# How to specify a parametric form of the value function?
+# How to frame value estimation as supervised learning problem?
+# How can we learn features for RL?
+# How can we learn policies directly?
+# What are the advantages of policy parameterization?
+# What is Actor-Critic Method?
+==== Typical questions for seminar classes (labs) within this section ====
+# How to define the Value Error objective?
+# How to implement gradient Monte for policy evaluation?
+# How to solve an infinite state prediction task with NN and TD?
+# How to estimate the policy gradient?
+# How to implement Actor-Critic?
+==== Tasks for midterm assessment within this section ====
+# What are policy learning and actor critic networks?
+# What is TD learning?
+# Given two function approximators, which discretize the state space in two different ways, we want to use the approximators to estimate the value function of a fixed policy, using on-policy data. Suppose you have a small number of samples. Explain the impact that you expect to see on the two algorithm by using TD with different values of lambda.
+==== Test questions for final assessment in this section ====
+# How to frame value estimation as supervised learning problem?
+# What is semi-gradient TD?
+# What is the Policy-Gradient theorem?

Difference between revisions of "IU:Template"

Latest revision as of 12:33, 17 November 2021

Reinforcement Learning

Course Characteristics

Key concepts of the class

What is the purpose of this course?

Course objectives based on Bloom’s taxonomy

- What should a student remember at the end of the course?

- What should a student be able to understand at the end of the course?

- What should a student be able to apply at the end of the course?

Course evaluation

Grades range

Resources and reference material

Course Sections

Section 1

Section title

Topics covered in this section

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Section 2

Section title

Topics covered in this section

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Section 3

Section title

Topics covered in this section

What forms of evaluation were used to test students’ performance in this section?

Typical questions for ongoing performance evaluation within this section

Typical questions for seminar classes (labs) within this section

Tasks for midterm assessment within this section

Test questions for final assessment in this section

Navigation menu

Search