Multi-Agent
Learning Lab
Part 2
Purpose:
The purposes of this lab are:
- To
introduce you to some of the issues involved in multi-agent learning.
- To
give me more information on which to base a grade.
- To
see how concurrent versus staggered learning affects convergence rates and
experiment outcomes.
- To
let you play around with Q-learning in a multi-agent (and therefore non-Markovian) domain.
Description:
Consider the world shown in the figure below. In the figure, black
squares represent the outer boundaries of the world and dark blue squares
represent a wall (with two openings). The red square with the A
indicates the starting position of the red agent, and the red square with the
letter G represents the goal for the red agent. A similar encoding
applies for the green agent. Use Q-learning to find a path from the
starting positions to the goal for each agent. The following information
will be helpful:
- There
are four actions that you can choose from: {N,S,E,W}.
- For
each square, there is an 85% chance that you will step in the direction
you choose, and a 15/3% chance that you will go in any of the other three
directions.
- When
you hit a wall, you get a penalty of 0.5 units.
- When
you reach the goal, you get a reward of 1.0 units.
- When
two agents hit each other (i.e., occupy the same square), they get a
penalty of 0.25 units.
- When
a collision occurs, the agents pass through each other and can continue on
their way; they do not "bounce" off each other.

Experiments:
You will conduct several learning experiments using this
world. We will separate these experiments into two broad categories, and
you will spend most of your time on the first category.
- Cooperative
Agents:
- Centralized
Learning: When agents are disposed toward cooperation and when perfect
cooperation can be enforced, then a centralized learning scheme can be
used. I want you to implement a centralized Q-learner. This
scheme is equivalent to only having one learning entity, which has 16
different possible actions. This
means that there is a single Q-function, Q(s,{a_1,a_2}),
and the actions of both players are dictated by the arg
max of this function; i.e., player 1 does the a_1 portion of the arg max{a_1,a_2} Q(s,{a_1,a_2}), and
player 2 does the a_2 portion. Here
are a couple of hints for you.
- The
state space is the tuple (x1,y1,x2,y2).
- The
action space is the pair (a1,a2).
- Since
the world is 8 by 8 and there are four actions, the size of your Q
function is 8*8*8*8*4*4 = 65,536. This is a big array, so consider
how to keep memory usage small.
- You
will need to decide how to handle rewards. One way is to not give a reward until
both agents are in their goal states, but sum the penalties when there
are penalties.
- You
will probably like to make the goal states absorbing states. This means that once an agent reaches
the goal state, he cannot leave.
- Make
sure you document decisions you make about how to handle multiple
agents.
- Decentralized
Learning: When communication is limited or when no centralized control
exists, then a decentralized learning scheme can be used. I want
you to implement a decentralized Q-learner --- one for each agent.
For each agent, the action space is the single action {a1}.
You will have four experiments to report for this section.
I want you to
try two learning variants:
- Concurrent
learning: agents move at the same time, and agents update their
Q-values at the same time.
- Staggered
Learning: agents move at the same time, but agent 1 updates its Q
values for five trials while agent 2 holds its values constant; agent 1
holds its Q values constant for five trials while agent 2 updates its Q
values; etc. A "trial" is defined as one trip from the
starting point to the goal.
Try the
following states for each agent:
- (x1,y1): consider only my
state (current position), and not the state of the other agent.
- (x1,y1,x2,y2):
consider both my state (current position) and the other agent's state.
- Something
Else: In addition to the experiments outlined above, I want you to
think of something interesting that you want to know about this world or
its agents. Run some experiments to try and learn something, and
report the problem and results to me. Consider such things as
changing goals, starting points, reward structures, state spaces, etc.
How to get there:
- Single
agent Q-learning code is available from the last lab. You will need to add in a second
agent.
- Types
of worlds: There are three worlds that are built into the program, you
will find them in the constructor of the world class (world.cpp)
One is labeled single-agent world and the others are labeled multi-agent
worlds. You used the single agent world while experimenting with
the q-learning variables in the last lab.
- You
should probably use the 8x8 multi-agent worlds to decrease the time it
takes to run experiments. You will
need to run many more iterations than were needed in the single agent
experiments. This is due to the
increase in the size of the q-table.
- Recommended
Roadmap
- Experiment
with the q-learning single agent code that we’ve provided.
b.
Develop concurrent learning with
state (x1, y1). This is basically running
two individual q-learning agents that have no knowledge about the other
agent.
- Develop
concurrent learning with state (x1,y1,x2,y2).
This is basically running two individual q-learning agents where each
agent keeps track of their own position as well as the position of the
other agent.
- Develop
centralized learning. Use a single agent with a state space that
represents the position of both agents and the movement possibilities of
both agents. Basically combine all the data into one big q-learner.
- Repeat
steps b and c with staggered learning.
- Important notes
- You
will need to spend time with and make changes to the code to get it to work
with two agents, as distributed it will perform q-learning with one agent
and it will display the start and end positions for that agent. It has some information about the
second agent, but it will not move the second agent.
- To
have the code show the second agent, you will need to fill in the methods
needed, and uncomment certain lines of code in RenderScene.
What you'll turn in:
This lab is like the previous labs. Do the experiments
outlined above, report the results, and discuss your results. I'm much
more concerned about your analysis than anything else, but you should also pay
attention to your writing.
Here are some hints.