Multi-Agent Learning Lab

PART 1

Purpose:

The purposes of this lab are:

Description:


Consider the world shown in the figure below.  In the figure, black squares represent walls, the blue square is the agent, and the blue X is the goal for the blue agent.  Use Q-learning to find a path from the starting positions to the goal for this agent.

There are four actions available to the agent regardless of the agent’s current state.  These actions are from the set {N, S, E, W} which represent the cardinal directions.  The action actually taken isn’t always the one that was chosen by the agent.  This is due to the randomness of the world.  This randomness is one parameter you will be changing in the course of this lab.

The following is a table of some of the modifiable parameters and what variable they correspond to in the code supplied:

Parameter

Variable Name

Range, Default

Description

Random Coefficient

RCOEFF

[0,100], 85

The randomness in the world.  The percent of the time you take the action you choose.  The rest is evenly distributed among the remaining actions.

0 – Action actually taken is never the one chosen.

100 – The world has no randomness.

α

ALPHA

[0,1], 1

A measure of how much the agent modifies its internal policy when it gets new information.  Used in the Q-learning update equation.

This value is decayed overtime towards 0.

0 – The agent will never alter its policy.

1 – The agent only remembers what it got last time, but only for the first update.  Afterwards the value will start decreasing according to the alpha decay function chosen.

γ

GAMMA

(0,1), .99

Expected discounted reward.  Used in the Q-learning update equation.

0 – No expected reward.

1 – No discount for the expected reward.  (Q-values would keep climbing and never converge)

Goal Reward

GOAL_R

Any, 1

The reward received by the agent when they reach the goal.

Wall Penalty

WALL_P

Any, .5

The penalty received by the agent when they run in to a wall.

Strategy

STRATEGY

RND

SMAX

EXPLOIT

BOLTZ

The strategy the agent uses while learning.

RND – Purely random exploration

SMAX – Uses a threshold to determine when to choose the ‘best’ action and when to explore.

EXPLOIT – Pure exploitation

BOLTZ – From an equation in physics involving energy.  The idea is to go from near random exploration to near pure exploitation

Boltzmann Parameter

BOLTZ_PARAM

(0,∞), 1

Parameter used in the BOLTZ strategy.

1 – Nearly pure exploration

∞ – Nearly pure exploitation

Softmax Threshold

SOFTMAX_THRESH

[0,100], 80

How often the ‘best’ action is chosen.

0 – Never use the best action

25 – Random

100 – Pure exploitation

World

WORLD

SINGLE16

MULTI16

MULTI8

or read from a file

The world to use to run the experiments.

SINGLE16 – The world to use for this lab.  It’s dimensions are 16x16

MULTI8 and MULTI16 – To be used for the next lab.

Text file in the format of Makeworlds.  Makeworlds generates random worlds of various wall concentrations, and is supplied with the code. 

 

Determining the quality of a policy learned will be important in this lab.  Average path length and collisions will be helpful, but not necessarily sufficient.

The code given performs two different kinds of iterations. The one used most frequently is the "learning iteration." This is not shown, but the agent makes a trek to the goal and updates Q-values along the way. The learned best response for each state is shown as a blue arrow. These are on by default. Toggle these by pressing '1.' The agent will start at (0,0) unless RANDOM_START is set to true.

The other iteration is the "testing iteration" which actually shows the agent moving around in the world, and a history of where the agent has been. The agent will use the same strategy as the learning iteration (Except for RND, which uses exploit) unless told to use EXPLOIT by setting EXPLOIT_TEST to true. During these iterations, data is collected about the path length, collisions, and the reward ultimately received. These values will be averaged over the number of samples requested. The default number of samples is 1. The agent will always start at (0,0) during testing iterations.

The blue arrows displaying the policy have already been discussed, but there are a couple of other keys you may find useful. Pressing '2' will show red arrows corresponding to the second agents policy. These will have all directions showing because the second agent isn't learning yet. Pressing '3' will make the arrows bigger. Up will increase the delay during a testing iteration, and down will decrease it. Left/right will set the delay to 0. 'q' or 'x' will exit the program.

After all the samples have been run, the policy learned for player 1 during the last sample will be drawn in ascii art to the console window.

Here is a link to a compiled executable.

 

Experiments:

You will conduct several learning experiments using this world.

 

  1. Single agent Q-learning code to play with is available as a zip file.  Experiment with Q-learning by adjusting the parameters near the beginning of Main.cpp.  Start by using the values set in the zip file.  These values are also in the file initial.txt.  To use initial.txt as file input for parameter values, the command line would be “<exename> -file initial.txt”.  The policy should mostly converge fairly quickly.
  2. From this converging parameter configuration (for each of the subparts, make sure you start from the initial configuration):
    1. What effect does modifying alpha have?

                                                               i.      For an alpha value of 0.1 and a total number of iterations of 10000, does it look like the policy has more or less converged? Why or why not?

                                                             ii.      Run the above simulation a couple of times.  Do the policies learned act the same when the agent is in the upper right quarter of the world? Why or why not? 

                                                            iii.      How does the policy differ on the far right column when using alpha values of 0.1 and 1.0?  Which is the better solution?  Why?

                                                           iv.      What is the tradeoff associated with alpha decay? Try some different decay settings and discuss what happens.

    1. What effect does gamma have?

                                                               i.      Try setting gamma to .6.  What happened and why?

                                                             ii.      How about a value of 0?  For what states do the policies change? Try some different values for gamma and discuss what happens.

    1. What effect does the randomness in the world have?  (Remember that the random coefficient is the percent of the time the action chosen is actually taken)

                                                               i.      Try setting the random coefficient to 0.  What does the solution look like?  Why does this happen?

                                                             ii.      Try setting the random coefficient to 25.  How is this different than a value of 0?

                                                            iii.      What happens to the policy learned along the edges with a value of 50? Try setting the random coefficient to 99 and discuss what happens.

    1. What effect do the rewards and penalties have?

                                                               i.      What if the reward for reaching the goal was 2?  How about 200?

                                                             ii.      What effect do the same values have when used for the penalty for running into walls?

                                                            iii.      Try a reward of 200 and a penalty of 100.  Is the policy learned different than the one learned from the initial configuration of 1 and .5?  Why or why not?

                                                           iv.      From what you learned in part iii, what is the most important value when choosing rewards and penalties?

    1. What effect does the exploration policy have on convergence?

                                                               i.      Exploitation

1.      Try setting the random coefficient to 95.  What does the learned policy look like? Why?

2.      Try setting the variable RANDOM_START to true.  This will cause the agent to start at a random location at the beginning of each iteration.  Does this help?

                                                             ii.      Softmax

1.      What does the policy look like when everything else is set to the initial values?  Why does this happen?

2.      Does starting in random locations help?

3.      Try a threshold value of 50.  Is this policy better?

                                                            iii.      Boltzman

1.      What does the learned policy look like?

2.      Try setting the Boltzmann parameter to 1,000,000. What does it look like now?  Is the policy learned consistent?

  1. Something cool
    1. You will need to perform experiments beyond the ones outlined here.  Use your imagination.  Possible experiments could include:

                                                               i.      Playing with the Boltzmann update equation.

                                                             ii.      Comparing multiple parameter changes. 

                                                            iii.      Using different worlds.  Handmade or from Makeworlds.

                                                           iv.      Use a more objective convergence criterion.

1.      Policy convergence – The policy does not change

2.      Q-value convergence – The Q-values do not stray very far from where they were.

What you'll turn in:

This lab is like the previous labs.  Do the experiments outlined above, report the results, and discuss your results.  I'm much more concerned about your analysis than anything else, but you should also pay attention to your writing.

 

For the second half of the lab you will need to modify the code given to handle a second agent.  It may be useful to spend a little time familiarizing yourself with the code.
 
 

In case compiling proves difficult, here's an executable.

A couple of people asked for a linux port of the code. Well, here it is. Its a little stripped down, but should work.

Note: For those using Visual Studio 2005, open the .sln file, for any other version open .dsw and have it converted.