Adam White :: awhite@cs.ualberta.ca
In September 2008, immediately preceding RL-Glue 3.0 development, the RL-Glue Project was split into two projects: RL-Glue and RL-Glue Extensions.
RL-Glue now only includes the RL-Glue interface and plugs for direct-compile C/C++ agents, environments and experiment programs.
The RL-Glue Extensions Project contains codecs that provide cross language support for RL-Glue (C/C++, Java, Python, Matlab, Lisp, etc). This multi-language support was previously bundled with RL-Glue. The reason for the split was partially to separate the technical details of using RL-Glue with a particular language from the high level overview of what RL-Glue does.
This document is the high-level overview document: it contains contains NO implementation specific technical details for writing programs.
Please refer to the RL-Glue technical manual and manuals for specific codecs for language specific details on how to implement agents, environments and experiment programs.
This document has been divided to reflect the purposes described above. To learn about the major components of RL-Glue and a description of how those components interact see Section 2. To learn how to make environment and agent programs compatible with RL-Glue we recommend sections 3.1 and 4.1. Sections 3.1 and 4.1 describe only the mandatory functions that RL-Glue environments and agents must implement. Sections 3.2 and 4.2 describe advanced environment and agent functions. To learn about experiment programs and how they interact with RL-Glue see Section 5. For quick function reference see Section 6. Frequently asked questions can be found in Section 8. A summary of and explanations for all changes from RL-Glue 2.X to RL-Glue 3.0 can be found in Section 7.
RL-Glue uses naming conventions and definitions from Sutton and Barto's text: ``Reinforcement Learning: An Introduction". This text is available for free online: http://www.cs.ualberta.ca/~sutton/book/the-book.html.
In machine learning research, it is important to look at other work being done in the field, compare your own performance and then improve. One goal for RL-Glue is to provide a consistent tool for using and comparing agents and environments from diverse sources. A common problem for researchers arises when they try compare their work with previously published results.
Before RL-Glue, the solution was often to reverse engineer code for the experiment based on the results and (often incomplete) implementation descriptions that had been published. Even when code was released to the public, it was often still a challenge to understand and adapt the original code. Now, you can make the necessary RL-Glue agent/environment/experiment programs available to the public such that another experimenter can reproduce your original experiment and easily experiment with their own code to compare performance. Several recent reinforcement competitions, at NIPS and ICML have used RL-Glue for benchmarking participant submissions, further exemplifying the utility of RL-Glue to the research community.
RL-Glue is both a set of ideas and standards, as well as a software implementation. In theory, RL-Glue is a protocol for the reinforcement learning community to follow. Having this very simple standard of necessary functions facilitates the exchange and comparison of agents and environments without limiting their abilities. As software, RL-Glue is functionally a test harness to ``plug in'' agents, environments and experiment programs without having to continually rewrite the connecting code for these pieces. An experiment program is, very simply, code stating how many times to run an agent in an environment and what data should be extracted from this interaction. Provided the agent, environment, and experiment program follow the RL-Glue protocol, by implementing the few necessary functions, they can easily be plugged in with the RL-Glue code to have an experiment running quite effortlessly. Figure 1 is a diagram which shows how function calls work in RL-Glue.
The Experiment Program contains the ``main function'' which will make all of the requests for information through RL-Glue. These requests are usually related to setting up, starting and running the experiment and then gathering data about the agent's performance. The experiment program can never interact with the agent or environment directly: all contact goes through the RL-Glue interface. There is also no direct contact between the agent and the environment. Any information the agent or environment returns is passed through RL-Glue to the module which needs it.
In RL-Glue, the agent is both the learning algorithm and the decision maker. The agent decides which action to take at every step.
The environment is responsible for storing all the relevant details of the world, or problem of your experiment. The environment generates the observations/states/perceptions that are provided to the agent, and also determines the transition dynamics and rewards.
The experiment is the intermediary which (through RL-Glue) controls all communication between the agent and environment. This structured separation is by design, division of the agent and environment both helps create modularized code and captures our intuitions about how much the agent and environment should ``know'' about each other.
The experiment program will be familiar to anyone who has created reinforcement learning experiments. Akin to the typical main function in many reinforcement learning experiments, an RL-Glue experiment program is a control loop which runs the agent through the environment x number of times, perhaps doing y trials of these x episodes, all the while gathering data about how efficiently the agent has behaved or how quickly it has learned. RL-Glue provides several functions (Section 6) to assist in writing an experiment program.
In RL-Glue, the environment is defined by a set of parameterized functions that the RL-Glue interface queries on behalf of the experiment program. These functions define what the environment does before an experiment begins, at the beginning of an episode, on every remaining step of the episode and after the experiment is completed. The following sections describe the basic requirements of an RL-Glue environment and present a complete list of all environment functions.
Every RL-Glue environment must implement a number of functions. The most important functions are env_start and env_step.
We have found that most action and observation types can easily be captured with this structure. In a grid world, for example, the action can be an int list of length 1 (with valid values 0-3), corresponding to (N,S,E,W) and the observation can also be an int list of length 1 that maps to the agent's current state label (which is also the state of the environment, in this case). In a problem like Mountain Car, the actions are discrete (0-2) and the observation is the car's position and velocity (both real numbers). The action can be an int list of length 1 and the observation can be a double list of length two. Different implementation languages will use different structures to encode observations and actions: please refer to the codec specific manual for your programming language of choice for more details.
1. env_start --> observation 2. state = rand()*num_states 3. set observation equal to state 4. return observation
1. env_step(action) --> reward, observation, flag 2. newState = updateState(action, state) 3. flag = isTerminal(newState) 3. reward = calculate reward for newState 4. set observation equal to newState 5. state = newState 6. return reward, observation, flagHere we assume the existence of a state update function and an isTerminal function that checks if the current state is a terminal state.
So thats it. Just fill in two functions and you have a valid RL-Glue environment. In later sections we will discuss advanced environment functions and how these additional functions can be used to write more complex experiment programs.
So far we have only scratched the surface of what you can do with RL-Glue environments. Additional environment functions can be used to initialize data structures and flexibly communicate with the environment from the experiment through ascii messages.
More specifically, the task_spec string encodes a version number, the number of observation and action dimensions, the types of observations and actions, the ranges of the observations and actions and the min and max reward values.
The task_spec is constantly evolving to match the state-of-art of learning algorithms and tasks being solved in reinforcement learning research; we expect that the task_spec will evolve much faster than the main RL-Glue protocol. To prevent this document from becoming quickly outdated, we have separated the task_spec documentation from the main RL-Glue documentation. Please see the online task_spec documentation for details about different task_spec versions.
The env_cleanup function usually deallocates or frees anything allocated in env_init.
1. env_message(inMessage) --> outMessage 2. if inMessage == "turnOffRandomStarts" 3. randStarts = false 4. end 5. if inMessage == "turnOnRandomStarts" 6. randStarts = true 7. end 8. return ""
An agent program is fully compatible with RL-Glue if it initializes the action type and implements three functions: agent_start, agent_step and agent_end.
1. agent_start (observation) --> action 2. lastObservation=observation 3. for each action a 4. if highest valued action valueFunction(observation,a) 5. then store a as lastAction 6. return lastAction
1. agent_step(reward, observation)-> action 2. update(valueFunction, lastObservation, lastAction, reward, observation) 3. newAction = select_action(observation, valueFunction) 4. lastObservation = observation 5. lastAction = newAction 6. return newActionNotice that the agent program must explicitly store the observation and action from the previous time step. RL-Glue does not make the history of actions, observations and rewards available to the agent or environment.
Continuing with the SARSA example:
1. agent_end (reward) 2. update(valueFunction, lastObservation, lastAction, reward)
The agent_end function does not receive the final observation from the environment. In many learning problems this is of no consequence because the agent does not make a decision in the terminal state. If, however, the agent were learning a model of the environment, information about the final transition would be important. In this case, it is recommended that the environment be augmented with a terminal state that has a reward of zero on the transition into it. This choice was made to keep the RL-Glue interface as minimal and light-weight as possible.
Here is a quick example of how you could query the current values of some parameters of an agent:
agent_message(inMessage) --> outMessage if inMessage == "getCurrentStepSize" return alpha end if inMessage == "getCurrentExplorationRate" return epsilon end return ""
At a minimum the experiment program must call RL_init and RL_cleanup and execute several time steps of agent-environment interaction. The following pseudo code represents a simple experiment program.
1. RL_init() 2. RL_start() 3. steps=0 4. terminal=false 5. while steps < 100 and not terminal 6. terminal,reward,observation,action = RL_step() 7. steps=steps+1 8. RL_cleanup()This experiment program initializes the agent and environment (RL_init), calls the start functions of the agent and environment (RL_start) and then executes a 100 or less step episode.
The RL_step function calls the env_step function passing it the most recent agent action (in this case from agent_start). The env_step function returns the new observation, reward and terminal flag. If the flag is not set the agent_step function is called with the new observation and reward as input arguments. The action returned by agent_step is stored by RL-Glue until the next call to RL_step. If the flag is set, the agent_end function is called with the reward as input. This process continues until either the flag is set or 100 steps are completed.
Using the RL_step function gives the experiment designer access to all the data produced during an episode; however, it is often more convenient to use the RL_episode function when step-level control is not needed. Lines 5 through 7, in the above experiment program, can be replaced by a single call to RL_episode(100). If the input to RL_episode is zero, control will return to the experiment program if and only if the environment enters a terminal state (i.e., terminal flag from the env_step function is set to true).
The RL_step function allows the experiment program to record/sum/average the reward at each step, but the RL_episode function runs many (perhaps millions of) steps before returning control to the experiment program. The RL_return and RL_num_steps functions allow the experiment program to retrieve the cumulative reward and the number of steps used during the episode. Specifically, RL_return returns the sum of rewards accumulated during the current or most recently completed episode (it is reset to zero at the start of every episode). The RL_num_steps returns the number of steps elapsed during the current or most recently completed episode (also reset to zero). The function reference in Section 6 provides pseudo code for each of the RL-Glue interface functions.
Putting these new functions together we can write a more useful experiment program:
1. RL_init() 2. theReturn = 0 3. for 1...100 4. RL_episode(1000) 5. theReturn += RL_return() 6. Print theReturn/100 7. RL_cleanup()The above experiment program runs 100 episodes, each with max length 1000, and computes the average cumulative reward per episode.
1. RL_init() 2. numSteps = 0 3. for 1...1000 4. RL_episode(1000) 5. RL_agent_message("freezeAgentPolicy") 6. for 1...100 7. RL_episode(1000) 8. numSteps += RL_num_steps() 9. Print numSteps/100 10. RL_cleanup()
This experiment program has two phases. During the exploration phase (lines 3-5) the agent is allowed to interact with the environment without any penalty: the experiment does not measure the reward or number of steps taken during the exploration phase. The experiment program then informs the agent that the training phase is over (line 5). The agent then (presumably) stops learning so its policy can be evaluated on the same environment for 100 episodes (lines 6-8). The evaluation phase records the agents performance by measuring the average number of steps the agent takes during each episode. Many results in the reinforcement learning literature are collected in a similar fashion.
Feel free to combine, mix and match the various RL-Glue interface functions. You will find that, these functions allow you to write powerful experiment programs that are easy to read and understand.
Every agent must implement all of the following routines. Note these functions are only accessed by the RL-Glue. Experiment programs should not try to bypass the Glue to directly call these functions.
agent_init(task_specification)This function will be called first, even before agent_start. The task_spec is a description of important experiment information, including but not exclusive to a description of the state and action space. The RL-Glue standard for writing task_spec strings is found here. In agent_init , information about the environment is extracted from the task_spec and then used to set up any necessary resources (for example, initialize the value function).
agent_start(first_observation) --> first_actionGiven the first_observation (the observation of the agent in the start state) the agent must then return the action it wishes to perform. This is called once if the task is continuing, else it happens at the beginning of each episode.
agent_step(reward, observation) --> actionThis is the most important function of the agent. Given the reward garnered by the agent's previous action and the resulting observation, choose the next action to take. Any learning (policy improvement) should be done through this function.
agent_end(reward)If the agent is in an episodic environment, this function will be called after the terminal state is entered. This allows for any final learning updates. If the episode is terminated prematurely (i.e., RL_episode cutoff before entering a terminal state) agent_end is NOT called.
agent_cleanup()This function is called at the end of a run/trial and can be used to free any resources which may have been allocated in agent_init . Calls to agent_cleanup should be in a one to one ratio with the calls to agent_init .
agent_message(input_message) --> output_messageThe agent_message function is a jack of all trades and master of none. Having no particular functionality, it is up to the user to determine what agent_message should implement. If there is any information which needs to be passed in or out of the agent, this message should do it. For example, if it is desirable that an agent's learning parameters be tweaked mid experiment, the author could establish an input string that triggers this action. Likewise, if the author wished to extract a representation of the value function, they could establish an input string which would cause agent_message to return the desired information.
NOTE: Unlike the other functions, agent_message can be called at any time: including before agent_init and after agent_cleanup.
env_init() --> task_specificationThis routine will be called exactly once for each trial/run. This function is an ideal place to initialize all environment information and allocate any resources required to represent the environment. It must return a task_spec which adheres to the task_spec language. A task_spec stores information regarding the observation and action space, as well as whether the task is episodic or continuous.
env_start() --> first_observationFor a continuing task this is done once. For an episodic task, this is done at the beginning of each episode. env_start assembles a first_observation given the agent is in the start state. Note the start state cannot also be a terminal state.
env_step(action) --> reward, observation, terminalComplete one step in the environment. Take the action passed in and determine what the reward and next state are for that transition.
env_cleanup()This can be used to release any allocated resources. It will be called once for every call to env_init.
env_message(input_string) --> output_stringSimilar to agent_message, this function allows for any message passing to the environment required by the experiment program. This may be used to modify the environment mid experiment. Any information that needs to passed in or out of the environment can be handled by this function.
NOTE: Unlike the other functions, env_message can be called at any time: including before env_init and after env_cleanup.
The following built-in RL-Glue functions are provided primarily for the use of the experiment program writers. Using these functions, the experiment program gains access to the corresponding environment and agent functions. The implementation of these routines are to be standard across all RL-Glue users. To ensure agents/environments/experiment programs can be exchanged between authors with no changes necessary, users should not change the RL-Glue interface code provided.
To understand the following, it is helpful to think of an episode as consisting of sequences of observations, actions, and rewards that are indexed by time-step as follows:
o0, a0, r1, o1, a1, r2, o2, a2, ..., rT, terminal_observationwhere the episode lasts T time steps (T may be infinite) and terminal_observation is a special, designated observation signaling the end of the episode.
RL_init() --> task_specification agent_init(env_init())This initializes everything, passing the environment's task_spec to the agent. This should be called at the beginning of every trial.
RL_start() --> o0, a0 o = env_start() a = agent_start(o) nextAction = a return o,a
Do the first step of a run or episode. The action is saved in nextAction so that it can be used on the next step.
RL_step() --> rt, ot, terminal, at r,o,terminal = env_step(nextAction) if terminal == true agent_end(r) return r, o,terminal else a = agent_step(r, o) nextAction = a return r, o, terminal, aTake one step. RL_step uses the saved action and saves the returned action for the next step. The action returned from one call must be used in the next, so it is better to handle this implicitly so that the user doesn't have to keep track of the action. If the end-of-episode observation occurs, then no action is returned.
RL_episode(steps) --> terminal num_steps = 0 o, a = RL_start() num_steps = num_steps + 1 list = [o, a] while o != terminal_observation if(steps !=0 and num_steps >= steps) return 0 else r, o, a = RL_step() list = list + [r, o, a] num_steps = num_steps + 1 agent_end(r) return 1
Do one episode until a termination observation occurs or until steps steps have elapsed, whichever comes first. As you might imagine, this is done by calling RL_start, then RL_step until the terminal observation occurs. If steps is set to 0, it is taken to be the case where there is no limitation on the number of steps taken and RL_episode will continue until a termination observation occurs. If no terminal observation is reached before num_steps is reached, the agent does not call agent_end, it simply stops.
RL_return() --> returnReturn the cumulative total reward of the current or just completed episode. The collection of all the rewards received in an episode (the return) is done within RL_return; however, any discounting of rewards must be done inside the environment or agent.
RL_num_steps() --> num_stepsReturn the number of steps elapsed in the current or just completed episode.
RL_cleanup() env_cleanup() agent_cleanup()Provides an opportunity to reclaim resources allocated by RL_init.
RL_agent_message(input_message_string) --> output_message_string return agent_message(input_message_string)This message passes the input string to the agent and returns the reply string given by the agent. See agent_message for more details.
RL_env_message(input_message_string) --> output_message_string return env_message(input_message_string)This message passes the input string to the environment and returns the reply string given by the environment. See env_message for more details.
As usual, you don't need to change your code depending on how it will be used. The code for an agent, environment, or experiment is identical no matter if you will run it using sockets or directly compiled together. The only difference is what library you link against.
This project is written entirely in C and can be linked from C or C++ code.
When you download and install the RL-Glue project, you get a few artifacts:
The way that they should be included in your agents/environments/experiments is like this:
<rlglue/RL_common.h> /* Data structures */ <rlglue/RL_glue.h> /* (RL_) functions for experiments (includes RL_common) */ <rlglue/Agent_common.h> /* Agent (agent_) functions (includes RL_common) */ <rlglue/Environment_common.h> /* Environment (env_) functions (includes RL_common) */ <rlglue/utils/C/RLStruct_util.h>/* Handy utility functions for copying/initing structs*/ <rlglue/utils/C/TaskSpec_Parser.h>/* Task Spec Parser fuctions*/
Generally, each of agent/env/experiment should only have to include one of the glue or common files. You'll probably never include RL_common.h, but it is needed by the others.
This project is written entirely in C and can be linked from C or C++ code.
The artifacts of the C Codec are:
So, to reduce the clutter and kruft (cruft?) of the API, we've removed Freeze. How do you unfreeze anyways?
The rl_abstract_type_t now looks like:
typedef struct { unsigned int numInts; unsigned int numDoubles; unsigned int numChars; int* intArray; double* doubleArray; char* charArray; } rl_abstract_type_t;
Keep in mind that charArray is an array of characters. It is not necessarily null terminated. We don't enforce null termination. Remember, 3 chars takes up 3 array spots, but ``123'' takes up 4 (`0' at the end).
If you do the following, bad things will probably happen if the char array is not null terminated:
printf("My char array is %s\n",observation.charArray);
The online FAQ may be more current than this document, which may have been distributed some time ago.
We're happy to answer any questions about RL-Glue. Of course, try to search through previous messages first in case your question has been answered before.
However, there is no reason that an implementation of an agent or environment shouldn't be designed using an object-oriented approach. In fact, many of the contributors to this project have their own object-oriented libraries of agents that they use with RL-Glue.
Some might argue that it makes sense to create a C/C++ or Java codec that supports an OO design directly. This would not be hard, it's just a matter of someone interested picking up the project and doing it. Personally, we've found it easy enough to write a small bridge between the existing codecs and our personal OO hierarchies.
Revision Number: $Rev: 962 $ Last Updated By: $Author: brian@tannerpages.com $ Last Updated : $Date: 2009-02-03 19:11:22 -0700 (Tue, 03 Feb 2009) $ $URL: https://rl-glue.googlecode.com/svn/trunk/docs/Glue-Overview.tex $
This document was generated using the LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html Glue-Overview.tex -split 0 -dir html -mkdir -title 'RL-Glue 3.0' -local_icons -math
The translation was initiated by Brian Tanner on 2009-02-13