Politechnika Warszawska Wydział Elektroniki i Technik Informacyjnych Instytut Automatyki i Informatyki Stosowanej. Intensive Reinforcement Learning - PDF

Politechnika Warszawska Wydział Elektroniki i Technik Informacyjnych Instytut Automatyki i Informatyki Stosowanej Paweł Wawrzyński Intensive Reinforcement Learning Rozprawa Doktorska w Dziedzinie Informatyki

Please download to get full document.

View again

of 39
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.


Publish on:

Views: 5 | Pages: 39

Extension: PDF | Download: 0

Politechnika Warszawska Wydział Elektroniki i Technik Informacyjnych Instytut Automatyki i Informatyki Stosowanej Paweł Wawrzyński Intensive Reinforcement Learning Rozprawa Doktorska w Dziedzinie Informatyki Promotor Pracy prof. nzw. dr hab. inż. Andrzej Pacut Warszawa 2005 Warsaw University of Technology Department of Electronics and Information Technology Institute of Control and Computation Engineering Paweł Wawrzyński Intensive Reinforcement Learning Dissertation Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Computer Science Thesis Supervisor Professor Andrzej Pacut May 2005 Streszczenie W niniejszej pracy problem uczenia się przez wzmacnianie jest przedyskutowany w języku statystyki jako zagadnienie estymacji. Zaproponowana zostaje rodzina algorytmów uczenia się przez wzmacnianie które wyznaczają politykę sterowania w procesie obliczeniowym przetwarzającym całą dostępną historię interakcji między sterwnikiem a urządzeniem. Aproksymacja stochastyczna jako mechanizm odpowiedzialny za zbieżność algorytmów uczenia się przez wzmacnianie jest zastąpiona przez estymację opartą na całej próbie. Zaprezentowane badania symulacyjne pokazują, że uzyskane w ten sposób algorytmy są w stanie zidentifikować parametery nietrywialnych sterowników w ciągu kilkudziesięciu minut sterowania urządzeniem. Czyni je to wielokrotnie wydajniejszymi od istniejących odpowiedników. Słowa kluczowe: Sztuczna Inteligencja, Uczenie się Maszyn, Uczenie się przez Wzmocnianie, Przybliżone Programowanie Dynamiczne, Sieci neuronowe, Sterowanie adaptacyjne. Abstract The Reinforcement Learning (RL problem is analyzed in this dissertation in the language of statistics as an estimation issue. A family of RL algorithms is introduced. They determine a control policy by processing the entire known history of plant-controller interactions. Stochastic approximation as a mechanism that makes the classical RL algorithms converge is replaced with batch estimation. The experimental study shows that the algorithms obtained are able to identify parameters of nontrivial controllers within a few dozens of minutes of control. This makes them a number of times more efficient than their existing equivalents. Keywords: Artificial Inteligence, Machine Learning, Reinforcement Learning, Approximate Dynamic Programming, Neural Networks, Adaptive Control. ii Acknowledgements I wish to thank my advisor, Professor Andrzej Pacut for introducing me into the world of science. iii Contents 1 Introduction Reinforcement Learning Scope of the work The problem of Reinforcement Learning Problem definition Short-term-memory vs. long-term-memory solutions The approach to RL inspired by biology The prototype of randomized Actor-Critics Q-Learning The approach inspired by dynamic programming Heuristic Dynamic Programming Action-Dependent HDP Dual Heuristic Programming and its modifications The approach based on agnostic optimization Smoothly exploring distributions Short-term-memory optimization Long-term-memory optimization Two estimation problems The LONG-MEM-MAX algorithm Limit properties of estimators Behavior of LONG-MEM-MAX, distant exploration tendency Examples of ϕ Normal distribution Log-normal distribution Finite set of controls Ordered set of controls v 4 Short-term-memory Actor-Critic algorithm Actor Critic Actor-Critic interactions The RAWC algorithm Actor adaptation Critic adaptation Implementational issues RAWC with λ-estimators of future rewards Long-term-memory Actor-Critic algorithm Structure Policy evaluation Policy optimization Limit behavior, comparison to RAWC The algorithm at work Implementation of the internal loop Extensions IRAWC with λ-estimators of future rewards Sequential implementation of IRAWC(λ Between RAWC and IRAWC. Semi-Intensive algorithms Continuous time Experimental study Algorithms implementations RAWC and RAWC(λ IRAWC IRAWC(λ, and SIRAWC, SIRAWC(λ Cart-Pole Swing-Up Controller and Critic Reinforcements and training Experiments and discussion Comparison to Adaptive Critic Designs Robot Weightlifting Controller and Critic Reinforcements and training Experiments and discussion Narendra s MIMO Controller and Critic 7.4.2 Reinforcements and training Experiments and discussion Summary Conclusions Contributions Future work Appendix 107 A Is reinforcement learning without exploration possible? 107 B Plants 111 B.1 Cart-Pole Swing-Up B.2 Robot Weightlifting B.3 Narendra s MIMO Bibliography 117 List of Tables 2.1 The prototype of randomized Actor-Critics The Q-Learning algorithm Heuristic Dynamic Programming Action-Dependent HDP Dual Heuristic Programming The SHORT-MEM-MAX algorithm of maximization of (Hf(θ The LONG-MEM-MAX algorithm of maximization of (Hf(θ The RAWC algorithm The RAWC(λ algorithm The IRAWC algorithm The SIRAWC algorithm Parameters of the RAWC algorithm s implementation The implementation of the IRAWC algorithm Parameters of IRAWC implementation Parameters of randomized Actor-Critic algorithms confronted with the Cart-Pole Swing-Up The implementation of Action-Dependent HDP The implementation of On-Line Modeling HDP Parameters of Adaptive Critic Designs Parameters of ACDs applied to the Cart-Pole Swing-Up Parameters of algorithms applied to the Robot Weightlifting Parameters of algorithms applied to the MIMO ix Notational conventions Vectors and matrices are denoted with the use of standard mathematical font: x, w, etc. For the vector function f : R n R m, the matrix of its partial derivatives will be denoted by f 1 f x df(x 1 m x 1 dx = f 1 x n Parametric aproximators are denoted with the use of wide tilde. E.g. f(x; w is an approximator parameterised by the vector w whose input is x. Predicates in formulae [predicate] = { e.g. sign(x = [x 0] + [x 0]. f m x n 1 if the predicate is true 0 otherwise Parameterized distributions are denoted by ϕ(z; θ where ϕ is a density (or a probability in the discrete case and θ is the parameter. The caption Y ϕ( ; θ means that the random variable Y is drawn with the use of the distribution ϕ( ; θ. Let r be a predicate, f be a function, and they both depend on Y. When the probability of r(y and the expected value of f(y is calculated for Y ϕ( ; θ, the probability is denoted by P θ (r(y and the expected value is denoted by E θ f(y. The normal distribution with mean θ and covariance matrix C is denoted by N(θ, C. Control policies (or simply policies are denoted by π. If a policy depends on a parameter, it is denoted by π(w, where w is the parameter. xi Chapter 1 Introduction Bodies of living organisms constitute very complex mechanical systems. So far, no methods are known that enable to design control mechanisms for artificial systems characterized by such complexity. It is believed that living organisms learn with the use of trials and errors to control their bodies. The topic of this dissertation may be understood as biologically inspired methodology of automatic trial and error optimization of control. From the practical point of view, we deal with the following problem. Suppose there is a controller together with a plant characterized by unknown dynamics. The controller, in order to work properly, has to optimize its parameters. The only way to do so available to the controller is an interaction with the plant, its observation, and analysis of consequences of executed control actions. Modern methodology of control system design in most cases involves some preliminary knowledge of the controlled plant. One extremal setting is that the dynamics of the controlled plant is known entirely in the sense that its exact motion equations are available and there are no external disturbances. That may require some preliminary tests with the use of the working plant. However, before the design of the control system is begun, all the necessary knowledge about the plant s dynamics is gathered. This setting prevails in the practice of control system design. The field of automatic control provides satisfying tools for a controller design in this case. The designer does not have to engage the plant because everything he would like to know about it and its control he may check in derivations and/or simulations. Whenever he knows less than enough about the plant, the control system must be optimized while it is working. In this case methods of adaptive control [2, 40] must be in use. Usually the designer of the controller knows something about the controlled system. Usually some of its components have been produced by humans. For instance the designer may know motion equations of the system yet he may be uncertain about values of some parameters. In case that their values imply the controller s 1 2 CHAPTER 1. INTRODUCTION behavior (the control policy, they may be estimated when the plant is working and the control takes place. The estimates determine then the control policy. This methodology is called indirect adaptive control. Alternatively, observation of the control process may imply the controller s behavior directly without the model estimation. This methodology of control optimization is called direct adaptive control. In this dissertation we deal with direct adaptive control of a plant whose dynamics is initially unknown, yet we suspect that it is too complex to try to build its model. 1.1 Reinforcement Learning From the point of view of modern engineering the exhibited by natural organisms ability to control motion of their bodies is indeed impressive. Suppose we were to design a control system for a humanoid robot that is to dig pits in the ground with a shovel. Human s skeleton is, from the point of view of robotics, a manipulator whose number of degrees of freedom exceeds one hundred 1. We are completely unable to provide a model of its motion. In fact, the modeling of a manipulator whose number of degrees of freedom exceeds 10, constitutes an enormous challenge. Furthermore, even if we had the model, we would still have to describe the desired behavior of the robot. In robotics this means that we are to provide trajectories of each of 120 degrees of freedom. It is even difficult to imagine how such a task could be completed. Nevertheless, almost each adult human is able to learn to dig pits in the ground with a shovel. The biologically inspired methodology of control systems optimization discussed in the presented dissertation is reinforcement learning (RL. We shall understand here RL as follows. Let us consider a plant. At each moment it is in a certain state that characterizes a current position and dynamics of the plant as well as a current objective of its operation 2. At each moment a controller applies a control stimulus to the plant and receives a numeric reward (or reinforcement. The closer the plant is to attain its current objective, the bigger the reward is. The goal of controller s reinforcement learning is to find with the use of trials and errors the optimal way of determining control stimuli on the basis of states. Optimality means here that starting from each state the controller may expect to receive the biggest possible sum of future (discounted rewards. 1 There are 204 bones, 460 muscles, and over one hundred joints in human body; the number of joints is difficult to provide, it depends on how one defines a single joint. 2 Note that in control theory the state does not include the objective of the plant s operation. Here we understand the state as all the current information the controller may need to control the plant. 1.1. REINFORCEMENT LEARNING 3 Barto and Dietterich are perhaps the most appropriate people to explain relations between the field of reinforcement learning and other sciences. In [5] they write: The term reinforcement comes from studies of animal learning in experimental psychology, where it refers to the occurrence of an event, in the proper relation to a response, that tends to increase the probability that the response will occur again in the same situation. The simplest reinforcement learning algorithms make use of the commonsense idea that if an action is followed by a satisfactory state of affairs, or an improvement in the state of affairs, then the tendency to produce that action is strengthened, that is, reinforced. This is the principle articulated by Thorndike in his famous Law of Effect [46]. Instead of the term reinforcement learning, however, psychologists use the terms instrumental conditioning, or operant conditioning, to refer to experimental situation in which what an animal actually does is a critical factor in determining the occurrence of subsequent events. These situations are said to include response contingencies, in contrast to Pavlovian, or classical, conditioning situations in which the animal s responses do not influence subsequent events, at least not those controlled by the experimenter. It is no wonder that the very idea of designing control systems that adapt like natural controllers do is sound and has existed in control systems community for a very long time; see e.g. [39, 21, 55]. However, the field of reinforcement learning [4, 44] seems to be founded outside this community and is driven by the idea of creation of artificial intelligent systems [24, 35, 48] that adapt to their environments. The perceived control engineering problems seem to be of less relevance for RL. This dissertation stands somewhere between these approaches. Our research is driven both by the idea of creation artificial intelligent systems and by perceived control problems. In fact, reinforcement learning, as well as other forms of direct adaptive control, introduces an approach to control system design that is alternative to the standard, model based methodology. We do not bother with the model of a process, but design a controller that learns to control. Potentially, we replace the procedure lasting for months and involving system modeling and identification, building of a simulator, and designing of the controller, with the process of controller learning, which lasts for maybe several hours and involves only human supervision. As far as the author is aware, at the beginning of 21 st century there is nobody, who makes money designing control systems that learn to control physical devices with the use of trials and errors. This dissertation belongs to publications intended to bring the moment of appearance of such people nearer. Obviously several questions emerge. The most important one concerns security. There are some devices, which are expensive, unique and easy to destroy by inappropriate control. Is it possible to learn to control such machines? On the other 4 CHAPTER 1. INTRODUCTION hand, if we think of cheap, mass-produced robots, the destruction of one of them during the controller training may not matter. 1.2 Scope of the work This dissertation deals with reinforcement learning methods applicable for control system training. The area of reinforcement learning is not homogenous. From variety of approaches, the spirit of the dissertation is closest to the one associated with randomized Actor-Critic algorithms [4, 16, 45, 17]. The theses of the presented dissertation are as follows. 1. A natural theoretic base for prevailing reinforcement learning algorithms is stochastic approximation. Such methods utilize information obtained from each consecutive control step to construct a gradient estimator of a certain global performance index and use it to alter the parameters of the control policy. We claim that one may also design an RL algorithm based on batch estimation. It is possible to construct an estimator of the global performance index parameterized by the policy parameters. We claim that optimization of this estimator with respect to the parameters leads to the optimization of the performance. 2. The objective of each reinforcement learning algorithm is to exploit observations of the control process to optimize the control policy. Let us analyze an RL algorithm as a procedure that receives certain input data (observations of a control process and produces certain output data (parameters of a control policy. All but few RL algorithms are so constructed that they process the input data sequentially, piece by piece. We claim that it is possible to design much faster algorithms when the above sequentiality constraint is discarded. Such algorithms may optimize the estimator of the performance index with respect to the policy parameters. They may have properties very similar to those of the methods based on stochastic approximation, yet they exploit available data much more efficiently. 3. We design a family of algorithms that put information of occurring control steps in a database and optimize the control policy in a computational process that is conducted simultaneously with the control process. The introduced algorithms have the form of certain optimization issues. We propose numerical methods to tackle these particular issues. We claim that the proposed algorithms are feasible and substantially faster than the comparable ones. When applied to problems of moderate complexity, they are able to optimize the control policy far before the stored data fill in the memory capacity of a typical computer. This document is organized as follows. Chapter 2 introduces the RL problem. It distinguishes short-term-memory and long-term-memory RL algorithms depending on whether they process the occurring control steps consecutively or not. The chapter also overviews main approaches in the field of RL. RL methods we introduce here are based on properties of certain probabilistic distributions. These are discussed in Chapter 3. Subsequent chapter discusses existing short-term-memory randomized Actor-Critic algorithms in the language that will be further utilized to present new methods. Chapter 5 constitutes the core of the dissertation. It introduces a basic long-term-memory Actor-Critic algorithm. Subsequent chapter extends the basic method to an entire family of algorithms. Chapter 7 contains an experimental study that compares existing methods to the ones proposed here. Chapter 8 concludes and overviews directions of possible future work. 5 6 Chapter 2 The problem of Reinforcement Learning We put the initial issue of control optimization into the framework of reinforcement learning (RL. The first formulation of RL comes from artificial intelligence (see [44] for exhaustive insight. It was then reformulated in the language of automatic control (e.g. [7] which is used below. This chapter has also the intention of presenting the background of methods introduced in the following chapters. There are three trends in RL we are aware of, and we try to borrow from all of them. In the first and probably the most influential approach, RL is viewed as a computer implementation of adaptation observed in natural organisms. It is mainly inspired by biology and psychology. Within the second approach the objective of a learning control system is to identify a model of the controlled plant and determine a policy by means of approximate Dynamic Programming. Within the third approach the RL problem is solved by direct search of the policy space that takes little advantage of the particular features of the problem. We overview compactly each of these approaches. We also refer an interested reader to [7, 44, 38] for deeper insight. 2.1 Problem definition In the RL framework, a learning controller interacts with a plant over a series of time steps t = 1, 2, 3,.... At each time t, the controller observes the plant s state, x t X, and chooses a control action, u t U, which causes the plant to transition to state x t+1. The controller receives then a reinforcement or reward, r t+1 R. The next state and reinforcement depend only on the preceding state and action, and 7 8 CHAPTER 2. THE PROBLEM OF REINFORCEMENT LEARNING they depend on these in a stochastic manner, namely x t+1 P x x,u( x t, u t
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks