Keep Doing What Worked: Behavioral Modelling Priors for Offline
**Reinforcement** **Learning**

**reinforcement**

**learning**algorithms promise to be applicable in settings where only a fixed data-set (batch) of environment interactions is available and no new experience can be acquired. Expand abstract.

**reinforcement**

**learning**algorithms promise to be applicable in settings where only a fixed data-set (batch) of environment interactions is available and no new experience can be acquired. This property makes these algorithms appealing for real world problems such as robot control. In practice, however, standard off-policy algorithms fail in the batch setting for continuous control. In this paper, we propose a simple solution to this problem. It admits the use of data generated by arbitrary behavior policies and uses a

**learned**prior -- the advantage-weighted behavior model (ABM) -- to bias the RL policy towards actions that have previously been executed and are likely to be successful on the new task. Our method can be seen as an extension of recent work on batch-RL that enables stable

**learning**from conflicting data-sources. We find improvements on competitive baselines in a variety of RL tasks -- including standard continuous control benchmarks and multi-task

**learning**for simulated and real-world robots.

10/10 relevant

arXiv

An anatomical substrate of credit assignment in **reinforcement** **learning**

**reinforcement**

**learning**6,7. Expand abstract.

**Learning**turns experience into better decisions. A key problem in

**learning**is credit assignment-knowing how to change parameters, such as synaptic weights deep within a neural network, in order to improve behavioral performance. Artificial intelligence owes its recent bloom largely to the error-backpropagation algorithm1, which estimates the contribution of every synapse to output errors and allows rapid weight adjustment. Biological systems, however, lack an obvious mechanism to backpropagate errors. Here we show, by combining high-throughput volume electron microscopy 2 and automated connectomic analysis3-5, that the synaptic architecture of songbird basal ganglia supports local credit assignment using a variant of the node perturbation algorithm proposed in a model of songbird

**reinforcement**learning6,7. We find that key predictions of the model hold true: first, cortical axons that encode exploratory motor variability terminate predominantly on dendritic shafts of striatal spiny neurons, while cortical axons that encode song timing terminate almost exclusively on spines. Second, synapse pairs that share a presynaptic cortical timing axon and a postsynaptic spiny dendrite are substantially more similar in size than expected, indicating Hebbian plasticity8,9. Combined with numerical simulations, these findings provide strong evidence for a biologically plausible credit assignment mechanism6.

8/10 relevant

bioRxiv

Control Frequency Adaptation via Action Persistence in Batch
**Reinforcement** **Learning**

**reinforcement**

**learning**algorithms to learn a highly performing policy. Expand abstract.

**reinforcement**

**learning**algorithms to learn a highly performing policy. In this paper, we introduce the notion of action persistence that consists in the repetition of an action for a fixed number of decision steps, having the effect of modifying the control frequency. We start analyzing how action persistence affects the performance of the optimal policy, and then we present a novel algorithm, Persistent Fitted Q-Iteration (PFQI), that extends FQI, with the goal of

**learning**the optimal value function at a given persistence. After having provided a theoretical study of PFQI and a heuristic approach to identify the optimal persistence, we present an experimental campaign on benchmark domains to show the advantages of action persistence and proving the effectiveness of our persistence selection method.

10/10 relevant

arXiv

First Order Optimization in Policy Space for Constrained Deep
**Reinforcement** **Learning**

**reinforcement**

**learning**, an agent attempts to learn high-performing behaviors through interacting with the environment, such behaviors are often quantified in the form of a reward function. Expand abstract.

**reinforcement**learning, an agent attempts to learn high-performing behaviors through interacting with the environment, such behaviors are often quantified in the form of a reward function. However some aspects of behavior, such as ones which are deemed unsafe and are to be avoided, are best captured through constraints. We propose a novel approach called First Order Constrained Optimization in Policy Space (FOCOPS) which maximizes an agent's overall reward while ensuring the agent satisfies a set of cost constraints. Using data generated from the current policy, FOCOPS first finds the optimal update policy by solving a constrained optimization problem in the nonparameterized policy space. FOCOPS then projects the update policy back into the parametric policy space. Our approach provides a guarantee for constraint satisfaction throughout training and is first-order in nature therefore extremely simple to implement. We provide empirical evidence that our algorithm achieves better performance on a set of constrained robotics locomotive tasks compared to current state of the art approaches.

10/10 relevant

arXiv

Reward Design for Driver Repositioning Using Multi-Agent **Reinforcement**
**Learning**

**reinforcement**

**learning**(MARL) approach. Expand abstract.

**reinforcement**

**learning**(MARL) approach. Noticing that the direct application of MARL to the multi-driver system under a given reward mechanism will very likely yield a suboptimal equilibrium due to the selfishness of drivers, this study proposes a reward design scheme with which a more desired equilibrium can be reached. To effectively solve the bilevel optimization problem with upper level as the reward design and the lower level as a multi-agent system (MAS), a Bayesian optimization algorithm is adopted to speed up the

**learning**process. We then use a synthetic dataset to test the proposed model. The results show that the weighted average of order response rate and overall service charge can be improved by 4% using a simple platform service charge, compared with that of no reward design.

9/10 relevant

arXiv

Universal Value Density Estimation for Imitation **Learning** and
Goal-Conditioned **Reinforcement** Learning

**reinforcement**

**learning**and show that it is both efficient and does not suffer from hindsight bias in stochastic domains. Expand abstract.

**learning**and goal-conditioned

**reinforcement**

**learning**. In either case, effective solutions require the agent to reliably reach a specified state (a goal), or set of states (a demonstration). Drawing a connection between probabilistic long-term dynamics and the desired value function, this work introduces an approach which utilizes recent advances in density estimation to effectively learn to reach a given state. As our first contribution, we use this approach for goal-conditioned

**reinforcement**

**learning**and show that it is both efficient and does not suffer from hindsight bias in stochastic domains. As our second contribution, we extend the approach to imitation

**learning**and show that it achieves state-of-the art demonstration sample-efficiency on standard benchmark tasks.

10/10 relevant

arXiv

Fast **Reinforcement** **Learning** for Anti-jamming Communications

**reinforcement**

**learning**algorithm for anti-jamming communications which chooses previous action with probability $\tau$ and applies $\epsilon$-greedy with probability $(1-\tau)$. A dynamic threshold based on the average value of previous several actions is designed and probability $\tau$ is formulated as a Gaussian-like function to guide the wireless devices. As a concrete example, the proposed algorithm is implemented in a wireless communication system against multiple jammers. Experimental results demonstrate that the proposed algorithm exceeds Q-learing, deep Q-networks (DQN), double DQN (DDQN), and prioritized experience reply based DDQN (PDDQN), in terms of signal-to-interference-plus-noise ratio and convergence rate.

10/10 relevant

arXiv

Effective **Reinforcement** **Learning** through Evolutionary Surrogate-Assisted
Prescription

**reinforcement**

**learning**(RL) benchmarks. Expand abstract.

**reinforcement**

**learning**(RL) benchmarks. Because the majority of evaluations are done on the surrogate, ESP is more sample efficient, has lower variance, and lower regret than standard RL approaches. Surprisingly, its solutions are also better because both the surrogate and the strategy network regularize the decision-making behavior. ESP thus forms a promising foundation to decision optimization in real-world problems.

10/10 relevant

arXiv

A Framework for End-to-End **Learning** on Semantic Tree-Structured Data

**reinforcement**

**learning**task. Expand abstract.

**learning**models are typically studied for inputs in the form of a fixed dimensional feature vector, real world data is rarely found in this form. In order to meet the basic requirement of traditional

**learning**models, structural data generally have to be converted into fix-length vectors in a handcrafted manner, which is tedious and may even incur information loss. A common form of structured data is what we term "semantic tree-structures", corresponding to data where rich semantic information is encoded in a compositional manner, such as those expressed in JavaScript Object Notation (JSON) and eXtensible Markup Language (XML). For tree-structured data, several

**learning**models have been studied to allow for working directly on raw tree-structure data, However such

**learning**models are limited to either a specific tree-topology or a specific tree-structured data format, e.g., synthetic parse trees. In this paper, we propose a novel framework for end-to-end

**learning**on generic semantic tree-structured data of arbitrary topology and heterogeneous data types, such as data expressed in JSON, XML and so on. Motivated by the works in recursive and recurrent neural networks, we develop exemplar neural implementations of our framework for the JSON format. We evaluate our approach on several UCI benchmark datasets, including ablation and data-efficiency studies, and on a toy

**reinforcement**

**learning**task. Experimental results suggest that our framework yields comparable performance to use of standard models with dedicated feature-vectors in general, and even exceeds baseline performance in cases where compositional nature of the data is particularly important. The source code for a JSON-based implementation of our framework along with experiments can be downloaded at https://github.com/EndingCredits/json2vec.

4/10 relevant

arXiv

Provably Convergent Policy Gradient Methods for Model-Agnostic
Meta-**Reinforcement** **Learning**

**Learning**(MAML) methods for

**Reinforcement**

**Learning**(RL) problems where the goal is to find a policy (using data from several tasks represented by Markov Decision Processes (MDPs)) that can be updated by one step of stochastic policy gradient for the realized MDP. In particular, using stochastic gradients in MAML update step is crucial for RL problems since computation of exact gradients requires access to a large number of possible trajectories. For this formulation, we propose a variant of the MAML method, named Stochastic Gradient Meta-

**Reinforcement**

**Learning**(SG-MRL), and study its convergence properties. We derive the iteration and sample complexity of SG-MRL to find an $\epsilon$-first-order stationary point, which, to the best of our knowledge, provides the first convergence guarantee for model-agnostic meta-

**reinforcement**

**learning**algorithms. We further show how our results extend to the case where more than one step of stochastic policy gradient method is used in the update during the test time.

7/10 relevant

arXiv