Improving Data-efficiency in Deep Reinforcement Learning using Human Feedback, Hierarchy and Language

Author/Creator

Author/Creator ORCID

Department

Computer Science and Electrical Engineering

Program

Computer Science

Citation of Original Publication

Rights

This item may be protected under Title 17 of the U.S. Copyright Law. It is made available by UMBC for non-commercial research and education. For permission to publish or reproduce, please see http://aok.lib.umbc.edu/specoll/repro.php or contact Special Collections at speccoll(at)umbc.edu
Distribution Rights granted to UMBC by the author.
Access limited to the UMBC community. Item may possibly be obtained via Interlibrary Loan thorugh a local library, pending author/copyright holder's permission.

Abstract

Developing intelligent agents that can make optimal sequential decisions over long time horizons remains a significant challenge in machine learning. Reinforcement learning (RL) provides a promising framework for training decision-making agents through experience, but often struggles with sample inefficiency, safety issues, and fails to scale to complex, temporally-extended tasks. This is compounded by the common practice of learning without prior knowledge (or tabula rasa learning). In this thesis, we explore sequential decision making agents that leverage hierarchical structures in complex tasks, incorporate prior domain knowledge using language and sparse human feedback signals to mitigate these limitations. First, we introduce a hierarchical agent framework that uses semantic predicates to represent goals and states. This enables the agent the learn various skills efficiently without the need to manually write complex reward functions. Moreover, it allows us to integrate symbolic planners as high level controllers to solve long horizon tasks. Second, we introduce a hierarchical agent that uses LLMs to solve long-horizon tasks. We leverage the planning and common sense reasoning capabilities of LLMs to guide exploration and improve sample efficiency. Our approach has the added benefit not needing access to these large models during deployment. Empirical evaluations demonstrate the benefits of the proposed methods on a suite of temporally-extended decision-making tasks.