research-article

Policy optimization with linear temporal logic constraints

AUTHORs:

Cameron Voloshin,

Swarat Chaudhuri,

Yisong YueAuthors Info & Claims

NIPS'22: Proceedings of the 36th International Conference on Neural Information Processing Systems

Article No.: 1286, Pages 17690 - 17702

Published: 03 April 2024 Publication History

Abstract

We study the problem of policy optimization (PO) with linear temporal logic (LTL) constraints. The language of LTL allows flexible description of tasks that may be unnatural to encode as a scalar cost function. We consider LTL-constrained PO as a systematic framework, decoupling task specification from policy selection, and as an alternative to the standard of cost shaping. With access to a generative model, we develop a model-based approach that enjoys a sample complexity analysis for guaranteeing both task satisfaction and cost optimality (through a reduction to a reachability problem). Empirically, our algorithm can achieve strong performance even in low-sample regimes.

Supplementary Material

Additional material (3600270.3601556_supp.pdf)

Supplemental material.

Download
3.51 MB

References

[1]

David Abel, Will Dabney, Anna Harutyunyan, Mark K Ho, Michael Littman, Doina Precup, and Satinder Singh. On the expressivity of markov reward. In Advances in Neural Information Processing Systems, 2021.

[2]

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained Policy Optimization. In Proceedings of the 34th International Conference on Machine Learning, 2017.

[3]

Alekh Agarwal, Sham Kakade, and Lin F Yang. Model-based reinforcement learning with a generative model is minimax optimal. In Conference on Learning Theory, 2020.

[4]

Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Conference on Learning Theory, 2020.

[5]

Gul Agha and Karl Palmskog. A survey of statistical model checking. ACM Trans. Model. Comput. Simul., 28(1), jan 2018.

[6]

Eitan Altman. Constrained Markov Decision Processes: Stochastic Modeling. Routledge, Boca Raton, 1 edition, December 2021.

[7]

Rajeev Alur, Suguman Bansal, Osbert Bastani, and Kishor Jothimurugan. A framework for transforming specifications in reinforcement learning. arXiv preprint arXiv:2111.00272, 2021.

[8]

Pranav Ashok, Jan Kretfnsky, and Maximilian Weininger. Pac statistical model checking for markov decision processes and stochastic games. In Isil Dillig and Serdar Tasiran, editors, Computer Aided Verification, page 497-519, Cham, 2019. Springer International Publishing.

[9]

Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, 2020.

[10]

Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Reinforcement learning of pomdps using spectral methods. In Conference on Learning Theory, 2016.

[11]

Christel Baier and Joost-Pieter Katoen. Principles of model checking. The MIT Press, Cambridge, Mass, 2008.

Digital Library

[12]

Dimitri P Bertsekas et al. Dynamic programming and optimal control 3rd edition, volume ii. Belmont, MA: Athena Scientific, 2011.

[13]

Alper Kamil Bozkurt, Yu Wang, Michael M. Zavlanos, and Miroslav Pajic. Control synthesis from linear temporal logic specifications using model-free reinforcement learning. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 10349-10355, 2020.

[14]

Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym, 2016.

[15]

Tomáš Brázdil, Krishnendu Chatterjee, Martin Chmelik, Vojtech Forejt, Jan Kretinský, Marta Kwiatkowska, David Parker, and Mateusz Ujma. Verification of markov decision processes using learning algorithms. In Franck Cassez and Jean-François Raskin, editors, Automated Technology for Verification and Analysis, page 98-114, Cham, 2014. Springer International Publishing.

[16]

Mingyu Cai, Shaoping Xiao, Zhijun Li, and Zhen Kan. Optimal probabilistic motion planning with potential infeasible ltl constraints. IEEE Transactions on Automatic Control, pages 1-1, 2021.

[17]

Alberto Camacho, Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, and Sheila A. McIlraith. Ltl and beyond: Formal languages for reward function specification in reinforcement learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 6065-6073. International Joint Conferences on Artificial Intelligence Organization, 7 2019.

[18]

Grace E. Cho and Carl D. Meyer. Comparison of perturbation bounds for the stationary distribution of a markov chain. Linear Algebra and its Applications, 335(1):137-150, 2001.

[19]

Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. Advances in Neural Information Processing Systems, 28, 2015.

[20]

Xuchu Ding, Stephen L. Smith, Calin Belta, and Daniela Rus. Optimal control of markov decision processes with linear temporal logic constraints. IEEE Transactions on Automatic Control, 59(5):1244-1257, May 2014.

[21]

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Conference on robot learning, 2017.

[22]

Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Improved analysis of ucrl2 with empirical bernstein inequality. arXiv preprint arXiv:2007.05456, 2020.

[23]

Jie Fu and Ufuk Topcu. Probably approximately correct MDP learning and control with temporal logic constraints. In Dieter Fox, Lydia E. Kavraki, and Hanna Kurniawati, editors, Robotics: Science and Systems X, University of California, Berkeley, USA, July 12-16, 2014, 2014.

[24]

Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325-349, 2013.

Digital Library

[25]

Ernst Moritz Hahn, Guangyuan Li, Sven Schewe, Andrea Turrini, and Lijun Zhang. Lazy probabilistic model checking without determinisation. arXiv preprint arXiv:1311.2928, 2013.

[26]

Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. Logically-constrained reinforcement learning, 2018.

[27]

Thomas Hérault, Richard Lassaigne, Frédéric Magniette, and Sylvain Peyronnet. Approximate probabilistic model checking. In Bernhard Steffen and Giorgio Levi, editors, Verification, Model Checking, and Abstract Interpretation, page 73-84, Berlin, Heidelberg, 2004. Springer Berlin Heidelberg.

[28]

Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems, 31, 2018.

[29]

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563-1600, 2010.

Digital Library

[30]

Kishor Jothimurugan, Rajeev Alur, and Osbert Bastani. A composable specification language for reinforcement learning tasks. Advances in Neural Information Processing Systems, 32, 2019.

[31]

Sham Kakade. On the sample complexity of reinforcement learning. PhD thesis, Gatsby Computational Neuroscience Unit, University College London, 2003.

[32]

Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.

[33]

Abbas Kazerouni, Mohammad Ghavamzadeh, Yasin Abbasi Yadkori, and Benjamin Van Roy. Conservative contextual linear bandits. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30, 2017.

[34]

Andrey Kolobov, Mausam, and Daniel S. Weld. A theory of goal-oriented mdps with dead ends. In Proceedings of the Twenty-Eighth Conference on Uncertainty in Artificial Intelligence, UAI'12, page 438-447, Arlington, Virginia, USA, 2012. AUAI Press.

Digital Library

[35]

Jan Kretinsky, Tobias Meggendorfer, and Salomon Sickert. Owl: a library for ω-words, automata, and ltl. In International Symposium on Automated Technology for Verification and Analysis, pages 543-550. Springer, 2018.

[36]

M. Kwiatkowska, G. Norman, and D. Parker. Prism 4.0: Verification of probabilistic real-time systems. In G. Gopalakrishnan and S. Qadeer, editors, Proc. 23rd International Conference on Computer Aided Verification (CAV'11), volume 6806 of LNCS, page 585-591. Springer, 2011.

Digital Library

[37]

Richard Lassaigne and Sylvain Peyronnet. Probabilistic verification and approximation. Electron. Notes Theor. Comput. Sci., 143:101-114, jan 2006.

Digital Library

[38]

Hoang M Le, Nan Jiang, Alekh Agarwal, Miro Dudik, Yisong Yue, and Hal Daumé III. Hierarchical imitation and reinforcement learning. In International Conference on Machine Learning (ICML), Jul 2018.

[39]

Hoang M Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning (ICML), 2019.

[40]

Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. Advances in neural information processing systems, 33:12861-12872, 2020.

[41]

Michael L. Littman, Ufuk Topcu, Jie Fu, Charles Isbell, Min Wen, and James MacGlashan. Environment-independent task specifications via gltl, 2017.

[42]

Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization, 2009.

[43]

Sobhan Miryoosefi, Kianté Brantley, Hal Daume III, Miro Dudik, and Robert E Schapire. Reinforcement learning with convex constraints. In Neural Information Processing Systems (NeurIPS), 2019.

[44]

Andrew Y Ng, Daishi Harada, and Stuart Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278-287, 1999.

Digital Library

[45]

Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. Advances in Neural Information Processing Systems, 26, 2013.

[46]

Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.

[47]

Guannan Qu and Adam Wierman. Finite-time analysis of asynchronous stochastic approximation and q-learning. In Conference on Learning Theory, 2020.

[48]

Guannan Qu, Chenkai Yu, Steven Low, and Adam Wierman. Exploiting linear models for model-free nonlinear control: A provably convergent policy gradient approach. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 6539-6546. IEEE, 2021.

Digital Library

[49]

Jette Randløv and Preben Alstrøm. Learning to drive a bicycle using reinforcement learning and shaping. In ICML, 1998.

[50]

Dorsa Sadigh, Eric S. Kim, Samuel Coogan, S. Shankar Sastry, and Sanjit A. Seshia. A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications. In 53rd IEEE Conference on Decision and Control, pages 1091-1096, 2014.

[51]

E. Seneta. Perturbation of the stationary distribution measured by ergodicity coefficients. Advances in Applied Probability, 20(1):228-230, 1988.

[52]

Salomon Sickert, Javier Esparza, Stefan Jaax, and Jan Kretinsky. Limit-deterministic büchi automata for linear temporal logic. In Swarat Chaudhuri and Azadeh Farzan, editors, Computer Aided Verification, page 312-332, Cham, 2016. Springer International Publishing.

[53]

David Silver, Satinder Singh, Doina Precup, and Richard S Sutton. Reward is enough. Artificial Intelligence, 299:103535, 2021.

[54]

Jonathan Sorg. The Optimal Reward Problem: Designing Effective Reward for Bounded Agents. PhD thesis, University of Michigan, USA, 2011.

[55]

Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.

Digital Library

[56]

R.S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112:181-211, 1999.

Digital Library

[57]

Csaba Szepesvári. Algorithms for reinforcement learning. Synthesis lectures on artificial intelligence and machine learning, 4(1):1-103, 2010.

Digital Library

[58]

Jean Tarbouriech, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. Sample complexity bounds for stochastic shortest path with a generative model. In Proceedings of the 32nd International Conference on Algorithmic Learning Theory, Mar 2021.

[59]

Jean Tarbouriech, Runlong Zhou, Simon Shaolei Du, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. Stochastic shortest path: Minimax, parameter-free and towards horizon-free regret. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.

[60]

Rodrigo Toro Icarte, Toryn Q. Klassen, Richard Valenzano, and Sheila A. McIlraith. Reward machines: Exploiting reward function structure in reinforcement learning. J. Artif. Int. Res., 73, may 2022.

[61]

Marin Toromanoff, Emilie Wirbel, and Fabien Moutarde. Is deep reinforcement learning really superhuman on atari? leveling the playing field. arXiv preprint arXiv:1908.04683, 2019.

[62]

John Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function approximation. Advances in neural information processing systems, 9, 1996.

[63]

Pashootan Vaezipoor, Andrew C. Li, Rodrigo Toro Icarte, and Sheila A. McIlraith. Ltl2action: Generalizing LTL instructions for multi-task RL. In Proceedings of the 38th International Conference on Machine Learning, ICML, volume 139 of Proceedings of Machine Learning Research, pages 10497-10508, 2021.

[64]

Eric M. Wolff, Ufuk Topcu, and Richard M. Murray. Robust control of uncertain markov decision processes with temporal logic specifications. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 3372-3379, 2012.

[65]

Yihong Wu and Pengkun Yang. Chebyshev polynomials, moment matching, and optimal estimation of the unseen. The Annals of Statistics, 47(2):857 - 883, 2019.

[66]

Cambridge Yang, Michael L. Littman, and Michael Carbin. Reinforcement learning for general LTL objectives is intractable. CoRR, abs/2111.12679, 2021.

[67]

Håkan L. S. Younes, Edmund M. Clarke, and Paolo Zuliani. Statistical verification of probabilistic properties with unbounded until. In Jim Davies, Leila Silva, and Adenilso Simao, editors, Formal Methods: Foundations and Applications, page 144-160, Berlin, Heidelberg, 2011. Springer Berlin Heidelberg.

[68]

Baohe Zhang, Raghu Rajan, Luis Pineda, Nathan Lambert, André Biedenkapp, Kurtland Chua, Frank Hutter, and Roberto Calandra. On the importance of hyperparameter optimization for model-based reinforcement learning. In International Conference on Artificial Intelligence and Statistics, 2021.

Index Terms

Policy optimization with linear temporal logic constraints
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
      1. Temporal reasoning
2. Theory of computation

Index terms have been assigned to the content through auto-classification.

Recommendations

Constructive linear-time temporal logic: Proof systems and Kripke semantics

In this paper we study a version of constructive linear-time temporal logic (LTL) with the ''next'' temporal operator. The logic is originally due to Davies, who has shown that the proof system of the logic corresponds to a type system for binding-time ...
A Paraconsistent Linear-time Temporal Logic

Inconsistency-tolerant reasoning and paraconsistent logic are of growing importance not only in Knowledge Representation, AI and other areas of Computer Science, but also in Philosophical Logic. In this paper, a new logic, paraconsistent linear-time ...
Embedding Linear-Time Temporal Logic into Infinitary Logic: Application to Cut-Elimination for Multi-agent Infinitary Epistemic Linear-Time Temporal Logic
Computational Logic in Multi-Agent Systems

Linear-time temporal logic (LTL) is known as one of the most useful logics for verifying concurrent systems, and infinitary logic (IL) is known as an important logic for formalizing common knowledge reasoning. The research fields of both LTL and IL have ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings

NIPS '22: Proceedings of the 36th International Conference on Neural Information Processing Systems

November 2022

39114 pages

ISBN:9781713871088

Copyright © 2022 Neural Information Processing Systems Foundation, Inc.

Publisher

Curran Associates Inc.

Red Hook, NY, United States

Publication History

Published: 03 April 2024

Qualifiers

Research-article
Research
Refereed limited

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 04 Nov 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

View Table of Contents