PPO x Family 决策智能入门公开课

欢迎来到 PPO x Family 系列决策智能入门公开课。本系列将深入理解深度强化学习算法 PPO，灵活运用单一 PPO 算法解决几乎所有常见的决策智能应用，帮助所有对深度强化学习技术感兴趣的人快速高效地创建应用原型，了解和学习最强大最易用的 PPO Family。

注：路过请点个 star ，2022年12月起持续更新中~

新闻

2023.06.07: PPO x Family 第八章（突破智能体终极界限）及课程大作业将于十月下旬正式上线
2023.06.01: [哔哩哔哩] PPO x Family 第七章（挖掘黑科技）正式上线
2023.04.06: [哔哩哔哩] PPO x Family 第六章（统筹多智能体）正式上线
2023.03.09: [哔哩哔哩] PPO x Family 第五章（探索时序建模）正式上线
2023.02.23: [哔哩哔哩] PPO x Family 第四章（解密稀疏奖励空间）正式上线
2023.01.16: [哔哩哔哩] PPO x Family 第三章（表征多模态观察空间）正式上线
2022.12.23: [哔哩哔哩] PPO x Family 第二章（解构复杂动作空间）正式上线
2022.12.23: PPO x Family "算法-代码" 注解文档网站上线传送门
2022.12.08: [哔哩哔哩] PPO x Family 第一章（开启决策AI探索之旅）正式上线
2022.12.06: [哔哩哔哩] PPO x Family 第一章微课视频：4分钟带你快速入门强化学习的万能钥匙
2022.12.05: [PaperWeekly] 给你一个 PPO × Family 课程，撑起整个决策 AI 宇宙
2022.12.01: [哔哩哔哩] PPO x Family 课程品牌宣传视频
2022.11.30: [机器之心] 集中一点，演化无限：PPO × Family决策智能入门公开课即日开讲
2022.11.30: [中国计算机学会CCF] 【CCF科普群星计划】决策智能入门公开课开课啦

课程大纲

<div align="center"> <a href="https://github.com/opendilab/PPOxFamily"><img width="1000px" height="auto" src="https://yellow-cdn.veclightyear.com/0a4dffa0/b20a1424-d46e-4a37-89db-b445a0384935.png"></a> </div> # 内容导航 | 章节（视频课） | 算法理论资料 | 补充资料 | 习题 | 代码样例 | 应用样例 | |------|-----|----------|-------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| ---| | [第一章：开启决策AI探索之旅](https://www.bilibili.com/video/BV1cG4y137dJ) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_lecture.pdf) [课程文字稿](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_manuscript.pdf) | [微课视频](https://www.bilibili.com/video/BV1e841157Um) [策略梯度](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_supp_pg.pdf) [A2C](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_supp_a2c.pdf) [TRPO](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_supp_trpo.pdf) [符号表](https://github.com/opendilab/PPOxFamily/blob/main/common/notation.pdf) [问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_homework.pdf) [习题题解](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/chapter1_hw_solution.pdf) | [PG算法示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/pg_zh.py) [A2C算法示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/a2c_zh.py) [PPO算法示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter1_overview/ppo_zh.py) | [应用混剪](https://www.bilibili.com/video/BV1vW4y1M7cH/?spm_id_from=333.337.search-card.all.click) | | [第二章：解构复杂动作空间](https://www.bilibili.com/video/BV1wv4y167w2) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_lecture.pdf) [课程文字稿](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_manuscript.pdf) | [重参数化](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_supp_reparameterization.pdf) [PPO与DDPG对比](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_supp_ppovsddpg.pdf) [HyAR](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_supp_hyar.pdf) [问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_homework.pdf) [习题题解](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_hw_solution.pdf) | [离散动作示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/discrete_tutorial_zh.py) [连续动作示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/continuous_tutorial_zh.py) [混合动作示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/hybrid_tutorial_zh.py) [应用训练代码](https://github.com/opendilab/PPOxFamily/blob/main/chapter2_action/chapter2_application_demo.py) | [火箭回收等](https://github.com/opendilab/PPOxFamily/issues/4) | | [第三章：表征多模态动作空间](https://www.bilibili.com/video/BV1rK411r7Kg) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_lecture.pdf) [课程文字稿](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_manuscript.pdf) | [表征学习](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_supp_representation.pdf) [PPG](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_supp_ppg.pdf) [不变性](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_supp_invariance.pdf) [问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_homework.pdf) [习题题解](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_hw_solution.pdf) | [编码方法示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/encoding.py) [Wrapper示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/mario_wrapper.py) [计算图示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/gradient.py) [应用训练代码](https://github.com/opendilab/PPOxFamily/blob/main/chapter3_obs/chapter3_application_demo.py) | [软体机器人等](https://github.com/opendilab/PPOxFamily/issues/8) | | [第四章：解密稀疏奖励空间](https://www.bilibili.com/video/BV15j411F7ni) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_lecture.pdf) [课程文字稿](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_manuscript.pdf) | [逆强化学习](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_supp_irl.pdf) [行为克隆BC](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_supp_bc.pdf) [问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_homework.pdf) [习题解答](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_hw_solution.pdf) | [ICM好奇心奖励](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/curiosity_icm.py) [RND好奇心奖励](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/curiosity_rnd.py) [Pop-Art示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/popart.py) [价值缩放](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/value_rescale.py) [应用训练代码](https://github.com/opendilab/PPOxFamily/blob/main/chapter4_reward/chapter4_application_demo.py) | [自动驾驶等](https://github.com/opendilab/PPOxFamily/issues/44) | | [第五章：探索时序建模](https://www.bilibili.com/video/BV1Uj411u7GA) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_lecture.pdf) | [随机性策略](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_supp_sto_det.pdf) [RWKV](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_supp_rwkv.pdf) [信念MDP](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_supp_belief.pdf) [问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_homework.pdf) [习题解答](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_hw_solution.pdf) | [LSTM示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/lstm.py) [GTrXL示例](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/gtrxl.py) [应用训练代码](https://github.com/opendilab/PPOxFamily/blob/main/chapter5_time/chapter5_application_demo.py) | [记忆型决策](https://github.com/opendilab/PPOxFamily/issues/48) | | [第六章：统筹多智能体](https://www.bilibili.com/video/BV1dg4y1g7BC) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_lecture.pdf) | [HAPPO](https://github.com/opendilab/PPOxFamily/tree/main/chapter6_marl/chapter6_supp_happo.pdf) [ACE](https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_supp_ace.pdf) [值分解](https://github.com/opendilab/PPOxFamily/tree/main/chapter6_marl/chapter6_supp_value_dec.pdf) [问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_homework.pdf) [习题解答](https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_hw_solution.pdf) | [独立策略梯度](https://github.com/opendilab/PPOxFamily/tree/main/chapter6_marl/independentpg.py) [多智能体策略梯度](https://github.com/opendilab/PPOxFamily/tree/main/chapter6_marl/mapg.py) [多智能体PPO](https://github.com/opendilab/PPOxFamily/tree/main/chapter6_marl/mappo.py) [HAPPO] [应用训练代码](https://github.com/opendilab/PPOxFamily/blob/main/chapter6_marl/chapter6_application_demo.py) | [多智能体协作](https://github.com/opendilab/PPOxFamily/issues/62) | | [第七章：挖掘黑科技](https://www.bilibili.com/video/BV1ou4y1o7qY) | [课程PPT](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_lecture.pdf) | [优势函数估计](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_supp_adv.pdf) [PPO离线版本](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_supp_ppo_offpolicy.pdf) [熵](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_supp_entropy.pdf) [问答总结](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_qa.pdf) | [习题](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_homework.pdf) [习题解答](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_hw_solution.pdf) | [广义优势估计](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/gae.py) [重新计算](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/recompute.py) [梯度裁剪](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/grad_clip_norm.py) [正交初始化](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/orthogonal_init.py) [双重裁剪](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/dual_clip.py) [价值裁剪](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/value_clip.py) [应用训练代码](https://github.com/opendilab/PPOxFamily/blob/main/chapter7_tricks/chapter7_application_demo.py) | [学术基准环境](https://github.com/opendilab/PPOxFamily/issues/79) | | 第八章：突破终极界限 | | 大语言模型基于人类反馈的强化学习 | | [语言模型强化学习环境](https://github.com/opendilab/PPOxFamily/blob/main/chapter8_large/lm_env.py) | | # 课程特点

一个算法解决万千应用视频链接

算法理论与代码实现一一对应网站链接

项目结构

.
├── LICENSE
├── assets                       --> 相关图片素材（转载请注明来源）
├── chapter2_action              --> 课程第二章相关内容
└── chapter1_overview            --> 课程第一章相关内容
    ├── chapter1_manuscript.pdf  --> 课程第一章文字稿（对PPT的补充说明）
    ├── chapter1_lecture.pdf     --> 课程第一章PPT
    ├── chapter1_qa.pdf          --> 课程第一章答疑文稿
    ├── chapter1_homework.pdf    --> 课程第一章习题作业
    ├── chapter1_hw_solution.pdf   --> 课程第一章习题作业题解
    ├── chapter1_supp_trpo.pdf          --> 课程第一章补充材料（算法理论推导等）
    └── chapter1_demo_code.py    --> 课程第一章相关代码实现