PPO近端策略优化玩cartpole游戏

news/2024/9/21 20:24:03

 

这个难度有些大,有两个policy,一个负责更新策略,另一个负责提供数据,实际这两个policy是一个东西,用policy1跑出一组数据给新的policy2训练,然后policy2跑数据给新的policy3训练,,,,直到policy(N-1)跑数据给新的policyN训练,过程感觉和DQN比较像,但是模型是actor critic 架构,on-policy转换成off-policy,使用剪切策略来限制策略的更新幅度,off-policy的好处是策略更新快,PPO的优化目标是最大化策略的期望回报,同时避免策略更新过大

 

import gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import pygame
import sys
from collections import deque# 定义策略网络
class PolicyNetwork(nn.Module):def __init__(self):super(PolicyNetwork, self).__init__()self.fc = nn.Sequential(nn.Linear(4, 2),nn.Tanh(),nn.Linear(2, 2),  # CartPole的动作空间为2nn.Softmax(dim=-1))def forward(self, x):return self.fc(x)# 定义值网络
class ValueNetwork(nn.Module):def __init__(self):super(ValueNetwork, self).__init__()self.fc = nn.Sequential(nn.Linear(4, 2),nn.Tanh(),nn.Linear(2, 1))def forward(self, x):return self.fc(x)# 经验回放缓冲区
class RolloutBuffer:def __init__(self):self.states = []self.actions = []self.rewards = []self.dones = []self.log_probs = []def store(self, state, action, reward, done, log_prob):self.states.append(state)self.actions.append(action)self.rewards.append(reward)self.dones.append(done)self.log_probs.append(log_prob)def clear(self):self.states = []self.actions = []self.rewards = []self.dones = []self.log_probs = []def get_batch(self):return (torch.tensor(self.states, dtype=torch.float),torch.tensor(self.actions, dtype=torch.long),torch.tensor(self.rewards, dtype=torch.float),torch.tensor(self.dones, dtype=torch.bool),torch.tensor(self.log_probs, dtype=torch.float))# PPO更新函数
def ppo_update(policy_net, value_net, optimizer_policy, optimizer_value, buffer, epochs=10, gamma=0.99, clip_param=0.2):states, actions, rewards, dones, old_log_probs = buffer.get_batch()returns = []advantages = []G = 0adv = 0dones = dones.to(torch.int)# print(dones)for reward, done, value in zip(reversed(rewards), reversed(dones), reversed(value_net(states))):if done:G = 0adv = 0G = reward + gamma * G  #蒙特卡洛回溯G值delta = reward + gamma * value.item() * (1 - done) - value.item()  #TD差分# adv = delta + gamma * 0.95 * adv * (1 - done)  #adv = delta + adv*(1-done)returns.insert(0, G)advantages.insert(0, adv)returns = torch.tensor(returns, dtype=torch.float)  #价值advantages = torch.tensor(advantages, dtype=torch.float)advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)  #add baselinefor _ in range(epochs):action_probs = policy_net(states)dist = torch.distributions.Categorical(action_probs)new_log_probs = dist.log_prob(actions)ratio = (new_log_probs - old_log_probs).exp()surr1 = ratio * advantagessurr2 = torch.clamp(ratio, 1.0 - clip_param, 1.0 + clip_param) * advantagesactor_loss = -torch.min(surr1, surr2).mean()optimizer_policy.zero_grad()actor_loss.backward()optimizer_policy.step()value_loss = (returns - value_net(states)).pow(2).mean()optimizer_value.zero_grad()value_loss.backward()optimizer_value.step()# 初始化环境和模型
env = gym.make('CartPole-v1')
policy_net = PolicyNetwork()
value_net = ValueNetwork()
optimizer_policy = optim.Adam(policy_net.parameters(), lr=3e-4)
optimizer_value = optim.Adam(value_net.parameters(), lr=1e-3)
buffer = RolloutBuffer()# Pygame初始化
pygame.init()
screen = pygame.display.set_mode((600, 400))
clock = pygame.time.Clock()draw_on = False
# 训练循环
state = env.reset()
for episode in range(10000):  # 训练轮次done = Falsestate = state[0]step= 0while not done:step+=1state_tensor = torch.FloatTensor(state).unsqueeze(0)action_probs = policy_net(state_tensor)dist = torch.distributions.Categorical(action_probs)action = dist.sample()log_prob = dist.log_prob(action)next_state, reward, done, _ ,_ = env.step(action.item())buffer.store(state, action.item(), reward, done, log_prob)state = next_state# 实时显示for event in pygame.event.get():if event.type == pygame.QUIT:pygame.quit()sys.exit()if draw_on:# 清屏并重新绘制
            screen.fill((0, 0, 0))cart_x = int(state[0] * 100 + 300)  # 位置转换为屏幕坐标pygame.draw.rect(screen, (0, 128, 255), (cart_x, 300, 50, 30))pygame.draw.line(screen, (255, 0, 0), (cart_x + 25, 300), (cart_x + 25 - int(50 * np.sin(state[2])), 300 - int(50 * np.cos(state[2]))), 5)pygame.display.flip()clock.tick(600)if step >10000:draw_on = Trueppo_update(policy_net, value_net, optimizer_policy, optimizer_value, buffer)buffer.clear()state = env.reset()print(f'Episode {episode} completed {step}.')# 结束训练
env.close()
pygame.quit()

 

运行效果

 

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.ryyt.cn/news/31590.html

如若内容造成侵权/违法违规/事实不符,请联系我们进行投诉反馈,一经查实,立即删除!

相关文章

实验8-tensorboard

VMware虚拟机 Ubuntu20-LTS python3.6 tensorflow1.15.0 keras2.3.1 运行截图:代码: 实验8-1tensorboard可视化import tensorflow as tf#定义命名空间 with tf.name_scope(input):#fetch:就是同时运行多个op的意思input1 = tf.constant(3.0,name=A)#定义名称,会在tensorboa…

实验7-使用TensorFlow完成MNIST手写体识别

VMware虚拟机 Ubuntu20-LTS python3.6 tensorflow1.15.0 keras2.3.1 运行截图:代码:import os os.environ[TF_CPP_MIN_LOG_LEVEL]=2import numpy as np import tensorflow as tf from tensorflow_core.examples.tutorials.mnist import input_data import time #%% #使用tens…

C121 李超树+DP P4655 [CEOI2017] Building Bridges

视频链接:C121 李超树+DP P4655 [CEOI2017] Building Bridges_哔哩哔哩_bilibili Luogu P4655 [CEOI2017] Building Bridges#include <iostream> #include <cstring> #include <algorithm> using namespace std;#define ll long long #define ls u<&l…

实验1-波士顿房价预测

VMware虚拟机 Ubuntu20-LTS python3.6 tensorflow1.15.0 keras2.3.1 运行截图 代码:from sklearn.linear_model import LinearRegression, SGDRegressor, Ridge, LogisticRegression from sklearn.datasets import load_boston from sklearn.model_selection import train_tes…

穿越

题目描述解析 纯搜索,注意不能用 \(dfs\) !!!每次四个方向以及所有传送门,判断 \(rain\) 最早下的时间,判雨;对于兽,如果醒了,等它着再走过去,需要判脚下兽,脚下雨,下一个点的雨。code #include<bits/stdc++.h> #define se second #define fi first using na…

windows下volumetric video conference环境搭建

最近在做volumetric video的rtc,在此记录下相关内容方便之后复习。所采用的end to end平台来自于mmsys24的 Scalable MDC-Based Volumetric Video Delivery for Real-Time One-to-Many WebRTC Conferencing. 源码地址:https://github.com/MatthiasDeFre/webrtc-pc-streaming …

mysql+node.js前后端交互(简单实现注册登录功能)

目录 sql文件 user.js 注册部分 登录部分 对应的表操作 usersql.jsresult.js 用户提交的信息会进行格式化

Linux错误:-bash: Su: command not found

问题:使用 su 命令出错:-bash: Su: command not found解决: 先查看/etc/sudoers.d 文件是否存在find /etc/sudoers.d说明系统已经安装了 sudo,只不过没有配置环境。解决一:使用vi 或 vim 以下命令打开/etc/sudoers文件。vim /etc/sudoers esc --> :wq 保存退出。