当前位置: 首页 > news >正文

强化学习算法:Actor-Critic方法

强化学习算法:Actor-Critic方法

1. 技术分析

1.1 Actor-Critic概述

Actor-Critic结合策略和价值函数:

Actor-Critic架构 Actor: 策略网络,输出动作概率 Critic: 价值网络,评估状态价值 优势: 降低方差 提高样本效率 在线学习

1.2 Actor-Critic组成

组件作用更新目标
Actor选择动作最大化优势函数
Critic评估价值最小化TD误差

1.3 Actor-Critic变体

Actor-Critic变体 A2C: 同步优势Actor-Critic A3C: 异步优势Actor-Critic DDPG: 深度确定性策略梯度 TD3: 双延迟深度确定性策略梯度

2. 核心功能实现

2.1 Actor-Critic算法

import numpy as np class ActorCritic: def __init__(self, actor, critic, actor_optimizer, critic_optimizer, gamma=0.99): self.actor = actor self.critic = critic self.actor_optimizer = actor_optimizer self.critic_optimizer = critic_optimizer self.gamma = gamma def compute_advantage(self, state, reward, next_state, done): value = self.critic(state) if done: target = reward else: target = reward + self.gamma * self.critic(next_state) advantage = target - value return advantage, target def train_step(self, state, action, reward, next_state, done): advantage, target = self.compute_advantage(state, reward, next_state, done) actor_loss = -np.log(self.actor(state)[action]) * advantage critic_loss = (target - self.critic(state)) ** 2 self.actor_optimizer.step(actor_loss) self.critic_optimizer.step(critic_loss) return actor_loss, critic_loss def train(self, env, episodes=1000): for episode in range(episodes): state = env.reset() done = False while not done: action_probs = self.actor(state) action = np.random.choice(len(action_probs), p=action_probs) next_state, reward, done = env.step(action) self.train_step(state, action, reward, next_state, done) state = next_state

2.2 A2C算法

class A2C(ActorCritic): def __init__(self, actor, critic, actor_optimizer, critic_optimizer, gamma=0.99, num_workers=4): super().__init__(actor, critic, actor_optimizer, critic_optimizer, gamma) self.num_workers = num_workers def train(self, env, episodes=1000): for episode in range(episodes): states = [] actions = [] rewards = [] next_states = [] dones = [] for _ in range(self.num_workers): state = env.reset() done = False while not done: action_probs = self.actor(state) action = np.random.choice(len(action_probs), p=action_probs) next_state, reward, done = env.step(action) states.append(state) actions.append(action) rewards.append(reward) next_states.append(next_state) dones.append(done) state = next_state total_actor_loss = 0 total_critic_loss = 0 for i in range(len(states)): advantage, target = self.compute_advantage( states[i], rewards[i], next_states[i], dones[i] ) actor_loss = -np.log(self.actor(states[i])[actions[i]]) * advantage critic_loss = (target - self.critic(states[i])) ** 2 total_actor_loss += actor_loss total_critic_loss += critic_loss self.actor_optimizer.step(total_actor_loss / len(states)) self.critic_optimizer.step(total_critic_loss / len(states))

2.3 Actor和Critic网络

class ActorNetwork: def __init__(self, state_dim, action_dim, hidden_dim=64): self.W1 = np.random.randn(state_dim, hidden_dim) * 0.01 self.b1 = np.zeros(hidden_dim) self.W2 = np.random.randn(hidden_dim, action_dim) * 0.01 self.b2 = np.zeros(action_dim) def forward(self, state): h = np.maximum(0, state @ self.W1 + self.b1) logits = h @ self.W2 + self.b2 exp_logits = np.exp(logits - np.max(logits)) probs = exp_logits / np.sum(exp_logits) return probs class CriticNetwork: def __init__(self, state_dim, hidden_dim=64): self.W1 = np.random.randn(state_dim, hidden_dim) * 0.01 self.b1 = np.zeros(hidden_dim) self.W2 = np.random.randn(hidden_dim, 1) * 0.01 self.b2 = np.zeros(1) def forward(self, state): h = np.maximum(0, state @ self.W1 + self.b1) value = h @ self.W2 + self.b2 return value[0]

3. 性能对比

3.1 Actor-Critic变体对比

方法并行性稳定性样本效率
A2C同步
A3C异步
DDPG同步

3.2 Actor-Critic vs 策略梯度

指标REINFORCEActor-Critic
方差
偏差无偏有偏
样本效率

3.3 网络规模影响

隐藏层大小性能训练时间过拟合风险
32
64
128

4. 最佳实践

4.1 Actor-Critic选择

def choose_actor_critic_method(environment_type): if environment_type == 'continuous': return 'DDPG' elif environment_type == 'discrete': return 'A2C' else: return 'A2C' class ActorCriticSelector: @staticmethod def select(config): methods = { 'a2c': A2C, 'a3c': A3C, 'ddpg': DDPG } return methods[config['method']](**config.get('params', {}))

4.2 训练技巧

class ActorCriticTrainingTips: @staticmethod def separate_learning_rates(actor_lr=0.001, critic_lr=0.005): return {'actor_lr': actor_lr, 'critic_lr': critic_lr} @staticmethod def target_networks(): return {'use_target': True, 'tau': 0.001} @staticmethod def gradient_clipping(max_norm=1.0): return {'max_norm': max_norm}

5. 总结

Actor-Critic是强化学习的主流方法:

  1. Actor:策略网络,选择动作
  2. Critic:价值网络,评估价值
  3. A2C:同步训练,稳定可靠
  4. DDPG:连续动作的标准方法

对比数据如下:

  • A2C比REINFORCE样本效率更高
  • DDPG是连续控制的首选
  • 推荐使用分离的学习率
  • 目标网络可以提高稳定性
http://www.cnnetsun.cn/news/2441781.html

相关文章:

  • SNAP 9.0实战:Sentinel-1A SLC影像预处理流程优化与PolSARpro兼容性探讨
  • LED驱动电源工程师选型解析|钡特电源 NCD24-1200 与 KC24H-1200R3 封装互通与参数匹配
  • 微信读书笔记助手:3分钟快速上手的终极笔记管理指南
  • 【效率利器】Show Comments插件:让代码注释从“幕后”走到“台前”
  • 3步搞定Windows上的Android应用安装:告别模拟器的终极方案
  • 给 AI加长期记忆:再也不用每次重新交接项目了
  • 090、机器人动力学:惯量辨识
  • Verilog数值转换:数字设计工程师必须掌握的底层规则与工程实践
  • TaskbarXI:为Windows 11任务栏注入macOS风格优雅的终极解决方案
  • 咕咚翻译剪贴板监听完全指南:从配置到高级使用 [特殊字符]
  • 30岁程序员的职业分叉口:是继续写代码还是转管理
  • 【多变量输入单步预测】基于金豺算法优化TCN-BiGRU-Attention的风电功率预测研究附Matlab代码
  • 如何免费解锁雀魂全角色皮肤:终极完整配置指南
  • JMSSerializerBundle与FOSRestBundle集成指南:构建高性能API的完整方案
  • 3步搭建免费网盘直链解析服务:彻底告别下载限速烦恼
  • Python正则表达式分组与反向引用:7个实用场景深度解析
  • LangGraph 分布式追踪:为什么你的 Agent 执行链总是“黑盒”?
  • AI思维伙伴:结构化提示工程驱动深度思考与决策
  • pyzk完整指南:5步轻松掌握ZKTeco考勤机Python自动化管理
  • NotebookLM+AlphaFold3协同工作流:打通文献理解→蛋白结构预测→突变效应分析的最后1公里(限时开放调试模板)
  • 【NotebookLM环境科学实战指南】:20年专家亲授3大科研提效秘技,错过再等5年?
  • JVM 调优介绍
  • NotebookLM假设构建辅助深度拆解(从语义锚点到可证伪性设计):谷歌AI Lab内部培训未公开方法论首次披露
  • 5分钟实现Obsidian插件全中文界面:告别英文困扰的智能解决方案
  • IAM Information System,一张看懂 SAP 权限关系网的地图
  • IAM Apps 对 SAP S/4HANA 授权治理的真实影响
  • Windows 10/11打印服务总罢工?别急着重装,试试这几招修复Print Spooler
  • 【我的stm32开发之路-实践篇-嵌入式的hello-world】原创
  • sklearn_tutorial实战案例:如何用高斯混合模型进行密度估计的完整指南 [特殊字符]
  • 猫抓Cat-Catch:浏览器资源嗅探的高效实战指南