python

关注公众号 jb51net

关闭
首页 > 脚本专栏 > python > 深度Q网络DQN强化学习

深度Q网络DQN(Deep Q-Network)强化学习的原理与实战

作者:wx62088446a1f70

深度Q学习将深度神经网络与强化学习相结合,解决了传统Q学习在高维状态空间下的局限性,通过经验回放和目标网络等技术,DQN能够在复杂环境中学习有效的策略,本文通过CartPole环境的完整实现,展示了DQN的核心思想和实现细节

DQN(Deep Q-Network)是一种基于深度学习和强化学习的算法,由DeepMind提出,用于解决离散动作空间下的马尔科夫决策过程(MDP)问题。它是首个成功将深度学习应用于解决强化学习任务的算法之一。DQN,即深度Q网络(Deep Q-network),是指基于深度学习的Q-Learing算法。

一、强化学习基础

强化学习(Reinforcement Learning)是机器学习的一个重要分支,其核心思想是通过与环境的交互学习最优策略。与监督学习不同,强化学习不需要预先准备好的输入-输出对,而是通过试错机制获得奖励信号来指导学习。

1.1 核心概念

• 智能体(Agent):学习的执行者 • 环境(Environment):智能体交互的对象 • 状态(State):环境的当前情况 • 动作(Action):智能体的行为 • 奖励(Reward):环境对动作的反馈 • 策略(Policy):状态到动作的映射

1.2 马尔可夫决策过程

强化学习问题通常建模为马尔可夫决策过程(MDP),由五元组(S, A, P, R, γ)组成: • S:状态集合 • A:动作集合 • P:状态转移概率 • R:奖励函数 • γ:折扣因子(0≤γ<1)

二、Q学习与深度Q网络

2.1 Q学习算法

Q学习是一种经典的强化学习算法,通过维护一个Q值表来估计在给定状态下采取某个动作的长期回报:

import numpy as np

# 初始化Q表
q_table = np.zeros((state_space_size, action_space_size))

# Q学习更新公式
alpha = 0.1  # 学习率
gamma = 0.99  # 折扣因子

for episode in range(total_episodes):
    state = env.reset()
    done = False
    
    while not done:
        action = select_action(state)  # ε-greedy策略
        next_state, reward, done, _ = env.step(action)
        
        # Q值更新
        q_table[state, action] = q_table[state, action] + alpha * (
            reward + gamma * np.max(q_table[next_state]) - q_table[state, action]
        )
        state = next_state

2.2 深度Q网络(DQN)

当状态空间较大时,Q表变得不切实际。DQN使用神经网络近似Q函数:

import torch
import torch.nn as nn
import torch.optim as optim

class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, output_dim)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

三、DQN的改进技术

3.1 经验回放(Experience Replay)

解决样本相关性和非平稳分布问题:

from collections import deque
import random

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)
    
    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))
    
    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)
    
    def __len__(self):
        return len(self.buffer)

3.2 目标网络(Target Network)

稳定训练过程:

target_net = DQN(input_dim, output_dim).to(device)
target_net.load_state_dict(policy_net.state_dict())
target_net.eval()

# 定期更新目标网络
if steps_done % TARGET_UPDATE == 0:
    target_net.load_state_dict(policy_net.state_dict())

四、完整DQN实现(CartPole环境)

import gym
import numpy as np
import torch
import random
from collections import deque
import matplotlib.pyplot as plt

# 超参数
BATCH_SIZE = 128
GAMMA = 0.99
EPS_START = 0.9
EPS_END = 0.05
EPS_DECAY = 200
TARGET_UPDATE = 10
LR = 0.001

# 初始化环境
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

# 神经网络定义
class DQN(nn.Module):
    def __init__(self, input_dim, output_dim):
        super(DQN, self).__init__()
        self.fc1 = nn.Linear(input_dim, 64)
        self.fc2 = nn.Linear(64, 64)
        self.fc3 = nn.Linear(64, output_dim)
        
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return self.fc3(x)

# 初始化网络
policy_net = DQN(state_dim, action_dim).to(device)
target_net = DQN(state_dim, action_dim).to(device)
target_net.load_state_dict(policy_net.state_dict())
optimizer = optim.Adam(policy_net.parameters(), lr=LR)
memory = ReplayBuffer(10000)

# 训练过程
def train():
    if len(memory) < BATCH_SIZE:
        return
    transitions = memory.sample(BATCH_SIZE)
    batch = list(zip(*transitions))
    
    state_batch = torch.FloatTensor(np.array(batch[0]))
    action_batch = torch.LongTensor(np.array(batch[1]))
    reward_batch = torch.FloatTensor(np.array(batch[2]))
    next_state_batch = torch.FloatTensor(np.array(batch[3]))
    done_batch = torch.FloatTensor(np.array(batch[4]))
    
    current_q = policy_net(state_batch).gather(1, action_batch.unsqueeze(1))
    next_q = target_net(next_state_batch).max(1)[0].detach()
    expected_q = reward_batch + (1 - done_batch) * GAMMA * next_q
    
    loss = nn.MSELoss()(current_q.squeeze(), expected_q)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# 主训练循环
episode_rewards = []
for episode in range(500):
    state = env.reset()
    total_reward = 0
    done = False
    
    while not done:
        # ε-greedy动作选择
        eps_threshold = EPS_END + (EPS_START - EPS_END) * \
            np.exp(-1. * episode / EPS_DECAY)
        if random.random() > eps_threshold:
            with torch.no_grad():
                action = policy_net(torch.FloatTensor(state)).argmax().item()
        else:
            action = random.randint(0, action_dim-1)
        
        next_state, reward, done, _ = env.step(action)
        memory.push(state, action, reward, next_state, done)
        state = next_state
        total_reward += reward
        
        train()
    
    episode_rewards.append(total_reward)
    if episode % 10 == 0:
        print(f"Episode {episode}, Total Reward: {total_reward}")

# 绘制训练曲线
plt.plot(episode_rewards)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('DQN Training Progress')
plt.show()

五、DQN的局限性与发展

  1. 过估计问题:Double DQN通过解耦动作选择和Q值评估来解决
  2. 优先经验回放:给重要的转移更高采样概率
  3. 竞争网络架构:Dueling DQN分离价值函数和优势函数
  4. 分布式强化学习:学习价值分布而不仅是期望值

六、总结

深度Q学习将深度神经网络与强化学习相结合,解决了传统Q学习在高维状态空间下的局限性。通过经验回放和目标网络等技术,DQN能够在复杂环境中学习有效的策略。本文通过CartPole环境的完整实现,展示了DQN的核心思想和实现细节。未来,结合改进技术和更强大的网络架构,深度强化学习将在机器人控制、游戏AI等领域发挥更大作用。

到此这篇关于深度Q网络DQN(Deep Q-Network)强化学习的原理与实战的文章就介绍到这了,更多相关深度Q网络DQN强化学习内容请搜索脚本之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持脚本之家!

您可能感兴趣的文章:
阅读全文