蒙特卡洛方法（MC Exploring Starts算法例子）

本文章中使用的算法和例子来源于bilibili中西湖大学赵世钰老师的【强化学习的数学原理】课程。网址：第5课-蒙特卡洛方法（MC Exploring Starts算法）_哔哩哔哩_bilibili

一、算法简介

二、相关定义

1、策略评估

2、visit定义

3、episode定义

三、算法流程

四、代码演示

一、算法简介

"Monte Carlo Exploring Starts"（MC Exploring Starts）是一种强化学习算法，用于估计策略下的状态-动作值函数 Q(s,a)。该方法适用于有限状态和动作空间的问题，主要用于确定性策略的优化。它属于蒙特卡洛（Monte Carlo）方法的一类，使用从起始状态到终止状态的完整回合（episode）进行学习。

二、相关定义

1、策略评估

①每次访问（Every-Visit）蒙特卡洛策略评估：每次一个（状态，动作）对出现在一个 episode 中时，都会对其进行一次回报更新。

②首次访问（First-Visit）蒙特卡洛策略评估：在一个 episode 中，只对某个（状态，动作）对的首次访问进行回报更新。即在一个 episode 中，如果某个（状态，动作）对已经更新过，就不再更新。

Every-Visit 和 First-Visit 的对比

策略评估方法	是否更新重复访问的状态	数据量	收敛速度	适用场景
每次访问	是	多	较快	适用于频繁访问的状态，数据充分时
首次访问	否	少	较慢	适用于减少计算量和数据冗余的情况

2、visit定义

对于从棋盘格中的一个位置出发，每一个状态-动作组成的一个元组称为一个visit，如(s1,a2)，(s2,a4)，(s1,a2)……，如此定义。

3、episode定义

从一个位置出发，所有的visit组成一个episode，至于步数可以自己自行设置。

三、算法流程

四、代码演示

import numpy as np
import random
from collections import defaultdictclass BehrmanMC_EGreedy:# Attributesn, m = None, None  # Chessboard dimensionsxx, yy = [-1, 0, 1, 0, 0], [0, 1, 0, -1, 0]  # 动作（上、右、下、左、停留）s = "↑→↓←O"  # 动作符号_lambda = 0.9  # 折扣因子# 初始化棋盘格和策略chessboard = [["***"],["**#"],["#*x"]]policy = [[1, 3, 2],[2, 3, 3],[2, 2, 0]]states = NonePunishment = Nonedef __init__(self):self.n, self.m = len(self.chessboard), len(self.chessboard[0][0])self.init_matrix()self.returns = defaultdict(list)  # 用于记录每个 (状态, 动作) 对的累积回报self.Q = defaultdict(lambda: [0] * 5)  # 用来存储每个 (状态, 动作) 对的 Q 值def init_matrix(self):"""Initialize chessboard matrix, rewards, states, and policy."""self.chessboard = [list(row[0]) for row in self.chessboard]self.states = [[0] * self.m for _ in range(self.n)]self.Punishment = [[0 if cell == "*" else (-10 if cell == "#" else 1)for cell in row] for row in self.chessboard]def next_point(self, x, y, i):"""Calculate the next position coordinates, returns current position if out of bounds."""wall = 0nx, ny = x + self.xx[i], y + self.yy[i]if 0 <= nx < self.n and 0 <= ny < self.m:return nx, ny, wallelse:wall = 1return x, y, walldef action(self, x, y, i):"""Return reward for the given action."""next_pos = self.next_point(x, y, i)if next_pos[2] == 1:return -1  # 撞墙else:return self.Punishment[next_pos[0]][next_pos[1]]def generate_episode(self,x,y,action):"""Generate an episode using Exploring Starts."""episode = []for t in range(10):  # episode中visit的数量reward = self.action(x, y, action)episode.append(((x, y), action, reward))x, y, _ = self.next_point(x, y, action)return episodedef update_policy(self):"""Update Q values and policy based on episode results."""for x in range(self.n):for y in range(self.m):for z in range(5):episode_data = self.generate_episode(x,y,z)G = 0visited = set()# 使用首次访问for (x, y), action, reward in reversed(episode_data):G = self._lambda * G + rewardif (x, y, action) not in visited:  # Every-visit MC updateself.returns[(x, y, action)].append(G)self.Q[(x, y)][action] = np.mean(self.returns[(x, y, action)])visited.add((x, y, action))for (x, y) in self.Q.keys():best_action = np.argmax(self.Q[(x, y)])self.policy[x][y] = best_action  # 将策略更新为最佳动作def show_policy(self):"""Display the final policy matrix."""policy_display = "\n".join("\t".join(self.s[self.policy[x][y]] for y in range(self.m)) for x in range(self.n))print(policy_display)if __name__ == '__main__':behrman_mc = BehrmanMC_EGreedy()behrman_mc.update_policy()  # 实施MC Exploring Starts算法behrman_mc.show_policy()  # 输出最终策略

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.xdnf.cn/news/8060.html

如若内容造成侵权/违法违规/事实不符，请联系一条长河网进行投诉反馈，一经查实，立即删除！