本文共 11870 字,大约阅读时间需要 39 分钟。
强化学习q学习求最值
by Thomas Simonini
通过托马斯·西蒙尼(Thomas Simonini)
This article is part of Deep Reinforcement Learning Course with Tensorflow ?️. Check the syllabus h
本文是使用Tensorflow?️的深度强化学习课程的一部分。 检查课程表
Today we’ll learn about Q-Learning. Q-Learning is a value-based Reinforcement Learning algorithm.
今天,我们将学习Q学习。 Q学习是一种基于价值的强化学习算法。
This article is the second part of a free series of blog post about Deep Reinforcement Learning. For more information and more resources, check out the See .
本文是有关深度强化学习的一系列免费博客文章的第二部分。 有关更多信息和更多资源,请查看 看到 。
In this article you’ll learn:
在本文中,您将学习:
Let’s say you’re a knight and you need to save the princess trapped in the castle shown on the map above.
假设您是一名骑士,您需要保存困在上面地图所示城堡中的公主。
You can move one tile at a time. The enemy can’t, but land on the same tile as the enemy, and you will die. Your goal is to go the castle by the fastest route possible. This can be evaluated using a “points scoring” system.
您一次只能移动一个图块。 敌人不能,但是和敌人降落在同一块地上,你会死。 您的目标是尽可能快地走城堡。 可以使用“积分”系统进行评估。
You lose -1 at each step (losing points at each step helps our agent to be fast).
您 每步损失-1(每步损失点 帮助我们的代理更快)。
The question is: how do you create an agent that will be able to do that?
问题是:如何创建能够做到这一点的代理?
Here’s a first strategy. Let say our agent tries to go to each tile, and then colors each tile. Green for “safe,” and red if not.
这是第一个策略。 假设我们的经纪人尝试转到每个图块,然后为每个图块着色。 绿色表示“安全”,否则表示红色。
Then, we can tell our agent to take only green tiles.
然后,我们可以告诉我们的代理商只拿绿砖。
But the problem is that it’s not really helpful. We don’t know the best tile to take when green tiles are adjacent each other. So our agent can fall into an infinite loop by trying to find the castle!
但是问题在于它并没有真正的帮助。 我们不知道绿色瓷砖彼此相邻时采取的最佳瓷砖。 因此我们的经纪人可以通过尝试找到城堡陷入无限循环!
Here’s a second strategy: create a table where we’ll calculate the maximum expected future reward, for each action at each state.
这是第二种策略:创建一个表,在该表中,我们将为每个州的每个动作计算最大的预期未来奖励。
Thanks to that, we’ll know what’s the best action to take for each state.
因此,我们将知道对每个州采取的最佳措施是什么。
Each state (tile) allows four possible actions. These are moving left, right, up, or down.
每个状态(平铺)都允许四个可能的操作。 它们在向左,向右,向上或向下移动。
In terms of computation, we can transform this grid into a table.
在计算方面,我们可以将此网格转换为表格。
This is called a Q-table (“Q” for “quality” of the action). The columns will be the four actions (left, right, up, down). The rows will be the states. The value of each cell will be the maximum expected future reward for that given state and action.
这称为Q表 (“ Q”代表动作的“质量”)。 列将是四个动作(左,右,上,下)。 这些行将是状态。 每个单元格的值将是给定状态和动作的最大预期未来回报。
Each Q-table score will be the maximum expected future reward that I’ll get if I take that action at that state with the best policy given.
如果我在给出最佳政策的情况下在该州采取该行动,则每个Q表得分将是我将获得的最大预期未来奖励。
Why do we say “with the policy given?” It’s because we don’t implement a policy. Instead, we just improve our Q-table to always choose the best action.
我们为什么说“给出了政策?” 这是因为我们没有执行政策。 相反,我们只是改进我们的Q表,以始终选择最佳操作。
Think of this Q-table as a game “cheat sheet.” Thanks to that, we know for each state (each line in the Q-table) what’s the best action to take, by finding the highest score in that line.
将此Q表视为游戏“备忘单”。 多亏了这一点,我们知道了每个状态(Q表中的每一行)所采取的最佳措施,即找出该行中的最高得分。
Yeah! We solved the castle problem! But wait… How do we calculate the values for each element of the Q table?
是的 我们解决了城堡问题! 但是等等……我们如何计算Q表中每个元素的值?
To learn each value of this Q-table, we’ll use the Q learning algorithm.
要学习此Q表的每个值, 我们将使用Q学习算法。
The Action Value Function (or “Q-function”) takes two inputs: “state” and “action.” It returns the expected future reward of that action at that state.
动作值函数(或“ Q函数”)接受两个输入:“状态”和“动作”。 它返回该状态下该动作的预期将来奖励。
We can see this Q function as a reader that scrolls through the Q-table to find the line associated with our state, and the column associated with our action. It returns the Q value from the matching cell. This is the “expected future reward.”
我们可以看到这个Q函数,它是一个读取器,它滚动Q表以查找与我们的状态关联的行以及与我们的动作关联的列。 它从匹配的单元格返回Q值。 这是“预期的未来奖励”。
But before we explore the environment, the Q-table gives the same arbitrary fixed value (most of the time 0). As we explore the environment, the Q-table will give us a better and better approximation by iteratively updating Q(s,a) using the Bellman Equation (see below!).
但是在探索环境之前,Q表会给出相同的任意固定值(大多数情况下为0)。 当我们探索环境时,通过使用Bellman方程迭代更新Q(s,a) , Q表将为我们提供越来越好的近似值(请参见下文!)。
Step 1: Initialize Q-valuesWe build a Q-table, with m cols (m= number of actions), and n rows (n = number of states). We initialize the values at 0.
第1步:初始化Q值我们构建一个Q表,其中包含m个 cols(m =动作数)和n行(n =状态数)。 我们将值初始化为0。
Step 2: For life (or until learning is stopped)Steps 3 to 5 will be repeated until we reached a maximum number of episodes (specified by the user) or until we manually stop the training.
步骤2:终生(或直到学习停止) ,将重复步骤3至5,直到达到最大发作次数(由用户指定)或直到我们手动停止训练为止。
Step 3: Choose an actionChoose an action a in the current state s based on the current Q-value estimates.
步骤3:选择一个动作根据当前的Q值估算值,选择一个处于当前状态s的动作。
But…what action can we take in the beginning, if every Q-value equals zero?
但是……如果每个Q值等于零,我们在一开始可以采取什么措施?
That’s where the exploration/exploitation trade-off that we spoke about in will be important.
那就是我们在谈到的勘探/开采权衡的重要性所在。
The idea is that in the beginning, we’ll use the epsilon greedy strategy:
想法是,一开始, 我们将使用epsilon贪婪策略:
We generate a random number. If this number > epsilon, then we will do “exploitation” (this means we use what we already know to select the best action at each step). Else, we’ll do exploration.
我们生成一个随机数。 如果该数字> epsil o n,那么我们将进行“剥削”(这意味着我们将使用我们已经知道的方法在每个步骤中选择最佳操作)。 否则,我们将进行探索。
Steps 4–5: Evaluate!Take the action a and observe the outcome state s’ and reward r. Now update the function Q(s,a).
步骤4–5:评估! 采取行动a并观察结果状态s'并奖励r。 现在更新函数Q(s,a)。
We take the action a that we chose in step 3, and then performing this action returns us a new state s’ and a reward r (as we saw in the Reinforcement Learning process in ).
我们采取在步骤3中选择的动作a ,然后执行此动作将为我们返回新状态s'和奖励r (如我们在的强化学习过程中看到的)。
Then, to update Q(s,a) we use the Bellman equation:
然后,要更新Q(s,a),我们使用Bellman方程:
The idea here is to update our Q(state, action) like this:
这里的想法是像这样更新我们的Q(state,action):
New Q value = Current Q value + lr * [Reward + discount_rate * (highest Q value between possible actions from the new state s’ ) — Current Q value ]
Let’s take an example:
让我们举个例子:
Step 1: We init our Q-table
步骤1:我们建立Q表
Step 2: Choose an action From the starting position, you can choose between going right or down. Because we have a big epsilon rate (since we don’t know anything about the environment yet), we choose randomly. For example… move right.
第2步:选择一个动作从起始位置,您可以选择向右还是向下。 因为我们的epsilon率很高(因为我们对环境一无所知),所以我们随机选择。 例如……向右移动。
We found a piece of cheese (+1), and we can now update the Q-value of being at start and going right. We do this by using the Bellman equation.
我们找到了一块奶酪(+1),现在我们可以更新起点和终点的Q值。 我们通过使用Bellman方程来做到这一点。
Steps 4–5: Update the Q-function
步骤4–5:更新Q功能
Think of the learning rate as a way of how quickly a network abandons the former value for the new. If the learning rate is 1, the new estimate will be the new Q-value.
可以将学习率视为网络放弃新旧价值的一种方式。 如果学习率是1,则新的估计值将是新的Q值。
Good! We’ve just updated our first Q value. Now we need to do that again and again until the learning is stopped.
好! 我们刚刚更新了第一个Q值。 现在,我们需要一次又一次地这样做,直到学习停止。
We made a video where we implement a Q-learning agent that learns to play Taxi-v2 with Numpy.
我们制作了一个视频,其中我们实现了一个Q学习代理,该代理学习了如何与Numpy玩Taxi-v2。
Now that we know how it works, we’ll implement the Q-learning algorithm step by step. Each part of the code is explained directly in the Jupyter notebook below.
现在我们知道了它的工作原理,我们将逐步实现Q学习算法。 下面的Jupyter笔记本中直接解释了代码的每个部分。
You can access it in the
您可以在“ 访问它
Or you can access it directly on Google Colaboratory:
或者,您可以直接在Google合作实验室上访问它:
The Q come from quality of a certain action in a certain state.
Q来自品质 在特定状态下的特定动作。
That’s all! Don’t forget to implement each part of the code by yourself — it’s really important to try to modify the code I gave you.
就这样! 不要忘了自己实现代码的每个部分-尝试修改我给您的代码非常重要。
Try to add epochs, change the learning rate, and use a harder environment (such as Frozen-lake with 8x8 tiles). Have fun!
尝试添加纪元,更改学习率,并使用更艰苦的环境(例如带有8x8磁贴的冰冻湖)。 玩得开心!
Next time we’ll work on Deep Q-learning, one of the biggest breakthroughs in Deep Reinforcement Learning in 2015. And we’ll train an agent that that plays Doom and kills enemies!
下次,我们将进行深度Q学习,这是2015年深度强化学习中最大的突破之一。我们还将训练一个扮演末日并杀死敌人的特工!
If you liked my article, please click the ? below as many time as you liked the article so other people will see this here on Medium. And don’t forget to follow me!
如果您喜欢我的文章, 请单击“?”。 您可以根据自己喜欢该文章的次数在下面进行搜索,以便其他人可以在Medium上看到此内容。 并且不要忘记跟随我!
If you have any thoughts, comments, questions, feel free to comment below or send me an email: hello@simoninithomas.com, or tweet me .
如果您有任何想法,意见,问题,请在下面发表评论,或给我发送电子邮件:hello@simoninithomas.com或向我发送 。
Keep learning, stay awesome!
继续学习,保持卓越!
? S
?
? V
? 视频
Part 1:
第1部分:
Part 2:
第2部分:
Part 3:
第3部分:
Part 3+:
第3部分+:
Part 4:
第4部分:
Part 5:
第5部分:
Part 6:
第6部分:
Part 7:
第七部分:
翻译自:
强化学习q学习求最值
转载地址:http://ycgwd.baihongyu.com/