David Silver - Mastering the game of Go without human knowledge (2017)
History /
Edit /
PDF /
EPUB /
BIB /
Created: October 20, 2017 / Updated: November 2, 2024 / Status: finished / 3 min read (~485 words)
Created: October 20, 2017 / Updated: November 2, 2024 / Status: finished / 3 min read (~485 words)
- How is AlphaGo Zero different from natural selection/evolution?
- The architecture from AlphaGo is simplified, merging the policy and value network into a single network
- A deep neural network $f_\theta$ with parameters $\theta$
- The neural network takes as an input the raw board representation $s$ of the position and its history, and outputs both move probabilities and a value, $(p, v) = f_\theta(s)$
- The vector of move probabilities $p$ represents the probability of selecting each move $a$, $p_a = \Pr(a|s)$
- The value $v$ is a scalar evaluation, estimating the probability of the current player winning from position $s$
- Combines both the roles of policy network and value network into a single architecture
- The neural network consists of many residual blocks of convolutional layers with batch normalization and rectifier nonlinearities
- In each position $s$, an MCTS search is executed, guided by the neural network $f_\theta$
- The MCTS search outputs probabilities $\pi$ of playing each move
- The main idea of our reinforcement learning algorithm is to use these search operators repeatedly in a policy iteration procedure: the neural network's parameters are updated to make the move probabilities and value $(p, v) = f_\theta(s)$ more closely match the improved search probabilities and self-play winner $(\pi, z)$
- Each edge $(s, a)$ in the search tree stores a prior probability $P(s, a)$, a visit count $N(s, a)$, and an action value $Q(s, a)$
- Each simulation starts from the root state and iteratively selects moves that maximizes an upper confidence bound $Q(s, a) + U(s, a)$, where $U(s, a) \propto P(s, a) / (1 + N(s, a))$, until a leaf node $s'$ is encountered
- MCTS may be viewed as a self-play algorithm that, given neural network parameters $\theta$ and a root position $s$, computes a vector of search probabilities recommending moves to play, $\pi = \alpha_\theta(s)$, proportional to the exponentiated visit count for each move, $\pi_a \propto N(s, a)^{1/\tau}$, where $\tau$ is a temperature parameter
- First, the neural network is initialized to random weight $\theta_0$. At each subsequent iteration $i \ge 1$, games of self-play are generated
- At each time-step $t$, an MCTS search $\pi_t = \alpha_{\theta_{i-1}}(s_t)$ is executed using the previous iteration of neural network $f_{\theta_{i-1}}$ and a move is played by sampling the search probabilities $\pi_t$
- A game terminates at step $T$ when both player pass, when the search value drops below a resignation threshold or when the game exceeds a maximum length
- A pure reinforcement learning approach is fully feasible
- A pure reinforcement learning approach requires just a few more hours to train, and achieves much better asymptotic performance, compared to training on human expert data
- Silver, David, et al. "Mastering the game of go without human knowledge." Nature 550.7676 (2017): 354.