Thomas Anthony - Thinking Fast and Slow with Deep Learning and Tree Search (2017)
History /
Edit /
PDF /
EPUB /
BIB /
Created: October 20, 2017 / Updated: November 2, 2024 / Status: finished / 3 min read (~486 words)
Created: October 20, 2017 / Updated: November 2, 2024 / Status: finished / 3 min read (~486 words)
- According to dual-process theory, human reasoning consists of two different kinds of thinking
- System 1: a fast, unconscious and automatic mode of thought (also known as intuition or heuristic process)
- System 2: an evolutionary recent process unique to humans, slow, conscious, explicit and rule-based mode of reasoning
- At a low leve, ExIt can be viewed as an extension of Imitation Learning (IL) methods to domains where the best known experts are unable to achieve satisfactory performance
- In IL an apprentice is trained to imitate the behavior of an expert policy
- Within ExIt, we iteratively resolve the IL problem
- Between each iteration, we perform an Expert Improvement step, where we bootstrap the (fast) apprentice policy to increase the performance of the (comparatively slow) expert
- In a typical implementation of ExIt, the apprentice is implemented as a deep neural network, and the expert by a Tree Search algorithm
- In Imitation Learning, we attempt to solve the MDP by mimicking an expert policy $\pi^*$ that has been provided
- The simplest approach is simply to ask the expert to name an optimal move $\pi^*(a|s)$
- To achieve Expert Improvement, we bootstrap the most recent apprentice policy learned in the Imitation Learning
- If the expert is able to find policies much stronger than the bootstrapped apprentice policy, each iteration will result in a large improvement in the apprentice's play strength
- The Imitation Learning step is analogous to a human improving their intuition for the task by studying example problems, while the Expert Improvement step is analogous to a human using their improved intuition to guide future analysis
- The learning rate of ExIt is controlled by two factors:
- the size of the performance gap between the apprentice policy and the improved expert
- Induces an upper bound on the new apprentice's performance at each iteration
- how close the performance of the new apprentice is to the expert it learns from
- Describes how closely we approach the upper bound
- the size of the performance gap between the apprentice policy and the improved expert
- The role of the expert is to perform exploration, and thereby to accurately determine strong move sequences, from a single position
- The role of the apprentice is to generalize the policies that the expert discovers across the whole state space, and to provide rapid access to that strong policy for bootstrapping
- The canonical choice of expert is a tree search algorithm
- Possible tree search algorithms include Monte Carlo Tree Search, $\alpha-\beta$ search, and greedy search
- The canonical apprentice is a deep neural network parametrization of the policy
- In each step of ExIt, Imitation Learning is restarted from scratch
- DQN can be seen as a special case of ExIt where the expert performs a single-sample single-step lookahead
- Anthony, Thomas, Zheng Tian, and David Barber. "Thinking Fast and Slow with Deep Learning and Tree Search." arXiv preprint arXiv:1705.08439 (2017).