Home AGI Presentations index The Future of Artificial Intelligence

The Future of Artificial Intelligence

History / Edit / PDF / EPUB / BIB /
Created: March 18, 2016 / Updated: July 24, 2025 / Status: finished / 3 min read (~551 words)
artificial general intelligence

If you've already seen Hamis Hassabis presentation General Learning Algorithms, you may skip to 27:25 since most of the content he covers is already covered in this other presentation. This content will not be covered here, so please refer to the other presentation article.

Pattern recognition with planning

3000 years old
40M players
$10^{170}$ positions
2 rules
- Rule #1: The capture rule: Stones are captured when they have no liberties
- Rule #2: The "Ko" Rule: A repeated board position is not allowed

The complexity makes brute force exhaustive search intractable
Two main challenges
- The branching factor is huge
- Writing an evaluation function to determine who is winning is thought to be impossible
Go is a game of intuition
Computes are not very good at replicating intuition
In chess, one can use heuristics to roughly determinate who's winning
In go, this is not possible as each pieces have the same value

Download 100K games from amateur play
Train a supervised learning policy network to try and copy those "expert" players
Train through a reinforcement learning policy network by playing against itself
Generate new data (30M positions) which will help produce an evaluation function
Build a value network which indicates, based on the current board, who's winning the game and estimates by how much

Policy network: Probability distribution over moves
Value network: A real number between 0 and 1, where 0 is white winning, 1 is black winning and 0.5 is the game is even
The policy network can be thought of as reducing the breadth of the search tree
The value network can be thought of as reducing the depth of the search tree

Q - Action value of a move
Do a little bit of searching in the tree
Find a couple of promising moves
Find the moves that have the maximum Q values
Follow the trajectory that has a high Q value until we hit a node that hasn't been explored yet
First, we call the policy network to expand the tree at that point but only for the moves with the highest prior (P) probability of that move occurring
Second, we call the value network to evaluate that position and give an estimate of who is winning
We also do another thing if we have time, which is rollouts to the end of the game (maybe a few thousands) to collect "true" statistics about who ends up winning the game from that position
Then, we combine these two estimates (from the value network and the rollouts) to give a final estimation

Internal testing: Different version of AlphaGo playing against itself 24/7
Calibration: Play against external programs such as Zen and CrazyStone
In April 2015: Winning 99% of games against Zen and CrazyStone
In October 2015: Could beat the April 2015 version 100% of the time and beat Fan Hui 5 out of 5
In March 2016: No estimate against the October 2015 version. Lee Sedol can beat Fan Hui about 97% of the time

General Learning Algorithms (some of the content of this presentation is already covered in this other presentation)