Two weeks ago SmartGo sponsored the Core Intuition podcast, and I encouraged listeners to learn Go before computers conquer the game. Little did I know that Google’s AlphaGo had already beaten a professional player in a five-game match back in October; they were just waiting for their paper to get published.
I had expected this day to be at least five years away, more likely a decade; this is an amazing accomplishment by Google’s DeepMind team. In my previous thoughts on Facebook and Google entering the race, I was right about tree search being essential, wrong about more Go-specific approaches being necessary.
So how did Google do it? They build on Monte Carlo Tree Search (MCTS), but refine and extend it, and combine it with neural networks. Their paper is full of interesting details; here’s my initial take on the most important factors contributing to their success:
- Extending MCTS with position evaluation: At each node where they perform a Monte Carlo simulation (a rollout or playout), they also evaluate the position using a value function. The results from the simulations and the value function are combined; the combination performs better than either alone.
- Improved rollout policy: They create a deep neural network for the rollout policy using supervised learning, then refine it with reinforcement learning. A neural network implementing this rollout policy takes 3 ms to select an action; they also trained a rollout policy that is 1000 times faster but less precise (based on a linear combination of pattern features instead of a neural net). They use the more accurate one during learning, the faster one during actual play.
- Using rollouts to learn value function: Position evaluation has not worked well for computer Go in the past, but here they just need a reasonably good guess at how likely a position will lead to a win. They trained it with 30 million distinct positions (each sampled from a separate game to avoid overfitting), and used their best rollout policy to play to the end and determine the winner. This resulting value function is more accurate than a playout using their fast rollout policy, and while it’s less accurate than using their best rollouts using neural nets, it takes 15000 times less computation. (The paper also mentions truncated rollouts, not quite clear on those yet.)
The AlphaGo team has also found good solutions for an asynchronous MCTS algorithm, for combining fast MCTS operations with slow neural networks, and for hashing local moves during the playouts. Massive amounts of hardware are helpful too.
The match against Lee Sedol in March is going to be tremendously exciting. There’s still time to learn Go now so you can follow along and share in the excitement.