# Paper: “Mastering the game of Go with deep neural networks and tree search”

Most recently, a paper “Mastering the game of Go with deep neural networks and tree search” was published in Nature. It describes a neural network architecture which for the first time won against a 2-dan professional Go player (Fan Hui) without handicaps achieving the title of the strongest Go engine existing. Between 5th and 15th of March 2016 a match between the engine – namely AlphaGo – and Lee Se-dol is scheduled. AlphaGo is pushed by Google.

It was difficult to me to follow the formal details lacking experience in machine learning. Here I would just like to cite the most interesting quotes:

[…] These games may be solved by recursively computing the optimal value function in a search tree containing approximately bd possible sequences of moves, where b is the game’s breadth (number of legal moves per position) and d is its depth (game length). In large games, such as chess (b ≈ 35, d ≈ 80) and especially Go (b ≈ 250, d ≈ 150), exhaustive search is infeasible, but the effective search space can be reduced by two general principles. First, […] position evaluation […] Second […] sampling actions from a policy p(a|s) that is a probability distribution over possible moves a in position s.

[…] evaluating positions using a value network, and sampling actions using a policy network.

We trained a 13-layer policy network […] from 30 million positions from the KGS Go Server.

We also tested against the strongest open-source Go program, Pachi.

The final version of AlphaGo used 40 search threads, 48 CPUs, and 8 GPUs. We also implemented a distributed version of AlphaGo that exploited multiple machines, 40 search threads, 1202 CPUs and 176 GPUs. […] The distributed version of AlphaGo was significantly stronger, winning 75% of games against single-machine AlphaGo and 100% of its games against other programs.

A policy p(a|s) is a probability distribution over legal actions a ∈ (s). A value function is the expected outcome if all actions for both players are selected according to policy p, that is, vP(s) = [zt|st = s, at…T ~ p].

Our rollout policy pΠ(a|s) contains less handcrafted knowledge than state-of-the-art Go programs.

Time controls for formal games were 1h main time plus three periods of 30s byoyomi.

The neural network architecture itself is described in the Methods Appendix. The (formal and informal) game results are provided as Data sheet. And in Methods/Evaluation a closing parenthesis is missing in the denominator of the logistic function 😉