The first version of

AlphaGo used two neural networks (policy and value network) to choose its move

.Both were convolutional neural network .Policy network is given input as

board positions and decide the best next move to make. This network learnt from

millions of Go Gameplay through training .This network used reinforcement

learning to get better move and stronger. This network was checking thousands

of possible move before making a decision. The network was changed so that it did

not check the entire board search space instead it look at smaller window

around opponent of previous move and new move it considering. Hence, this helps

to compute the best next move which is thousand times faster. Value network

determine the chance of winning of game with a given board positions. These

deep neural network is integrated with Monte Carlo Tree Search to narrow down

the search in this large search space to have a faster search and better

prediction about move. This algorithm started out with a bootstrapping process

where it was shown thousands of games that were used to learn the basics of Go.

This version was based on supervised learning through human input and instructions.

AlphaGo Zero is another variant of AlphaGo which is not given any human played

games in first phase and learns by itself. This has no prior knowledge of the game,

starting completely from scratch and only the basic rules of games and board positions

as an input.

AlphaGo Zero is based on self-play reinforcement learning . In this

version, two neural network are fused into one, which can be trained more

efficiently. AlphaGo Zero is integrated with both policy network and value

network in single deep neural network that evaluates position. This deep neural

network uses simpler tree search (look-ahead) for better prediction and faster

search without using Monte Carlo rollouts which is faster and used by other Go

software to get better move and to decide which player will win from the

current board position. But it depends on its high quality deep neural networks

as position evaluators. Board representation and history are given as input to

AlphaGo Zero which uses single residual neural network. It uses Monte Carlo

Tree Search (MCTS) look-ahead in policy iteration to have stable learning for getting

better evaluations 8,9. AlphaGo Zero uses a deep neural network f? with

parameters ? and this neural network takes the positon s of the board

representation, its history and outputs both move probabilities and a value,

(p, v) = f?(s) 9. v is the value which is a scalar evaluation,

determines the chance of the current player winning from position s 50. For each

move, MCTS gives the search probabilities ?.This neural network combines the

roles from both policy network and value network 12 into a single

architecture. The neural network progresses itself to

predict the moves as well as the winner of the games 51. And this

increases the AlphaGo Zero strength over AlphaGo. The neural network has many residual blocks 4 of

convolutional layers 16, 17 with batch normalization 18 and rectifier

nonlinearities 19. For each position s, an MCTS 13-15 executes search,

followed by the neural network f?. This Self-play with improved

MCTS-based policy updates the neural network to determine every move and the

winner as a value. It acts as a powerful policy evaluation operator. These

parameters are taken in the next iteration of self-play to make the search more

powerful in a policy iteration procedure 22, 23. The neural network (p, v)= f?(s)i is used to minimize the error between the predicted

value v and the self-play winner z and to maximize the neural network move probabilities p to the

search probabilities ?. Specifically, the parameters ? are adjusted by gradient descent on a loss function l that

sums over the mean-squared error and cross-entropy losses, respectively:

(p, v)=f?(s) and l= (z-v)2 ? ?T

logp+ c ||?||2

where c is a

parameter controlling the level of L2 weight regularization (to prevent overfitting).