The human input and instructions. AlphaGo Zero is another

The first version of
AlphaGo used two neural networks (policy and value network) to choose its move
.Both were convolutional neural network .Policy network is given input as
board positions and decide the best next move to make. This network learnt from
millions of Go Gameplay through training .This network used reinforcement
learning to get better move and stronger. This network was checking thousands
of possible move before making a decision. The network was changed so that it did
not check the entire board search space instead it look at smaller window
around opponent of previous move and new move it considering. Hence, this helps
to compute the best next move which is thousand times faster. Value network
determine the chance of winning of game with a given board positions. These
deep neural network is integrated with Monte Carlo Tree Search to narrow down
the search in this large search space to have a faster search and better
prediction about move. This algorithm started out with a bootstrapping process
where it was shown thousands of games that were used to learn the basics of Go.
This version was based on supervised learning through human input and instructions.
AlphaGo Zero is another variant of AlphaGo which is not given any human played
games in first phase and learns by itself. This has no prior knowledge of the game,
starting completely from scratch and only the basic rules of games and board positions
as an input.
AlphaGo Zero is based on self-play reinforcement learning . In this
version, two neural network are fused into one, which can be trained more
efficiently. AlphaGo Zero is integrated with both policy network and value
network in single deep neural network that evaluates position. This deep neural
network uses simpler tree search (look-ahead) for better prediction and faster
search without using Monte Carlo rollouts which is faster and used by other Go
software to get better move and to decide which player will win from the
current board position. But it depends on its high quality deep neural networks
as position evaluators. Board representation and history are given as input to
AlphaGo Zero which uses single residual neural network. It uses Monte Carlo
Tree Search (MCTS) look-ahead in policy iteration to have stable learning for getting
better evaluations 8,9. AlphaGo Zero uses a deep neural network f? with
parameters ? and this neural network takes the positon s of the board
representation, its history and outputs both move probabilities and a value,
(p, v) =  f?(s) 9.  v is the value which is a scalar evaluation,
determines the chance of the current player winning from position s 50. For each
move, MCTS gives the search probabilities ?.This neural network combines the
roles from both policy network and value network 12 into a single
architecture. The neural network progresses itself to
predict the moves as well as the winner of the games 51. And this
increases the AlphaGo Zero strength over AlphaGo. The neural network has many residual blocks 4 of
convolutional layers 16, 17 with batch normalization 18 and rectifier
nonlinearities 19. For each position s, an MCTS 13-15 executes search,
followed by the neural network f?.  This Self-play with improved
MCTS-based policy updates the neural network to determine every move and the
winner as a value. It acts as a powerful policy evaluation operator. These
parameters are taken in the next iteration of self-play to make the search more
powerful in a policy iteration procedure 22, 23. The neural network (p, v)= f?(s)i is used to minimize the error between the predicted
value v and the self-play winner z and to maximize the  neural network move probabilities p to the
search probabilities ?. Specifically, the parameters ? are adjusted by gradient descent on a loss function l that
sums over the mean-squared error and cross-entropy losses, respectively:

(p, v)=f?(s) and l= (z-v)2 ? ?T
logp+ c ||?||2  

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

where c is a
parameter controlling the level of L2 weight regularization (to prevent overfitting).