# Game-changer: AlphaGo Revolutionizes Artificial Intelligence

SPRING 2018 — *A Writing & Culture (WRIT-015) class assignment: Explaining Complexity*

An uncharacteristically anxious Lee Sedol glances up at his opponent across the Go board. Out of desperation, he searches for a look of unease, a flicker of doubt—a sign of weakness. Force of habit. It’s unnecessary in this game; his competitor is not human.

In March 2016, South Korean Lee Sedol, 18-time reigning world Go champion and arguably one of the greatest Go players of all time, faltered in a match against AlphaGo, an artificial intelligence (AI) program created by DeepMind. Millions of eyes scrutinized the televised encounter between Lee Sedol versus AlphaGo—human versus machine—and millions of eyes widened in disbelief when Lee Sedol appeared befuddled against an opponent for the first time in his 20-year career. AlphaGo ultimately emerged victorious 4 games to 1, marking the first time AI defeated a world champion Go player—a feat that experts predicted was at least 10 years away.

In a game of Go, two players alternate placing their color stone, white or black, on a 19-by-19 board with 361 positions. Stones are “captured” and removed from the board when the black ones are completely surrounded by white, or vice versa. The game ends when neither player decides to make a move. The player that encloses more territory of the board with their stones wins.

Despite its relatively simple rules, Go is extraordinarily complex. Two moves into a game of chess, there are 400 possible next moves. In Go, there are 130,000. Due to the sheer number of possible Go board configurations, traditional AI programs have struggled to play at a level any higher than a capable amateur. The numbers attest to the immense challenge Go poses to AI, but its gameplay also relies on an innate sense where a move “just feels right.” This human intuition—how did AlphaGo replicate it? It takes the most creative and strategic minds to master Go, so how did a computer program manage to defeat the poster child of a game dominated by humans for the last two millennia?

AlphaGo combines three machine learning algorithms: **Monte Carlo tree search**, a **policy network**, and a **value network**. DeepMind’s 2016 paper explains the full technical details.

Prior to AlphaGo, traditional AI relied solely on Monte Carlo tree search. These AI programs simulate a game of Go by selecting moves at random and expanding from a particular state of the board—a node. When a AI program reaches the end of that game, it retraces its steps and returns to the starting node and assigns a value to it based on whether it resulted in a win, loss, or draw—a process called back-propagation. The assigned value helps the program determine which moves are more favorable. As the program plays more and more games, the tree of possible moves becomes larger and larger. In subsequent games, it can search the tree to choose moves that are more likely to result in a victory.

However, there is a handicap to this method: the tree search algorithm mindlessly samples arbitrary moves among legal plays, but they are not always the best plays. Essentially, Monte Carlo tree search employs a “brute force” method of playing all possible combinations of moves and finding the winning one. Given the vast number of possible positions in Go, it is neither efficient nor effective to play through and evaluate every potential move.

## How is AlphaGo different from other AI?

This is where the **policy network** and **value network** come into play; together, they tackle the inefficiencies of the tree search algorithm. Rather than picking moves at random, the policy network only selects moves that expert Go players would choose, narrowing the choices to those most likely to win. The value network evaluates the probability of winning associated with those select moves so that AlphaGo doesn’t have to play through an entire game to determine the outcome. AlphaGo then executes the move with the highest winning probability. Thus, these two neural networks reduce the breadth and depth of the tree search, producing a more efficient model. Let’s break down how each of these neural networks functions.

#### Policy Network

The policy network uses a machine learning algorithm called **supervised learning**. In supervised learning, you train a model based on a set of labelled data that consists of input values x and their corresponding output values y. The trained model can derive a relationship between x and y, which is represented by the function f so that y = f(x). Once the model has learned this relationship, it can predict the output y based solely on test data inputs. Humans process the world around them in a similar fashion: we make inferences based on what we know and what we have experienced.

A single unit where the process of supervised learning occurs is called a neuron. The neuron takes an input and predicts an output based on a learned relationship between input and output. A collection of neurons in an interconnected web forms an **artificial neural network**. These neural networks—as the name suggests—are inspired by our own biological neural networks, or the systems of cell connections in our brains. It’s this concept of neural networks that allows AlphaGo to imitate the ingenuity of the human mind.

AlphaGo’s policy network follows this two-step supervised learning process: it learns from training data and uses what it learned to make a prediction from the test data. Before its match against Lee Sedol, AlphaGo learned from a set of training data that contained 30 million positions in 160,000 expert-level Go matches from an online database. At any point in the match, AlphaGo processes an image of the current board configuration—the test data—and takes each position on the board as an input. But neural networks described above can’t process all the attributes of an image through a simple function f. To analyze the more complex inputs of an image, AlphaGo utilizes what is known as a **deep neural network**, characterized by hidden layers.

Instead of feeding directly into the output layer, the input layer sends information to an intermediary layer of neurons—a hidden layer—each with its own set of functions. This allows AlphaGo to perform a series of convoluted calculations as the information travels from hidden layer to hidden layer. The final output is a set of probability values corresponding to each of the board positions.

The positions with higher probability values—positions that AlphaGo learned from the supervised learning process—are moves that players from the training data were more likely to choose. For example, at a certain point in the game, AlphaGo determines that there is a 36% chance that a player would place at stone at position A3, while players are only 12% likely to play E8 next. Instead of blindly using the Monte Carlo search tree to sample for random moves, AlphaGo employs the deep neural network to purposefully pinpoint moves that human players prefer.

#### Value Network

AlphaGo narrows the search field with supervised learning, but out of those select moves, how does it know which one to pick? Even though experts prefer those moves based on the training data, not all of them will end in victory. To predict the position that yields the highest probability of winning, the value network—which has the same deep neural network structure as the policy network—utilizes another type of machine learning called **reinforcement learning**.

In reinforcement learning, the program learns without the labelled training data used in supervised learning. Think of supervised learning as having an answer key, while in reinforcement learning, you have figure out everything yourself from scratch. As humans, we learn through experience: if an action produces a pleasurable response, we continue it. If an action produces pain, we stop. It’s the same in reinforcement learning: AlphaGo plays to maximize reward, or in this case, wins.

Prior to the match against Lee Sedol, the value network sampled 30 million moves from games that AlphaGo played against itself. From this self-training process, the value network can predict the winning probability—the value—of the current state of the board. It’s able to answer the question: what is the probability that black will win the game based on the current state? With this information, the value network can determine which board states yield the greatest likelihood of winning.

It’s costly for the tree search to play through all the possible moves picked from the policy network, so the value network prioritizes positions with high winning probabilities. The search tree then invests more time in expanding and simulating those promising board positions. For example, if the policy network calculated that human players are more likely to choose position A3 and C6, Monte Carlo tree search would expand from each of those nodes and play a few moves. The value network assesses the probability of winning at those future board states and determines that the probability of winning at A3 is 54% and 23% at C6. The tree search algorithm stops playing from the C6 node, but continues to expand the board state with a higher winning probability, A3. Thus, AlphaGo finds the “best” move to play, A3, because that move produces a board state with the greatest chances of winning.

With these two deep neural networks—the policy and the reinforcement network—AlphaGo optimizes the clumsy tree search algorithm. Choosing the next move becomes much more feasible and efficient, which ultimately allowed AlphaGo to triumph over Lee Sedol.

## What we gain from a loss against AlphaGo

Something unexpected emerged from AlphaGo that its creators could have never predicted. DeepMind sought to create a program capable of beating champion players. It accomplished that and much, much more.

One of AlphaGo’s moves in the historic match astounded Go experts: move 37 in game 2, a move many described as “beautiful.” It was a shot in the dark; using the policy network, AlphaGo deduced that there was only a one-in-ten-thousand chance that someone would play that move. But based on the intuition it gained from all those games against itself, AlphaGo played it anyway.

That move overturned thousands of years of conventional wisdom and will fundamentally change the way we play Go for generations to come. AlphaGo learned how to be creative. AlphaGo *taught us*.

In game 4, Lee Sedol reciprocated with a move on par with AlphaGo’s move 37. Turns out, his move also had a one-in-ten-thousand chance of being played. Referred to as “God’s Touch,” that move turned the tables in Lee Sedol’s favor and ultimately rewarded him with his sole victory in the 5-game match.

Lee Sedol may have lost overall, but he walked away from that match with a redefined understanding of Go. “It made me rethink what it means to be creative,” he reflected. An encounter with a machine expanded an integral aspect of his humanity—the capacity to create, to innovate.

Maybe AI isn’t as disconnected from us as we think. Maybe the future of AI won’t only entail the progress of machines, but also the enhancement of human mentality. Maybe hidden behind the functions, the ones and zeros, and the pixels is a manifestation of what makes us human.

One thing’s for sure: AlphaGo rewrote the game.