Pathfinding using SARSA(Lambda).

Implementing our Learning AI

Hey guys,

we’re very excited to introduce the initial version of our Machine Learning AI for Project Aleron. We will explain our architecture and how the AI works. We hope to provide an example for AI beginners, such as ourselves as we started research a few months ago.

Here’s a video of the AI in action:

If you have no prior knowledge of Machine Learning, check out our article Machine Learning in Games that offers a friendly introduction.

Hierarchical AI

For Project Aleron, we decided to use a hierarchical AI structure, as used in the Killzone series of games and explained in detail in the following papers: Killzone 2, Killzone 3Intel article

The Hierarchical AI structure used in Project Aleron.

In Project Aleron we implemented three layers of AI:

  • The Strategic AI interprets the game mode and sets a goal. Currently this means “kill all enemies”.
  • The Tactical AI receives a goal from the Strategic AI, looks at the game state and determines a target for each Operational AI. Currently it only determines attack targets. Later on it could give orders to protect a friendly target, for example.
  • The Operational AI receives a target from the Tactical AI, looks at its game state and chooses an optimal action to achieve its goal. The AI can choose from four different actions. All actions are performed relative to the target.
    • Attack – Attack the target, moving into range if necessary. Cannot be used if the target cannot be reached in one turn.
    • Engage – Move into the attack range of the target, i.e. the target could attack the bot in targets next turn. If that is not possible move towards the target.
    • Outrange – Move directly outside the attack range of the target. If that is not possible, move to the safest position.
    • Flee – Moves next to or behind allies, if no allies remain puts maximum distance between itself and the enemy team.

The action is then converted into commands for the game server and executed there.

Right now we only want to train the OperationalAI. We can’t train several layers at once, because they would influence eachother and make training and debugging much harder. After having a stable learning AI for one layer we could implement another one in the future.

Influence Maps

Influence Maps are a form of stigmergy. Stigmergy is a very popular communication tool in the animal world. The communication happens indirectly: One animal leaves a trail behind, e.g. in form of scent, and other individuals can interpret these trails. The scent’s range is limited and the scent decays over time. This way current scents are automatically preferred and old ones disappear.

We want to use this concept for our AI, so that every bot can act independently, but still have the possibility to communicate with others.

Visualization of an Influence Map for one character.

Damage Map for a Character

Influence Maps display the influence of a character on its enviornment. A damage map, for example, shows how much damage a character can deal to which tiles in its next turn. Tiles which cannot be attacked next turn are not included in the damage map. Because movement and actions both consume Action Points, the values in the map descend with increasing distance.

We generate multiple influence maps per character which contain different information. These maps can be applied against eachother, e.g. to show the influence of a team.

Visualization of an Influence Map for two characters.


Threat Map for Two Characters

Those of you who have trouble understanding this should take a look at the article at Game School Gems. They have very good and colorful visualizations for influence maps.

The AI works solely with the influence maps. As the values are preprocessed, the AI doesn’t have to calculate much itself. Also the AI doesn’t care for the size of the map, because it only operates on the tiles within the influence map.

We currently have the following influence maps:

  • Remaining Life – remaining effective healthpoints. This means that defence and barrier values are converted to health and added to the characters current health. This way all damage calulcations consider defence and barrier.
  • Remaining Life in Percent – relative health points. The barrier is converted to health points and added to them. Defence is a percentage value and needs no extra calculation here.
  • Movement – Tiles, which the character can reach in its next turn. Each reachable tile holds the remaining Action Points after movement.
  • Damage – Tiles, which the character can attack in its next turn. Each attackable tile holds the maximum damage that can be applied to it next turn.
  • Influence – This map represents the danger of a character. The damage maps are taken as a foundation and multiplied by the characters remaining health points in percent. This way a character with low health points spreads less influence. A character with high health is much more dangerous than a low health character. Characters with very low damage are generally not dangerous.

If you still want to read more about influence maps, here are two awesome sources:

  • A Video from aiGameDev, which shows different types of influence maps with different usages.
  • A high quality article on influence maps can be found in our beloved Game Programming Bible (Collection of scientific articles about game development topics).

State Representation

As mentioned in our previous article, the AI needs a state representation, in order to learn. The state represents the situation on the game’s field. Today we will have a closer look at our state representation.

The tricky part is to represent as much information as possible with as few values as possible. We don’t have to perform many calculations because we have access to the influence maps. We only have to read the correct values from the correct influence maps. It’s very important that the bot can distinguish different situations while only looking at the state representation.

Everything which we cannot influence and everything which cannot influence us is irrelevant. In order to minimize our state representation we focus on our own influence and the enemies influence.

For the current state representation we extract five values from the damage maps and influence maps:

  • Can Move – We calculate whether the bot can move at all. If this is not the case, any action apart from attack are not usable. With this information the bot can learn this.
  • Own Remaining Health – We read the bots current health from the current health map.
  • Damage Potential on the Targets Position – We can read our damage map’s value at the targets position to determine how much damage we can deal to him. If we can’t deal damage to the target, it is our of range.
  • Own Endangerment Factor – This value is read from the influence maps. If the endangerment factor is high, then many enemies are near.
  • Own Safety Factor – This value is read from the influence maps as well. If the safety value is high, then many allies are near.
  • Target’s Endangerment Factor – This value is analog to the endangerment factor, but from the targets view.
  • Target’s Safety Factor – This value is analog to the endangerment factor, but from the targets view.

The final touch is the abstraction of these five values to nothing, small, medium, large.  This way the bot cannot distinguish as many situations anymore, but it reduces the state space enormously. Finding an optimal abstraction is something we are still looking for 😉

Reinforcement Learning

In our last article Machine Learning in Games, we gave an introduction to reinforcement learning and why we decided to use it. Now we look at the available reinforcement learning methods and explain the one we chose.

While looking at reinforcement learning methods, there are two types: Some learn State-State combinations while others learn State-Action combinations.
State-State methods look at the current state and all reachable states and choose the best one available. State-Action methods look at the game state and choose the best available action.

The big difference between these two is that we have to simulate each available action to see the resulting state for State-State methods.
State-Action methods, on the other hand, depend on the amount and type of actions. When an action is changed or added, the AI must begin learning from zero again.

We chose the State-Action approach, because it is faster in choosing an action and therefore can get through battles quicker. Additionally we want to use modern reinforcement learning algorithms, which commonly use the State-Action-Paradigm.


We use the reinforcement learning algorithm known as SARSA.
SARSA( stands for “State, Action, Reward, State, Action”.

The concept of SARSA is simple and effective:
The bot examines the current game state and chooses an action based on the rating. If it has no ratings yet, or some actions have identical rating, it can choose one of these at random. Note that the bot doesn’t learn anything here yet, even if the chosen action yields a reward.
The next turn is where the magic happens: In the next turn the bot examines the new game state and chooses a new action. Now the bot did not forget his previous State-Action pair and can update its rating based on the old State-Action pairs rating, the reward granted and the new rating. This establishes a connection between state-action pairs. Normal Sarsa only rates one state-action pair when rewarded.

We implemented a variant called SARSA(Lambda), which remembers how relevant past state-action pairs were and updates the rating for all relevant state-action pairs.

The following image visualizes the difference in rated state-action pairs for Sarsa and Sarsa(Labmda) after receiving a reward at the star’s location:

Pathfinding using SARSA(Lambda).

(a) Actions taken to reach goal    (b) Sarsa    (c) Sarsa(Lambda)

Here are links to implementation details for SARSA and SARSA(Lambda)

Further Reading

More Machine Learning examples in games:
Killzone 3
Quakebot by Imitation
Quakebot with Evolutionary Algorithms

Closing Words

While writing this article, we noticed simulating a fight took very long. Good results require many fights. Running 1000 fights at four seconds per fight takes about an hour, an hour of waiting for results. That was quite depressing 🙁

We then optimized a bit and are now at around 100 milliseconds per fight 🙂 We might write an article on that in the future.

Authors: Sergej Tihonov, Eike Wulff and Benjamin Justice

About The Author

Leave a Comment

Scroll to Top