our SARSA-based learning AI is making great progress! We have our first experiment results and have begun further tweaking of our experiment setups.
As our experiments all follow a common non-trivial setup, we thought it would be fun to share it with you 🙂
As we explained in our last AI article, SARSA needs to fight many battles before it learns how to exploit the enemy tactics. After fighting many battles, we then have to find out how well SARSA learned the exploit before we can think about optimizing it. After observing SARSA’s first learning scenarios, we discovered a great way to train and evaluate a learning scenario:
- SARSA participates in an amount of battles in order to learn
- SARSA’s learning capabilities are disabled
- SARSA participates in another 100 battles
- SARSA’s winrate in these 100 battles tells us how well it learned the exploit!
This setup is very basic and is not sufficient, in order to optimize SARSA. Tuning SARSA’s learning behavior by hand would take forever, as we have over 10 parameters which we can adjust in very small steps.
For this purpose, we used a few tricks used commonly in computer science 😉
The evolutionary algorithms (EA) are oriented towards the evolution theory of “survival of the fittest”. The first generation starts with some parents. These parents generate many children. The strongest children survive, become parents themselves and create the next generation, which is stronger than the previous one. This process lets each generation adapt better to the environment. Those who fail to do so, are discarded.
Die Evolutionären Algorithmen (EA) orientieren sich an der Evolutionären Theorie des überleben des Stärkeren. Die erste Generation startet mit einigen Eltern. Die Eltern zeugen viele Kinder. Die stärksten Kinder überleben, werden selber zu Eltern und zeugen eine neue Generation, die stärker und besser an die Umgebung angepasst ist. Durch diesen Prozess passen sich die Kinder immer besser an die Umgebung an. Kinder, die es nicht schaffen, werden verworfen.
The algorithm consists of four steps:
- Initial population – The initial population can be predefined or randomly generated.
- Fitness calculation and selection – The fitness value describes, how close the individual is to an optimal solution. The closer the fitness is to 0 (optimum), the better. The calculation is always problem-specific. The best individuals are selected for the next generation; the rest is discarded.
- Reproduction – The selected individuals are used to generate new children. Commonly two parents are combined to create a child.
- Mutation – Like in real-life evolution, the children are slightly mutated, so they differentiate from their parents and introduce new features. The mutation strength can vary throughout the generations.
- Repeat from step 2.
The special thing about evolutionary algorithms is, that the algorithm can be cancelled after any generation. If the generation’s fitness values are good enough to solve the problem at hand, we are done. If not, we can use the last generation as the new initial population, so that the algorithm can continue working.
General Experiment Setup
As mentioned, we evalutate the fitness of an individual by letting it learn for a certain number of battles and have it fight 100 battles without learning afterwards. The winrate of these 100 battles is used to determine the fitness of the individual. Battles are grouped into batches to make the handling easier. The EA simply creates a learning batch and an evaluation batch with the neccessary parameters and sizes and hands them of to the simulation. The simulation runs each battle in a batch and writes statistics such as the winner back into the batch. The EA is notified whenever a batch is finished. When an evaluation batch finishes the EA calculates the fitness of the individual. In testing randomness proved a huge problem because it leads to very unrealiable results. Therefore we changed to setup to repeating the learning and evaluation scenarios multiple times and calculating the fitness as an average of them.
Experiment: State Representation
Our goal is to reduce the state space while retaining 100% win rate. Therefore we want to know whether all seven values in the state respresentation are needed and which resolution each value needs. What special about this experiement setup is the termination condition for learning. It is based on the number of new state-action-pairs the AI found. Learning is terminated when the number of new state-action-pairs visited in the last 10 battles drops below a threshold, e.g. 3. In this case we assume the AI has learned enough.
Termination after a fixed number of battles does not work, because the AI can only visit so many states-action-pairs in one battle.
As you might remember SARSA’s always selects the best action in a state, except when a roll against the parameter epsilon decides it should take a random action. These random actions might uncover new paths to victory. The problem is now: how often should SARSA take random actions? To solve this, we will allow the EA to mutate both the parameter epsilion and the number of training battles. The fitness is the number of battles the individual trained with penalty for every battle not won.
SARSA has many learning parameters and rewards may be given in many different situations. We have setup a few intermediate rewards, namely for killing an enemy unit and a negative reward for the death of an allied unit. The EA mutates three learning parameters and the weighting of seven different rewards factors, as well as the number of training battles. The fitness is the number of battles the individual trained with penalty for every battle not won.
We hope this article gave you a nice overview of our work. In the upcomming articles we will go in depth into each experiment.
As always, feel free to ask us questions in the comments section below or in social media 🙂
Authors: Sergej Tihonov, Eike Wulff and Benjamin Justice