in our blog article Implementing our Learning AI we mentioned that we optimized the calculation time for AI battles from 4 seconds to under 100ms!
We want to show you how we achieved this and how you can optimize your own games if you develop with Unity or any Java based engine.
Table of Contents
When should you optimize?
Only optimize if you notice problems. If you are running smooth and without frame drops, you are done. Doing more won’t get you decorated, since nobody will notice.
Optimizing for performance usually kills readability of your code. Therefore you always have to balance optimization and readable code.
We have been developing software together for more than three years and had to optimize for performance the first time recently. This occurred because we are simulating thousands of fights in our experiments, and saving parts of a second per fight adds up quickly.
Profilers are a powerful tool to analyze performance and find bottlenecks. They can show you which methods were called how often, and how much CPU time they required. We will have a look at the Unity Profiler and Oracle’s VisualVM Profiler for Java Virtual Machines.
As of Unity 5, the Profiler is included in the free version (documentation and tutorial video). Unity’s Profiler was designed for games, so it not only offers CPU and memory usage information, but rendering, physics, audio and networking information as well!
Unity’s Profiler running in the Linux Editor (Click to enlarge)
You can open the Unity Profiler by clicking on “Window->Profiler” in Unity’s Menu (hotkey Ctrl+7). When you have the window open and start your game, it will record all information. You can even start the profiler after starting your game in case you find a rare performance problem 😉
When your profiler is running, you can see the data per frame being recorded in the graphs in the top part. Let the profiler record some data and then pause your game in the editor. You can choose any frame the profiler has recorded by clicking on the CPU Usage Graph.
After doing so, you will see all method’s called in that frame including their CPU time in %, CPU time in milliseconds, and how often they were called.
When you click on a method, you can see details about the method calls including the GameObjects responsible for the calls 🙂
On the top you can enable Deep Profile to show detailed method call stacks. Enabling Profile Editor is useful when you have editor extensions or when an asset is slowing down your editor. You also profile remote players, such as a development build on a console or mobile device.
As mentioned before, Unity’s profiler can show you much more information. By scrolling down in the top part you can select frame information from the other categories and display rendering information.
If you have trouble with draw call performance, this article has some good info: Reducing Draw Calls (also names SetPass Calls) in Unity
For the most thrilling GPU profiling/debugging experience, check out the Frame Debugger (small button in the middle of the profiler window).
You can read about it and see it in action here: Frame Debugger (Unity Docs)
VisualVM is one of the many profilers for Java programs. It comes bundeled with the JDK, which means that you already have it installed if you develop in Java.
After starting VisualVM you need to connect the process you want to analyze. If you are running your programm from your IDE, keep in mind that they usually use reflection to start your code. The screenshot below shows you how it looks like for IntelliJ Idea: The main class of your program is an argument to the IntelliJ process.
The monitor tab offers an overview over your application. In the threads tab you can see all threads in your code and take a thread dump. Thread dumps can be helpful if your application won’t quit because a thread is waiting on a lock.
The next two tabs are the sampler and the profiler. Here you can measure cpu and memory usage. We will be using cpu measuring for these explanations. While both methods share the same layout, the settings are different. That’s because while they give you answers to the same questions, they calculate these answers in completely different ways. The sampler takes thread dumps in intervals and then extrapolates the cpu usage from there. The profiler inserts timing code into every method of every class loaded. This leads to rather precise timings, but it has a high performance cost. This is why you should generally use the sampler, and profile only the slow parts. If you would like to know more about the technical aspects read this stackoverflow answer and dig further from there.
Enough of the tech talk, lets sample something!
The results for my example look like this:
The first thing to understand when it comes to reading these results is the difference between self time and total time. Self time is the time spent in the this method, without any futher method calls. Total time is time in this methods and all methods it called. Note that any time spent in methods not being measured will be added to the calling method. The next distiction is time and time (CPU). The first is any time spent in this method, i.e. waiting on a lock, waiting for the hard disk and calculating things, while the second one is just the time spent calculating. This is relevant because non cpu bottlenecks need other solutions than cpu bottlenecks.
At the top of the results you see three methods from the jme3 lwjgl package that take a lot of time cpu but none of that is theirs own. Somewhere deep in these methods a lot of work is done, which is fine for us. The fifth method is update() in kryonet.Client. This method has a lot of self time, but no cpu time. Knowing that kryonet is a network library, it’s safe to say that it is waiting for the network device to have new data. Further down we have LinuxContextImplementation.nSwapBuffers(). Now this is a hard working method: All its time is self time (CPU). This pattern is what you usally look for when optimizing performance.
The memory sampler is rather self explantory and can be useful in case of a memory leak.
Optimizing Aleron’s Combat
At first we analyzed the process of running a fight in our game.
The first optimization was to deactivate the serverside framelimit. In simulation, the AI generates and executes one turn in one frame. A framelimit to 30 fps therefore limited the turns per second to 30.
Then we kind of stumbled on to the second optimization. At the start of a fight, players have to wait one second before they can move to the position phase. This was implemented to make sure player did not accidentially start the fight with out their team mates. Turns out evens games with only bots had to wait.
Now all was well, at least as long as we were at our university. But when we tried simulating at home, it was way slower. The database sits on a server in our university, in the uni we have a LAN connection to it. At home the connection is slower and so database access takes longer. We fixed this by caching all database access occuring at the beginning of a fight. This means we load it once and afterwards serve other request from the copy we have in memory.
These three optimizations already made fights sub second. Now we used a profiler to take a closer look at the performance of our code. The next optimizations are specific to java and might also be specific to our problem.
One bottleneck was copying collections. We create many copies of collections, some were uncessary and could simple be removed. In most cases collection copies were used to avoid ConcurrentModificationExceptions that occur when removing objects from a collection while iteration on it. In these cases we added the objects slated for removal another collection and removed the contents of the other list from the main list afterwards. This reduces the amount of objects copied greatly.
Another bottleneck was the creation of new collections. In many cases these collections were just read and then discarded, a filtered view on the original list would have been sufficent. Luckily the Java 8 streaming API offers just that. A stream is a view on a collection or part of that collection. By passing streams instead of collections we no longer need to instatiate three collections to return all allied players in this fight. ( getAlliedPlayers() calls getPlayers() calls getCharacters()).
We could go further but this would mangle our code too much, while increasing speed only a little, so we stop here, at around 100 ms per fight.
Authors: Sergej Tihonov, Eike Wulff and Benjamin Justice