The Problem
Every few weeks we release a major version of our software. Throughout the release, we have many processes in place to ensure we don’t cause regressions in our latest code. Since we serve up millions of pages each day, regressions in scalability are just as important as failures.
At code freeze, which is approximately one week prior to release, we start running performance tests. The tests themselves are pretty basic: we have scripts that hit a server running new code with a large number of concurrent requests. We then compare compare the results with a baseline computed by executing the same requests against a server running the previous version’s code on the same hardware and database. The results are usually within a reasonable margin of error. This time, however, the new code was only able to process about 1/3 of the requests that the old code could.
In our current product offering, we have two main parts of public access: content display and content submission. Either one of these pieces could have been the culprit since we typically run all types of requests through the system during the first run. A subsequent run of just display traffic proved that display was causing the majority of the slowdown, even without submissions added to the mix. Another key piece of information is that the database didn’t appear to be the culprit: the processors on the app server were pegged, and the database servers were handling queries as normal. One of the usual suspects had an alibi! Of course, we weren’t ruling out the possibility that the database was still an issue, or possibly even Solr.
Finding the Bottleneck
The results so far indicated that we would probably need to profile the app and not just look for bad queries. In order to find the offending code, most Java profilers must connect to the JVM in a way that gives them access to low-level data. We use JProfiler for these types of issues, and it integrates very well with IntelliJ IDEA, but that requires reproducing the issue locally: not always an easy thing to do when large systems with enormous databases are involved.
Running locally, I would get different performance results than running in the performance environment. In fact, sometimes my Mac Pro is faster than the performance environment. Therefore, I would need to run both new and old code on my box to see how they compare – and where the breaking point is.
We have a few options for local testing. The first option is to extract a large data set for use locally. This option means importing the data into a local MySQL DB and reindexing the content into Solr. Unfortunately this is a multi-hour process for a large set of data. In addition, the request logs span all traffic for a particular server, so that means a multi-GB data set. I decided to start the extraction and import process in case this was necessary, and move on to Option 2.
The next option is to attempt to access our lab databases. The databases in the lab don’t have as much data as the extractions, but they do have enough to do a quick-and-dirty performance test. The big issue is that the databases are on different hardware, so it was possible that they wouldn’t match perfectly. I decided to try it anyway, and as expected from the original performance tests, the new code performed about half as well as the old code for each request. I was able to see these results by just executing the requests serially, so I thought I had a good candidate for profiling.
Profiling, Take 1
Profiling a single request in the new version, I didn’t find any method call that really stood out:
The highest cost method above is only 4%, and it turns out to be in the HTML templating system we use. JProfiler has the ability to compare two snapshots, so I took a snapshot for the new code, and then reran the request for the old code while profiling. Comparing the two, we see that there is really no change between the two codebases for these calls:
It looks like Freemarker could be the culprit, but looking at the increasing and decreasing parts of the results, they seem to cancel each other out. This doesn’t make sense if all of our time is supposedly in Java.
DB slowdown?
Looking more closely at the JProfiler snapshot output, there’s an option to show which methods and threads are being filtered out. The threads in the output above were all in “Running” state. This makes sense if the problem is within running Java code, but what if the problem is in the database? In that case, our Java code would be waiting a lot. It turns out that JProfiler has a specific option to show the JDBC calls. Looking at the following output, we can see a few calls are taking 15% of the time while running with the new code:
The nice thing about saving off a snapshot for each of the code versions is that I can keep looking at the data long after the server has been shutdown. Looking at the comparison between new and old code again, we see that there are a few queries that are taking 10x longer to execute in the new system:
This is a problem, but at that point, I realized that this is not likely to be our main problem. Noticing the number of invocations, we don’t call these queries any more often in the new code than the old. Instead, the same exact queries are taking longer. This could mean that we changed the indexes on these tables (nope!), or there is much more data in the database (probably not the case), or (most likely) the hardware is affecting performance of the DB queries drastically.
Switching DBs
Back to square one. I needed to use the same DB hardware for these comparisons, because it skews the results too much, mainly because DB query time overshadows processing time. Also, since we weren’t seeing heavy load on the DB servers during these tests, we were still thinking that the problem was in the Java code.
At this point I didn’t yet have a local DB environment that was reliable, so I decided to run against the performance testing DBs directly. We had to shut down performance testing, but otherwise this was surprisingly easy. While testing it out, I ran into an unexpected bonus: both new and old code was able to run against the new schema! That meant truly testing against the exact same DB was possible.
I ran the serial tests again in both new and old code, and found that they were actually quite similar in performance. Profiling both versions confirmed this, as there were really no differences between the new and old code outside a normal margin of error. It definitely wasn’t taking 3x longer to process requests. However, there is a lot of local processor dead time when handling requests serially because there is a ton of time spent waiting for the DB. I wasn’t able to reproduce the negative results running requests one at a time.
In the conclusion of this post, we’ll get past the red herrings and drive to a solution.