Author Archives: Michael Norman

About Michael Norman

Michael Norman was a Senior Software Engineer at Bazaarvoice.

MongoDB Arrays and Atomicity

Over time, even technologies that are tried and true begin to show their age. This is especially true for data stores as the shear amount of data explodes and site traffic increases. Because of this, we are continually working with new technologies to determine whether they have a place in our primary stack.

To that end, we began using MongoDB for one of our new internal systems several months ago. Its quick setup and ease of use make it a great data store for starting a new project that’s constantly in flux. Its robustness and scalability make it a worthy contender for our primary data store, if it should prove capable of handling our use cases.

Being traditionally a MySQL shop, our team is used to working with databases in a certain way. When we write code that interacts with the DB, we use transactions and locks rather liberally. Therefore using MongoDB is a rather significant paradigm shift for our developers.

Though MongoDB doesn’t have true ACID compliance, it does support atomic updates to an individual object. And because a “single” object in Mongo can actually be quite complex, this is sufficient for many operations. For example, for the following complex object, setting values, incrementing numbers, adding and removing items from an array, can all be done at the same time atomically:

db.employees.update(
  {_id: ObjectId("4e5e4c0e945a66e30892e4e3")},
  {$set: {firstName: "Bobby", lastName: "Schmidt"}}
);

Even setting values of child objects can be done atomically:

db.employees.update(
  {_id: ObjectId("4e5e4c0e945a66e30892e4e3")},
  {$set: {"position.title": "Software Developer", "position.id": 1015}}
)

In fact, individual items in an array can be changed:

db.employees.update(
  {_id: ObjectId("4e5e4c0e945a66e30892e4e3")},
  {$set: {"employmentHistory.0.company": "Google, Inc."}}
);

However, one thing you may notice is that the code above to modify a specific item in the array requires us to know the specific index of the item in the array we wish to modify. That seems simple, because we can pull down the object, find the index of the item, and then change it. The difficulty with this is that other operations on an array do not necessarily guarantee that the order of the elements will be the same. For example, if an item in an array is removed and then added later, it will always be appended to the array. The only way to keep an array in a particular order is to $set it (replacing all the objects in it), which for a large array may not have the best performance.

This problem is best demonstrated with the following race conditions:

One solution to this problem is to $pull a particular item from the array, change it, and then $push it back onto the array. In order to avoid the race condition above, we decided that this was a safer solution. After all, these two operations should be doable atomically.

db.employees.update(
  {_id: ObjectId("4e5e4c0e945a66e30892e4e3")},
  {$pull: {badges: {type: "TOP_EMPLOYEE"}}, $push: {badges: {type: "TOP_EMPLOYEE", date: null}}}
);

So, what’s the problem? The issue is that MongoDB doesn’t allow multiple operations on the same property in the same update call. This means that the two operations must happen in two individually atomic operations. Therefore it’s possible for the following to occur:

That is bad enough, but it can be dealt with by updating an object only if it is in the correct state. Unfortunately, the following can happen as well:

That looks very odd to the reader, because at one moment it would see the item, then the next read would lose it, and then it would come back. To an end user that’s bad enough, but to another system that is taking actions based on the data, the results can be inconsistent at best, and deleterious at worst.

What we really want is the ability to modify an item (or items) in an array by query, effectively $set with $pull semantics:

db.employees.update(
  {_id: ObjectId("4e5e4c0e945a66e30892e4e3")},
  {$update: {badges: {$query: {type: "TOP_EMPLOYEE"}, $set: {date: null}}}}
);

Since that’s not yet supported by MongoDB, we decided to use a map. This makes a lot of sense for our use case since the structure happened to be a map in Java. Because MongoDB uses JSON-like structures, the map is an object just like any other, and the keys are simply the property names of the object.

This means that individual elements can be accessed and updated from the map:

db.employees.update(
  {_id: ObjectId("4e5e4c0e945a66e30892e4e3")},
  {$set: {"badges.TOP_EMPLOYEE.date": null}}
);

This seems like a rather elegant solution, so why didn’t we start out this way? One pretty major problem: if you don’t know all of the possible keys to the map, there is no good way to index the fields for faster querying. Indexes to child objects in MongoDB are created using dot notation, just like the query syntax. This works really well with arrays because MongoDB supports indexing fields of objects in arrays just like any other child object field. But for a map, the key is part of the path to the child field, so each key is another index:

// Simple fields
db.employees.ensureIndex({lastName: 1, firstName: 1})

// Child object fields
db.employees.ensureIndex({"position.id": 1})

// Array object fields
db.employees.ensureIndex({"employmentHistory.company": 1})

// Map fields, treating map values like child objects
db.employees.ensureIndex({"badges.TOP_EMPLOYEE.date": 1})

For our purposes, we are actually able to index every key individually, and treat the map as an object with known properties.

The sinister part of all of this is that you won’t normally run into any problems while doing typical testing of the system. And in the case of the pull/read/push diagram above, even typical load testing wouldn’t necessarily exhibit the problem unless the reader is checking every response and looking for inconsistencies between subsequent reads. In a complex system, where data is expected to change constantly, this can be a difficult bug to notice, and even more difficult to track down the root cause.

There are ways to make MongoDB work really well as the primary store, especially since there are so many ways to update individual records atomically. However, working with MongoDB and other NoSQL solutions requires a shift in how developers think about their data storage and the way they write their query/update logic. So far it’s been a joy to use, and getting better all the time as we learn to avoid these types of problems in the future.

Ember.js — the framework formerly known as SproutCore 2.0 (Amber.js)

Update: It looks like there was a conflict with the name Amber with another JavaScript project, so the SproutCore folks have graciously decided to change their name to Ember.js so as not to be in conflict. Check out the details at Yehuda’s blog.

The following is the original blog post:

Yehuda Katz has just officially announced the rebranding of SproutCore 2.0 to Amber.js.

This is refreshing for all the reasons he mentions in his post, especially when attempting to learn new things about the framework. It gets confused with SproutCore 1.0 quite a bit, and I can imagine trying to find answers about Amber.js will be a much easier experience!

Bazaarvoice’s newest product, Customer Intelligence, has been using SproutCore 2.0 since well before its release, and one of the primary developers on that team, Daniel Marcotte, spent a fair bit of time over several months working with the nascent framework and helping to get Beta 2 out the door. Look for an upcoming post where Daniel goes into much more detail on that process.

Will it be usable over dualip? (Vista always started downloading stuff)Will it be possible to turn off indexing? (Can’t concentrate on my work with the chattering the whole time)

Daniel Marcotte on SproutCore 2.0 (now Ember.js)

Daniel Marcotte is one of the top developers on our latest product offering, Customer Intelligence (CI). He spent quite a bit of time evaluating SproutCore 2.0 for the complex user interface requirements of CI, helping to work out some of the kinks in the product as it moved toward beta.

Daniel has recently moved to San Francisco to work on a new project — still for Bazaarvoice, of course! Before he left, I had a chance to sit down with him to chat about his experience working with the framework.

MDN: Welcome, Daniel! Thanks for letting me take some time out of your hectic schedule to get you on record, so to speak.

DM: Ah, my pleasure, Michael. Thank you for doing all the hard work.

MDN: No, no, not at all. You did all the hard work already — I’m just gonna put my name on it!

DM: I’m a team player; you go right ahead!

MDN: Awesome! So let’s get right down to it. You switch seamlessly between front-end and back-end development, and you’re very passionate about the user experience. A lot of developers in the industry want to work on back-end systems and don’t seem to get as excited about building great UIs. So, what do you think contributes to your passion? I’m really interested to hear what drives you because it also tends to drive the Web UI frameworks people choose.

DM: Yeah, that’s fair. I’ve actually given this a little bit of thought because my new manager asked me something about this recently. Something like, ‘Wow, you know, you’re really a UI guy. That’s great; how did you get into that?’ And I had kind of a funny moment, thinking it might be fruitful to start thinking of myself as a UI guy because then it’ll focus me on taking time to even go the extra level. Maybe read some stuff about user experience and even more wacky technologies. Because, I hadn’t traditionally done that. So how did I even end up in this position where someone would think, ‘of course you’re a UI guy’?

I think it’s because I don’t wanna do a shitty job. So if you’ve been tasked with building a UI feature, it needs to work. And it needs to be attractive. It needs to be usable. You need to put yourself in the user’s shoes.

MDN: It’s amazing, but most people don’t think that way. It seems like you actually know a little bit more. Have you actually done any reading and learning about any of this?

DM: I have, yeah…some. But nothing extensive. It’s a bit intuitive which I think gets people relatively far. And I think it has a lot to do with just being honest with yourself, of trying to put yourself in someone’s shoes maybe.

I have a bit of a teaching background. It’s one of my great pleasures. (Let’s ramble a little bit.) And one of my favorite things was being able to find a way to explain something to someone who it was completely new and novel to. To never forget what that was like to learn it for the first time. ‘Cause you can get lost and be an expert and think, ‘oh, it’s obvious, it’s fine.’ And I think that bleeds into UI sometimes. ‘Oh, I know how to use this.’ And that’s not the bar we’re trying to meet here.

MDN: You were instrumental in choosing SproutCore 2.0 for CI. What other Web UI frameworks did you and the team look at before deciding on Sproutcore?

DM: I looked at what Dojo was offering because it has the same full app/infrastructure stuff. It was a touch inscrutable, but really appealing. But it kind of seemed to go down the Ext-JS packaged-widget route. And I knew that Alex Zelenak, our UX designer, would come in and do a lot of custom stuff. And the other big driver, though I’m perceived as choosing this, I was really just the person who made it possible. It was really Alex Sexton that said, ‘this is the one; this is it for these reasons.’ So that kind of focused it. And it sort of became the one, and it needed to lose for some reason as opposed to beat something else.

MDN: Did Alex talk about Handlebars at all, because I know that he’s using it in other places.

DM: Oh, that’s interconnected. You use Handlebars templating inside SproutCore.

MDN: I’m just showing my ignorance!

DM: It’s like a specialized version of Handlebars that’s bindings-aware. So as your values change, your templates re-render.

MDN: You talked about this a little bit, but does SproutCore have widgets too?

DM: No. They’d like to get there, but it’d be like a UI layer. This is more like the structure/spine of the app.

MDN: So other than Alex’s stamp of approval, what are the other reasons you might use it in the future?

DM: I like the way it makes me think. I like the flavor of thinking it lends itself to. This disconnect with the bindings…. (Let me see if I can package this a little bit.) Trying to wrangle web UI using HTML, CSS, JavaScript without making the classic disaster, the classic mess, has been a mission of mine for a while. What’s the answer? What’s the secret? How do we make this maintainable and obvious to the next developer? Extensible, that sort of stuff.

And one of the big realizations I had a while ago was, with Don’t Repeat Yourself, the DOM, it’s the boss in so many things. I have a list of things I’m showing the user, and I don’t wanna duplicate in JavaScript having another list of things and try to synchronize those two. It’s much more a matter of saying something like… DOM: you’re the boss. You’re displaying it. You have these nodes that represent my things. So if I need to reevaluate, I’m just gonna ask you, ‘Hey dude, what’s my list look like?’ OK, I’m gonna modify it inside you ’cause I wanna display it. ‘Cause you’re the boss. And so in practice this works out really well.

MDN: ‘Cause that’s kind of how JQuery works because you have this separate set of code over here and it’s overlayed on top of the DOM.

DM: Yeah exactly. But for some richer apps and much more complex apps, asking the DOM those questions gets more complicated ’cause stuff is kinda all over the place. And so you end up with these really bonkers selectors. And it’s slow to be constantly asking the DOM these questions to see what’s going on with it. So then you start to pull back and say, ‘Oh, crap, I wish I had an object representation of this in my hand!’ Is it worth writing this connective tissue? And so now, enter SproutCore, and the app is sort of pure JS and then you bind it to the UI and you don’t maintain that connection; SproutCore does. Which is cool.

So the classic collections example that ties into the list example earlier, where before the DOM was the boss, now javaScript’s the boss. I have this list and I’m gonna monkey with it. Using observers and bindings, SproutCore’s gonna hold onto the fact that the list corresponds to something in the DOM, that’s rendered by a template. When the data changes, the template is called to re-render that piece.

MDN: So SproutCore owns the DOM then.

DM: Effectively, and it can do smart stuff like buffer your updates. So it’ll just be like manually updating the DOM, but will be a lot faster. Then you kinda get into their flavor of MVC. SproutCore MVC gets quite appealing where you can really build an app that does its thing and then you can write unit tests against that. And when you bind the UI on it, that’s just another view. But independently, you have this app that you can validate does its thing and is independent of how it’s displayed. You do have to make sure you structure it carefully enough that you’re not getting into circular references in your bindings, but it’s super-appealing that your templates are HTML and you just bind them in with the app and there’s a lot of really great ideas. It’s technically very satisfying.

MDN: And do you just bind it on top of a div element, because you’re going to have to put it in somewhere, right? So there’s some ID where you say, hey put this over here in this spot?

DM: You have a script block that has a type “text/html” or “text/x-handlebars”. And one of the first things SproutCore does is it looks at the page and any script block of that type, if it has an ID on it, then it assumes you’re gonna reference this later and so it will just pull it out of the DOM and store it compiled. Otherwise, it’ll replace it inline. You get to write very naturally, right inline, if you’re not reusing that component. It’s got a nice flow to it where it will just replace that piece with the markup it corresponds to.

MDN: Did you run into any problems with SproutCore’s architecture as you were building out the system at all?

DM: So, to even decide if we wanted to do this, I threw together a prototype that recreated what the pilot’s UI — the interested parts — did. And I was really able to see the value because we’re managing different views on the same data. And the user gets to toggle displayed data and interact with what they’re seeing. And so this idea of an app that will hold that state and where you can just swap out these views, it’s great because you can just say “app, here’s some data, do your thing” and now you’re not fetching from the server again and again if you just wanna re-slice something or have a different look at it, you have all that power easily in the client side.

So that was all going really well, but then I’d run into mundane stuff. Like: you’ve sent in a collection and you’re running over it and the designer wants to number each row. Perfectly reasonable request, but the context that you have in the collection rendering is just the collection item that you’re on. So do you artificially start adding indexes to your objects externally? Or do you say we should really have access to the index? And so that was kinda fun; I submitted a patch which exposed the current index when rendering a collection. Another example is that Tapestry [the server side web framework in this project] hosed us because you have almost no access to the head of the documents that it generates, which in general isn’t a problem, but in one of the early versions of SproutCore, it was only looking for stored templates in the head, which was an artificial restriction.

MDN: But not really any big pains or roadblocks. Nothing you couldn’t work around or fix.

DM: There was naturally a bit of instability in a few of the alpha versions I was using. And then there’s a tougher learning curve that goes with that: you can imagine being unable to distinguish between a flaw in understanding and a flaw in the library. But that’s come a long way.

There are also things that it really super-facilitates incredibly, but as with a lot of these things, it can mean limiting you in other ways, like there being hoops to jump through to integrate with say a select control.

And then we got into trouble where other developers started building out what I had there, and it wasn’t obvious that this code worked differently than they were used to. I hadn’t quite gotten it polished to the point of “this is how it’s done and you grow it from here and follow these patterns.” I hadn’t quite gotten there. The schedule comes in and you end up with this problem. So there was a bit of: “I wanna do this thing, and I can picture that vision in the ‘JQuery way.’ Well now, what’s the ‘SproutCore way’ of doing exactly that?” And you end up kind of cheating and creating artificial properties that you can bind to and kind of tricking it into doing things the ‘JQuery way’ instead of doing what is really the ‘SproutCore way’. And that’s really challenging.

MDN: Well, it’s the speed of sort of doing it versus doing it in a way that it’s gonna be good later.

DM: Yeah. Do you want the speed now, or more of it later? And that’s a really hard choice to make. And I’ve been ruminating on whether that’s a flaw on our part, or on the library’s part. How do we keep pushing that stuff forward and make it even better? I’m not sure.

MDN: You made a pretty concerted effort to get SproutCore out of beta and into mainstream for release 2.0. So, you actually fixed a couple of issues. How did that process go? How did you work with the team and get them to take your patches?

DM: Contributing to open source had been mysterious to me for a long time. And I’ve met a lot of people who are also like, ‘oh I want to contribute to open source, but I kinda like don’t know how. And IRC makes me cross-eyed.’ What do you do? In a lot of ways, this was just sort of really kinda lucky — a nice sort of confluence of events. And in my mind some of the things that made it possible was actually meeting Tom Dale and Yehuda Katz (who are great, by the way) and shaking their hands [at TXJS]. And they don’t have to remember it, but that’s kinda your way in later to politely say, ‘Hey, remember when I expressed interest in your framework? I’m using it and I don’t need anything from you. I’m just sharing good news. I’m not after you for anything.’

And then, GitHub makes a lot of this brilliantly easier. You make your fork, you monkey around, you learn something. Figuring out how Github works is pretty straightforward. You still see some trolling and joke pulls and this sort of stuff, but it’s kind of neat and friendly once you get your head around it. So then to contribute to a project: Keep it small. Keep it focused. Explain why. Write a unit test for whatever you’re putting up. And then you just post it, and then you wait.

I also started a bit of not-constant (it was a real priority for me not to bother these guys) but still regular e-mail correspondence with Tom Dale, so I think for my initial patches, the SproutCore guys were able to connect them with me. So I think it sort of gets you in the door that you’re not just some random unknown pull request. And then it was pretty easy. It was super satisfying to get even simple messages like, ‘Hey, thanks!’ And then they merge in your code. and that’s pretty cool.

Being able to take some time to work on that at BV is brilliant. People knew that I was investigating some of this stuff and there was a lot of support. It was cool. And John [Daniel’s manager] got a lot of satisfaction seeing my patches going in and that sort of stuff. So yeah, it was really cool.

I still think navigating all this is kind of hard. It’s social. You get self-conscious someone’s probably going to be a dick to you. Someone’s probably going to ignore you ’cause you’re nobody. But this experience was really good so it’s worth trying more. And I would encourage other people too: find something you’re passionate about, some little tool you use you wish worked a bit different. You’re probably bright enough at this point; you can actually figure out how to make a useful change, even just for yourself initially.

MDN: What kind of support did you get from the rest of the R&D organization here? You mentioned you could do some of it during work hours, which is great.

DM: I just tried to not be stupid about it. As long as I was being productive and watching my diminishing returns level, I was free to put in the time I needed. The VP of Engineering said we should at least try and do something interesting. He wanted to know “Is that feasible?” And so that’s the kind of support. I’m like yeah, sure, I believe that it’s feasible and I have a vision for walling it off so we’re not betting the whole product on it. I can try, but you’ve just introduced some risk that we may have to pay for, we may have to mitigate. And he was fully supportive, and so when that directive to try something fresh comes from that high up, that’s kind of the definition of support. You know what I mean? And I felt like I had the freedom to at any time make the call that we were going to bail on that course of action (the diminishing returns thing). I don’t think anyone would have turned on me and been like, ‘what the hell, Daniel, you didn’t even use that technology. You spent all that time on it.’ I think they would have been like, ‘Thank you for actually evaluating this and making a technical decision.’

MDN: So, would you recommend SproutCore 2.0 for everybody?

DM: I have no idea. It was still in its infancy during this project, so “everybody” seems a bit strong. But I really enjoyed using it, and I’m looking forward to using it more and watching it grow.

MDN: Thanks Daniel, for taking the time to chat with me! It was very educational.

 

Daniel brought in his laptop and showed me https://github.com/dmarcotte/ember-chess. Specifically, we walked through his very informative presentation in just a few minutes: https://github.com/dmarcotte/ember-chess/blob/master/presentation.js.

 

Thank you for sharing this interview. I’m evaluating ember now, and it really excites me that we are finally seeing an appeal to structure and proper separation of concerns on the client-side in JavaScript.

I too am a “hybrid” developer of sorts, with the user experience being at the forefront of my passion and enjoyment. It’s a good place to be nowadays; I remember not too long ago when the art of UI was not getting the respect it deserves. I think Steve Jobs did a good job of helping correct that.

Anyway, having a fellow UX afficianado come off a successful project with ember is pretty encouraging. I’m now pretty confident I’m heading down the right path.

P.S. ctrl + h, sproutcore, tab, ember, enter 😉

Fixing a performance problem a few days before release (Part 2)

In the previous post, we discussed a performance issue we were facing, and some of the steps we tried while investigating it to no avail.

Profiling Concurrent Requests

Running concurrent requests a few times, I could indeed reproduce the 3x slowdown for new code that we were seeing in the performance environment. The next step was to profile each version during the concurrent test run.

Using the default JProfiler settings, the server was only able to process a few requests per minute. That wasn’t enough to see the performance effects, plus it was clear that profiling itself was taking far too much time and would skew the results considerably. After talking to a few folks, I was shown the CPU profiling options. There are two primary types of CPU profiling: Dynamic Instrumentation and Sampling.

profiling_concurrent_1

Dynamic Instrumentation, as the name implies, instruments all classes as they are loaded by the VM. Every method called will log data into JProfiler, making really small and fast methods appear much more expensive than they truly are. This can skew the results considerably for small, fast methods that are called many many times.

Sampling takes a sample of the JVM state every few milliseconds, thereby storing which methods are executing at any given time and extrapolating how long each method is taking overall. This method doesn’t allow us to know overall method counts, and therefore we cannot know the average amount of time per method call. However, it takes far fewer resources than instrumentation, especially for smaller methods.

After switching the setting, the server started up much faster during profiling and the number of requests handled while profiling was much more reasonable. Looking at the individual results after several responses were processed in parallel, the individual method that was causing the most processor time was quite obvious:

profiling_concurrent_2

The first method in the list, using 42% of the processing time, definitely appeared to be the culprit. To determine the specific method, JProfiler normally indicates the class name along with the method name. However, the class name for the method was incorrect: JProfiler thought that the method was in a JVM Generated Method Accessor. I haven’t determined the cause of this problem, but the offending method is primarily called by one other method, so it was easy enough to find.

Taking the wrong path

The method itself should have been very simple: it simply returned the product ID from an XML object. Commenting out the call to this method and rerunning the tests confirmed that this was indeed the problem. Because there was no indication of where the actual problem was, I made a first guess that the problem had to do with XML processing, specifically XPath accessors. I attempted to fix this problem by using other methods of property retrieval, but nothing I tried solved the problem when running the concurrent test run.

I reran the profiler with the newest code to see if any new information would show up. It should be noted at this time that each test run takes several minutes to setup and execute, so it had been quite a few hours of trial and error so far. Running the application while profiling takes even longer, so I had avoided running it again up to this point.

After running the application again, the top hit was StringBuilder.append. That was an odd problem because that method wasn’t using StringBuilder at all. It did have a simple String concatenation, but that should be pretty small:

Assert.isNotNull(idAttribute, "Product is missing required 'id' attribute: " + product);

Behind the scenes, the Assert.isNotNull method is doing something like the following:

public static void isNotNull(Object value, String message) {
    if (value == null) {
        throw new IllegalArgumentException(message);
    }
}

The more interesting part is that, though idAttribute is null very rarely (and hopefully never), the String concatenation happens every single time the method is called, which is millions of times! Plus, the product variable is rather large, and its toString() method generates a large XML String from the data. This simple line turned out to be the entire slowdown of the app.

The quick solution since we were so close to release was to replace the line with the logic in the isNotNull method. Because the String concatenation was then guarded by the null check, it was never actually executed. In addition, we looked through the codebase for other similar concatenations and removed them in the same way.

The long-term solution was to change the Assert class methods so that they used MessageFormat:

public static void isNotNull(Object value, String message, Object... messageParams) {
    if (value == null) {
        throw new IllegalArgumentException(MessageFormat.format(message, messageParams));
    }
}

The isNotNull calls look like the following:

Assert.isNotNull(idAttribute, "Product is missing required "id" attribute: {0}", product);

The String concatenation and the product.toString() method never get called unless the assertion fails, which should be rare. MessageFormat.format is more expensive than a String concatenation, but since it is called so infrequently, the performance savings more than make up for the slight cost of each call. Plus, the code is very clear and much more expressive than even using the String concatenation method.

Lessons Learned

There were quite a few lessons learned throughout this process. The first and likely most important is that simple, seemingly innocuous code can sometimes have a dramatic impact on overall performance of the system. In addition, I (re)learned quite a bit about performance debugging in general, and learned a lot about the performance testing tools in particular.

Debugging performance problems tends to take quite a bit of time just because of the shear volume of data or requests that inevitably cause the problem to begin with. Getting reproduceable scenarios and tests that can produce the same results time and time again for comparison is an added challenge, but extremely important in order to determine where the true bottlenecks are. By replaying production request logs to simulate traffic, the requests are very reproduceable and much closer to real-world scenarios. Failing to get reproduceable scenarios will lengthen the entire process.

These types of problems don’t come up often, but when they do, it’s really important to get them identified and fixed quickly. Hopefully this experience will help other developers crack open any performance problems and come to a quick solution.

Fixing a performance problem a few days before release (Part 1)

The Problem

Every few weeks we release a major version of our software. Throughout the release, we have many processes in place to ensure we don’t cause regressions in our latest code. Since we serve up millions of pages each day, regressions in scalability are just as important as failures.

At code freeze, which is approximately one week prior to release, we start running performance tests. The tests themselves are pretty basic: we have scripts that hit a server running new code with a large number of concurrent requests. We then compare compare the results with a baseline computed by executing the same requests against a server running the previous version’s code on the same hardware and database. The results are usually within a reasonable margin of error. This time, however, the new code was only able to process about 1/3 of the requests that the old code could.

In our current product offering, we have two main parts of public access: content display and content submission. Either one of these pieces could have been the culprit since we typically run all types of requests through the system during the first run. A subsequent run of just display traffic proved that display was causing the majority of the slowdown, even without submissions added to the mix. Another key piece of information is that the database didn’t appear to be the culprit: the processors on the app server were pegged, and the database servers were handling queries as normal. One of the usual suspects had an alibi! Of course, we weren’t ruling out the possibility that the database was still an issue, or possibly even Solr.

Finding the Bottleneck

The results so far indicated that we would probably need to profile the app and not just look for bad queries. In order to find the offending code, most Java profilers must connect to the JVM in a way that gives them access to low-level data. We use JProfiler for these types of issues, and it integrates very well with IntelliJ IDEA, but that requires reproducing the issue locally: not always an easy thing to do when large systems with enormous databases are involved.

Running locally, I would get different performance results than running in the performance environment. In fact, sometimes my Mac Pro is faster than the performance environment. Therefore, I would need to run both new and old code on my box to see how they compare – and where the breaking point is.

We have a few options for local testing. The first option is to extract a large data set for use locally. This option means importing the data into a local MySQL DB and reindexing the content into Solr. Unfortunately this is a multi-hour process for a large set of data. In addition, the request logs span all traffic for a particular server, so that means a multi-GB data set. I decided to start the extraction and import process in case this was necessary, and move on to Option 2.

The next option is to attempt to access our lab databases. The databases in the lab don’t have as much data as the extractions, but they do have enough to do a quick-and-dirty performance test. The big issue is that the databases are on different hardware, so it was possible that they wouldn’t match perfectly. I decided to try it anyway, and as expected from the original performance tests, the new code performed about half as well as the old code for each request. I was able to see these results by just executing the requests serially, so I thought I had a good candidate for profiling.

Profiling, Take 1

Profiling a single request in the new version, I didn’t find any method call that really stood out:

The highest cost method above is only 4%, and it turns out to be in the HTML templating system we use. JProfiler has the ability to compare two snapshots, so I took a snapshot for the new code, and then reran the request for the old code while profiling. Comparing the two, we see that there is really no change between the two codebases for these calls:

It looks like Freemarker could be the culprit, but looking at the increasing and decreasing parts of the results, they seem to cancel each other out. This doesn’t make sense if all of our time is supposedly in Java.

DB slowdown?

Looking more closely at the JProfiler snapshot output, there’s an option to show which methods and threads are being filtered out. The threads in the output above were all in “Running” state. This makes sense if the problem is within running Java code, but what if the problem is in the database? In that case, our Java code would be waiting a lot. It turns out that JProfiler has a specific option to show the JDBC calls. Looking at the following output, we can see a few calls are taking 15% of the time while running with the new code:

The nice thing about saving off a snapshot for each of the code versions is that I can keep looking at the data long after the server has been shutdown. Looking at the comparison between new and old code again, we see that there are a few queries that are taking 10x longer to execute in the new system:

This is a problem, but at that point, I realized that this is not likely to be our main problem. Noticing the number of invocations, we don’t call these queries any more often in the new code than the old. Instead, the same exact queries are taking longer. This could mean that we changed the indexes on these tables (nope!), or there is much more data in the database (probably not the case), or (most likely) the hardware is affecting performance of the DB queries drastically.

Switching DBs

Back to square one. I needed to use the same DB hardware for these comparisons, because it skews the results too much, mainly because DB query time overshadows processing time. Also, since we weren’t seeing heavy load on the DB servers during these tests, we were still thinking that the problem was in the Java code.

At this point I didn’t yet have a local DB environment that was reliable, so I decided to run against the performance testing DBs directly. We had to shut down performance testing, but otherwise this was surprisingly easy. While testing it out, I ran into an unexpected bonus: both new and old code was able to run against the new schema! That meant truly testing against the exact same DB was possible.

I ran the serial tests again in both new and old code, and found that they were actually quite similar in performance. Profiling both versions confirmed this, as there were really no differences between the new and old code outside a normal margin of error. It definitely wasn’t taking 3x longer to process requests. However, there is a lot of local processor dead time when handling requests serially because there is a ton of time spent waiting for the DB. I wasn’t able to reproduce the negative results running requests one at a time.

In the conclusion of this post, we’ll get past the red herrings and drive to a solution.