Using the Cloud to Troubleshoot End-User Performance (Part 1)

Debugging performance issues is hard. Debugging end-user performance issues from a distributed production software stack is even harder, especially if you are the 3rd-party service provider for one of your clients that actually is in control of how your code is integrated into their site. There are lots of articles on the web regarding what the performance best practices are, but few, if any, that discuss the tactics of developing and improving them.

Primarily, the challenges stem from the fact that troubleshooting performance issues is always iterative. If your production operations can handle deploying test code to production on a rapidly iterative schedule, then you can stop reading this post — you are already perfect, and worthy of a cold brewski for your impressive skills.

As our team and software stack continued to grow in both size and complexity, we underwent a restructuring of our client (js, css and html) code a while back that allows us to only deliver the code that is actually needed by our client’s site and configuration. We did this by reorganizing our code into modules that can be loaded asynchronously by the browser using require.js. Effectively this took us from a monolithic js & css include that contained a bunch of unused code and styles, to something that averaged out to a much smaller deliverable that was consumed in smaller chunks by the browser.

This technique is a double-edged sword, and like all things, is best when done in moderation. Loading multiple modules asynchronously in the browser results in multiple http requests. Every HTTP request made by the browser results in some overhead spent doing the following:

DNS Lookup – Only performed once per unique hostname that the browser encounters.
Initial Connection – This is simply the time that the browser takes to establish a TCP socket connection with the web server.
SSL Negotiation – This step is omitted if the connection is not intended to be secure. Otherwise, it is just the SSL handshake of certificates.
Time to First Byte (TTFB) – This is the time starting from after the browser has sent the request to when the first byte of the response is received by the browser.
Content Download – This is the time spent downloading all the bytes of the response from the web server.

There are many great resources from Yahoo! and Google which discuss the details of best-practices for improving page performance. I won’t re-hash that great information, nor dispute any of the recommendations that the respective authors make. I will, on the other hand, discuss some tactics and tools that I have found beneficial in analyzing and iterating on performance-related enhancements to a distributed software stack.

Mission

A few months ago, we challenged ourselves to improve our page load performance with IE7 in a bandwidth constrained (let’s call it “DSL-ish”) environment. DSL connections vary in speed and latency with each ISP.

I won’t bore you with the details of the changes we ended up making, but I want to give you a flavor of the situation we were trying to solve before I talk about the tactics and tools we used to iterate on the best practices that are available all over the web.

The unique challenges here are that IE7 only allows 2 simultaneous connections to the same host at a time. Since our software distributes multiple modules of js, css and images that are very small, we were running into this 2-connection-per-hostname issue with a vengeance. Commonly accepted solutions to this involve image spriting, file concatenation and distributing or “sharding” requests for static resources across multiple domains. The sharding tactic made us realize the other constraining factor we were dealing with — the longer latency of HTTP requests on a DSL connection that gets exaggerated when making multiple DNS lookups to a larger set of distinct host names.

Tools

The tools that we used to measure and evaluate our changes affected the tactics we used – so I’ll discuss them first.

Charles Proxy

Charles Proxy is a tool that runs on all platforms and provides some key features that really aided us in our analysis. Primarily, it had a built-in bandwidth throttling capability which allowed us to simulate specific latency and upload/download conditions from our local machine. We used CharlesProxy for a rougher on-the-spot analysis of changes. CharlesProxy also allowed us to easily and quickly see some aggregate numbers of specific metrics we were interested in. In particular, we were looking for the total # of requests, total durations of all requests and the total response size of all requests. Since these numbers are affected by the rest of the code (not ours) on our client’s site – Charles allowed us to filter out the resources that were not ours, but still allowed us to see how our software behaved in the presence of our client’s code.

However, since we had multiple developers working on the project — each making isolated changes — we wanted a way to run a sort of “integration” test of all the changes at once in a manner that more closely aligned with how our software is delivered from our production servers. This led us to our next tool of choice – one that we’d never used until now.

WebPageTest.org

In it’s own words:

WebPagetest is a tool that was originally developed by AOL for use internally and was open-sourced in 2008 under a BSD license. The platform is under active development by several companies and community contributors on Google code. The software is also packaged up periodically and available for download if you would like to run your own instance.

In our case, WebPageTest provided two key things:

It’s Free
It is a useful 3rd party mediator between ourselves and others for spot-checking page performance

At a high level, WebPageTest.org controls a bunch of compute agents that live in various geographic locations of the US that are able to simulate bandwidth conditions according to your specifications (under the hood it uses DummyNet). It allows you to request one of it’s agents to load your page and interact with your site by simulating link clicks (if necessary) and monitors and captures the results for detailed analysis by you later. This tool is a great way for you to use an external entity to verify your changes and have a consistent pre & post benchmark of your page’s performance.

Of course, having some random machine on the web poke your site means that your changes must be publicly accessible over the web. Password protection is fine since you can use WPT to script the login, but IMHO is non-ideal as that is not part of the normal end-user experience.

Tactics

Now that we have a good handle on the tools we used – we should discuss how we put them to work. Stay tuned for part 2, where we will explore the tactics for using these tools together effectively.

bazaarvoice: engineering

The official blog of Bazaarvoice R&D

Using the Cloud to Troubleshoot End-User Performance (Part 1)