Using the Cloud to Troubleshoot End-User Performance (Part 2)

This is is the second of a two-part article that outlines how we used a various set of tools to improve our page load performance. If you haven’t read part 1, go ahead and give it a read before continuing.

Tactics

We opted to not use our normal staging environment for this project, since our staging environment doesn’t run experimental code.

In order to iterate rapidly on our changes and to provide a location that is publicly accessible over the web (so that WebPageTest can see it), we set up an Amazon EC2 instance running a complete copy of all of our software so that it effectively behaved exactly like a local developer instance with the exception that it could be hit from any external resource on the web. (Heh… I make this sound really easy)

So now that we have a server on the web running a customized version of our software, the problem is now making requests that normally go to our production datacenter get redirected to our EC2 instance without redirecting real end-users. In my opinion, this is where the capabilities of WebPageTest really shined and began to flex it’s muscle.

Let’s say that under normal conditions, your production application runs at foo.com/123.456.789.1 and that the EC2 instance you created and that is running a customized version of your app is running at ec2-123-456-789-2.aws.com/123.456.789.2. WebPageTest will allow you to override DNS resolution for foo.com to 123.456.789.2. This works in a similar manner to a host override except that WPT will still have the browser perform a DNS lookup of your production host so that you still get accurate timings for the DNS resolution.

To take advantage of this, you need to provide the following “script” to your test execution:

navigate http://www.external-site-that-includes-your-code.com
setDns foo.com 123.456.789.2

The other cool thing about WebPageTest is that the test execution and results parsing can be scripted via their REST-like APIs. In fact, check out this java class gist I wrote (embedded below) that makes use of this API to run some aggregated stats of the usage of Twitter on AOL.com. This class allows you to more easily view aggregated statistics for a narrow set of the resources that are actually used by a page — assuming that you only care about the resources being delivered by your servers into another page.

package com.bazaarvoice.mbogner.utils;
import org.apache.commons.httpclient.HttpClient;
import org.apache.commons.httpclient.HttpMethod;
import org.apache.commons.httpclient.methods.GetMethod;
import org.apache.commons.httpclient.methods.PostMethod;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringUtils;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathFactory;
import java.io.ByteArrayInputStream;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.Set;
/**
* @author: Matthew Bogner
* @date: 10/25/11
*/
public class WebPageTestRunner {
public static HttpClient client = new HttpClient();
public static XPath xpath = XPathFactory.newInstance().newXPath();
public static Set<String> hostnameFilter = new HashSet<String>();
public static void main(String[] args) throws Exception {
hostnameFilter.add(“cdn.api.twitter.com”);
hostnameFilter.add(“platform.twitter.com”);
PostMethod post = new PostMethod(“http://www.webpagetest.org/runtest.php”);
post.addParameter(“url”, “www.aol.com”);
post.addParameter(“runs”, “1”); // Only request the page once
post.addParameter(“fvonly”, “1”); // Only look at the FirstView
post.addParameter(“location”, “Dulles_IE7.DSL”); // Dulles, VA with IE7 and DSL connection
post.addParameter(“f”, “xml”); // Respond with XML
post.addParameter(“k”, args[0]); // API Key from WebPageTest.org
// Now we need to (optionally) send over a script so that we can
// override the DNS entries for certain hosts that the page will
// attempt to reverence.
//post.addParameter(“script”,
// “navigate http://www.aol.com\n” +
// “setDns platform.twitter.com 123.456.789.2\n” +
// “setDns cdn.api.twitter.com 123.456.789.2”);
String responseBody = executeHttpMethod(post);
System.out.println(responseBody);
Node statusCodeNode = (Node) xpath.evaluate(“/response/statusCode”, getXmlSrc(responseBody), XPathConstants.NODE);
String statusCode = statusCodeNode.getTextContent();
System.out.println(“StatusCode = “ + statusCode + “\n”);
if (“200”.equals(statusCode)) {
// Request was successful. Wait for the test to complete.
Node testIdNode = (Node) xpath.evaluate(“/response/data/testId”, getXmlSrc(responseBody), XPathConstants.NODE);
waitForTestCompletion(testIdNode.getTextContent());
}
}
private static InputSource getXmlSrc(String content) throws Exception {
return new InputSource(new ByteArrayInputStream(content.getBytes(“UTF-8”)));
}
public static String executeHttpMethod(HttpMethod method) throws Exception {
int responseCode;
String responseBody;
try {
responseCode = client.executeMethod(method);
responseBody = IOUtils.toString(method.getResponseBodyAsStream());
} finally {
method.releaseConnection();
}
if (responseCode != 200) {
throw new Exception(“Invalid server response. \nResponse code: “ + responseCode + “\nResponse body: “ + responseBody);
}
return responseBody;
}
private static void waitForTestCompletion(String testId) throws Exception {
PostMethod post = new PostMethod(“http://www.webpagetest.org/testStatus.php”);
post.addParameter(“f”, “xml”); // Respond with XML
post.addParameter(“test”, testId);
String responseBody = executeHttpMethod(post);
Node statusCodeNode = (Node) xpath.evaluate(“/response/statusCode”, getXmlSrc(responseBody), XPathConstants.NODE);
String statusCode = statusCodeNode.getTextContent();
// 200 indicates test is completed. 1XX means the test is still in progress. And 4XX indicates some error.
if (statusCode.startsWith(“4”)) {
System.err.println(responseBody);
throw new Exception(“Error getting test results.”);
} else if (statusCode.startsWith(“1”)) {
System.out.println(“Test not completed. Waiting for 30 seconds and retrying…”);
Thread.sleep(30000); // Wait for 30sec
waitForTestCompletion(testId);
} else if (“200”.equals(statusCode)) {
obtainTestResults(testId);
} else {
System.err.println(responseBody);
throw new Exception(“Unknown statusCode in response”);
}
}
private static void obtainTestResults(String testId) throws Exception {
GetMethod get = new GetMethod(“http://www.webpagetest.org/xmlResult/” + testId + “/”);
String responseBody = executeHttpMethod(get);
Node statusCodeNode = (Node) xpath.evaluate(“/response/statusCode”, getXmlSrc(responseBody), XPathConstants.NODE);
String statusCode = statusCodeNode.getTextContent();
if (!“200”.equals(statusCode)) {
System.err.println(responseBody);
throw new Exception(“Unable to obtain raw test results”);
}
NodeList requestsDataUrlNodes = (NodeList) xpath.evaluate(“/response/data/run/firstView/rawData/requestsData”,
getXmlSrc(responseBody),
XPathConstants.NODESET);
for(int nodeCtr = 0; nodeCtr < requestsDataUrlNodes.getLength(); ++nodeCtr) {
Node requestsDataNode = requestsDataUrlNodes.item(nodeCtr);
String requestsDataUrl = requestsDataNode.getTextContent().trim();
analyzeTestResult(requestsDataUrl);
}
}
private static void analyzeTestResult(String requestsDataUrl) throws Exception {
System.out.println(“\n\nAnalyzing results for “ + requestsDataUrl);
/*
Things we want to track for each hostname.
Total # requests
Total # of requests for each content type
Total # of bytes for each content type
Total Time to First Byte
Total DNS Time
Total bytes
Total connection time
*/
HashMap<String, Integer> numRequestsPerHost = new HashMap<String, Integer>();
HashMap<String, HashMap<String, Integer>> numRequestsPerHostPerContentType = new HashMap<String, HashMap<String, Integer>>();
HashMap<String, Integer> totalTTFBPerHost = new HashMap<String, Integer>();
HashMap<String, Integer> totalDNSLookupPerHost = new HashMap<String, Integer>();
HashMap<String, Integer> totalInitialCnxnTimePerHost = new HashMap<String, Integer>();
HashMap<String, HashMap<String, Integer>> totalBytesPerHostPerContentType = new HashMap<String, HashMap<String, Integer>>();
HashMap<String, Integer> totalBytesPerHost = new HashMap<String, Integer>();
String responseBody = executeHttpMethod(new GetMethod(requestsDataUrl)); // Unlike the rest, this response will be tab-delimited
String[] lines = StringUtils.split(responseBody, “\n”);
for (int lineCtr = 1; lineCtr < lines.length; ++lineCtr) {
String line = lines[lineCtr];
String[] columns = StringUtils.splitPreserveAllTokens(line, “\t”);
String hostname = columns[5];
String contentType = columns[18];
String ttfb = StringUtils.isBlank(columns[9]) ? “0” : columns[9];
String dns = StringUtils.isBlank(columns[47]) ? “0” : columns[47];
String cnxn = StringUtils.isBlank(columns[48]) ? “0” : columns[48];
String bytes = StringUtils.isBlank(columns[13]) ? “0” : columns[13];
if (“0”.equals(bytes) || (!hostnameFilter.isEmpty() && !hostnameFilter.contains(hostname))) {
continue;
}
// Track total # requests per host
if (!numRequestsPerHost.containsKey(hostname)) {
numRequestsPerHost.put(hostname, new Integer(1));
} else {
numRequestsPerHost.put(hostname, numRequestsPerHost.get(hostname) + 1);
}
// Track total # requests per host per content-type
if (!numRequestsPerHostPerContentType.containsKey(hostname)) {
HashMap<String, Integer> tmp = new HashMap<String, Integer>();
tmp.put(contentType, new Integer(1));
numRequestsPerHostPerContentType.put(hostname, tmp);
} else if (!numRequestsPerHostPerContentType.get(hostname).containsKey(contentType)) {
numRequestsPerHostPerContentType.get(hostname).put(contentType, new Integer(1));
} else {
numRequestsPerHostPerContentType.get(hostname).put(contentType, numRequestsPerHostPerContentType.get(hostname).get(contentType) + 1);
}
// Track total # bytes per host per content-type
if (!totalBytesPerHostPerContentType.containsKey(hostname)) {
HashMap<String, Integer> tmp = new HashMap<String, Integer>();
tmp.put(contentType, Integer.valueOf(bytes));
totalBytesPerHostPerContentType.put(hostname, tmp);
} else if (!totalBytesPerHostPerContentType.get(hostname).containsKey(contentType)) {
totalBytesPerHostPerContentType.get(hostname).put(contentType, Integer.valueOf(bytes));
} else {
totalBytesPerHostPerContentType.get(hostname).put(contentType, totalBytesPerHostPerContentType.get(hostname).get(contentType) + Integer.valueOf(bytes));
}
// Track total TTFB for host
if (!totalTTFBPerHost.containsKey(hostname)) {
totalTTFBPerHost.put(hostname, Integer.valueOf(ttfb));
} else {
totalTTFBPerHost.put(hostname, totalTTFBPerHost.get(hostname) + Integer.valueOf(ttfb));
}
// Track total DNS lookup time for host
if (!totalDNSLookupPerHost.containsKey(hostname)) {
totalDNSLookupPerHost.put(hostname, Integer.valueOf(dns));
} else {
totalDNSLookupPerHost.put(hostname, totalDNSLookupPerHost.get(hostname) + Integer.valueOf(dns));
}
// Track total initial connection time for host
if (!totalInitialCnxnTimePerHost.containsKey(hostname)) {
totalInitialCnxnTimePerHost.put(hostname, Integer.valueOf(cnxn));
} else {
totalInitialCnxnTimePerHost.put(hostname, totalInitialCnxnTimePerHost.get(hostname) + Integer.valueOf(cnxn));
}
// Track total bytes for host
if (!totalBytesPerHost.containsKey(hostname)) {
totalBytesPerHost.put(hostname, Integer.valueOf(bytes));
} else {
totalBytesPerHost.put(hostname, totalBytesPerHost.get(hostname) + Integer.valueOf(bytes));
}
}
printMap(“Total # requests per host”, numRequestsPerHost);
printMap(“Total # requests per host per content-type”, numRequestsPerHostPerContentType);
printMap(“Total # bytes per host per content-type”, totalBytesPerHostPerContentType);
printMap(“Total TTFB per host”, totalTTFBPerHost);
printMap(“Total DNS lookup per host”, totalDNSLookupPerHost);
printMap(“Total Initial Connection Time per host”, totalInitialCnxnTimePerHost);
printMap(“Total Bytes per host”, totalBytesPerHost);
}
private static void printMap(String title, HashMap stats) {
System.out.println(“\t” + title);
Iterator keyItr = stats.keySet().iterator();
while (keyItr.hasNext()) {
Object key = keyItr.next();
Object value = stats.get(key);
System.out.println(“\t\t” + key.toString() + “: “ + value.toString());
}
}
}

Summary

Let’s recap what we have available to us…

  • EC2 instance running a set of our code that can be recompiled and relaunched on a whim
  • The CharlesProxy tool for a cursory look at the whole transaction of the page from your desktop
  • WebPageTest tool for a third-party view of your app’s performance
  • and lastly, a custom java class that can invoke your WebPageTest runs and consolidate/aggregate the results programmatically

With these tools and tactics we had an externally facing environment that we could use to iterate on new ideas quickly.

You’ve lost that startup feeling…

Huge opportunity for impact. A sense of ownership. Collaboration with people across every function in the organization. An understanding of the big picture and a real opportunity to shape the future.

These are some of the best and most exciting qualities of working at a software start-up. In my personal experience, working at both a number of start-ups as well as in my nearly 4 years at Google, environments with these qualities attract some of the highest-quality software engineers, and subsequently allows them to be maximally effective once they’re on board.

Now that Bazaarvoice is six years old and technically a “big company” (in fact the best big company to work for in Austin), one of the biggest challenges we face from an engineering team culture standpoint is maintaining these qualities through growth of both the team and the scope of business. All too often in technology companies, especially those that experience fast success and growth, the culture is sacrificed at the altar of the business. This can be deadly to companies that thrive in large part on the strength of their culture, and highlights an important and often overlooked question about organizational culture: Is your culture scalable?

As VP of Engineering, a big part of my job is shepherding our culture and making sure that we create the kind of environment that allows us to attract and retain the best engineering talent. This brings us to the main topic of this blog post: Bazaarvoice engineering’s approach to building a scalable engineering culture and organization.

The central question I try to tackle is this: how can I build an engineering organization and culture that scales to hundreds of engineers where every engineer experiences, on a daily basis, those key qualities of a start-up? To answer this question, we focus on four major areas.

1. Team Size

This one seems obvious, but it never ceases to amaze me how many organizations miss the boat on this one. The bigger the team is, the less significant each individual is going to feel. The math is simple: it’s hard to feel like there’s a big opportunity for impact when you’re one of fifteen people on a team as compared to when you’re one of three people on a team. Also often overlooked is how much easier it is to hide an underperformer on a larger team than on a smaller team, and having to work with (and make up for) underperforming teammates can be toxic to an engineering culture.

In my experience, some of the most effective and high-powered engineering teams were 3- and 4-person teams working on Really Big Problems. I like to create environments populated by small teams, led by high-potential people, tackling problems that on their face are just slightly bigger than the team should, on paper, be able to handle. When done properly, this creates a sense of urgency and creates an opportunity for growth and self-discovery for the team members. (When done improperly, it makes people feel like you’re asking the impossible, which is obviously very demotivating, and it’s a fine line to walk between the two. This highlights how important it is that, as a leader, you know your team well and are very in tune with their state of mind.)

2. Cross-Functional Structure

Every organization has to choose how it’s going to fundamentally structure itself: functionally or cross-functionally (sort of like choosing your clustered index in SQL Server). I believe in a functional organization structure in the sense that every engineer reports to an engineering manager and on up through a VP of Engineering or CTO, for individual performance management and career development reasons.

However, on a daily basis, I think the organization should operate logically and physically as if it were cross-functionally structured. What I mean by this is that I believe we should have teams that work together daily consist of members from different functions (e.g. engineering, product management, design, operations, etc), that those teams should by physically co-located (i.e. sit together), and that the success of the team as a whole should be weighed more heavily than individual success.

The reason for this is simple: I feel that every engineer should have both visibility into why they’re building what they’re building, as well as having input into what gets built and how it gets built. Engineers, by and large, are creative people and they have a lot to bring to the table when it comes to creative problem solving. That stuff shouldn’t be solely the domain of the product managers, it should be a highly collaborative process. It’s similarly important that the engineers get visibility into how their products are being used when out in the wild. All of these inputs are hugely valuable to making sure the engineering team builds the right product for the market. As an important side benefit, exposure to these other areas gives the engineers themselves an opportunity o learn, makes them smarter and stronger and more likely to make a good decision when faced with uncertainty in the future.

3. Power! Unlimited Power!

(Not really, I just wanted an excuse to link to this video.) Seriously though, this bullet would be better titled: “Uncapped Potential”, or “Roles, Not Titles”. Every engineer ultimately wants to feel like they’re in control of their own destiny, and I want to create an environment where that’s largely true.

This boils down to one major question: can an engineer with talent, potential and ambition take the ball and run with it as far as her talent will let her? If not, what’s getting in her (or his) way? Maybe it’s an overbearing management structure, or lack of visibility into how the product is used in the field, or how it’s sold. Whatever it is, find it and get it out of the way as quickly as you can. We want to create an environment in which engineers with talent, potential and ambition can be as effective as possible, and nothing is more frustrating than an artificial barrier.

One simple organization structure change that helps this along is organizing teams around roles, not titles. First, some definitions. In my opinion, a job “title” should do one thing and one thing only: define the expectations by which someone’s work should be judged, vis-a-vis the organization’s overall values. For instance, the job title of “senior software engineer” may entail the expectation that an individual be capable of owning moderate-to-large components of a big system and operating in a self-directed manner. Notably, it says nothing about the function of that individual, or his relationship to his peers.

A job “role”, on the other hand, is all about the function of the individual in some specific context. For instance, a technical lead (TL) for a team may be responsible for the technical delivery of a project, including the day-to-day work assignments for the engineers on that team. Notably, it says nothing about the title or level of the individual. Every team can and should have a TL, regardless of whether the TL is the only individual on the team or if the team has dozens of members (in which case there may be multiple TLs).

What this means, as an extreme example, is that you can have situations where you’ve got a more junior engineer that is the TL of a team that includes much more senior engineers. Under the right circumstances, this can be exactly what you want (I’ve seen it work out spectacularly well on multiple occasions in my career).

More importantly though, is that you want your organization and culture to support this kind of flexibility because it provides an opportunity for high-potential engineers, regardless of seniority, to take ownership over something and stretch themselves.

4. A Voice Heard

The last bullet point is in many ways the most important, and really runs as an undercurrent to each of the others. Ultimately, the engineers on the team need to feel like they have a voice, and that that voice will be heard when they’ve got something to say. Whether that means having a voice in the design of the product, the technology direction of the team, the company’s core values, or anything else, engineers are ultimately people, and they want to feel like their opinion matters. As team leaders, it’s our job to make sure both that they feel this way and that it is actually true!

Conclusion

As always with these types of philosophical discussions, your mileage may vary, and there are always exceptions to the rules. However, I’ve seen these approaches and techniques be successful repeatedly throughout my career, perhaps most notably during my time at Google, and I believe that they are generally applicable.

What are your thoughts? I’d love to hear them and discuss them in the comments section below.

Like the main 4 constraints you are focusing for building up a healthy orgainzation culture giving efficient productivity. Also would like to add to give balanced amount of feedback including criticism and appreciation must be given to each and every employee so that they should feel that there work is being observed.

Platform API Release Notes, Version 5.1

We are pleased to announce that the following functionality has been developed for version 5.1:

  • ReviewStats updated
  • Moderator Codes for user-generated content (UGC) exposed
  • Wildcard character in ContentLocale filter enabled
  • IP address in Content Display exposed
  • API key creation and management added to the client portal

More detailed information on each of these items is listed below. For complete documentation, refer to the Platform API Documentation, version 5.1.

ReviewStats

Review statistics for Products now returns ReviewStats, which account for product family data in the statistics calculations. Prior to version 5.1, the ReviewStats were returning NativeStats, which do not take product families into account.

Moderator Codes

Moderator codes (content codes) can now be explicitly requested and filtered. A new filter (ModeratorCode) has been added and can accept any moderator code values. For a list of all moderator codes, see the API Basics page. In addition, a new value for the attribute parameter (ModeratorCodes) has been added and must be requested in order to filter by ModeratorCode.

Wildcard character in ContentLocale filter

The ContentLocale filter now accepts the asterisk (‘*’) as a wildcard character. This wildcard character represents zero or more characters. Adding the wildcard enables filtering for multiple locales without requiring each one to be listed. For example, &Filter=ContentLocale:eq:en* would return all the English locales (en_US, en_CA, en_GB, etc.). When the ContentLocale filter is set to the wildcard character, it is the equivalent of requesting all locales. It is important to note that the value of the filter cannot start with a wildcard filter. For example, &Filter=ContentLocale:eq:*_US is not a valid filter.

IP address in Content Display

When the API key is configured to allow access to IP addresses, you will either get the IP address back or null inside the IpAddress element. If you are interested in getting access to IP address within an application, contact technical support.

API key creation and management in the client portal

For clients that have signed the API Data Agreement, they can now create and manage their Developer API keys via the client portal. This functionality allows you to self-manage the API keys that you use to create applications using the Developer Platform APIs and tools. A detailed list of the prerequisites and procedures for creating, viewing and editing your API keys can be found in the Bazaarvoice Release Notes, version 5.1. (You have to log in to the Spark portal to view.)

Additon of Wildcard character in ContentLocale filter is going to be very useful as it is required quite often. All the new functionalites might now change and make it easier to access the platform for new version.

Engineers Giving Back at Random Hacks of Kindness

On December 3rd and 4th Bazaarvoice was the lead sponsor on an event in Austin called Random Hacks of Kindness (RHoK), a coordinated worldwide hackathon for social good. The event started Friday night with a reception for all of the hackers at the Volstead Lounge where over over 60 people celebrated, heard a few quick thoughts on how technology could solve some big problems, and of course had a drink. Representatives of the Chicago Community Emergency Response Team, Williamson County Office of Emergency Management, NASA (a global sponsor for RHoK) and others gave a quick preview of the projects they hoped would be completed during the hackathon, and our own Scott Bonneau (VP of Engineering) spoke about the power that engineers have to change the world. Scott reminded us all how only a few short years ago, creating a new technology meant years of R&D efforts by large teams of highly educated scientists, and that now, anyone with a laptop and a credit card can launch an application over the course of days.

Saturday morning kicked off with coffee and presentations by each subject matter expert on the problems they had been researching to the over 50 hackers who were eager to get started. Additionally, NASA had a special surprise for the Austin attendees, and had arranged for Astronaut Ron Garan to speak about how his time on the international space station had provided him with a unique view into the power RHoK hackers have and the need for greater collaboration on the biggest problems of the world. The hackers organized themselves into teams based on their skillsets and desires, and before lunch the product design had begun. Teams were well fed throughout the weekend with great local food from P. Terry’s, The Peached Tortilla, and Freebirds, ensuring that no one ever went hungry or was lacking caffeine. Several teams coded late into the evening at the Capital City Event’s space, lounging on couches, and some even stayed the night.

By the end of the hackathon Sunday afternoon, the teams had built a number of amazing applications, and you can read more about the applications on the RHoK web site. All of the teams presented their work, and the top teams as selected by our judges received some amazing prizes (iPads, Kindles, and Buckyballs). Overall we’re very proud to have helped support this amazing opportunity, and we couldn’t have done it without the generous help of our sponsors and partners Homeaway.com, Freebirds, Capital City Events Center, and Github, as well as the numerous volunteers from tech companies throughout the community.

Not that our event went perfectly, but we certainly learned a few lessons along the way:

  1. Lead the leaders. You cannot run something of this size on your own. You need a team of leaders who’ll run alongside you and carry the ball for you in specific areas.
  2. It takes an army. You can never have enough volunteers. Find them early and have roles clearly lined out.
  3. It’s all about the network. It’s not about who you know, it’s about who they know. Find the connectors in your target demographic and pursue them. They’ll connect you with the masses.

To find out more about the next RHoK Austin, just follow the Pixadillo @rhokaustin. Want to help run RHoK Austin in the future? Send us a message @rhokaustin and we’ll connect you with the steering team for the next RHok Austin.

Using the Cloud to Troubleshoot End-User Performance (Part 1)

Debugging performance issues is hard. Debugging end-user performance issues from a distributed production software stack is even harder, especially if you are the 3rd-party service provider for one of your clients that actually is in control of how your code is integrated into their site. There are lots of articles on the web regarding what the performance best practices are, but few, if any, that discuss the tactics of developing and improving them.

Primarily, the challenges stem from the fact that troubleshooting performance issues is always iterative. If your production operations can handle deploying test code to production on a rapidly iterative schedule, then you can stop reading this post — you are already perfect, and worthy of a cold brewski for your impressive skills.

As our team and software stack continued to grow in both size and complexity, we underwent a restructuring of our client (js, css and html) code a while back that allows us to only deliver the code that is actually needed by our client’s site and configuration. We did this by reorganizing our code into modules that can be loaded asynchronously by the browser using require.js. Effectively this took us from a monolithic js & css include that contained a bunch of unused code and styles, to something that averaged out to a much smaller deliverable that was consumed in smaller chunks by the browser.

This technique is a double-edged sword, and like all things, is best when done in moderation. Loading multiple modules asynchronously in the browser results in multiple http requests. Every HTTP request made by the browser results in some overhead spent doing the following:

  1. DNS Lookup – Only performed once per unique hostname that the browser encounters.
  2. Initial Connection – This is simply the time that the browser takes to establish a TCP socket connection with the web server.
  3. SSL Negotiation – This step is omitted if the connection is not intended to be secure. Otherwise, it is just the SSL handshake of certificates.
  4. Time to First Byte (TTFB) – This is the time starting from after the browser has sent the request to when the first byte of the response is received by the browser.
  5. Content Download – This is the time spent downloading all the bytes of the response from the web server.

There are many great resources from Yahoo! and Google which discuss the details of best-practices for improving page performance. I won’t re-hash that great information, nor dispute any of the recommendations that the respective authors make. I will, on the other hand, discuss some tactics and tools that I have found beneficial in analyzing and iterating on performance-related enhancements to a distributed software stack.

Mission

A few months ago, we challenged ourselves to improve our page load performance with IE7 in a bandwidth constrained (let’s call it “DSL-ish”) environment. DSL connections vary in speed and latency with each ISP.

I won’t bore you with the details of the changes we ended up making, but I want to give you a flavor of the situation we were trying to solve before I talk about the tactics and tools we used to iterate on the best practices that are available all over the web.

The unique challenges here are that IE7 only allows 2 simultaneous connections to the same host at a time. Since our software distributes multiple modules of js, css and images that are very small, we were running into this 2-connection-per-hostname issue with a vengeance. Commonly accepted solutions to this involve image spriting, file concatenation and distributing or “sharding” requests for static resources across multiple domains. The sharding tactic made us realize the other constraining factor we were dealing with — the longer latency of HTTP requests on a DSL connection that gets exaggerated when making multiple DNS lookups to a larger set of distinct host names.

Tools

The tools that we used to measure and evaluate our changes affected the tactics we used – so I’ll discuss them first.

Charles Proxy

Charles Proxy is a tool that runs on all platforms and provides some key features that really aided us in our analysis. Primarily, it had a built-in bandwidth throttling capability which allowed us to simulate specific latency and upload/download conditions from our local machine. We used CharlesProxy for a rougher on-the-spot analysis of changes. CharlesProxy also allowed us to easily and quickly see some aggregate numbers of specific metrics we were interested in. In particular, we were looking for the total # of requests, total durations of all requests and the total response size of all requests. Since these numbers are affected by the rest of the code (not ours) on our client’s site – Charles allowed us to filter out the resources that were not ours, but still allowed us to see how our software behaved in the presence of our client’s code.

However, since we had multiple developers working on the project — each making isolated changes — we wanted a way to run a sort of “integration” test of all the changes at once in a manner that more closely aligned with how our software is delivered from our production servers. This led us to our next tool of choice – one that we’d never used until now.

WebPageTest.org

In it’s own words:

WebPagetest is a tool that was originally developed by AOL for use internally and was open-sourced in 2008 under a BSD license. The platform is under active development by several companies and community contributors on Google code. The software is also packaged up periodically and available for download if you would like to run your own instance.

In our case, WebPageTest provided two key things:

  • It’s Free
  • It is a useful 3rd party mediator between ourselves and others for spot-checking page performance

At a high level, WebPageTest.org controls a bunch of compute agents that live in various geographic locations of the US that are able to simulate bandwidth conditions according to your specifications (under the hood it uses DummyNet). It allows you to request one of it’s agents to load your page and interact with your site by simulating link clicks (if necessary) and monitors and captures the results for detailed analysis by you later. This tool is a great way for you to use an external entity to verify your changes and have a consistent pre & post benchmark of your page’s performance.

Of course, having some random machine on the web poke your site means that your changes must be publicly accessible over the web. Password protection is fine since you can use WPT to script the login, but IMHO is non-ideal as that is not part of the normal end-user experience.

Tactics

Now that we have a good handle on the tools we used – we should discuss how we put them to work. Stay tuned for part 2, where we will explore the tactics for using these tools together effectively.

Grilling up an API

BBQ is a religion in Austin. Everyone has their opinion on who serves up the best BBQ. Debates between people defending their choices have been known to last into the wee hours of the night. Friendships have been ruined, and neighbors turned into enemies (okay, I might have made that last bit up…but you get my point).

APIs are also like a religion to many in the developer community. Developers spend their precious time using the tools and APIs that companies create. The easiest tools to use will be the ones they turn to consistently – and tell their friends about. At Bazaarvoice, we are hyper-focused on how to make our API and Platform the best set of tools around.

But how do you “serve up” good API? To answer that, let’s borrow an analogy from the world of BBQ.

Imagine you wanted to make a BBQ dinner, and you came to Bazaarvoice for help. We could help you in a few different ways:

Method #1: We can provide you with the raw ingredients & materials you need – e.g. spices to make your sauce, sticks to build your fire, and of course – a cow.

Method #2: We can provide you with some pre-packaged ingredients & items – e.g. a bottle of BBQ sauce, a grill, and some prime cuts of meat.

Method #3: We can provide you with a menu from the Salt Lick, as well as the number to their delivery service.

So what does this translate to (aside from a yummy BBQ dinner)?

Method #1: High innovation, high support costs, and low adoption.

Method #2: Medium innovation, medium support costs, and medium adoption.

Method #3: Low innovation, low support costs, and high adoption.

At Bazaarvoice, we aim to provide the developer community with tools that support all three of the methods above:

Method #1: Our API gives developers fine-grained control over the information they can request, the filters they can specify, etc. However, this flexibility comes at a cost. Developers will have to understand our object model and syntax to take full advantage of the API, and we at Bazaarvoice need to provide training and documentation to help with this process.

Method #2: Our API documentation always starts off with example API calls and popular use cases. These “pre-packaged” examples can help you skip straight to the API calls that will get the job done.

Method #3: We have reference apps available to download, and we will continue to add more over time. These reference apps serve two purposes. First, you can download the apps, enter your API credentials, and be off and running (just like BBQ takeout!). Second, you can use these apps as a learning tool to help you get familiar with the Bazaarvoice API faster.

Like grilling up BBQ, it is hard to satisfy everyone. But when you get your product just right, you can turn customers into dedicated fans. So in conclusion – when you see an employee of Bazaarvoice feasting away at one of the many popular local BBQ joints, feel confident knowing that we are hard at work.

Extremenly superb way of comparing APIs with BBQ though firstly was quite confussed while reading this article.. So finally can say enjoyed both Bazaarvoice API and Austin BBQ 😉

The Tools We Use to Innovate in Bazaarvoice Labs (Part 2)

In the previous post, I provided a rundown on what Bazaarvoice Labs is, our process and why it is important to have flexibility in our toolset choices. I now want to give you some tool examples in the following categories:

  • Operational Tools
  • Server-side Application Development Environments
  • Data Storage and Management
  • Client-side Tools
  • Measurement Tools

Operational Tools

  • Amazon EC2: Well, duh. I mentioned that we need to seamlessly transition from internal prototypes to live running pilots and by using EC2, Elastic Load Balancer and creating a set of mostly standardized AMIs, we’re able to get a machine up and running to demo a prototype or scale out to supporting hundreds of thousands of requests almost instantly. Key to our use of the EC2 is the fact that it has a very robust API and tools like boto so we can automate just about everything that we do. This is important since it’s well documented that EC2 instances can go up and down without rhyme or reason. Which brings me to my next operational tool…
  • Cloudkick: We use Cloudkick for basic monitoring. Its UI is simple and it just plain works. Given how frequently we take services and applications up and down in EC2, it’s really nice to have an easily configurable, straightforward monitoring solution to rely on.

Server-side Application Development Environments

  • Ruby on Rails and Django: While we’ve experimented with microframeworks like Flask, sometimes when you’re moving fast and prototyping, you don’t know exactly what you need or when you’re going to need it. You may not want to think about what ORM or templating language to use or want to re-invent how user sessions are handled and it’s times like these that a nice full-stack web application framework comes in handy. Why both though? Well, quite simply, some engineers on our team prefer Ruby and some (most) prefer Python. This is where our one engineer, one project comes in handy. We work with the tools that will make us fastest. Ultimately, if someone needs to step up and lend a hand on a project when someone is on vacation, we’re all polyglots and can get our hands dirty in any language or framework necessary. The Facebook apps referenced above were written in Rails and the very, very high traffic pilot that we ran with TurboTax was written with Django (as was our Customer Intelligence product).
  • Node.js: The evented asynchronous server built on Google’s V8 Javascript engine. Node is a great tool to use when you’re building an application that needs to pull data in from multiple HTTP-based APIs and mash it together. Its performance is remarkable and it allows a developer to work in the same language in both the client and on the server. While some people think server-side JS is a fad, I think Node is leading a revolution in how people build and think about web applications. Please note that Node is so much more useful than for just building webapps. It can be used, for example as a very effective proxy as well (see Joe Stump’s answer about what technologies SimpleGeo uses on Quora). Data for Travelocity’s Social Connect Discovery pilot is served from Node.js backed with the Bazaarvoice Developer API and custom indices stored in Redis.

Data Storage and Management

  • ElasticSearch: We’re no strangers to Lucene-based search and data stores at Bazaarvoice. Most of our core platform’s displays are backed by queries made to SOLR. However, unlike SOLR, ElasticSearch is schema free and therefore really nice to use for prototyping and pilots where you’re not sure of the kinds of data that you’ll be wanting to index. There are some gotchas with this approach but for Labs projects, we’ll take the flexibility it offers. As a side note, it’s amazing how often Lucene-based tools are left out of the NoSQL discussion (In fact, my colleague RC Johnson did a SXSWi presentation on this). The search functionality in our Ask and Answer for Facebook pilot with Nikon is driven out of ElasticSearch.
  • MongoDB: We’ve used MongoDB in any number of Labs pilots at this point. Most notably, it drives the leaderboard and newsfeed functionality in our Ratings and Reviews for Facebook pilot with Benefit Cosmetics and also the majority of our new product discovery pilot application that we’re running with Sam’s Club in Facebook.

Client-side Tools

  • Dust: Dust is a Javascript templating library well suited to asynchronous applications. We like Dust because it’s a flexible and easy to use templating language, it integrates well with server-side JS tools like Node and allows you to pre-compile your templates for great performance.
  • Protovis: Protovis is an excellent visualization library. It’s declarative and very easy to build complex, interactive visualizations while still having a high degree of flexibilty over how those visualizations are rendered. We use Protovis to create what I believe are visualizations that are way beyond typical for an analytics tool in our Customer Intelligence product.

Measurement Tools

  • Google Analytics: It’d be tough to tell where we’d be without Google Analytics. It’s got its obvious uses, but also has comprehensive APIs that allow you to call custom events, set variables and then suck the data back out as necessary. This allows us to track specific actions that a user takes and to set up funnels based on those actions (even when the actions are clicks within a page vs. full page views).
  • Mixpanel: Mixpanel is a great alternative to Google Analytics. Many of our projects in Bazaarvoice Labs take the form of Javascript plugins or widgets that don’t conform to the traditional page-view-first mentality of most web analytics. Mixpanel focuses much more on tracking individual events that a user takes either in-page or across pages. Their API for doing this is very easy to use and it has the added benefit of being realtime which means you don’t have to wait a few hours to start seeing results from that code change that you just launched.

Of course, no project, prototype or pilot would get off the ground in Bazaarvoice Labs if we couldn’t get at our customer’s data. In order to maintain agility, all Bazaarvoice Labs projects are written as free-standing applications that are not part of our core application stack (a somewhat traditional J2EE application built on Spring MVC). Early on in Labs, even though we had direct access to our databases, we knew we needed to maintain separation between our core stack and Labs applications. Since we maintain a very complex set of business rules that are configurable on a per client basis around content submission and display, if we were to write directly to the databases, there’d be a high risk that we’d compromise data integrity. Generally, we’d use our existing XML API for submission (because it was obvious that trying to write data into the DBs from a separate application was a recipe for disaster) but we’d still use replicas of our core MySQL database clusters for display. That was okay but there were still some business logic mistakes made in the display of content (unacceptable when your pilot clients are some of the biggest online retailers around). In order to get around this, we created a new API that supported significantly higher degree of queryability, JSON and JSON-P data formats and had much lighter weight responses. This allows Bazaarvoice Labs to talk to our core data sets in a much more efficient manner and be assured that business rules are followed. This new API has now be productized as The Bazaarvoice Developer API. We will often create new, experimental method calls or create application local data indexes, but every single Bazaarvoice Labs project leverages this API heavily.

I hope I’ve given you a good overview of how Bazaarvoice Labs operates and the tools that keep us humming. It’s great to be able to work in an environment where exploration of new ideas and technologies are supported and encouraged. By operating the Bazaarvoice Labs team off-stack, it gives the Labs Engineers a chance not only to give input into what new products get built but what technologies get used to build them in a very low risk way.

The Tools We Use to Innovate in Bazaarvoice Labs (Part 1)

Hi everyone! This is my first post to the Bazaarvoice Developer blog and I’d like to take this opportunity to shed some light on some of the tools Bazaarvoice Labs has recently found very useful in creating the pilots and prototypes that ultimately morph into new products and features on the Bazaarvoice platform. Before I talk about our toolset though, I’d like to give you a quick rundown on what Bazaarvoice Labs is, our process and why it’s important for us to be flexible in our toolset choices.

Bazaarvoice Labs is the new product research and development group at Bazaarvoice with emphasis on the new and research. We are actually a team of engineers that report to our Product Management team (rather than through the engineering group) that help our Product Managers realize their wildest (and potentially most game-changing) ideas. Every quarter we evaluate and prioritize new ideas proposed by our Product Management team, customers and Bazaarvoicers around the company in order to research and create prototypes. The ideas we prioritize highest are those that come with big hairy assumptions but could change our business if they work. By building prototypes we’re able to suss out where the trouble might lie if we were to introduce the new product or feature to our entire customer base. We currently have over one thousand of the world’s biggest brands hosting their user-generated content in our platform and many large services organization to boot. The introduction of even a small new feature can have very large consequences to our organization. So on the risky stuff, we like to know where the gotchas lie. Some of the products spawned out of this process include BrandAnswers and Ratings and Reviews for Facebook (part of our SocialConnect Suite).

In order to build a prototype, we assign an engineer to work directly with a Product Manager or Product Designer. These two work together in an agile manner (agile with a little-a, not a capital A) in order to create a tangible prototype that demonstrates the Product Manager’s idea unencumbered by writing lots of requirements or unnecessary process. It’s this one-to-one relationship that makes this process hum, gives the creative process a kick in the pants and really lets these ideas properly gestate. Once more people get involved in the project, due to network effect, managing the project gets exponentially harder with each person you add and the need for process increases as a way to mitigate risk. By imposing a one-to-one structure for our prototyping teams we strip away any unnecessary obstacles to creativity and give real creative ownership to our Product Managers, Designers and Engineers. In a way, these teams become entrepreneurial cofounders as they attempt to prove their ideas. Additionally, by artificially constraining the initial project team to one engineer, the team is focused on building out the Minimally Viable Product needed to prove their assumptions and build a business case before further investment is required. Another nice side effect of this style of working is that it allows the engineer working on the project to choose their own tool-chain for each new project. Since they’re working alone on a project, there’s no need to constrain the tool choices to lowest common denominator of what every team member might already know or be familiar with. Of course, it’s up to the engineer’s discretion to reuse code or tools that may already be in use at Bazaarvoice but that choice ultimately lies with the engineer and the engineer knows to optimize around speed of creation vs. other organizational considerations. One nice side effect of an engineer being able to choose a new tool-chain with every project is that, in addition to proving business and product ideas, emerging technologies can be realistically evaluated and, where appropriate, integrated into our core engineering stack (this happened with requireJS which has become an integral part of how we deploy Javascript on our customer’s sites).

Sometimes simply building a prototype may not answer the questions that we have around the viability of the product and we need to take further steps to answer the questions we might have. For example, we needed to answer the following question for Ratings and Reviews for Facebook: “Will people be willing to read and write Product reviews inside a Facebook app?” In this case, a prototype isn’t enough. We needed to progress to the next phase of the process and actually pilot the application we had built with a couple of customers. For this reason, the “prototypes” we build need to be more robust than what you might initially think. Yes, we’re building concept cars in Labs but our concept cars actually need to run. We generally launch these pilots with three to five customers and generally won’t internationalize them. These restrictions keep us agile and help make sure we don’t have to build too much customizability into the pilots. Even though these pilots will only be launched with a handful of customers, some of these will be placed in some very high profile, high traffic places (some getting over 100,000 hits per day). Examples of running pilots right now include Nikon’s Ask and Answer for Facebook, Travelocity’s Social Connect Discovery Pilot (see the “Traveler Reviews” link) and TurboTax’s People Like You review search tool. Of course, when we launch pilots we track the data and rapidly move to improve the product and build out a suitable business case for productization. In the pilot phase the engineers are free to launch new code whenever they choose and must play the roles of UX, server side and operations engineer. By being chief cook and bottle washer on these projects, it frees the owning engineer (there’s still only one per project) to push releases as frequently as necessary to build that business case and observe how changes affect the project’s KPIs.

So what tools do we use to build software in Labs? Let’s review the two phases of our projects momentarily: The tools we chose to build with in Bazaarvoice Labs must support two phases of a project:

  • Prototyping: When the engineer needs to build a usable, tangible artifact targeted for internal consumption and demonstrations for clients.
  • Pilots: Where we launch our new ideas with a few select clients and measure results to build a business case. Pilots must be stable and scale yet the engineer still has to rapidly iterate on the feature set.

Because our development cycles are so short at Bazaarvoice, projects must also be able to transition between the prototype and pilot phase seamlessly. The tools we select must therefore support the requirements mentioned above. Generally we can divide our tool-chain into a few broad categories:

  • Operational Tools: Tools that help us keep things up and running
  • Server-side Application Development Environments: Application containers, full-stack and micro frameworks. Tools to build web apps with.
  • Data Storage and Management: SQL, noSQL and whatever else you need
  • Client-side Tools: Because there’s a lot you can do with just a browser nowadays
  • Measurement Tools: Without the data to back up our hypotheses, there’s no science

In my next blog post, I’m going to step through each of these categories and talk about a couple of tools that we use and the projects that we’ve used them in. This will not be an exhaustive list since we’re always evaluating new tools but it should give you some insight into the how and why we pick the tools that we do.