Category Archives: Uncategorized

Grilling up an API

BBQ is a religion in Austin. Everyone has their opinion on who serves up the best BBQ. Debates between people defending their choices have been known to last into the wee hours of the night. Friendships have been ruined, and neighbors turned into enemies (okay, I might have made that last bit up…but you get my point).

APIs are also like a religion to many in the developer community. Developers spend their precious time using the tools and APIs that companies create. The easiest tools to use will be the ones they turn to consistently – and tell their friends about. At Bazaarvoice, we are hyper-focused on how to make our API and Platform the best set of tools around.

But how do you “serve up” good API? To answer that, let’s borrow an analogy from the world of BBQ.

Imagine you wanted to make a BBQ dinner, and you came to Bazaarvoice for help. We could help you in a few different ways:

Method #1: We can provide you with the raw ingredients & materials you need – e.g. spices to make your sauce, sticks to build your fire, and of course – a cow.

Method #2: We can provide you with some pre-packaged ingredients & items – e.g. a bottle of BBQ sauce, a grill, and some prime cuts of meat.

Method #3: We can provide you with a menu from the Salt Lick, as well as the number to their delivery service.

So what does this translate to (aside from a yummy BBQ dinner)?

Method #1: High innovation, high support costs, and low adoption.

Method #2: Medium innovation, medium support costs, and medium adoption.

Method #3: Low innovation, low support costs, and high adoption.

At Bazaarvoice, we aim to provide the developer community with tools that support all three of the methods above:

Method #1: Our API gives developers fine-grained control over the information they can request, the filters they can specify, etc. However, this flexibility comes at a cost. Developers will have to understand our object model and syntax to take full advantage of the API, and we at Bazaarvoice need to provide training and documentation to help with this process.

Method #2: Our API documentation always starts off with example API calls and popular use cases. These “pre-packaged” examples can help you skip straight to the API calls that will get the job done.

Method #3: We have reference apps available to download, and we will continue to add more over time. These reference apps serve two purposes. First, you can download the apps, enter your API credentials, and be off and running (just like BBQ takeout!). Second, you can use these apps as a learning tool to help you get familiar with the Bazaarvoice API faster.

Like grilling up BBQ, it is hard to satisfy everyone. But when you get your product just right, you can turn customers into dedicated fans. So in conclusion – when you see an employee of Bazaarvoice feasting away at one of the many popular local BBQ joints, feel confident knowing that we are hard at work.

Extremenly superb way of comparing APIs with BBQ though firstly was quite confussed while reading this article.. So finally can say enjoyed both Bazaarvoice API and Austin BBQ 😉

The Tools We Use to Innovate in Bazaarvoice Labs (Part 2)

In the previous post, I provided a rundown on what Bazaarvoice Labs is, our process and why it is important to have flexibility in our toolset choices. I now want to give you some tool examples in the following categories:

Operational Tools
Server-side Application Development Environments
Data Storage and Management
Client-side Tools
Measurement Tools

Operational Tools

Amazon EC2: Well, duh. I mentioned that we need to seamlessly transition from internal prototypes to live running pilots and by using EC2, Elastic Load Balancer and creating a set of mostly standardized AMIs, we’re able to get a machine up and running to demo a prototype or scale out to supporting hundreds of thousands of requests almost instantly. Key to our use of the EC2 is the fact that it has a very robust API and tools like boto so we can automate just about everything that we do. This is important since it’s well documented that EC2 instances can go up and down without rhyme or reason. Which brings me to my next operational tool…
Cloudkick: We use Cloudkick for basic monitoring. Its UI is simple and it just plain works. Given how frequently we take services and applications up and down in EC2, it’s really nice to have an easily configurable, straightforward monitoring solution to rely on.

Server-side Application Development Environments

Ruby on Rails and Django: While we’ve experimented with microframeworks like Flask, sometimes when you’re moving fast and prototyping, you don’t know exactly what you need or when you’re going to need it. You may not want to think about what ORM or templating language to use or want to re-invent how user sessions are handled and it’s times like these that a nice full-stack web application framework comes in handy. Why both though? Well, quite simply, some engineers on our team prefer Ruby and some (most) prefer Python. This is where our one engineer, one project comes in handy. We work with the tools that will make us fastest. Ultimately, if someone needs to step up and lend a hand on a project when someone is on vacation, we’re all polyglots and can get our hands dirty in any language or framework necessary. The Facebook apps referenced above were written in Rails and the very, very high traffic pilot that we ran with TurboTax was written with Django (as was our Customer Intelligence product).
Node.js: The evented asynchronous server built on Google’s V8 Javascript engine. Node is a great tool to use when you’re building an application that needs to pull data in from multiple HTTP-based APIs and mash it together. Its performance is remarkable and it allows a developer to work in the same language in both the client and on the server. While some people think server-side JS is a fad, I think Node is leading a revolution in how people build and think about web applications. Please note that Node is so much more useful than for just building webapps. It can be used, for example as a very effective proxy as well (see Joe Stump’s answer about what technologies SimpleGeo uses on Quora). Data for Travelocity’s Social Connect Discovery pilot is served from Node.js backed with the Bazaarvoice Developer API and custom indices stored in Redis.

Data Storage and Management

ElasticSearch: We’re no strangers to Lucene-based search and data stores at Bazaarvoice. Most of our core platform’s displays are backed by queries made to SOLR. However, unlike SOLR, ElasticSearch is schema free and therefore really nice to use for prototyping and pilots where you’re not sure of the kinds of data that you’ll be wanting to index. There are some gotchas with this approach but for Labs projects, we’ll take the flexibility it offers. As a side note, it’s amazing how often Lucene-based tools are left out of the NoSQL discussion (In fact, my colleague RC Johnson did a SXSWi presentation on this). The search functionality in our Ask and Answer for Facebook pilot with Nikon is driven out of ElasticSearch.
MongoDB: We’ve used MongoDB in any number of Labs pilots at this point. Most notably, it drives the leaderboard and newsfeed functionality in our Ratings and Reviews for Facebook pilot with Benefit Cosmetics and also the majority of our new product discovery pilot application that we’re running with Sam’s Club in Facebook.

Client-side Tools

Dust: Dust is a Javascript templating library well suited to asynchronous applications. We like Dust because it’s a flexible and easy to use templating language, it integrates well with server-side JS tools like Node and allows you to pre-compile your templates for great performance.
Protovis: Protovis is an excellent visualization library. It’s declarative and very easy to build complex, interactive visualizations while still having a high degree of flexibilty over how those visualizations are rendered. We use Protovis to create what I believe are visualizations that are way beyond typical for an analytics tool in our Customer Intelligence product.

Measurement Tools

Google Analytics: It’d be tough to tell where we’d be without Google Analytics. It’s got its obvious uses, but also has comprehensive APIs that allow you to call custom events, set variables and then suck the data back out as necessary. This allows us to track specific actions that a user takes and to set up funnels based on those actions (even when the actions are clicks within a page vs. full page views).
Mixpanel: Mixpanel is a great alternative to Google Analytics. Many of our projects in Bazaarvoice Labs take the form of Javascript plugins or widgets that don’t conform to the traditional page-view-first mentality of most web analytics. Mixpanel focuses much more on tracking individual events that a user takes either in-page or across pages. Their API for doing this is very easy to use and it has the added benefit of being realtime which means you don’t have to wait a few hours to start seeing results from that code change that you just launched.

Of course, no project, prototype or pilot would get off the ground in Bazaarvoice Labs if we couldn’t get at our customer’s data. In order to maintain agility, all Bazaarvoice Labs projects are written as free-standing applications that are not part of our core application stack (a somewhat traditional J2EE application built on Spring MVC). Early on in Labs, even though we had direct access to our databases, we knew we needed to maintain separation between our core stack and Labs applications. Since we maintain a very complex set of business rules that are configurable on a per client basis around content submission and display, if we were to write directly to the databases, there’d be a high risk that we’d compromise data integrity. Generally, we’d use our existing XML API for submission (because it was obvious that trying to write data into the DBs from a separate application was a recipe for disaster) but we’d still use replicas of our core MySQL database clusters for display. That was okay but there were still some business logic mistakes made in the display of content (unacceptable when your pilot clients are some of the biggest online retailers around). In order to get around this, we created a new API that supported significantly higher degree of queryability, JSON and JSON-P data formats and had much lighter weight responses. This allows Bazaarvoice Labs to talk to our core data sets in a much more efficient manner and be assured that business rules are followed. This new API has now be productized as The Bazaarvoice Developer API. We will often create new, experimental method calls or create application local data indexes, but every single Bazaarvoice Labs project leverages this API heavily.

I hope I’ve given you a good overview of how Bazaarvoice Labs operates and the tools that keep us humming. It’s great to be able to work in an environment where exploration of new ideas and technologies are supported and encouraged. By operating the Bazaarvoice Labs team off-stack, it gives the Labs Engineers a chance not only to give input into what new products get built but what technologies get used to build them in a very low risk way.

The Tools We Use to Innovate in Bazaarvoice Labs (Part 1)

Hi everyone! This is my first post to the Bazaarvoice Developer blog and I’d like to take this opportunity to shed some light on some of the tools Bazaarvoice Labs has recently found very useful in creating the pilots and prototypes that ultimately morph into new products and features on the Bazaarvoice platform. Before I talk about our toolset though, I’d like to give you a quick rundown on what Bazaarvoice Labs is, our process and why it’s important for us to be flexible in our toolset choices.

Bazaarvoice Labs is the new product research and development group at Bazaarvoice with emphasis on the new and research. We are actually a team of engineers that report to our Product Management team (rather than through the engineering group) that help our Product Managers realize their wildest (and potentially most game-changing) ideas. Every quarter we evaluate and prioritize new ideas proposed by our Product Management team, customers and Bazaarvoicers around the company in order to research and create prototypes. The ideas we prioritize highest are those that come with big hairy assumptions but could change our business if they work. By building prototypes we’re able to suss out where the trouble might lie if we were to introduce the new product or feature to our entire customer base. We currently have over one thousand of the world’s biggest brands hosting their user-generated content in our platform and many large services organization to boot. The introduction of even a small new feature can have very large consequences to our organization. So on the risky stuff, we like to know where the gotchas lie. Some of the products spawned out of this process include BrandAnswers and Ratings and Reviews for Facebook (part of our SocialConnect Suite).

In order to build a prototype, we assign an engineer to work directly with a Product Manager or Product Designer. These two work together in an agile manner (agile with a little-a, not a capital A) in order to create a tangible prototype that demonstrates the Product Manager’s idea unencumbered by writing lots of requirements or unnecessary process. It’s this one-to-one relationship that makes this process hum, gives the creative process a kick in the pants and really lets these ideas properly gestate. Once more people get involved in the project, due to network effect, managing the project gets exponentially harder with each person you add and the need for process increases as a way to mitigate risk. By imposing a one-to-one structure for our prototyping teams we strip away any unnecessary obstacles to creativity and give real creative ownership to our Product Managers, Designers and Engineers. In a way, these teams become entrepreneurial cofounders as they attempt to prove their ideas. Additionally, by artificially constraining the initial project team to one engineer, the team is focused on building out the Minimally Viable Product needed to prove their assumptions and build a business case before further investment is required. Another nice side effect of this style of working is that it allows the engineer working on the project to choose their own tool-chain for each new project. Since they’re working alone on a project, there’s no need to constrain the tool choices to lowest common denominator of what every team member might already know or be familiar with. Of course, it’s up to the engineer’s discretion to reuse code or tools that may already be in use at Bazaarvoice but that choice ultimately lies with the engineer and the engineer knows to optimize around speed of creation vs. other organizational considerations. One nice side effect of an engineer being able to choose a new tool-chain with every project is that, in addition to proving business and product ideas, emerging technologies can be realistically evaluated and, where appropriate, integrated into our core engineering stack (this happened with requireJS which has become an integral part of how we deploy Javascript on our customer’s sites).

Sometimes simply building a prototype may not answer the questions that we have around the viability of the product and we need to take further steps to answer the questions we might have. For example, we needed to answer the following question for Ratings and Reviews for Facebook: “Will people be willing to read and write Product reviews inside a Facebook app?” In this case, a prototype isn’t enough. We needed to progress to the next phase of the process and actually pilot the application we had built with a couple of customers. For this reason, the “prototypes” we build need to be more robust than what you might initially think. Yes, we’re building concept cars in Labs but our concept cars actually need to run. We generally launch these pilots with three to five customers and generally won’t internationalize them. These restrictions keep us agile and help make sure we don’t have to build too much customizability into the pilots. Even though these pilots will only be launched with a handful of customers, some of these will be placed in some very high profile, high traffic places (some getting over 100,000 hits per day). Examples of running pilots right now include Nikon’s Ask and Answer for Facebook, Travelocity’s Social Connect Discovery Pilot (see the “Traveler Reviews” link) and TurboTax’s People Like You review search tool. Of course, when we launch pilots we track the data and rapidly move to improve the product and build out a suitable business case for productization. In the pilot phase the engineers are free to launch new code whenever they choose and must play the roles of UX, server side and operations engineer. By being chief cook and bottle washer on these projects, it frees the owning engineer (there’s still only one per project) to push releases as frequently as necessary to build that business case and observe how changes affect the project’s KPIs.

So what tools do we use to build software in Labs? Let’s review the two phases of our projects momentarily: The tools we chose to build with in Bazaarvoice Labs must support two phases of a project:

Prototyping: When the engineer needs to build a usable, tangible artifact targeted for internal consumption and demonstrations for clients.
Pilots: Where we launch our new ideas with a few select clients and measure results to build a business case. Pilots must be stable and scale yet the engineer still has to rapidly iterate on the feature set.

Because our development cycles are so short at Bazaarvoice, projects must also be able to transition between the prototype and pilot phase seamlessly. The tools we select must therefore support the requirements mentioned above. Generally we can divide our tool-chain into a few broad categories:

Operational Tools: Tools that help us keep things up and running
Server-side Application Development Environments: Application containers, full-stack and micro frameworks. Tools to build web apps with.
Data Storage and Management: SQL, noSQL and whatever else you need
Client-side Tools: Because there’s a lot you can do with just a browser nowadays
Measurement Tools: Without the data to back up our hypotheses, there’s no science

In my next blog post, I’m going to step through each of these categories and talk about a couple of tools that we use and the projects that we’ve used them in. This will not be an exhaustive list since we’re always evaluating new tools but it should give you some insight into the how and why we pick the tools that we do.

public static void isNotNull(Object value, String message, Object... messageParams) { if (value == null) { throw new IllegalArgumentException(MessageFormat.format(message, messageParams)); } }

Near real time analytics on a large data set = hard!

About two years ago we found ourselves adding a lot of analytical features to the Bazaarvoice Workbench, our client facing analytics product. We were implementing various new dashboards, reports, and alerts. Dashboards needed to be near real time, so we were forced to either find a way to run required SQL queries in real time, or pre-compute required data. As the main data store was MySQL, we found it impossible to aggregate large amounts of data on the fly in real time, so we were forced to pre-compute a lot of aggregates. Pre-computation happens to be error prone, so we eventually went looking for other solutions which would not require us to do as much pre-calculation.

As we looked at the queries which we were running to pre-compute aggregate data, we spotted a common theme, i.e. most of the queries filtered and aggregated data. In the select clause of the query there would normally be a small number of columns with most values being aggregates such as counts, sums and averages. So we went brainstorming what type of database would potentially be a better fit for such queries compared to a traditional row oriented relational DBMS.

We looked at a number of different data stores which could work for the problem we had. One of the alternatives considered was Apache SOLR which is a great NoSQL search server we had been using for a long time already with great success. SOLR supports filtering and faceting which allows us to implement most things we were doing with MySQL using SQL. However, there is quite a bit of dev work which needs to go into indexing data in SOLR and querying it.

Enter the world of column oriented databases

Imagine a database which stores the data by column rather than by row. For example, we might want to store the following employee data in a table:

ID, Name, Department, Salary

1, Chuck Norris, DEV, 10000

2, Gate Bills, SALES, 20000

3, Bart Simpson, HR, 15000

Traditional row oriented databases would store column values for each row sequentially, hence the data is stored like this:

1, Chuck Norris, DEV, 10000

2, Gate Bills, SALES, 20000

3, Bart Simpson, HR, 15000

A column oriented database serializes values of each column together:

1,2,3

Chuck Norris, Gate Bills, Bart Simpson

DEV, SALES, HR

10000, 20000, 15000

This approach has some advantages over row oriented databases:

A columnar database does not need to read the whole row when a query only needs a small subset of columns. This is a typical pattern you see in queries performing aggregations. Not having to read the whole row from disk has an advantage of minimized disk I/O which really matters since databases are normally disk I/O bound these days.
Another important factor is data compression. If data is stored by column, it’s normally possible to achieve high compression rate for each column. This in turn helps to minimize disk footprint of the database, so the disk I/O is minimized as well compared to non-compressed data bases. It often makes sense to take a hit decompressing the data on the fly compared to reading more data from disk.

Getting to know Infobright

One of the column oriented databases we evaluated is Infobright which is an open source database built on MySQL by changing the storage engine to be column oriented. Of course a significant change to storage layout also means that Infobright had to replace MySQL’s query execution engine with a custom engine capable of taking advantage of the columnar storage model. So, let’s jump straight to test results we got when evaluating Infobright using a data set with 100MM records in the main fact table.

As you can see, the average query execution time for analytical queries was 20x faster than MySQL’s.

Additionally, the disk footprint was over 10x smaller compared to MySQL due to data compression.

Also, we found that Infobright supported most of the SQL syntax we were already using in our queries including outer joins, sub-queries and unions. We found that a little tweaking was still required to make some queries perform well.

The benchmark numbers were impressive enough that we ended up using Infobright to implement near real time analytics in one of our products. It does a great job at calculating aggregates on the fly so we no longer needed to maintain as many aggregate tables (for instance, daily, and monthly aggregates).

Infobright – the secret sauce

What is the secret sauce in Infobright? First, its column oriented storage model which leads to smaller disk I/O. Second, its “knowledge grid” which is aggregate data Infobright calculates during data loading. Data is stored in 65K Data Packs. Data Pack nodes in the knowledge grid contain a set of statistics about the data that is stored in each of the Data Packs. For instance, Infobright can pre-calculate min, max, and avg value for each column in the pack during the load, as well as keep track of distinct values for columns with low cardinality. Such metadata can really help when executing a query since it’s possible to ignore data packs which have no data matching filter criteria. If a data pack can be ignored, there is no penalty associated with decompressing the data pack.

Compared to our MySQL implementation, Infobright eliminated the need to create and manage indexes, as well as to partition tables.

Infobright – Limitations

Infobright Community Edition is the open source version of Infobright. Unfortunetly, it does not support DML (inserts, updates, and deletes), so the only way to get data loaded is bulk loads using “LOAD DATA INFILE …” command. It’s still possible to append data to the table, however there is no way to update or delete existing data without having to re-load the table. For some types of data (such as log files or call detail records), this is not a significant issue since data is rarely edited or deleted. However, for other projects this limitation may be a show stopper unless the data set is small enough when it can be periodically re-loaded into Infobright.

Infobright chose to take the SMP (Symmetric MultiProcessing) approach, so there is no built-in support for MPP (Massively Parallel Processing). This certainly limits the size of the databases for which query performance would be acceptable. However, it’s possible to shard the data across multiple Infobright instances and then aggregate the results. Shard-Query is an open source project which makes it possible to query a data set partitioned across many database instances, and our test results confirm that this approach works really well.

Summary

Column oriented databases proved to be a great fit for analytical workloads. We managed to eliminate the need to maintain a large number of aggregate tables by migrating our data warehouse from MySQL to Infobright. This in turn let us support near real time visualizations in our analytics products.

Thanks, this is useful information about column-oriented databases.

keep up the great piece of work, I read few posts on this site and I conceive that your web site is very ineirtsteng and has got circles of wonderful information.Valuable information and excellent design you got here! I would like to thank you for sharing your thoughts and time into the stuff you post!

bazaarvoice: engineering

The official blog of Bazaarvoice R&D

Category Archives: Uncategorized

Grilling up an API

The Tools We Use to Innovate in Bazaarvoice Labs (Part 2)

The Tools We Use to Innovate in Bazaarvoice Labs (Part 1)

Listening for Search Improvements @ Lucene Revolution 2011

RHoK’ing Hackathon coming to Austin on December 2-4

Fixing a performance problem a few days before release (Part 2)

Fixing a performance problem a few days before release (Part 1)

Can’t get into your favorite conference? No problem. Just create your own!

Why Columns are Cool

What I learned from the Etsy CTO-turned-CEO