BVIO 2015 Summary and Presentations

Every year Bazaarvoice R&D throws BVIO, an internal technical conference followed by a two-day hackathon. These conferences are an opportunity for us to focus on unlocking the power of our network, data, APIs, and platforms as well as have some fun in the process. We invite keynote speakers from within BV, from companies who use our data in inspiring ways, and from companies who are successfully using big data to solve cool problems. After a full day of learning we engage in an intense, two-day hackathon to create new applications, visualizations, and insights into our extensive our data.

Continue reading for pictures of the event and videos of the presentations.

bvio-logo

This year we held the conference at the palatial Omni Barton Creek Resort in one of their well-appointed ballrooms.

omni

Participants arrived around 9am (some of us a little later). After breakfast, provided by Bazaarvoice, we got started with the speakers followed by lunch, also provided by Bazaarvoice, followed by more speakers.

bvio2015_presentation2 bvio2015_presentation

After the speakers came a “pitchfest” during which our Product team presented hackathon ideas and participants started forming teams and brainstorming.

bvio2015_bigidea bvio2015_bigidea2

Finally it was time for 48 hours of hacking, eating, and gaming (not necessarily in that order) culminating in project presentations and prizes.

bvio2015_hacking bvio2015_hacking2 bvio2015_gaming bvio2015_eating bvio2015_demo bvio2015_demo2

Presentations

Sephora: Consumer Targeted Content

Venkat Gopalan
Director of Architecture & Devops @ Sephora.com

Venkat presented on the work Sephora is doing around serving relevant, targeted content to their consumers in both the mobile and in-store space. It was a fascinating speech and we love to see our how our clients are innovating with us. Unfortunately due to technical difficulties we don’t have a recording 🙁

Philosophy & Design of The BV System of Record

John Roesler & Fahd Siddiqui
Bazaarvoice Engineers

This talk was about the overarching design of Bazaarvoice’s innovative data architecture. According to them there are aspects to it that may seem unexpected at first glance (especially not coming from a big data background), but are actually surprisingly powerful. The first innovation is the separation of storage and query, and the second is choosing a knowledge-base-inspired data model. By making these two choices, we guarantee that our data infrastructure will be robust and durable.

Realtime Bidding: Predicting the future, 10,000 times per second

Ian Clarke
Co-Founder and CTO at OneSpot

Ian has built and manages a team of world-class software engineers as well as data scientists at OneSpot™s. In his presentation he discusses how he applied machine learning and game theory to architect a sophisticated realtime bidding engine for OneSpot™ capable of predicting the behavior of tens of thousands of people per second.

New Amazon Machine Learning and Lambda architectures

Jeff Nun
Amazon Solutions Architect

In his presentation Jeff discusses the history of Amazon Machine Learning and the Lambda architecture, how Amazon uses it and you can use it. This isn’t just a presentation; Ian walks us through the AWS UI for building and training a model.

Thanks to Sharon Hasting, Dan Heberden, and the presenters for contributing to this post.

Partner Integrations: Do’s and Don’ts

In this blog post , a Senior Product Manager on our Product team, discusses challenges to building and maintaining technical partnerships between organizations as well as provides advice on how to overcome those challenges.

Every company comes to a point, early or late, where it realizes that it must partner with other companies to drive value in the market. Partnerships always start with conversations, handshakes, and NDAs. At some point, unlocking the value of partnership may hinge upon establishing a formal integration between the two companies. These integrations constitute a technical “bridge” between companies. They can unlock otherwise inaccessible value, allow for one company to OEM the other, and/or can accelerate work that otherwise is “re-invented” each time the companies engage each other.

Integrations can be amazing vehicles to create value that only comes from combining capabilities from separate entities, while simultaneously allowing each entity to focus on what each one does best. They can be the perfect manifestation of the all too often promised “complimentary” value. Integrations can offer consistency, repeatability, and reduced friction in the activities involved in unlocking that value.

Unfortunately, integrations are often approached in manner in which all parties involved are not setup for success. Why?

Integrations aren’t just some “code.” They are product. They require an organized effort to build, including technical staff and non-technical staff to build (engineers, architects, project manager, product manager, partnership manager). They require support, assigned owners, subject matter experts, marketing, documentation, and proper roadmap vision. Integrations demand the same attention and focus that any first class “product” requires.

Integrations require both more and different types of communication. Because the value of the integration is typically not front-and-center to the core value of each org, there must be additional effort to communicate the existence of the integration and the value it brings within each org. Sales, onboarding support, post-live support organizations all need ways to communicate with the other integrated party (who calls who when something stops working?). The two product organizations must communicate ahead of any changes to dependent technologies such as APIs. A classic communication gap happens when one entity changes their APIs and doesn’t let the other party know soon enough or not at all. Problems are only discovered when something breaks.

Integrations are usually birthed by the wrong part of the org. The challenge with integrations is that the impetus to create them usually originates from one or both of the company’s business development/partnerships team – a group that typically has little appreciation for the discipline of product management. Their priority is on “relationships” that historically focus on non-technical efforts. Additionally, the ADD-like attention span of most partnerships teams results in a great desire to create an “integration” with a partner for marketing and sales-driven reasons, but very little attention, effort, and commitment to the long-term requirements of a properly supported product. It is quite easy to just stop communicating with a partner who is no longer deemed valuable, but such an about-face cannot be made when an integration is in place with paying customers. Most often, partnerships orgs do not have technical resources within their structure, but rather rely on borrowing technical resources from wherever they can be found (“Hey, I have a partner company who just trying to do this thing, and they have a quick technical question…”). This is a great approach for proof-of-concept initiatives, but certainly not for something that companies/customers must trust to deliver value. The product organizations at each company must be responsible for bringing an integration to life. Regardless of whether the product org has enough resources to service an integration like a first-class product citizen, at least the owner will have an understanding of what is and isn’t getting handled properly and can mitigate the potentially negative outcomes that arise from under-served products.

Correctly structured incentives are crucial to the short and long-term success of integrations. There must be something in it for all concerned parties. Direct compensation and rev share are two good options. You should be cautious of such benefits as “stickiness” (as in, the assumption that giving an integration free-of-charge to an existing customer makes that customer less likely to debook your core service) or the halo effect associated with integrating with a company (i.e. “Do you know we’re integrated with Facebook?”). Many integrations have been built on the promise of return. Once that promise begins to fade (from any one or more of the parties), so does the motivation of the affected party to keep up their end of the technical bargain. The technology world’s version of “he’s just not that into you (anymore).” Once an integration is no longer properly attended to from one party, the integration becomes a liability. It’s not enough for the bridge to be secured to just one side of the river.

People love to build stuff. But, they hate to support it. There must be something in it for the platform to properly prioritize integration maintenance efforts. Be wary of agreements that lack commitments, SLAs, etc. (often termed that they will do any needed work within the bounds of “best efforts”) as these agreements allow the company responsible for the integration (“code”) to elect to not invest in the support and roadmap development, should their interest wane. If the agreement lacks commitments, then the partnership will likely as well. They will acknowledge the maintenance effort, but it will always get pushed to the next dev cycle. Which leads us to…

The Challenge of Opportunity Cost

The assumption here is that these companies contemplating an integration are predominantly product organizations. Their company mandate is to bring products to market at scale. This is dramatically different than a service organization who essentially trades dollars for hours.
This means that the cost of the technical/engineering effort at a product organization is different than that of a service organization. Not in that engineers get paid more at product organizations, but rather the additional opportunity cost of engineering effort at a product organization often introduces an impossibly high hurdle rate for putting those engineers on non-core “integration work.” Even just the existence of opportunity cost, albeit uncalculated, is all that a dissenting product or engineering leader needs to de-prioritize seemingly less important “integration work” that doesn’t deliver core value.

One innovative approach to solve this dilemma is to use outsourced engineering resources from a service organization to avoid the challenges that comes with opportunity cost. It makes good business sense: let your in-house engineering staff concentrate on doing things that drive core value at scale. The downside of this approach is that there is a very clear and visible cost (hrs * hourly rate) that is attached to all effort associated with the integration. A similar cost analysis is rarely thought about when utilizing internal resources, so the integration product manager should be prepared. Getting things done is always more expensive than you thought.

Of course, another solution is to simply make integration work of the same perceived class of value as that of the core product org’s core solution. However, as we describe above, this can be a big challenge.

The technical approach must be at the convergence of correctly structured incentives and technical viability. How open or closed a platform is can dictate how an integration can be executed. The associated partnership incentive structure can dictate how an integration should be executed. The resulting integration will result from the intersection of these two perspectives.

Closed platforms force the work on that platform. Open platforms allow for more options – either or both entities, possibly even a third-party, can contribute to the integration development.

Let’s look at a few scenarios.

Scenario 1: B is a “closed” platform

b_is_closed_platform

“Closed” here means that the platform does not allow for integration (read: code) to be hosted by that platform and that the platform does not have externally accessible APIs to utilize from outside the platform. The closed platform may have internally accessible APIs, but those do an external party little good.

Closed platforms force that platform to do the integration work. Thus, there must be incentives for the closed platform to both build and support the integration long-term. The effort to build the integration is often simply the result of the opportunistic convergence of both parties being sold on (at least) the promise of value and some available engineering capacity within the close platform. Without the proper incentives for platform B, this becomes a classic example of the issue of the Challenge of Opportunity Cost, discussed above. The engineer who had some free time to build the integration is suddenly no longer available to fix a bug or build a new feature. There must be motivation in some form to continue to maintain the integrity of the integration.

Scenario 2: B is open

b_is_open_platform

Open platforms present more options. In scenario 2, B is no longer the only entity who can develop the integration. A, B, or a third-party org can build the integration. There are more alternative incentive structures as well. Since the engineering effort can be executed by a non-B entity, there doesn’t need to be much in it for B (there can be, but it is not near the necessity). There will certainly need to be knowledge of the B platform (documentation, sandboxes, API keys, deployment directions, etc.) on the part of the developing entity, but this effort on the part of B has a much lower hurdle rate than that which is required to get something into B’s engineering roadmap. Typically, B will have some form of partner “program” by where such assets and knowledge are available for a predetermined fee. Even in absence of such a program, the needs are significantly less than if the development effort required engineers from platform B to do the build work.

Scenario 3: Middle-ware Solution

middleware_solution

Scenario 3 is just a derivative of Scenario 2. Options are abundant. A, B, or a third-party can build the integration. In most cases, any of those entities can bring the integration to market. A major decision will be how and where to host the middle-ware solution and how to provide production-ready support, specifically beyond the initial build phase (which can just leverage cloud hosting services like Amazon, etc. to quickly get up and running). The trade-off is that such a middle-man solution removes any challenges that come with the need to host the integration within the B platform, which can range from simple plug-n-play effort to per-instance customizations required for each integration incarnation.

Incentive options are very similar to Scenario 2. One exception is that there is a clear opportunity for a third-party to bring the integration to market with an associated price tag.

Summary

Integrations are powerful and often hugely valuable, but their success is directly tied the ability to structure them for the long-term. Integrations are a special kind of “product” requiring different types of communication and can benefit from the use of outsourced resources to execute and maintain.

A successful integration is the result of a technical and non-technical relationship that is structured in a way that provides benefit to both parties that can adequately compensate for the often underestimated level of involvement required across both organizations.

Automating a Git Rebase Workflow

When I started on the Firebird team at Bazaarvoice, I was happy to learn that they host their code on GitHub and review and land changes via pull requests. I was less happy to learn that they merged pull requests with the big green button. I was able to convince the team to try out a new, rebase-oriented, workflow that keeps the mainline branch linear and clean. While the new workflow was a hit with the team, it was much more complicated than just clicking a button, so I automated the workflow with a simple git extension, git land, which we have released as an open source tool.

What’s Wrong With the Big Green Button?

The big green button is the “Merge pull request” button that GitHub provides to merge pull requests. Clicking it prompts the user to enter a commit message (or accept a default provided by GitHub) and then confirm the merge. When the user confirms the merge, the pull request branch is merged using the –no-ff option, which always creates a merge commit. Finally, GitHub closes the pull request.

For example, given a master branch like this:

git log of an example master branch with three commits

An example master branch

 

…and a feature branch that diverges from the second commit:

An example feature branch started from the second commit

A feature branch started from the second commit

 

…this is the result of doing a –no-ff merge:

The result of merging the examples with the --no-ff option

The result of merging the examples with the –no-ff option. Note that the third commit on master is interleaved with the merge commit and the feature branch commits.

 

Merging with the big green button is frowned upon by many; for detailed discussions of why this is, see Isaac Z. Schlueter and Benjamin Sandofsky. In addition to the problems with merge commits that Isaac and Benjamin point out, the big green button has another downside: it merges the pull request without an opportunity to squash commits or otherwise clean up the branch.

This causes a couple of problems. First, because only the pull request author can clean up the PR branch, merging often became a tedious and drawn out process as reviewers cajoled the author to update their branch to a state that would keep `master`’s history relatively clean. Worse, sometimes messy pull requests were hastily or mistakenly merged.

As a result, the team was encouraged to keep their pull requests squashed into one or two clean commits at all times. This solved one problem, but introduced another: when an author responds to comments by pushing up a new version of the pull request, the latest changes are squashed together into one or two commits. As a result, reviewers had to hunt through the entire diff to ensure that their comments were fully addressed.

An Alternate Workflow

After some lively discussion, the team adopted a new workflow centered on fast-forward merging squashed and rebased pull request branches. Developers create topic branches and pull requests as before, but when updating their pull request, they never squash commits. This preserves detailed history of the changes the author makes in response to review feedback.

When the PR is ready to be merged, the merger interactively rebases it on the latest master, squashes it down to one or two commits, and does a fast-forward merge. The result is a clean, linear, and atomic history for `master`.

The result of merging the example feature branch into master by rebasing and doing a --ff-only merge

The result of merging the example feature branch into master using the described workflow.

One hiccup is that GitHub can’t easily tell that the rebased and squashed commit contains the changes in the pull request, so it doesn’t close the PR automatically. Fortunately, GitHub will close pull requests that contain special keywords. So, the merger has a final task: adding “[closes #<PR number>]” to one of the squashed commit’s message.

git-land

The biggest downside to the new workflow is that it transformed merging a PR from a simple operation (pushing a button) to a somewhat tricky multi-step process:

  • update local master to latest from upstream
  • check out pull request branch
  • do an interactive rebase on top of master, squashing down to one or two commits
  • add “[closes #<PR number>]” to the last commit message for the most recent squashed commit
  • do a fast-forward merge of the pull request branch into master
  • push local master to upstream

This process was too lengthy and error-prone to be reliable unless automated. To address this problem, I created a simple git extension: git-land. The Firebird team has been using this tool for a little over a year with very few problems. In fact, it has spread to other teams at Bazaarvoice. We are excited to release it as an open source tool for the public to use.

Front End Application Testing with Image Recognition

One of the many challenges of software testing has always been cross-browser testing. Despite the web’s overall move to more standards compliant browser platforms, we still struggle with the fact that sometimes certain CSS values or certain JavaScript operations don’t translate well in some browsers (cough, cough IE 8).

In this post, I’m going to show how the Curations team has upgraded their existing automation tools to allow for us to automate spot checking the visual display of the Curations front end across multiple browsers in order to save us time while helping to build a better product for our clients.

The Problem: How to save time and test all the things

The Curations front end is a highly configurable product that allows our clients to implement the display of moderated UGC made available through the API from a Curations instance.

This flexibility combined with BV’s browser support guidelines means there are a very large number ways Curations content can be rendered on the web.

Initially, rather than attempt to test ‘all the things’, we’ve codified a set of possible configurations that represent general usage patterns of how Curations is implemented. Functionally, we can test that content can be retrieved and displayed however, when it comes whether that the end result has the right look-n-feel in Chrome, Firefox and other browsers, our testing of this is largely manual (and time consuming).

How can we better automate this process without sacrificing consistency or stability in testing?

Our solution: Sikuli API

Sikuli is an open-source Java-based application and API that allows users to automate web, mobile and OS applications across multiple platforms using image recognition. It’s platform based and not browser specific, so it enables us to circumvent limitations with screen capture and compare features in other automation tools like Webdriver.

Imagine writing a test script that starts with clicking the home button within an iOS simulator, simply by providing the script a .png of the home button itself. That’s what Sikuli can do.

You can read more about Sikuli here. You can check out their project here on github.

Installation:

Sikuli provides two different products for your automation needs – their stand-alone scripting engine and their API. For our purposes, we’re interested in the Sikuli API with the goal to implement it within our existing Saladhands test framework, which uses both Webdriver and Cucumber.

Assuming you have Java 1.6 or greater installed on your workstation, from Sikuli.org’s download page, follow the link to their standalone setup JAR

http://www.sikuli.org/download.html

Download the JAR file and place it in your local workstation’s home directory, then open it.

Here, you’ll be prompted by the installer to select an installation type. Select option 3 if wish to use Sikuli in your Java or Jython project as well as have access to its command line options. Select option 4 if you only plan on using Sikuli within the scope of your Java or Jython project.

Once the installation is complete, you should have a sikuli.jar file in your working directory. You will want to add this to your collection of external JARs for your installed JRE.

For example, if you’re using Eclipse, go to Preferences > Java > Installed JREs, select your JRE version, click Edit and add Sikuli.jar to the collection.

Alternately, if you are using Maven to build your project, you can add Sikuli’s API to your project by adding the following to your POM.XML file:

<dependency>
    <groupId>org.sikuli</groupId>
    <artifactId>sikuli-api</artifactId>
    <version>1.2.0</version>
</dependency>

Clean then build your project and now you’re ready to roll.

Implementation:

Ultimately, we wanted a method we could control using Cucumber that allows us to articulate a web application using Webdriver that could take a screen shot of a web application (in this case, an instance of Curations) and compare it to a static screen shot of specific web elements (e.g. Ratings and Review stars within the Curations display).

This test method would then make an assumption that either we could find a match to the static screen element within the live web application or have TestNG throw an exception (test failure) if no match could be found.

First, now that we have the ability to use Sikuli, we created a new helper class that instantiates an object from their API so we can compare screen output.

import org.sikuli.api.*;
import java.io.IOException;
import java.io.File;
/**
* Created by gary.spillman on 4/9/15.
*/
public class SikuliHelper {

public boolean screenMatch(String targetPath) {
new ImageTarget(new File(targetPath));

Once we import the Sikuli API, we create a simple class with a single class method. In this case, screenMatch is going to accept a path within the Java project relative to a static image we are going to compare against the live browser window. True or false will be returned depending on if we have a match or not.

//Sets the screen region Sikuli will try to match to full screen
ScreenRegion fullScreen = new DesktopScreenRegion();

//Set your taret to compare from
Target target = new ImageTarget(new File(targetPath));

The main object type Sikuli wants to handle everything with is ScreenRegion. In this case, we are instantiating a new screen region relative to the entire desktop screen area of whatever OS our project will run on. Without passing any arguments to DesktopScreenRegion(), we will be defining the region’s dimension as the entire viewable area of our screen.

double fuzzPercent = .9;

try {
    fuzzPercent = Double.parseDouble(PropertyLoader.loadProperty(&quot;fuzz.factor&quot;));
}
catch (IOException e) {
    e.printStackTrace();
}
new ImageTarget(new File(targetPath));

Sikuli allows you to define a fuzzing factor (if you’ve ever used ImageMagick, this should be a familiar concept). Essentially, rather than defining a 1:1 exact match, you can define a minimal acceptable percentage you wish your screen comparison to match. For Sikuli, you can define this within a range from 0.1 to 1 (ie 10% match up to 100% match).

Here we are defining a default minimum match (or fuzz factor) of 90%. Additionally, we load in from a set of properties in Saladhand’s test.properties file a value which, if present can override the default 90% match – should we wish to increase or decrease the severity of test criteria.

target.setMinScore(fuzzPercent);
new ImageTarget(new File(targetPath));

Now that we know what fuzzing percentage we want to test with, we use target’s setMinScore method to set that property.

ScreenRegion found = fullScreen.find(target);

//According to code examples, if the image isn't found, the screen region is undefined
//So... if it remains null at this point, we're assuming there's no match.

if(found == null) {
    return false;
}
else {
    return true;
}
new ImageTarget(new File(targetPath));

This is where the magic happens. We create a new screen region called found. We then define that using fullScreen’s find method, providing the path to the image file we will use as comparison (target).

What happens here is that Sikuli will take the provided image (target) and attempt to locate any instance within the current visible screen that matches target, within the lower bound of the fuzzing percentage we set and up to a full, 100% match.

The find method either returns a new screen region object, or returns nothing. Thus, if we are unable to find a match to the file relative to target, found will remain undefined (null). So in this case, we simply return false if found is null (no match) or true of found is assigned a new screen region (we had a match).

Putting it all together:

To completely incorporate this behavior into our test framework, we write a simple cucumber step definition that allows us to call our Sikuli helper method, and provide a local image file as an argument for which to compare it against the current, active screen.

Here’s what the cucumber step looks like:

public class ScreenShotSteps {

    SikuliHelper sk = new SikuliHelper();

    //Given the image &quot;X&quot; can be found on the screen
    @Given(&quot;^the image \&quot;([^\&quot;]*)\&quot; can be found on the screen$&quot;)
    public void the_image_can_be_found_on_the_screen(String arg1) {

        String screenShotDir=null;

        try {
            screenShotDir = PropertyLoader.loadProperty(&quot;screenshot.path&quot;).toString();
        }
        catch (IOException e) {
            e.printStackTrace();
        }

        Assert.assertTrue(sk.screenMatch(screenShotDir + arg1));
    }
    new ImageTarget(new File(targetPath));
}

We’re referring to the image file via regex. The step definition makes an assertion using TestNG that the value returned from our instance of SikuliHelper’s screen match method is true (Success!!!). If not, TestNG throws an exception and our test will be marked as having failed.

Finally, since we already have cucumber steps that let us invoke and direct Webdriver to a live site, we can write a test that looks like the following:

Feature: Screen Shot Test
As a QA tester
I want to do screen compares
So I can be a boss ass QA tester

Scenario: Find the nav element on BV's home page
Given I visit &quot;http://www.bazaarvoice.com&quot;
Then the image &quot;screentest1.png&quot; can be found on the screen
new ImageTarget(new File(targetPath));

In this case, the image we are attempting to find is a portion of the nav element on BV’s home page:

screentest1

Considerations:

This is not a full-stop solution to cross browser UI testing. Instead, we want to use Sikuli and tools like it to reduce overall manual testing as much as possible (as reasonably as possible) by giving the option to pre-warn product development teams of UI discrepancies. This can help us make better decisions on how to organize and allocate testing resources – manual and otherwise.

There are caveats to using Sikuli. The most explicit caveat is that tests designed with it cannot run heedlessly – the test tool requires a real, actual screen to capture and manipulate.

Obviously, the other possible drawback is the required maintenance of local image files you will need to check into your automation project as test artifacts. How deep you will be able to go with this type of testing may be tempered by how large of a file collection you will be able to reasonably maintain or deploy.

Despite that, Sikuli seems to have a large number of powerful features, not limited to being able to provide some level of mobile device testing. Check out the project repository and documentation to see how you might be able to incorporate similar automation code into your project today.

Predictively Scaling EC2 Instances with Custom CloudWatch Metrics

One of the chief promises of the cloud is fast scalability, but what good is snappy scalability without load prediction to match? How many teams out there are still manually switching group sizes when load spikes? If you would like to make your Amazon EC2 scaling more predictive, less reactive and hopefully less expensive it is my intention to help you with this article.

Problem 1: AWS EC2 Autoscaling Groups can only scale in response to metrics in CloudWatch and most of the default metrics are not sufficient for predictive scaling.

For instance, by looking at the CloudWatch Namespaces reference page we can see that Amazon SQS queues, EC2 Instances and many other Amazon services post metrics to CloudWatch by default.

From SQS you get things like NumberOfMessagesSent and SentMessageSize. EC2 Instances post metrics like CPUUtilization and DiskReadOps. These metrics are helpful for monitoring. You could also use them to reactively scale your service.

The downside is that by the time you notice that you are using too much CPU or sending too few messages, you’re often too late. EC2 instances take time to start up and instances are billed by the hour, so you’re either starting to get a backlog of work while starting up or you might shut down too late to take advantage of an approaching hour boundary and get charged for a mostly unused instance hour.

More predictive scaling would start up the instances before the load became business critical or it would shut down instances when it becomes clear they are not going to be needed instead of when their workload drops to zero.

Problem 2: AWS CloudWatch default metrics are only published every 5 minutes.

In five minutes a lot can happen, with more granular metrics you could learn about your scaling needs quite a bit faster. Our team has instances that take about 10 minutes to come online, so 5 minutes can make a lot of difference to our responsiveness to changing load.

Solution 1 & 2: Publish your own CloudWatch metrics

Custom metrics can overcome both of these limitations, you can publish metrics related to your service’s needs and you can publish them much more often.

For example, one of our services runs on EC2 instances and processes messages off an SQS queue. The load profile can vary over time; some messages can be handled very quickly and some take significantly more time. It’s not sufficient to simply look at the number of messages in the queue as the average processing speed can vary between 2 and 60 messages per second depending on the data.

We prefer that all our messages be handled within 2 hours of being received. With this in mind I’ll describe the metric we publish to easily scale our EC2 instances.

ApproximateSecondsToCompleteQueue = MessagesInQueue / AverageMessageProcessRate

The metric we publish is called ApproximateSecondsToCompleteQueue. A scheduled executor on our primary instance runs every 15 seconds to calculate and publish it.

private AmazonCloudWatchClient _cloudWatchClient = new AmazonCloudWatchClient();
_cloudWatchClient.setRegion(RegionUtils.getRegion("us-east-1"));

...

PutMetricDataRequest request = new PutMetricDataRequest()
  .withNamespace(CUSTOM_SQS_NAMESPACE)
  .withMetricData(new MetricDatum()
  .withMetricName("ApproximateSecondsToCompleteQueue")
  .withDimensions(new Dimension()
                    .withName(DIMENSION_NAME)
                    .withValue(_queueName))
  .withUnit(StandardUnit.Seconds)
  .withValue(approximateSecondsToCompleteQueue));

_cloudWatchClient.putMetricData(request);

In our CloudFormation template we have a parameter calledDesiredSecondsToCompleteQueue and by default we have it set to 2 hours (7200 seconds). In the Auto Scaling Group we have a scale up action triggered by an Alarm that checks whether DesiredSecondsToCompleteQueue is less than ApproximateSecondsToCompleteQueue.

"EstimatedQueueCompleteTime" : {
  "Type": "AWS::CloudWatch::Alarm",
  "Condition": "HasScaleUp",
  "Properties": {
    "Namespace": "Custom/Namespace",
    "Dimensions": [{
      "Name": "QueueName",
      "Value": { "Fn::Join" : [ "", [ {"Ref": "Universe"}, "-event-queue" ] ] }
    }],
    "MetricName": "ApproximateSecondsToCompleteQueue",
    "Statistic": "Average",
    "ComparisonOperator": "GreaterThanThreshold",
    "Threshold": {"Ref": "DesiredSecondsToCompleteQueue"},
    "Period": "60",
    "EvaluationPeriods": "1",
    "AlarmActions" : [{
      "Ref": "ScaleUpAction"
    }]
  }
}

 

Visualizing the Outcome

What’s a cloud blog without some graphs? Here’s what our load and scaling looks like after implementing this custom metric and scaling. Each of the colors in the middle graph represents a service instance. The bottom graph is in minutes for readability. Note that our instances terminate themselves when there is nothing left to do.

Screen Shot 2015-04-17 at 11.37.21 AM

I hope this blog has shown you that it’s quite easy to publish your own CloudWatch metrics and scale your EC2 AutoScalingGroups accordingly.

Upgrading Dropwizard 0.6 to 0.7

At Bazaarvoice we use Dropwizard for a lot of our java based SOA services. Recently I upgraded our Dropwizard dependency from 0.6 to the newer 0.7 version on a few different services. Based on this experience I have some observations that might help any other developers attempting to do the same thing.

Package Name Change
The first change to look at is the new package naming. The new io.dropwizard package replaces com.yammer.dropwizard. If you are using codahale’s metrics library as well, you’ll need to change com.yammer.metrics to com.codahale.metrics. I found that this was a good place to start the migration: if you remove the old dependencies from your pom.xml you can start to track down all the places in your code that will need attention (if you’re using a sufficiently nosy IDE).

- com.yammer.dropwizard -> io.dropwizard
- com.yammer.dropwizard.config -> io.dropwizard.setup
- com.yammer.metrics -> com.codahale.metrics

Class Name Change
aka: where did my Services go?

Something you may notice quickly is that the Service interface is gone, it has been moved to a new name: Application.

- Service -> Application

Configuration Changes
The Configuration object hierarchy and yaml organization has also changed. The http section in yaml has moved to server with significant working differences.

Here’s an old http configuration:

http:
  port: 8080
  adminPort: 8081
  connectorType: NONBLOCKING
  requestLog:
    console:
      enabled: true
    file:
      enabled: true
      archive: false
      currentLogFilename: target/request.log

and here is a new server configuration:

server:
  applicationConnectors:
    - type: http
      port: 8080
  adminConnectors:
    - type: http
      port: 8081
  requestLog:
    appenders:
      - type: console
      - type: file
        currentLogFilename: target/request.log
        archive: true

There are at least two major things to notice here:

  1. You can create multiple connectors for either the admin or application context. You can now serve several different protocols on different ports.
  2. Logging is now appender based, and you can configure a list of appenders for the request log.

Speaking of appender-based logging, the logging configuration has changed as well.

Here is an old logging configuration:

logging:
  console:
    enabled: true
  file:
    enabled: true
    archive: false
    currentLogFilename: target/diagnostic.log
  level: INFO
  loggers:
    "org.apache.zookeeper": WARN
    "com.sun.jersey.spi.container.servlet.WebComponent": ERROR

and here is a new one:

logging:
  level: INFO
  loggers:
    "org.apache.zookeeper": WARN
    "com.sun.jersey.spi.container.servlet.WebComponent": ERROR
  appenders:
    - type: console
    - type: file
      archive: false
      currentLogFilename: target/diagnostic.log

Now that you can configure a list of logback appenders, you can write your own or get one from a library. Previously this kind of logging configuration was not possible without significant hacking.

Environment Changes
The whole environment API has been re-designed for more logical access to different components. Rather than just making calls to methods on the environment object, there are now six component specific environment objects to access.

JerseyEnvironment jersey = environment.jersey();
ServletEnvironment servlets = environment.servlets();
AdminEnvironment admin = environment.admin();
LifecycleEnvironment lifecycle = environment.lifecycle();
MetricRegistry metrics = environment.metrics();
HealthCheckRegistry healthCheckRegistry = environment.healthChecks();

AdminEnvironment extends ServletEnvironment since it’s just the admin servlet context.

By treating the environment as a collection of libraries rather than a Dropwizard monolith, fine-grained control over several configurations is now possible and the underlying components are easier to interact with.

Here is a short rundown of the changes:

Lifecycle Environment
Several common methods were moved to the lifecycle environment, and the build pattern for Executor services has changed.

0.6:

     environment.manage(uselessManaged);
     environment.addServerLifecycleListener(uselessListener);
     ExecutorService service = environment.managedExecutorService("worker-%", minPoolSize, maxPoolSize, keepAliveTime, duration);
     ExecutorServiceManager esm = new ExecutorServiceManager(service, shutdownPeriod, unit, poolname);
     ScheduledExecutorService scheduledService = environment.managedScheduledExecutorService("scheduled-worker-%", corePoolSize);

0.7:

     environment.lifecycle().manage(uselessManaged);
     environment.lifecycle().addServerLifecycleListener(uselessListener);
     ExecutorService service = environment.lifecycle().executorService("worker-%")
             .minThreads(minPoolSize)
             .maxThreads(maxPoolSize)
             .keepAliveTime(Duration.minutes(keepAliveTime))
             .build();
     ExecutorServiceManager esm = new ExecutorServiceManager(service, Duration.seconds(shutdownPeriod), poolname);
     ScheduledExecutorService scheduledExecutorService = environment.lifecycle().scheduledExecutorService("scheduled-worker-%")
             .threads(corePoolSize)
             .build();

Other Miscellaneous Environment Changes
Here are a few more common environment configuration methods that have changed:

0.6

environment.addResource(Dropwizard6Resource.class);

environment.addHealthCheck(new DeadlockHealthCheck());

environment.addFilter(new LoggerContextFilter(), "/loggedpath");

environment.addServlet(PingServlet.class, "/ping");

0.7

environment.jersey().register(Dropwizard7Resource.class);

environment.healthChecks().register("deadlock-healthcheck", new ThreadDeadlockHealthCheck());

environment.servlets().addFilter("loggedContextFilter", new LoggerContextFilter()).addMappingForUrlPatterns(EnumSet.allOf(DispatcherType.class), true, "/loggedpath");

environment.servlets().addServlet("ping", PingServlet.class).addMapping("/ping");

Object Mapper Access

It can be useful to access the objectMapper for configuration and testing purposes.

0.6

ObjectMapper objectMapper = bootstrap.getObjectMapperFactory().build();

0.7

ObjectMapper objectMapper = bootstrap.getObjectMapper();

HttpConfiguration
This has changed a lot, it is much more configurable and not quite as simple as before.
0.6

HttpConfiguration httpConfiguration = configuration.getHttpConfiguration();
int applicationPort = httpConfiguration.getPort();

0.7

HttpConnectorFactory httpConnectorFactory = (HttpConnectorFactory) ((DefaultServerFactory) configuration.getServerFactory()).getApplicationConnectors().get(0);
int applicationPort = httpConnectorFactory.getPort();

Test Changes
The functionality provided by extending ResourceTest has been moved to ResourceTestRule.
0.6

import com.yammer.dropwizard.testing.ResourceTest;

public class Dropwizard6ServiceResourceTest extends ResourceTest {
  @Override
  protected void setUpResources() throws Exception {
    addResource(Dropwizard6Resource.class);
    addFeature("booleanFeature", false);
    addProperty("integerProperty", new Integer(1));
    addProvider(HelpfulServiceProvider.class);
  }
}

0.7

import io.dropwizard.testing.junit.ResourceTestRule;
import org.junit.Rule;

public class Dropwizard7ServiceResourceTest {

  @Rule
  ResourceTestRule resources = setUpResources();

  protected ResourceTestRule setUpResources() {
    return ResourceTestRule.builder()
      .addResource(Dropwizard6Resource.class)
      .addFeature("booleanFeature", false)
      .addProperty("integerProperty", new Integer(1))
      .addProvider(HelpfulServiceProvider.class)
      .build();
  }
}

Dependency Changes

Dropwizard 0.7 has new dependencies that might affect your project. I’ll go over some of the big ones that I ran into during my migrations.

Guava
Guava 18.0 has a few API changes:

  • Closeables.closeQuietly only works on objects implementing InputStream instead of anything implementing Closeable.
  • All the methods on HashCodes have been migrated to HashCode.

Metrics
Metric 3.0.2 is a pretty big revision to the old version, there is no longer a static Metrics object available as the default registry. Now MetricRegistries are instantiated objects that need to be managed by your application. Dropwizard 0.7 handles this by giving you a place to put the default registry for your application: bootstrap.getMetricRegistry().

Compatible library version changes
These libraries changed versions but required no other code changes. Some of them are changed to match Dropwizard dependencies, but are not directly used in Dropwizard.

Jackson
2.3.3

Jersey
1.18.1

Coursera Metrics-Datadog
1.0.2

Jetty
9.0.7.v20131107

Apache Curator
2.4.2

Amazon AWS SDK
1.9.21

Future Concerns
Dropwizard 0.8
The newest version of Dropwizard is now 0.8, once it is proven stable we’ll start migrating. Hopefully I’ll find time to write another post when that happens.

Thank You For Reading
I hope this article helps.

Conversations API Deprecation for Versions 4.9, 5.0 and 5.1, and Custom Domains

This blog post only applies to the Conversations API and does not apply to any other Bazaarvoice product. You are able to identify the Bazaarvoice Conversations API by the following:

  • Path includes ‘data’: http://api.bazaarvoice.com/data/reviews.json?

Code related to the Bazaarvoice Hosted Display does not need modification. It can be identified by the following:

  • References ‘bvapi.js’: http://display.ugc.bazaarvoice.com/static/ClientName/en_US/bvapi.js

Still unsure if this applies to you? Learn more.

Today we are announcing two important changes to our Conversations API services:

  • Deprecation of Conversations API versions older than 5.2 (4.9, 5.0, 5.1)
  • Ending Conversations API service using custom domains

Both of these changes will go into effect on April 30, 2016.

Our newer APIs and universal domain system offer you important advantages in both features and performance. In order to best serve our customers, Bazaarvoice is focusing its API efforts on the latest, highest performing API services. By deprecating older versions, we can refocus our energies on the current and future API services, which we feel offer the most benefits to our customers. Please visit our Upgrade Guide to learn more about the Conversations API, our API versioning, and the steps necessary to support the upgrade.

We understand that this news may be surprising. This is your first notification of this change. In the months and weeks ahead, we will continue to remind you that this change is coming.

We also understand that this change will require effort on your part. Bazaarvoice is committed to making this transition easy for you. We are prepared to assist you in a number of ways:

  • Pre-notification: You have 12 months to plan for and implement the change.
  • Documentation: We have specific documentation to help you.
  • Support: Our support team is ready to address any questions you may have.
  • Services: Our services teams are available to provide additional assistance.

In summary, on April 30, 2016, Conversations API versions released before 5.2 will no longer be available. Applications and websites using versions before 5.2 will no longer function properly after April 30, 2016. In addition, all Conversations API calls, regardless of version, made to a custom domain will no longer respond. Applications and websites using custom domains (such as “ReviewStarsInc.ugc.bazaarvoice.com”) will no longer function properly after April 30, 2016. If your application or website is making API calls to Conversations API versions 4.9, 5.0 and 5.1 you will need to upgrade to the current Conversations API (5.4) and use the universal domain (“api.bazaarvoice.com.”). Applications using Conversations API versions 5.2 and later (5.2, 5.3, 5.4) with the universal domain will continue to receive uninterrupted API service.

If you have any questions about this notice, please submit a case in Spark. We will periodically update this blog and our developer Twitter feed (@BazaarvoiceDev) as we move closer to the change of service date.

Thank you for your partnership,
Chris Kauffman
Sr. Product Manager

Automated Product Matching, Part II: Guidelines

This post continues the discussion from Automated Product Matching, Part I: Challenges.

System First, Algorithm Second

With each design iteration, I gradually came to appreciate how important it was to have an overall matching system that was well designed. The quality of the matching algorithm did not matter if its output was going to be impacted by failures in other parts of the system. The mistake of focusing on the algorithmic design first, and the system design second means that you wind up with an interesting technical talk, but you have not really solved the problem for the end users and/or the business. An example from my past experience might help give flavor for why the system view matters just as much as the algorithmic view.

In one of the earlier systems I worked on, after having successfully defined how to build a set of “canonical” products which would be used to match against all our incoming data, and having created a reasonably good matching algorithm, we were happy that we could now continually process and match all of our data each day and at scale. The problem was solved, but only in a static sense. We chose to ignore how new products would get into the canonical set. As time went on, this became more and more of a problem, until we finally had to address this omission. This was about the time when iPads first hit the market and the lack of freshness became glaringly obvious to anyone looking at iPads on our web site.

There was nothing algorithmically challenging about solving this: we knew how to create canonical products, but the code we built did not support adding new canonical products easily. Although the guts of the algorithmic logic could be re-used, the vast majority of the code that comprised the system we built around it needed to be redesigned. A little forethought here would have saved many months of additional work, not to mention all the bad user experiences that we were delivering due to our lack of matching newer products.

“A good algorithm in a bad system is indistinguishable from a bad algorithm.”

Keep it Simple

The difficulty of matching products ranges from easy to impossible. I recommend starting with an algorithm that focuses on matching the easiest things first and building a system around that. This allows you to start working on the important system issues sooner and get some form of product matching working faster. From a system design perspective, the product data needs to find its way to the matching algorithm, you will need a data model and data storage for the matches, you also need some access layer for the matches and you likely need to have some system for evaluating and managing the product matches.

The matching logic itself will be a very small percentage of the code you have to write and there are plenty of challenges in all these other areas. There are important lessons to be learned just in putting the system together, and even the simplest matching logic will lead to a greater understanding of how to build the appropriate data models you will need.

“For the human makers of things, the incompleteness and inconsistencies of our ideas become clear only during implementation.”

People cannot be Ignored

The topic name of “automated matching” implies that people will not be involved. Combine this with engineers who are conditioned to build systems that remove the rote, manual work from tasks and there is the risk of being completely blind to a few important questions.

Most fundamentally, you should ask whether you really need automated matching and whether it will be the most cost-effective solution. This is going to be determined by the scale of your problem. If your matching needs are on the order of only thousands of products, there are crowd-source solutions that make manual matching a very viable option. If your scale is on the order of millions, manual matching is not out of the question, though it may take some time and money to get through all the work. Once you get into the range of tens of millions, you likely have little choice but to use some form of automated matching.

Another option is a hybrid approach that uses algorithms to generate candidate matches and has people assigned to accept or reject the matches. This puts less pressure on the accuracy requirements of your algorithms and makes the people process more efficient, so it can be viewed as an optimization of a manual matching system. An approach that scales slightly better is to automatically match the easy products and defer the harder ones to manual matching or verification.

The other question about human involvement depends on how the quality of the matching system will be measured. Building training and/or evaluation data sets will likely require some human input and tools to support this work. Considering how feedback will be used is important because it can have an impact on the matching system and algorithm designs. Evaluation will likely need to be an ongoing process, so make sure consideration is given to the longer term human resource requirements.

“It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.” — Mark Twain

One Algorithm will not Rule Them All

Simply put: it is not possible for a single algorithm to do equally well matching all types of products. It is possible to use the same class of algorithm on some parts of the product category space, but you will need to parameterize the algorithm with category-specific information. Here are a couple examples to illustrate this point.

Consider the color attribute of a product. For a product like a dishwasher, color is a secondary characteristic and would be unimportant for the purpose of review content. For some types of products, say a computer monitor, the color might not even matter for price comparisons. For a $50 savings, many people are not going to care if their monitor is black or silver. On the other hand, for products like cosmetics, the color is the most essential feature of the product. If your algorithm universally treats all attributes the same while matching, regardless of the type of product it is, then it will necessarily perform poorly in some categories.

To get better accuracy you will have to invest in human domain expertise in either encoding domain-specific information, or training algorithms for each product category. If you have ever taken a hard look at camera products, there is a lot of cryptic symbology in the lens specifications and the other camera accessories. Without encoding knowledge from a domain expert, it is not going to be possible to match these types of products well. There’s no silver bullet. You can decide to allocate your time to one set of categories over another, but you should expect limited accuracy in the areas you have not invested in.

Another example lies in the contrast between consumer electronics and books. The product titles for consumer electronics are descriptive in that they contain a list of product features. With a rich enough title, there are enough features to yield relatively high confidence in matches. However, titles for books are arbitrary words and phrases chosen by the author and may give you little understanding of the contents. Similarity between book titles is not correlated with the similarity of their content.

“Do not mistake knowing something for knowing everything.”

Products are Not Strings

String-based matching algorithms may suffice depending on your targets for accuracy and coverage, but there is a hard limit on how well they will perform without imparting semantics to the strings. Not all words in product titles are created equal, so it helps to do something that is akin to part of speech tagging (e.g., The product “noun” is much more important than a product’s adjective, such as its color). Showing two different dishwashers as being the same might be a data error, but it is a characteristically different user experience than showing a dishwasher and a shoe as being the same. A string comparison algorithm might match the shoe to the dishwasher because it had the same color plus a few other strings in common, but no understanding that the mismatch of nouns “shoe” and “dishwasher” should trump anything else that might be indicating that they are similar.

You will need more than just adjectives and nouns though. There are many different types of adjectives used to describe products. There are colors, materials, dimensions, quantities, shapes, patterns, etc. and depending on the types of product, these may or may not matter in how you want to define product equivalence.

It is also true that just because two strings are different, it is not necessarily the case that they are referring to two different concepts. If you do not encode the knowledge that “loafer” and “shoe” are semantically similar, even though they have no string similarity, you will be limited in matching the variations that different data sources will provide. For more accurate results, it is important to semantically tokenize the strings so that your algorithms can work on a normalized, conceptual view of the products.

Some algorithmic technique might be helpful in dealing with these word synonyms, but if the domain vocabulary is restricted, it may even be feasible to manually curate the important variations and their parts of speech. Whether algorithmic or hand curated, you will need to encode this domain knowledge so that it is dependent on the product’s context. The string “apple” may be referring to a popular brand, a deciduous fruit or the scent of a hair care product. Category and peripheral information about the product will be needed to disambiguate “apple” and similar strings.

“Algorithms are for people who don’t know how to buy RAM.”

NLP Will Not Save You

Product titles are not amenable to generic natural language processing (NLP) solutions. Product titles are not well-formed sentences and have their own structure that often varies by the person or company that crafted them. Thinking that product matching can be solved with some off-the-shelf NLP techniques is a mistake. There are some NLP techniques that can be applied, but they have to be carefully tailored to work in this domain.

Consider the relative importance of word order between product titles for consumer electronics and for books. For electronics, the title word order does not really matter: “LCD TV 55 inch Sony” is not semantically different from “Sony 55 inch LCD TV”. Yet if you change the order of two words in a book’s title, you now have something completely different. “The Book of the Moon” and “The Moon Book” are two completely different books.

Product descriptions offer the best opportunity for the use of NLP techniques, since they tend to be natural language descriptions. Unfortunately, all sorts of peripheral concepts are included in the descriptions and this makes it hard to use them for product matching. It is also true that the descriptions for similar, but not identical products tend to look very similar. The best use of descriptions is in helping to determine the product’s category, which can help with matching, but do not expect that it will provide a strong signal for matching.

“If you find a solution and become attached to it, the solution may become your next problem.”

Design for Errors

Neither the input product data nor your matching algorithm will be 100% accurate. You need to make sure your algorithms are not rigidly expecting consistent data. This includes being able to compensate and/or correct bad data when it is detected. This is easier said than done, especially because we have all been conditioned to prefer more elegant and/or understandable code. Software code can look quite poetic when you do not have to litter it with constant sanity checks for edge cases and all the required exception handling this leads to. Unfortunately, the real world of crawl and feed data is not very elegant, nor will your algorithms produce flawless results.

This assumption about imperfect data should not be limited to the technical side of the product. I believe it is critically important that product designers work closely with the algorithmic designers to understand the characteristics of the data and the nature of the errors since this can be critical in designing a good user experience. Different algorithmic choices can result in different types of errors, and only by working together can the trade-offs be evaluated and good choices made which will influence how the users will perceive the product matching accuracy.

As a simple example of how designers and engineers can work together to make a better solution, suppose the engineers build a matching algorithm that outputs some measure of confidence. In isolation, the engineers will have to find the right threshold to balance accuracy and coverage, then declare matches for those above the threshold. In this scenario, the user interface designer only knows whether or not there are matches, so the interface is designed with the wording that says “matching products”. If these products are on the lower end of the confidence range, and they are bad matches, it will be a bad user experience.

Alternatively, if the designers are aware that there is a spectrum of match confidence, they could agree to expose those confidence values and instead of having to declare “matching products”, when the confidence is lower, they might opt to use softer wording like “similar products”, maybe even positioning them differently on the page. A user will not be quite as disappointed in the matching if they were only promised “similar” products.

“There are two ways to write error-free programs; only the third one works.” — Alan J. Perlis

Choose the Right Metrics

Suppose you have built your matching system, and an evaluation system to go along with it, then find out the accuracy rate is 95%. Assuming your system is giving reasonably good coverage, and in the presence of bad and missing data, this is definitely an impressive achievement. But what if within that 5% of the errors lies the current most popular products? If you weight error frequency by number of page views, the effective accuracy rate is going to be much, much lower. All the people viewing those mismatched product are not going to be impressed with your algorithm.

Even without considering weighting by page views, consider a situation where you display 20 products at a time on a page. With a 5% error rate, on average every page you show contains an error. Defined differently, this means your error rate is not 5% but 100%.

Matching algorithms will not be perfect, and even near perfect algorithms will need help. This help usually comes in the form of providing tools that allow human intervention to influence the overall quality or to influence the algorithmic output. When you are making an error on a highly visible product, someone should be able to be able to quickly override the algorithmic results to fix the problem.

“Williams and Holland’s Law: If enough data is collected, anything may be proven by statistical methods.”

Are We Done Yet?

There is no shortage of other product matching topics to discuss and interesting details to dive into. These first two blog posts have tried to capture some of the higher-level considerations. Future articles will provide more detailed examinations of these topics and some of the approaches we have taken in Bazaarvoice’s product matching systems.

Full Consistency Lag for Eventually Consistent Systems

A distributed data system consisting of several nodes is said to be fully consistent when all nodes have the same state of the data they own. So, if record A is in State S on one node, then we know that it is in the same state in all its replicas and data centers.

Screen Shot 2015-02-23 at 11.24.22 AM

Full Consistency sounds great. The catch is the CAP theorem that states that its impossible for a distributed system to simultaneously guarantee consistency (C), availability (A), and partition tolerance (P). At Bazaarvoice, we have sacrificed full consistency to get an AP system and contend with an eventually consistent data store. One way to define eventual consistency is that there is a point in time in the past before which the system is fully consistent (full consistency timestamp, or FCT). The duration between FCT and now is called the Full Consistency Lag (FCL).

An eventually consistent system may never be in a fully consistent state given a massive write throughput. However, what we really want to know deterministically is the last time before which we can be assured that all updates were fully consistent on all nodes. So, in the figure above, in the inconsistent state, we would like to know that everything up to Δ2 has been replicated fully, and is fully consistent. Before we get down to the nitty-gritty of this metric, I would like to take a detour to set up the context of why it is so important for us to know the full consistency lag of our distributed system.

At Bazaarvoice, we employ an eventually consistent system of record that is designed to span multiple data centers, using multi-master conflict resolution. It relies on Apache Cassandra for persistence and cross-data-center replication.

One of the salient properties of our system of record is immutable updates. That essentially means that a row in our data store is simply a sequence of immutable updates, or deltas. A delta can be a creation of a new document, an addition, modification, or removal of a property on the document, or even a deletion of the entire document. For example, a document is stored in the following manner in Cassandra, where each delta is a column of the row.

Δ1 { “rating”: 4,
“text”: “I like it.”}
Δ2 { .., “status”: “APPROVED” }
Δ3 { .., “client”: “Walmart” }

So, when a document is requested, the reader process resolves all the above deltas (Δ1 + Δ2 + Δ3) in that order, and produces the following document:

{ “rating”: 4,
“text”: “I like it.”,
“status”: “APPROVED”,
“client”: “Walmart”,
“~version”: 3 }

Note that these deltas are stored as key-value pairs with the key as Time UUID. Cassandra would thus always present them in increasing order of insertion, making sure the last-write-wins property. Storing the rows in this manner allows us massive non-blocking global writes. Writes to the same row from different data centers across the globe would eventually achieve a consistent state without making any cross-data center calls. This point alone warrants a separate blog post, but it will have to suffice for now.

To recap, rows are nothing but a sequence of deltas. Writers simply append these deltas to the row, without caring about the existing state of the row. When a row is read, these deltas are resolved in ascending order and produce a json document.

There is one problem with this: over time rows will accrue a lot of updates causing the row to become really wide. The writes will still be OK, but the reads can become too slow as the system tries to consolidate all those deltas into one document. This is where compaction helps. As the name suggests, compaction resolves several deltas, and replaces them with one “compacted” delta. Any subsequent reads will only see a compaction record, and the read slowness issue is resolved.

Screen Shot 2015-03-14 at 9.18.29 PM

Great. However, there is a major challenge that comes with compaction in a multi-datacenter cluster. When is it ok to compact rows on a local node in a data center? Specifically, what if an older delta arrives after we are done compacting? If we arbitrarily decide to compact rows every five minutes, then we run the risk of losing deltas that may be in flight from a different data center.

To solve this issue, we need to figure out what deltas are fully consistent on all nodes and only compact those deltas, which basically is to say, “Find time (t) in the past, before which all deltas are available on all nodes”. This t, or full consistency timestamp, assures us that no deltas will ever arrive with a time UUID before this timestamp. Thus, everything before the full consistency timestamp can be compacted without any fear of data loss.

There is just one issue. This metric is absent in out of the box AP systems such as Cassandra. To me, this is a vital metric for an AP system. It would be rare to find a business use case in which permanent inconsistency is tolerable.

Although Cassandra doesn’t provide the full consistency lag, we can still compute it in the following way:

Tf = Time no hints were found on any node
rpc_timeout = Maximum timeout in cassandra that nodes will use when communicating with each other.

FCT = Full Consistency Timestamp
FCL = Full Consistency Lag

FCT = Tf – rpc_timeout
FCL = Tnow – FCT

The concept of Hinted Handoffs was introduced in Amazon’s dynamo paper as a way of handling failure. This is what Cassandra leverages for fault-tolerant replication. Basically, if a write is made to a replica node that is down, then Cassandra will write a “hint” to the coordinator node and try again in a configured amount of time.

We exploit this feature of Cassandra to get us our full consistency lag. The main idea is to poll all the nodes to see if they have any pending hints for other nodes. The time when they all report zero (Tf) is when we know that there are no failed writes, and the only pending writes are those that are in flight. So, subtracting the cassandra timeout (rpc_timeout) will give us our full consistency lag.
Now, that we have our full consistency lag, this metric can be used to alert the appropriate people when the cluster is lagging too far behind.

Finally, you would want to graph this metric for monitoring.

FCTMonitor

Note that in the above graph we artificially added a 5 minute lag to our rpc_timeout value to avoid excessively frequent compactions. We periodically poll for full consistency every 300 seconds (or 5 minutes). You should tweak this value according to your needs. For our settings above, the expected lag is 5 minutes, but you can see it spike at 10 minutes. All that really says is there was one time when we checked and found a few hints. The next time we checked (after 5 minutes in our case) all hints were taken care of. You can now set an alert in your system that should wake people up if this lag violates a given threshold–perhaps several hours–something that makes sense for your business.

Automated Product Matching, Part I: Challenges

Bazaarvoice’s flagship product is a platform for our clients to accept, display and manage consumer generated content (CGC) on their web sites. CGC includes reviews, ratings, images, videos, social network content, etc. Over the last few years, syndicating CGC from one site to another has become increasingly important to our customers. When a user submits a television review on Samsung’s branded web site, it benefits Samsung, Target and the consumer when that review can be shown on Target’s retail web site.

Before syndicating CGC became important to Bazaarvoice, our content could be isolated for each individual client. There was never any need for us to consider the question of whether our clients had any overlap in their product catalogs. With syndication, it is now vital for us to be able to match products across all our clients’ catalogs.

The product matching problem is not unique to Bazaarvoice. Shopping comparison engines, travel aggregators and ticket brokers are among the other domains that require comprehensive and scalable automated matching. This is a common enough problem that there are even a number of companies trying to grow a business based on providing product matching as a service.

Overview

I have helped design and build product matching systems five different times across two different domains and will share some of what I have learned about the characteristics of the problem and its solutions. This article will not be about specific algorithms or technologies, but guidelines and requirements that are needed when considering how to design a technical solution. This will address not just the algorithmic challenges, but also the equally important issues with designing product matching as a system.

Blog posts are best kept to a modest length, and I have many more thoughts to share on this topic than would be polite to include in a single article, so I have divided this discussion into two parts. This blog post is about the characteristics that make this an interesting and challenging problem. The second posting will focus on guidelines to follow when designing a product matching system.

The focus here will be on retail product matching, since that is where my direct experience lies. I am sure that there are additional lessons to be learned in other domains, but I think many of these insights may be more broadly applicable.

“If at first you don’t succeed, you must be a programmer.”

Imprecise Requirements

Product matching is one of those problems that initially seems straightforward, but whose complexity is revealed only after having immersed oneself in it. Even the most enlightened product manager is not going to have the time to spell out, in detail, how to deal with every nuance that arises. Understanding problems in depth, and filling in the large numbers of unspecified items with reasonably good solutions is why many software engineers are well paid. It is also what makes our jobs more interesting than most, since it allows us to invoke our problem solving and design skills, which we generally prefer to rote execution of tasks.

I am not proposing that the engineers should fill in all the details without consulting the product managers and designers, I only mean that the engineers should expect the initial requirements will need to be refined. Ideally both will work to fill in the gaps, but the engineers should expect they will be the ones uncovering and explaining the gaps.

“I have yet to see any problem, however complicated, which, when you looked at it in the right way, did not become still more complicated.” — Poul Anderson

What is a “Product”?

Language is inherently imprecise. The same word can refer to completely different concepts at different times and yet it causes no confusion when the people conversing share the same contextual information. On the other hand, software engineers creating a data model have to explicitly enumerate, encode, and give names to all the concepts in the system. This is a fundamental difference between how the engineers and others view the problem and can be a source of frustration when engineers begin to inject questions into the requirements process such as: “What is a product?”. Those that are not accustomed to diving into the concepts underlying their use of a word can often feel like this is a time-wasting, philosophical discussion.

I’ve run across 8 distinct concepts where the word “product” has been used. The most basic difference lies between those “things” that you are trying to match and the “thing” you are using as the basis of the match. Suppose you get a data feed from Acme, Inc. which includes a thing called an “Acme Giant Rubber Band” and that you also crawled the Kwik-E-Mart web site, which yielded a thing called an “Acme Giant Green Rubber Band”. You then ask the question, are these the same “product”? Here we have an abstract notion of a specific rubber band in our mind and we are asking the question of whether these specific items from those two data sources match this concept.

Now let us also suppose that the “Acme Giant Rubber Band” item in the Acme data feed has listed 6 different UPC values, which correspond to 6 different colors they manufacturer for the product. This means that the “thing” in the feed is really a set of individual items, while the “Acme Giant Green Rubber Band” we saw on the Kwik-E-Mart web site just is a single item. These two items are similar, but not identical product-related concepts.

With just this simple example, there are 3 different concepts floating around, yet for each of them the “product” is often the word people will use. For most domains, when you really start to explore the data model that is required, more than three product-related concepts will likely be needed.

Software designers must carefully consider how many different “product” concepts they need to model and those helping to define the requirements should appreciate the importance of, and invest time in understanding the differences between the concepts. The importance of getting this data model correct from the start cannot be stressed enough.

“If names are not correct, then language is not in accord with the truth of things. If language is not in accord with the truth of things, then affairs cannot be carried out successfully.” — Confucius

Equality for All?

You should start with the most basic of questions: What is a “match”? My experience working on product matching in different domains and varying use cases is that there is not a single definition of product equality that applies everywhere. For those that have never given product matching much thought beyond their intuition, this might seem like an odd statement: two products are either the same or they are not, right? By way of example, here is an illustration of why different use cases require different notions of equality.

Suppose you are shopping for some 9-volt batteries and you are interested in seeing which brands tend to last longer based on direct user experience. You do a search, you navigate through some web site and then will likely need to make a choice at some point: are you looking to buy the 2-pack, the 4-pack or the 8-pack?

Having to make a quantity choice at this point may be premature, but you usually have to make this choice to get at the review content. However, the information you are looking for, and likely the bulk of the review content, is independent of the size of the box in which it is packaged. Requiring a quantity choice to get at review content may just be a bad user experience, but regardless of that, you certainly would not want to miss out on relevant review content simply because you had chosen the wrong quantity at this point in your research.

The conclusion here is that reviews posted to the web page for the 2-pack and reviews posted for the page of an 8-pack should probably not be fragmented. Therefore, for the purposes of review content, these two products, which would have different UPC and/or EAN values, should be considered equivalent.

Now suppose you have made your decision on the brand of battery to buy and now you are looking for the best price on an 8-pack. For a price comparison, you most definitely do not want to be comparing the 2-pack prices along with its 8-pack equivalent. Here, for price comparisons, these two products should definitely not be considered equivalent.

Understanding that product equivalence varies by context is not only important for designing algorithms and software systems, but has a lot of implications for creating better user experiences. For the companies looking to offer product matching as a service, the flexibility they offer in tailoring the definition of equality for their clients will be an important factor in how broadly applicable their solutions will be.

“It is more important to know where you are going than to get there quickly. Do not mistake activity for achievement.” — Isocrates

Imperfect Data Sources

If all the products you need to match have been assigned a globally unique identifier, such as a UPC, EAN or ISBN, and you have access to that data, and the data can be trusted, then product matching could be trivial. However, not all products get assigned such a number and for those that do, you do not always have access to those values. As discussed, it is also true that a “match” cannot always be defined simply by the equality of unique identifiers.

Those that crawl the web for product data tend to think that a structured data feed is the answer to getting better data. However, the companies that create product feeds vary greatly in their competency. Even when competent, they may build their feed from one system’s database, while more useful information may be stored in another system. Further, the competitive business landscape can result in companies wanting to deliberately suppress or obfuscate identifying information. You also have the ubiquitous issues of software bugs and data entry errors to contend with. All these realities add up to the fact that data feeds are not a panacea for product matching.

So while we have the web crawling folks wishing for feed data, we simultaneously have the feed processing folks wishing for crawled data to fill in their gaps. The first piece of advice for building a product matching system is to assume you will need to accept data from a variety of data sources. The ability to fill in data gaps with alternative sources will allow you to get the best of both worlds. This also means you may not only be trying to match products between different sites, but you may need to match products within the same site and merge the data from different sources to form a single view of a product at a site. I know of one very large shopping comparison site that did not design for this case and found themselves without the ability to support particular types of new business opportunities.

“If you think the problem is bad now, just wait until we’ve solved it.” — Arthur Kasspe

Look Before You Leap

The specific algorithms and technologies one chooses for an automated product matching system should not be the primary focus. It is very tempting for us information scientists and engineers to dive right into the algorithmic and technical solutions. After all, this is predominantly what universities have trained us to focus on and, in some sense, is the more interesting part of the problem. You can choose almost any one of a host of algorithms and get some form of product matching fairly quickly. Depending on your specific quality requirements, a simple system may be enough, but if there are higher expectations for a matching system, you will need a lot more than just a fancy algorithm.

When more than simple matching is needed, it will not be the algorithm you use, but how you use the algorithm that will matter. This means really understanding the characteristics of the problem in the context of your domain. It is also important not to define the problem too narrowly. There are a bunch of seemingly tangential issues in product matching that are very easy to put into the bucket of “we can deal with that later”, but which turn out to be very hard to deal with after the fact. It is how well you handle all of these practical details that will most influence the overall success of the project.

Choosing a simplistic data model is an example where it may seem like a good starting approach. However, this will wind up being so deeply ingrained in the software, that it will become nearly impossible to change. You wind up with either serious capability limitations or a series of kludges that both complicate your software and lead to unintended side effects. I learned this from experience.

“A doctor can bury his mistakes but an architect can only advise his clients to plant vines.” — Frank Lloyd Wright

Up Next

This posting covers some of the important characteristics of the product matching problem. In the sequel, there will be some more specific guidelines for building matching systems.