Automated Product Matching, Part I: Challenges

Bazaarvoice’s flagship product is a platform for our clients to accept, display and manage consumer generated content (CGC) on their web sites. CGC includes reviews, ratings, images, videos, social network content, etc. Over the last few years, syndicating CGC from one site to another has become increasingly important to our customers. When a user submits a television review on Samsung’s branded web site, it benefits Samsung, Target and the consumer when that review can be shown on Target’s retail web site.

Before syndicating CGC became important to Bazaarvoice, our content could be isolated for each individual client. There was never any need for us to consider the question of whether our clients had any overlap in their product catalogs. With syndication, it is now vital for us to be able to match products across all our clients’ catalogs.

The product matching problem is not unique to Bazaarvoice. Shopping comparison engines, travel aggregators and ticket brokers are among the other domains that require comprehensive and scalable automated matching. This is a common enough problem that there are even a number of companies trying to grow a business based on providing product matching as a service.

Overview

I have helped design and build product matching systems five different times across two different domains and will share some of what I have learned about the characteristics of the problem and its solutions. This article will not be about specific algorithms or technologies, but guidelines and requirements that are needed when considering how to design a technical solution. This will address not just the algorithmic challenges, but also the equally important issues with designing product matching as a system.

Blog posts are best kept to a modest length, and I have many more thoughts to share on this topic than would be polite to include in a single article, so I have divided this discussion into two parts. This blog post is about the characteristics that make this an interesting and challenging problem. The second posting will focus on guidelines to follow when designing a product matching system.

The focus here will be on retail product matching, since that is where my direct experience lies. I am sure that there are additional lessons to be learned in other domains, but I think many of these insights may be more broadly applicable.

“If at first you don’t succeed, you must be a programmer.”

Imprecise Requirements

Product matching is one of those problems that initially seems straightforward, but whose complexity is revealed only after having immersed oneself in it. Even the most enlightened product manager is not going to have the time to spell out, in detail, how to deal with every nuance that arises. Understanding problems in depth, and filling in the large numbers of unspecified items with reasonably good solutions is why many software engineers are well paid. It is also what makes our jobs more interesting than most, since it allows us to invoke our problem solving and design skills, which we generally prefer to rote execution of tasks.

I am not proposing that the engineers should fill in all the details without consulting the product managers and designers, I only mean that the engineers should expect the initial requirements will need to be refined. Ideally both will work to fill in the gaps, but the engineers should expect they will be the ones uncovering and explaining the gaps.

“I have yet to see any problem, however complicated, which, when you looked at it in the right way, did not become still more complicated.” — Poul Anderson

What is a “Product”?

Language is inherently imprecise. The same word can refer to completely different concepts at different times and yet it causes no confusion when the people conversing share the same contextual information. On the other hand, software engineers creating a data model have to explicitly enumerate, encode, and give names to all the concepts in the system. This is a fundamental difference between how the engineers and others view the problem and can be a source of frustration when engineers begin to inject questions into the requirements process such as: “What is a product?”. Those that are not accustomed to diving into the concepts underlying their use of a word can often feel like this is a time-wasting, philosophical discussion.

I’ve run across 8 distinct concepts where the word “product” has been used. The most basic difference lies between those “things” that you are trying to match and the “thing” you are using as the basis of the match. Suppose you get a data feed from Acme, Inc. which includes a thing called an “Acme Giant Rubber Band” and that you also crawled the Kwik-E-Mart web site, which yielded a thing called an “Acme Giant Green Rubber Band”. You then ask the question, are these the same “product”? Here we have an abstract notion of a specific rubber band in our mind and we are asking the question of whether these specific items from those two data sources match this concept.

Now let us also suppose that the “Acme Giant Rubber Band” item in the Acme data feed has listed 6 different UPC values, which correspond to 6 different colors they manufacturer for the product. This means that the “thing” in the feed is really a set of individual items, while the “Acme Giant Green Rubber Band” we saw on the Kwik-E-Mart web site just is a single item. These two items are similar, but not identical product-related concepts.

With just this simple example, there are 3 different concepts floating around, yet for each of them the “product” is often the word people will use. For most domains, when you really start to explore the data model that is required, more than three product-related concepts will likely be needed.

Software designers must carefully consider how many different “product” concepts they need to model and those helping to define the requirements should appreciate the importance of, and invest time in understanding the differences between the concepts. The importance of getting this data model correct from the start cannot be stressed enough.

“If names are not correct, then language is not in accord with the truth of things. If language is not in accord with the truth of things, then affairs cannot be carried out successfully.” — Confucius

Equality for All?

You should start with the most basic of questions: What is a “match”? My experience working on product matching in different domains and varying use cases is that there is not a single definition of product equality that applies everywhere. For those that have never given product matching much thought beyond their intuition, this might seem like an odd statement: two products are either the same or they are not, right? By way of example, here is an illustration of why different use cases require different notions of equality.

Suppose you are shopping for some 9-volt batteries and you are interested in seeing which brands tend to last longer based on direct user experience. You do a search, you navigate through some web site and then will likely need to make a choice at some point: are you looking to buy the 2-pack, the 4-pack or the 8-pack?

Having to make a quantity choice at this point may be premature, but you usually have to make this choice to get at the review content. However, the information you are looking for, and likely the bulk of the review content, is independent of the size of the box in which it is packaged. Requiring a quantity choice to get at review content may just be a bad user experience, but regardless of that, you certainly would not want to miss out on relevant review content simply because you had chosen the wrong quantity at this point in your research.

The conclusion here is that reviews posted to the web page for the 2-pack and reviews posted for the page of an 8-pack should probably not be fragmented. Therefore, for the purposes of review content, these two products, which would have different UPC and/or EAN values, should be considered equivalent.

Now suppose you have made your decision on the brand of battery to buy and now you are looking for the best price on an 8-pack. For a price comparison, you most definitely do not want to be comparing the 2-pack prices along with its 8-pack equivalent. Here, for price comparisons, these two products should definitely not be considered equivalent.

Understanding that product equivalence varies by context is not only important for designing algorithms and software systems, but has a lot of implications for creating better user experiences. For the companies looking to offer product matching as a service, the flexibility they offer in tailoring the definition of equality for their clients will be an important factor in how broadly applicable their solutions will be.

“It is more important to know where you are going than to get there quickly. Do not mistake activity for achievement.” — Isocrates

Imperfect Data Sources

If all the products you need to match have been assigned a globally unique identifier, such as a UPC, EAN or ISBN, and you have access to that data, and the data can be trusted, then product matching could be trivial. However, not all products get assigned such a number and for those that do, you do not always have access to those values. As discussed, it is also true that a “match” cannot always be defined simply by the equality of unique identifiers.

Those that crawl the web for product data tend to think that a structured data feed is the answer to getting better data. However, the companies that create product feeds vary greatly in their competency. Even when competent, they may build their feed from one system’s database, while more useful information may be stored in another system. Further, the competitive business landscape can result in companies wanting to deliberately suppress or obfuscate identifying information. You also have the ubiquitous issues of software bugs and data entry errors to contend with. All these realities add up to the fact that data feeds are not a panacea for product matching.

So while we have the web crawling folks wishing for feed data, we simultaneously have the feed processing folks wishing for crawled data to fill in their gaps. The first piece of advice for building a product matching system is to assume you will need to accept data from a variety of data sources. The ability to fill in data gaps with alternative sources will allow you to get the best of both worlds. This also means you may not only be trying to match products between different sites, but you may need to match products within the same site and merge the data from different sources to form a single view of a product at a site. I know of one very large shopping comparison site that did not design for this case and found themselves without the ability to support particular types of new business opportunities.

“If you think the problem is bad now, just wait until we’ve solved it.” — Arthur Kasspe

Look Before You Leap

The specific algorithms and technologies one chooses for an automated product matching system should not be the primary focus. It is very tempting for us information scientists and engineers to dive right into the algorithmic and technical solutions. After all, this is predominantly what universities have trained us to focus on and, in some sense, is the more interesting part of the problem. You can choose almost any one of a host of algorithms and get some form of product matching fairly quickly. Depending on your specific quality requirements, a simple system may be enough, but if there are higher expectations for a matching system, you will need a lot more than just a fancy algorithm.

When more than simple matching is needed, it will not be the algorithm you use, but how you use the algorithm that will matter. This means really understanding the characteristics of the problem in the context of your domain. It is also important not to define the problem too narrowly. There are a bunch of seemingly tangential issues in product matching that are very easy to put into the bucket of “we can deal with that later”, but which turn out to be very hard to deal with after the fact. It is how well you handle all of these practical details that will most influence the overall success of the project.

Choosing a simplistic data model is an example where it may seem like a good starting approach. However, this will wind up being so deeply ingrained in the software, that it will become nearly impossible to change. You wind up with either serious capability limitations or a series of kludges that both complicate your software and lead to unintended side effects. I learned this from experience.

“A doctor can bury his mistakes but an architect can only advise his clients to plant vines.” — Frank Lloyd Wright

Up Next

This posting covers some of the important characteristics of the product matching problem. In the sequel, there will be some more specific guidelines for building matching systems.

Scoutfile: A module for generating a client-side JS app loader

A couple of years ago, my former colleague Alex Sexton wrote about the techniques that we use at Bazaarvoice to deploy client-side JavaScript applications and then load those applications in a browser. Alex went into great detail, and it’s a good, if long, read. The core idea, though, is pretty simple: an application is bootstrapped by a “scout” file that lives at a URL that never changes, and that has a very short TTL. Its job is to load other static resources with long TTLs that live at versioned URLs — that is, URLs that change with each new deployment of the application. This strategy balances two concerns: the bulk of application resources become highly cacheable, while still being easy to update.

In order for a scout file to perform its duty, it needs to load JavaScript, load CSS, and host the config that says which JS and CSS to load. Depending on the application, other functionality might be useful: the ability to detect old IE; the ability to detect DOM ready; the ability to queue calls to the application’s methods, so they can be invoked for real when the core application resources arrive.

At Bazaarvoice, we’ve been building a lot of new client-side applications lately — internal and external — and we’ve realized two things: one, it’s very silly for each application to reinvent this particular wheel; two, there’s nothing especially top secret about this wheel that would prevent us from sharing it with others.

To that end, I’m happy to release scoutfile as an NPM module that you can use in your projects to generate a scout file. It’s a project that Lon Ingram and I worked on, and it provides both a Grunt task and a Node interface for creating a scout file for your application. With scoutfile, your JavaScript application can specify the common functionality required in your scout file — for example, the ability to load JS, load CSS, and detect old IE. Then, you provide any code that is unique to your application that should be included in your scout file. The scoutfile module uses Webpack under the hood, which means you can use loaders like json! and css! for common tasks.

The most basic usage is to npm install scoutfile, then create a scout file in your application. In your scout file, you specify the functionality you need from scoutfile:

var App = require('scoutfile/lib/browser/application');
var loader = require('scoutfile/lib/browser/loader');

var config = require('json!./config.json');
var MyApp = App('MyApp');

MyApp.config = config;

loader.loadScript(config.appJS);
loader.loadStyleSheet(config.appCSS);

Next, you can generate your scout file using a simple Node script:

var scout = require('scoutfile');
scout.generate({
  appModules: [
    {
      name: 'MyApp',
      path: './app/scout.js'
    }
  ],

  // Specify `pretty` to get un-uglified output.
  pretty: true
}).then(function (scout) {
  console.log(scout);
});

The README contains a lot more details, including how to use flags to differentiate production vs. development builds; how to configure the Grunt task; how to configure the “namespace” that is occupied on window (a necessary evil if you want to queue calls before your main application renders); and more.

There are also several open issues to improve or add functionality. You can check out the developer README if you’re interested in contributing.

Analyzing our global shopper network (part one)

Every holiday season, the virtual doors of your favorite retailer are blown open by a torrent of shoppers who are eager to find the best deal, whether they’re looking for a Turbo Man action figure or a ludicrously discounted 4K flat screen. This series focuses on our Big Data analytics platform, which is used to learn more about how people interact with our network.

The challenge

Within the Reporting & Analytics group, we use Big Data analytics to help some of the world’s largest brands and retailers understand how to most effectively serve their customers, as well as provide those customers with the information they need to make informed buying decisions. The amount of clickstream traffic we see during the holidays – over 45,000 events per second, produced by 500 million monthly unique visitors from around the world – is tremendous.

In fact, if we reserved a seat at the Louisiana Superdome for each collected analytics event, we would fill it up in about 1.67 seconds. And, if we wanted to give each of our monthly visitors their own seat in a classic Beetle, we’d need about 4.64 times the total number produced between 1938 and 2003. That’s somewhere in the neighborhood of a hundred million cars!

Fortunately for us, we live in the era of Big Data and high scalability. Our platform, which is based on the principles outlined in Nathan Marz’s Lambda architecture design, addresses the requirements of ad-hoc, near real-time, and batch applications. Before we could analyze any data, however, we needed a way to reliably collect it. That’s where our in-house event collection service, which we named “Cookie Monster,” came into the picture.

Collecting the data

When investigating how clients would send events to us, our engineers knew that the payload had to fit within the query string of an HTTP GET request. They settled upon a lightweight serialization format called Rison, which expresses JSON data structures, but is designed to support URI encoding semantics. (Our Rison plugin for Jackson, which we leverage to handle the processing of Rison-encoded events, is available on GitHub.)

In addition, we decided to implement support for client-side batching logic, which would allow a web browser to send multiple events within the payload of a single request. By sending fewer requests, we reduced the amount of HTTP transaction overhead, which minimized the amount of infrastructure required to support a massive network audience. Meanwhile, as their browsers would only need to send one request, end-users also saw a performance uptick.

Because the service itself needed a strong foundation, we chose the ubiquitous Dropwizard framework, which accelerated development by providing the basic ingredients needed to create a maintainable, scalable, and performant web service. Dropwizard glues together Jetty (a high-performance web server), Jersey (a framework for REST-ful web services), and Jackson (a JSON processor).

BDAP - Cookie Monster Event Collection - Diagram

Perhaps most importantly, we used the Disruptor library‘s ring buffer implementation to facilitate very fast inter-thread messaging. When a new event arrives, it is submitted to the EventQueue by the EventCollector. Two event handler classes, which listen for ring events, ensure that the event is delivered properly. The first event handler acts as a producer for Kafka, publishing the event to the appropriate topic. (Part two of this series will discuss Kafka in further detail.)

The second is a “fan out” logging sink, which mods specific event metadata and delivers the corresponding event to the appropriate logger. At the top of every hour, the previous hour’s batch logs are delivered to S3, and then consumed by downstream processes.

In the real world

When building Cookie Monster, we knew that our service would need to maintain as little state as possible, and accommodate the volatility of cloud infrastructure.

Because EC2 is built on low-cost, commodity hardware, we knew that we couldn’t “cheat” with sophisticated hardware RAID – everything would run on machines that were naturally prone to failure. In our case, we deemed those trade-offs acceptable, as our design goals for a distributed system aligned perfectly with the intent of EC2 auto-scaling groups.

Even though the service was designed for EC2, there were a few hiccups along the way, and we’ve learned many valuable lessons. For example, the Elastic Load Balancer, which distributes HTTP requests to instances within the Auto Scaling group, must be “pre-warmed” before accepting a large volume of traffic. Although that’s by design, it means that good communication with AWS prior to deployment must be a crucial part of our process.

Also, Cookie Monster was designed prior to the availability of EBS optimized instances and provisioned IOPS, which allow for more consistent performance of an I/O-bound process when using EBS volumes. Even in today’s world, where both of those features could be enabled, ephemeral (i.e. host-local) volumes remain a fiscally compelling – if brittle – alternative for transient storage. (AWS generally discourages the use of ephemeral storage where data loss is a concern, as they are prone to failure.)

Ultimately, our choice to deploy into EC2 paid off, and it allowed us to scale the service to stratospheric heights without a dedicated operations team. Today, Cookie Monster remains an integral service within our Big Data analytics platform, successfully collecting and delivering many billions of events from all around the world.

Open sourcing cloudformation-ruby-dsl

Cloudformation is a powerful tool for building large, coordinated clusters of AWS resources. It has a sophisticated API, capable of supporting many different enterprise use-cases and scaling to thousands of stacks and resources. However, there is a downside: the JSON interface for specifying a stack can be cumbersome to manipulate, especially as your organization grows and code reuse becomes more necessary.

To address this and other concerns, Bazaarvoice engineers have built cloudformation-ruby-dsl, which turns your static Cloudformation JSON into dynamic, refactorable Ruby code.

https://github.com/bazaarvoice/cloudformation-ruby-dsl

The DSL closely mimics the structure of the underlying API, but with enough syntactic sugar to make building Cloudformation stacks less painful.

We use cloudformation-ruby-dsl in many projects across Bazaarvoice. Now that it’s proven its value, and gained some degree of maturity, we are releasing it to the larger world as open source, under the Apache 2.0 license. It is still an earlier stage project, and may undergo some further refactoring prior to it’s v1.0 release, but we don’t anticipate major API changes. Please download it, try it out, and let us know what you think (in comments below, or as issues or pull request on Github).

A big thanks to Shawn Smith, Dave Barcelo, Morgan Fletcher, Csongor Gyuricza, Igor Polishchuk, Nathaniel Eliot, Jona Fenocchi, and Tony Cui, for all their contributions to the code base.

Output from bv.io

Looks like everyone had a blast at bv.io this year! Thank yous go out to the conference speakers and hackathon participants for making this year outstanding. Here are some tweets and images from the conference:


https://twitter.com/bentonporter/status/451362916181090304

Continue reading

HTTP/RESTful API troubleshooting tools

As a developer I’ve used a variety of APIs and as a Developer Advocate at Bazaarvoice I help developers use our APIs. As a result I am keenly aware of the importance of good tools and of using the right tool for the right job. The right tool can save you time and frustration. With the recent release of the Converstations API Inspector, an inhouse web app built to help developers use our Conversations API, it seemed like the perfect time to survey tools that make using APIs easier.

The tools

This post is a survey covering several tools for interacting with HTTP based APIs. In it I introduce the tools and briefly explain how to use them. Each one has its advantages and all do some combination of the following:

  • Construct and execute HTTP requests
  • Make requests other than GET, like POST, PUT, and DELETE
  • Define HTTP headers, cookies and body data in the request
  • See the response, possibly formatted for easier reading

Firefox and Chrome

Yes a web browser can be a tool for experimenting with APIs, so long as the API request only requires basic GET operations with query string parameters. At our developer portal we embed sample URLs in our documentation were possible to make seeing examples super easy for developers.

Basic GET

http://api.example.com/resource/1?passkey=12345&apiversion=2

Some browsers don’t necessarily present the response in a format easily readable by humans. Firefox users already get nicely formatted XML. To see similarly formatted JSON there is an extension called JSONView. To see the response headers LiveHTTP Headers will do the trick. Chrome also has a version of JSONview and for XML there’s XML Tree. They both offer built in consoles that provide network information like headers and cookies.

CURL

The venerable cURL is possibly the most flexable while at the same time being the least usable. As a command line tool some developers will balk at using it, but cURL’s simplicity and portability (nix, pc, mac) make it an appealing tool. cURL can make just about any request, assuming you can figure out how. These tutorials provide some easy to follow examples and the man page has all the gory details.

I’ll cover a few common usages here.

Basic GET

Note the use of quotes.

$ curl "http://api.example.com/resource/1?passkey=12345&apiversion=2"

Basic POST

Much more useful is making POST requests. The following submits data the same as if a web form were used (default Content-Type: application/x-www-form-urlencoded). Note -d "" is the data sent in the request body.

$ curl -d "key1=some value&key2=some other value" http://api.example.com/resource/1

POST with JSON body

Many APIs expect data formatted in JSON or XML instead of encoded key=value pairs. This cURL command sends JSON in the body by using -H 'Content-Type: application/json' to set the appropriate HTTP header.

$ curl -H 'Content-Type: application/json' -d '{"key": "some value"}' http://api.example.com/resource/1

POST with a file as the body

The previous example can get unwieldy quickly as the size of your request body grows. Instead of adding the data directly to the command line you can instruct cURL to upload a file as the body. This is not the same as a “file upload.” It just tells cURL to use the contents of a file as the request body.

$ curl -H 'Content-Type: application/json' -d @myfile.json http://api.example.com/resource/1

One major drawback of cURL is that the response is displayed unformatted. The next command line tool solves that problem.

HTTPie

HTTPie is a python based command line tool similar to cURL in usage. According to the Github page “Its goal is to make CLI interaction with web services as human-friendly as possible.” This is accomplished with “simple and natural syntax” and “colorized responses.” It supports Linux, Mac OS X and Windows, JSON, uploads and custom headers among other things.

The documentation seems pretty thorough so I’ll just cover the same examples as with cURL above.

Basic GET

$ http "http://api.example.com/resource/1?passkey=12345&apiversion=2"

Basic POST

HTTPie assumes JSON as the default content type. Use --form to indicate Content-Type: application/x-www-form-urlencoded

$ http --form POST api.example.org/resource/1 key1='some value' key2='some other value'

POST with JSON body

The = is for strings and := indicates raw JSON.

$ http POST api.example.com/resource/1 key='some value' parameter2:=2 parameter3:=false parameter4:='["http", "pies"]'

POST with a file as the body

HTTPie looks for a local file to include in the body after the < symbol.

$ http POST api.example.com/resource/1 < resource.json

PostMan Chrome extension

My personal favorite is the PostMan extension for Chrome. In my opinion it hits the sweet spot between functionality and usability by providing most of the HTTP functionality needed for testing APIs via an intuitive GUI. It also offers built in support for several authentication protocols including Oath 1.0. There a few things it can’t do because of restrictions imposed by Chrome, although there is a python based proxy to get around that if necessary.

Basic GET

The column on the left stores recent requests so you can redo them with ease. The results of any request will be displayed in the bottom half of the right column.

postman_get

Basic POST

It’s possible to POST files, application/x-www-form-urlencoded, and your own raw data

postman_post

POST with JSON body

Postman doesn’t support loading a BODY from a local file, but doing so isn’t necessary thanks to its easy to use interface.

postman_post_json

RunScope.com

Runscope is a little different than the others, but no less useful. It’s a webservice instead of a tool and not open source, although they do offer a free option. It can be used much like the other tools to manually create and execute various HTTP requests, but that is not what makes it so useful.

Runscope acts a proxy for API requests. Requests are made to Runscope, which passes them on to the API provider and then passes the responses back. In the process Runscope logs the requests and responses. At that point, to use their words, “you can view the request/response details, share requests with others, edit and retry requests from the web.”

Below is a quick example of what a Runscopeified request looks like. Read their official documentation to learn more.

before: $ curl "http://api.example.com/resource/1?passkey=12345&apiversion=2"
after: $ curl "http://api-example-com-bucket_key.runscope.net/resource/1?passkey=12345&apiversion=2"

Conclusion

If you’re an API consumer you should use some or all of these tools. When I’m helping developers troubleshoot their Bazaarvoice API requests I use the browser when I can get away with it and switch to PostMan when things start to get hairy. There are other tools, I know because I omitted some of them. Feel free to mention your favorite in the comments.

(A version of this post was previously published at the author’s personal blog)