Category Archives: Software Architecture

Kedro 6 Months In

We build AI software in two modes: experimentation and productization. During experimentation, we are trying to see if modern technology will solve our problem. If it does, we move on to productization and build reliable data pipelines at scale.

This presents a cyclical dependency when it comes to data engineering. We need reliable and maintainable data engineering pipelines during experimentation, but don’t know what that pipeline should do until after we’ve completed the experiments. In the past, I and many data scientists I know have used an ad-hoc combination of bash scripts and Jupyter Notebooks to wrangle experimental data. While this may have been the fastest way to get experimental results and model building, it’s really a technical debt that has to be paid down the road.

The Problem

Specifically, the ad-hoc approach to experimental data pipelines causes pain points around:

  • Reproducibility: Ad-hoc experimentation structures puts you at risk of making results that others can’t reproduce, which can lead to product downtime if or when you need to update your approach. Simple mistakes like executing a notebook cell twice or forgetting to seed a random number generator can usually be caught. But other, more insidious problems can occur, such as behavior changes between dependency versions.
  • Readability: If you’ve ever come across another person’s experimental code, you know it’s hard to find where to start. Even documented projects might just say “run x script, y notebook, etc”, and it’s often unclear where the data come from and if you’re on the right track. Similarly, code reviews for data science projects are often hard to read: it’s asking a lot for a reader to differentiate between notebook code for data manipulation and code for visualization.
  • Maintainability: It’s common during data science projects to do some exploratory analysis or generate early results, and then revise how your data is processed or gathered. This becomes difficult and tedious when all of these steps are an unstructured collection of notebooks or scripts. In other words, the pipeline is hard to maintain: updating or changing it requires you to keep track of the whole thing.
  • Shareability: Ad-hoc collections of notebooks and bash scripts are also difficult for a team to work on concurrently. Each member has to ensure their notebooks are up to date (version control on notebooks is less than ideal), and that they have the correct copy of any intermediate data.

Enter Kedro

A lot of the issues above aren’t new to the software engineering discipline and have been largely solved in that space. This is where Kedro comes in. Kedro is a framework for building data engineering pipelines whose structure forces you to follow good software engineering practices. By using Kedro in the experimentation phase of projects, we build maintainable and reproducible data pipelines that produce consistent experimental results.

Specifically, Kedro has you organize your data engineering code into one or more pipelines. Each pipeline consists of a number of nodes: a functional unit that takes some data sets and parameters as inputs and produces new data sets, models, or artifacts.

This simple but strict project structure is augmented by their data catalog: a YAML file that specifies how and where the input and output data sets are to be persisted. The data sets can be stored either locally or in a cloud data storage service such as S3.

I started using Kedro about six months ago, and since then have leveraged it for different experimental data pipelines. Some of these pipelines were for building models that eventually were deployed to production, and some were collaborations with team members. Below, I’ll discuss the good and bad things I’ve found with Kedro and how it helped us create reproducible, maintainable data pipelines.

The Good

  • Reproducibility: I can’t say enough good things here: they nailed it. Their dependency management took a bit of getting used to but it forces a specific version on all dependencies, which is awesome. Also, the ability to just type kedro install and kedro run to execute the whole pipeline is fantastic. You still have to remember to seed random number generators, but even that is easy to remember if you put it in their params.yml file.
  • Function Isolation: Kedro’s fixed project structure encourages you to think about what logical steps are necessary for your pipeline, and write a single node for each step. As a result, each node tends to be short (in terms of lines of code) and specific (in terms of logic). This makes each node easy to write, test, and read later on.
  • Developer Parallelization: The small nodes also make it easier for developers to work together concurrently. It’s easy to spot nodes that won’t depend on each other, and they can be coded concurrently by different people.
  • Intermediate Data: Perhaps my favorite thing about Kedro is the data catalog. Just add the name of an output data set to catalog.yml and BOOM, it’ll be serialized to disk or your cloud data store. This makes it super easy to build up the pipeline: you work on one node, commit it, execute it, and save the results. It also comes in handy when working on a team. I can run an expensive node on a big GPU machine and save the results to S3, and another team member can simply start from there. It’s all baked in.
  • Code Re-usability: I’ll admit I have never re-used a notebook. At best I pulled up an old one to remind myself how I achieved some complex analysis, but even then I had to remember the intricacies of the data. The isolation of nodes, however, makes it easy to re-use them. Also, Kedro’s support for modular pipelines (i.e., packaging a pipeline into a pip package) makes it simple to share common code. We’ve created modular pipelines for common tasks such as image processing.

The Bad

While Kedro has solved many of the quality challenges in experimental data pipelines, we have noticed a few gotchas that required less than elegant work arounds:

  • Incremental Dataset: This support exists for reading data, but it’s lacking for writing datasets. This affected us a few times when we had a node that would take 8-10 hours to run. We lost work if the node failed part of the way through. Similarly, if the result data set didn’t fit in memory, there wasn’t a good way to save incremental results since the writer in Kedro assumes all partitions are in memory. This GitHub issue may fix it if the developers address it, but for now you have to manage partial results on your own.
  • Pipeline Growth: Pipelines can quickly get hard to follow since the input and outputs are just named variables that may or may not exist in the data catalog. Kedro Viz helps with this, but it’s a bit annoying to switch between the navigator and code. We’ve also started enforcing name consistency between the node names and their functions, as well as the data set names in the pipeline and the argument names in the node functions. Finally, making more, smaller pipelines is also a good way to keep your sanity. While all of these techniques help you to mentally keep track, it’s still the trade off you make for coding the pipelines by naming the inputs and outputs.
  • Visualization: This isn’t really considered much in Kedro, and is the one thing I’d say notebooks still have a leg up on. Kedro makes it easy for you to load the Kedro context in a notebook, however, so you can still fire one up to do some visualization. Ultimately, though, I’d love to see better support within Kedro for producing a graphical report that gets persisted to the 08_reporting layer. Right now we worked around this by making a node that renders a notebook to disk, but it’s a hack at best. I’d love better support for generating final, highly visual reports that can be versioned in the data catalog much like the intermediate data.

Conclusion

So am I a Kedro convert? Yah, you betcha. It replaces the spider-web of bash scripts and Python notebooks I used to use for my experimental data pipelines and model training, and enables better collaboration among our teams. It won’t replace a fully productionalized stream-based data pipeline for me, but it absolutely makes sure my experimental pipelines are maintainable, reproducible, and shareable.

Creating a Realtime Reactive App for a collaborative domain

Sampling is a Bazaarvoice product that allows consumers to join communities and claim a limited amount of free products. In return consumers provide honest & authentic product reviews for the products they sample. Products are released to consumers for reviews at the same time. This causes a rush to claim these products. This is an example of a collaborative domain problem, where many users are trying to act on the same data (as discussed in Eric Evens book Domain-Driven design).

 

Bazaarvoice ran a 2 day hackathon twice a year. Employees are free to use this time to explore any technologies or ideas they are interested in. From our hackathon events Bazaarvoice has developed significant new features and products like our advertising platform and our personalization capabilities. For the Bazaarvoice 2017.2 hackathon, the Belfast team demonstrated a solution to this collaborative domain problem using near real-time state synchronisation.    

Bazaarvoice uses React + Redux for our front end web development. These libraries use the concepts of unidirectional data flows and immutable state management. These mean there is always one source of truth, the store, and there is no confusion about how to mutate the application state. Typically, we use the side effect library redux-thunk to synchronise state between server and client via HTTP API calls. The problem here is that, the synchronisation one way, it is not reactive. The client can tell the server to mutate state, but not vice versa. In a collaborative domain where the data is changing all the time, near-real time synchronisation is critical to ensure a good UX.

To solve this we decided to use Google’s Firebase platform. This solution provided many features that work seamlessly together, such as OAuth authentication, CDN hosting and Realtime DB. One important thing to note about Firebase, it’s a backend as a service, there was no backend code in this project.

The Realtime Database provides a pub/sub model on nodes of the database, this allows clients to be always up-to-date with the latest state. With Firebase Realtime DB there is an important concept not to be overlooked, data can only be accessed by it’s key (point query).

You can think of the database as a cloud-hosted JSON tree. Unlike a SQL database, there are no tables or records. When you add data to the JSON tree, it becomes a node in the existing JSON structure with an associated key (Reference)

Hackathon Goals

  1. Realtime configurable UI
  2. Realtime campaign administration and participation
  3. Live Demo to the whole company for both of the above

Realtime configurable UI

During the hackathon demo we demonstrated updating the app’s style and content via the administration portal, this would allow clients to style the app to suite their branding. These updates were pushed in real time to 50+ client devices from Belfast to Austin (4,608 miles away). 

The code to achieve this state synchronisation across clients was deceptively easy!

Given the nature of react, once a style config update was received, every device just ‘reacted’.

 

Realtime campaign administration and participation

In the demo, we added 40 products to the live campaign. This pushed 40 products to the admin screen and to the 50+ mobile app. Participants were then instructed to claim items.

Admin view

Member view

All members were authenticated via OAuth providers (Facebook, Github or Gmail).

To my surprise the live demo went without a hitch. I’m pleased to add…. my team won the hackathon for the ‘Technical’ category.

Conclusion

Firebase was a pleasure to work with, everything worked as expected and it performed brilliantly in the live demo….even on their free tier. The patterns used in Firebase are a little unconventional for the more orthodox engineers, but if your objective is rapid development, Firebase is unparalleled by any other platform. Firebase Realtime Database produced a great UX for running Sampling campaigns. While Firebase will not be used in the production product, it provided great context for discussions on the benefits and possibilities of realtime data synchronisation.

Some years ago, I would have maintained that web development was the wild west of software engineering; solutions were being developed without discipline and were lacking a solid framework to build on. It just didn’t seem to have any of the characteristics I would associate with good engineering practices. Fast forward to today, we now have a wealth of tools, libraries and techniques that make web development feel sane.

In recent years front-end developers have embraced concepts like unidirectional dataflows, immutability, pure functions (no hidden side-effects), asynchronous coding and concurrency over threading. Im curious to see if these same concepts gain popularity in backend development as Node.js continues to grow as backend language.

Telling Your Data To “Back Off!” (or How To Effectively Use Streams)

Foreword

Our Curations engineering team makes heavy use of serverless architecture. While this typically gives us the benefit of reduced costs, flexibility, and rapid development, it also requires us to ensure that our processes will run within the tight memory and lifecycle constraints of serverless instances.

In this article, I will describe an actual case where a scheduled job had started to fail, the discovery of the root cause and the refactor that resolved the issue. I will assume you have at least a rudimentary knowledge of Node JS and Amazon Web Services.

Content Export

Months previously, we built a scheduled job using AWS Lambdas that would kick off an export of our social content every 24 hours. In a nutshell, its purpose was to:

  1. Query for social content documents stored in a MongoDB server.
  2. Transform each document to a row in CSV format.
  3. Write out the CSV rows to a file in S3.

The resulting CSV files were meant for ingestion into an analytics tool to measure web site conversion impact.

It all worked fine for months. Then recently, this job began to fail.

Drowning in Content

For your viewing pleasure, here is an excerpted, condensed section of the main code:

function runExport(query) {
  MongoClient.connect(MONGODB_URL)
    .then((db) => {
      let exportCount = 0;
      db
        .collection('content')
        .find(query)
        .forEach(
          // iterator function to perform on each document
          (content) => {
            // transform content doc to a csv row, then push to S3 writer
            s3.write(csvTransform(content));
            exportCount++;
          },
          // done function, no more documents
          () => {
            db.close();
            s3.close();
            logger.info(`documents exported: ${exportCount}`);
          }
        );
    });
}

Each document from the database query is transformed into a CSV row, then pushed to our S3 writer which buffered all content until an s3.close(). At this point, the CSV data would be flushed out to the S3 destination bucket.

It was obvious from this implementation that heap usage would grow unbounded, as documents from the database would be loaded into resident memory as CSV data. As we aggregated content over the previous months, the data pushed into our export process began to generate frequent “out of heap memory” errors in our logs.

Memory Profile

To gain visibility into the memory usage, I wrote a quick and dirty MemProfiler class whose purpose was to sample memory usage using process.memoryUsage(), collecting the maximum heap used over the process lifetime.

class MemProfiler {
  constructor(collectInterval) {
    this.heapUsed = {
      min : null,
      max : null,
    };
    if (collectInterval) {
      this.start(collectInterval);
    }
  }

  start(collectInterval = 500) {
    this.interval = setInterval(() => this.sample(), collectInterval);
  }

  stop() {
    if (this.interval) {
      clearInterval(this.interval);
    }
  }

  reset() {
    this.stop();
    this.heapUsed = {
      min : null,
      max : null,
    };
  }

  sample() {
    const mem = process.memoryUsage();
    if (this.heapUsed.min === null || mem.heapUsed < this.heapUsed.min) {
      this.heapUsed.min = mem.heapUsed;
    }
    if (this.heapUsed.max === null || mem.heapUsed > this.heapUsed.max) {
      this.heapUsed.max = mem.heapUsed;
    }
  }
}

I then added the MemProfiler to the runExport code:

function runExport(query) {
  memProf.start();
  MongoClient.connect(MONGODB_URL)
    .then((db) => {
      let exportCount = 0;
      db
        .collection('content')
        .find(query)
        .forEach(
          (content) => {
            // transform content doc to a csv row, then push to S3 writer
            s3.write(csvTransform(content));
            exportCount++;
            memProf.sample();
          },
          () => {
            db.close();
            s3.close();
            logger.info(`documents exported: ${exportCount}`);
            console.log('heapUsed:');
            console.log(` max: ${memProf.heapUsed.max/1024/1024} MB`);
            console.log(` min: ${memProf.heapUsed.min/1024/1024} MB`);
          }
        );
    });
}

Running the export against a small set of 16,900 documents yielded this result:

documents exported: 16900
heapUsed:
max: 30.32764434814453 MB
min: 8.863059997558594 MB

Then running against a larger set of 34,000 documents:

documents exported: 34000
heapUsed:
max: 36.204505920410156 MB
min: 8.962921142578125 MB

Testing with additional documents increased the heap memory usage linearly as we expected.

Enter the Stream

The necessary code rewrite had one goal in mind: ensure the memory profile was constant, whether a total of one document was processed or a million. Since this was a batch process, we could trade off memory usage for processing time.

Let’s reiterate the primary operations:

  1. Fetch a document
  2. Transform the document to a CSV row
  3. Write the CSV row to S3

If I may present an analogy, we have widgets moving down the assembly line from one station to the next. The conveyor belt on which the widgets are transported is moving at a constant rate, which means the throughput at each station is constant. We could increase the speed of that conveyor belt, but only as fast as can be handled by the slowest station.

In Node JS, streams are the assembly line stations, and pipes are the conveyor belt. Let’s examine the three primary types of streams:

  1. Readable streams: This is typically an upstream data source that feeds data to a writable or transform stream. In our case, this is the MongoDB find query.
  2. Transform streams: This typically takes upstream data, performs a calculation or transformation operation on it and feeds it out downstream. In our case, this would take a MongoDB document and transform it to a CSV row.
  3. Writable streams: This is typically the terminating point of a data flow. In our case this is the writing of a CSV row to S3.

Readable Stream

Conveniently, the MongoDB Node JS driver’s find() function returns a Cursor object that extends the Node JS Readable stream. How convenient!

const streamContentReader = db.collection('content').find(query);

Transform Stream

All we need to do is extend the Transform stream class, and override the _transform() method to transform the content to a CSV row, push it out the other end, and signal upstream that it is ready for the next content.

class CsvTransformer extends Transform {
  constructor(options) {
    super(options);
  }

  _transform(content, encoding, callback) {
    // write row
    const { _id, authoredAt, client, channel, mediaType, permalink , text, textLanguage } = content;
    const csvRow = `${_id},${authoredAt},${client},${channel},${mediaType},${permalink}${text},${textLanguage}\n`;
    this.push(csvRow);
    // signal upstream that we are ready to take more data
    callback();
  }
}

Writable Stream

Let’s purposely mock the S3 Writer for now. It does nothing here, but realistically, we would buffer up incoming data, then flush out the buffer in chunks over the network for best throughput.

class S3Writer extends Writable {
  constructor(options) {
    super(options);
  }

  _write(data, encoding, callback) {
    // TODO: probably use a fixed size buffer
    // that we write to and then flush out to S3 using
    // multipart data upload
    callback();
  }
}

Pipe ‘Em Up

We now create the three streams:

const streamContentReader = db.collection('content').find(query);
const streamCsvTransformer = new CsvTransformer({ objectMode: true });
const streamS3Writer = new S3Writer();

Streams alone do nothing. What makes them useful is connecting the output from one stream to the input of another. In stream terminology, this is called piping, and is accomplished using the stream’s pipe method:

streamContentReader.pipe(streamCsvTransformer).pipe(streamS3Writer)

Although the native pipe method is perfectly functional, I highly recommend using the pump library. In addition to being more readable by passing in an array of streams to pipe together, we can also invoke then() when the pipe has finished, as well as handle close or error events emitted by the streams:

pump([ streamContentReader, streamCsvTransformer, streamS3Writer ])
  .then(() => {
    console.log(`documents exported: ${exportCount}`);
  })
  .catch((err) => console.error(err));

CSV and S3 Write Streams

I purposely did not implement the S3Writer above, because npm is a treasure trove of solutions! The s3-write-stream library will take care of buffering, using multipart upload and handling retries, so we don’t have to get our hands too dirty. And wouldn’t you know it, there is also the csv-write-stream library that will generate properly escaped CSV rows.

As an exercise, you the reader may want to try using the s3-write-stream and csv-write-stream. Trust me, once you get the hang of streaming, you will enjoy it!

Back Off!

Let’s touch back on the issue of memory usage.

What if a write stream becomes blocked for a period of time? Maybe we are writing data to a third-party service over HTTP and network congestion or service throttling is slowing the write rate. The readable stream will happily pump data in as fast as it can. Won’t that cause increased heap usage as all that data gets backed up and buffered?

The quick answer is “no”. In the implementation of the _write() method of your writable stream, when done with the data chunk (after writing to a file for example), call callback(). This signals to the incoming stream that it is ready to receive the next chunk. You can also provide a highWaterMark to the writable stream constructor to give a specific number of objects to buffer before pausing the incoming stream. This is all handled internally by the Node JS streams implementation. No more unbounded buffering!

For a deep dive into the concept of backpressure control, read this great article on Backpressuring in Streams.

Stream ‘Em If You Got ’em!

Our team is now in the process of utilizing Node JS streams whenever we read in volumes of data, process that data, then write out the results. We are now seeing great gains in both stability and maintainability!

References

Node JS Streams APIhttps://nodejs.org/api/stream.html

Article: Backpressuring in Streams: https://nodejs.org/en/docs/guides/backpressuring-in-streams/

MongoDB Cursor Streamhttp://mongodb.github.io/node-mongodb-native/2.2/api/Cursor.html

Pump Modulehttps://www.npmjs.com/package/pump

CSV Write Stream Module: https://www.npmjs.com/package/csv-write-stream

S3 Write Stream Module: https://www.npmjs.com/package/s3-write-stream

 

Maintaining Test Data with the “someObject” Test Structure

Language: Scala
TestTool: Scalatest

How did we get here?

When systems become reasonably complex, tests must manage cumbersome amounts of data. A test case that may test a small bit of functionality may start to require large amounts of domain knowledge about the system being tested. This is often done through the mock data used to set up the test. Maintenance of this data becomes cumbersome, monotonous and can feel Sisyphean. To solve these problems we created “someObject”, a modular system that allows us to maintain data in only one location while providing the flexibility to create specific data for our tests.

Let’s Do A Code Time!

To start this post, we’re going to build a system without the “someObject” test structure to provide context for its use. (To skip to the “someObject” structure, jump to here!). Suppose we are building a service that reports on advertising campaigns. We may create a class that describes an advertising campaign and call it `Campaign`.

case class Campaign(id: String,
    name: String,
    client: String,
    startDate: UTCDate,
    endDate:UTCDate,
    deleted: Boolean)

Now we are going to store this campaign in a database, and we need to write some integration tests to make sure this operation is performed properly. A test that confirms a campaign is stored properly might look like this:

it should "properly store a single campaign" in {
    Given("we have a proper campaign")
    val campaignToStore = Campaign(
        id = "someId",
        name = "someName",
        client = "someClient",
        startDate = UTCDate(1955, 10, 6),
        endDate = UTCDate(1956, 10, 6),
        deleted = false)

    And("we store this campaign in the database")
    database.storeCampaign()

    When("we fetch the given campaign")
    val fetchedCampaign:Campaign = database.fetchCampaign()

    Then("All the campaign values were stored properly")
    fetchedCampaign.id shouldBe campaignToStore.id
    fetchedCampaign.name shouldBe campaignToStore.name
    fetchedCampaign.client shouldBe campaignToStore.client
    fetchedCampaign.startDate shouldBe campaignToStore.startDate
    fetchedCampaign.endDate shouldBe campaignToStore.endDate
    fetchedCampaign.deleted shouldBe campaignToStore.deleted
}

BlogIntro.scala

Add some functionality:

Now we add the ability to update some values for this campaign in the database, and we need to test this new piece of functionality. That test might look something like this:

it should "properly update a single campaign" in {
    Given("we have a proper campaign")
    val campaignToStore = Campaign(
        id = "someId",
        name = "someName",
        client = "someClient",
        startDate = UTCDate(1955, 10, 6),
        endDate = UTCDate(1956, 10, 6),
        deleted = false)

    And("some update parameters for our campaign")
    val updateParameters = UpdateParameters(name = Some("someNewName"))

    And("we store this campaign in the database")
    database.storeCampaign(campaignToStore)

    When("we update this campaign")
    database.updateCampaign("someId", updateParameters)

    val fetchedCampaign:Campaign = database.fetchCampaign("someId")

    Then("All the campaign values were stored properly")
    //Unchanged values
    fetchedCampaign.id shouldBe campaignToStore.id
    fetchedCampaign.client shouldBe campaignToStore.client
    fetchedCampaign.startDate shouldBe campaignToStore.startDate
    fetchedCampaign.endDate shouldBe campaignToStore.endDate
    fetchedCampaign.deleted shouldBe campaignToStore.deleted
    //Changed values
    fetchedCampaign.name shouldBe updateParameters.name
}

BlogUpdateFunction.scala

But here we see the duplication of test boilerplate code in `campaignToStore`. We don’t want to have to copy over `campaignToStore` into every test, so we might want to abstract that out to be used all over the suite.

object MyCampaignTestObjects {
    val campaignToStore = Campaign(
        id = "someId",
        name = "someName",
        client = "someClient",
        startDate = UTCDate(1955, 10, 6),
        endDate = UTCDate(1956, 10, 6),
        deleted = false)
}

BlogAbstractToSuite.scala

Now we can use the same data in every test!

Add a Test that requires unique data:

Suppose we now write a function to fetch all the campaigns that are stored in the database. We might need a test that involves uniqueness in the data we store, such as the following example:

it should "properly fetch all stored campaigns" in {
    Given("we store several unique campaigns")
    val anotherCampaignToStore = Campaign(
        id = "someSecondId",
        name = "someSecondName",
        client = "someSecondClient",
        startDate = UTCDate(1955, 10, 6),
        endDate = UTCDate(1956, 10, 6),
        deleted = false)
    database.storeCampaign(campaignToStore)
    database.storeCampaign(anotherCampaignToStore)

    When("we fetch all campaigns")
    val allCampaigns = database.fetchAllCampaigns()

    Then("all campaigns are returned")
    allCampaigns shouldBe List(campaignToStore, anotherCampaignToStore)
}

BlogTestWithUniqueTestData.scala

In the example, we re-used the campaign we abstracted out earlier for conciseness, but this makes this test unclear that `anotherCampaignToStore` is unique. What if someone else comes in and changes `campaignToStore` and it happens to match data from `anotherCampaignToStore`? This test would then become flakey and nobody likes flakey tests. We might decide to just make all data used in this test local to this test, but then we will need to maintain the test data in both this test, and `MyCampaignTestObjects`.

Add Some Arbitrary Data Constraints:

Suppose now that there is a new design constraint on how campaigns can be stored in the database. Now all client names must be lowercased in all campaigns:

object MyCampaignTestObjects {
    val campaignToStore = Campaign(
    id = "someId",
    name = "someName",
    //We change the client name to match our new requirement
    client = "some_client",
    startDate = UTCDate(1955, 10, 6),
    endDate = UTCDate(1956, 10, 6),
    deleted = false)
}

BlogNewConstraint.scala

Now we start to see the issue with maintaining test data across the whole suite that we’ve been constructing. We need to find every mock campaign that is used in our suite and ensure that its client field data is lowercased. Realistically, many of our tests, (specifically in this example, the `fetchAllCampaigns` test) don’t care about the client field of their campaign, and so we shouldn’t need to care about the client field value while setting up our mock test data. Because this example is small, it’s not cumbersome to directly update the value to satisfy the new constraint. Now let us Imagine a large set of suites, each containing hundreds of unique test cases. Suddenly this single suite requires a large amount of work to refactor one field across each test case. Nobody wants to do that monotonous maintenance. To address this, our team adopted the “someObject” structure to minimize this data maintenance within our tests.

someObject Test Structure:

When designing this test structure, we wanted to make our test data extendable for use anywhere it is needed. We used Scala’s `trait` to mix in necessary functions to provide test objects to the objects inside our tests, such as the `MyCampaignTestObjects` object above:

trait CampaignTestObjects {
    def someCampaign(id: String = "someId",
                     name: String = "someName",
                     client: String = "some_client",
                     startDate: UTCDate = UTCDate(1955, 10, 6),
                     endDate: UTCDate = UTCDate(1956, 10, 6),
                     deleted: Boolean = false): Campaign =
        Campaign(
            id = id,
            name = name,
            client = client,
            startDate = startDate,
            endDate = endDate,
            deleted = deleted)
}
 object MyCampaignTestObjects extends CampaignTestObjects {
    //Any other setup methods for test data
}

Now we can revisit our `fetchAllCampaigns` test example.

it should "properly fetch all stored campaigns" in {
    Given("we store several unique campaigns")
    val campaignToStore = someCampaign(id = "someId")
    val anotherCampaignToStore = someCampaign(id = "someNewId")
    database.storeCampaign(campaignToStore)
    database.storeCampaign(anotherCampaignToStore)

    When("we fetch all campaigns")
    val allCampaigns = database.fetchAllCampaigns()

    Then("all campaigns are returned")
    allCampaigns shouldBe List(campaignToStore, anotherCampaignToStore)
}

BlogBasicSomeObject.scala

Inside this test, we’ve set up two unique campaigns, by calling the `someCampaign` method from our test data trait. This populates the returned campaign with dummy data that we don’t care about. All we need out of this method is “some campaign” with “some data”. Now, instead of obscuring the intent of the test case by setting up cumbersome, overly-expressive mock data, we can simply override the implicitly available mock objects with only the necessary data. For the unique campaigns needed in the `fetchAllCampaigns` test, we only only really care about each campaign’s identifier. We don’t update the name, client, startDate, etc. because this test doesn’t care about any of that data. We only need to care that the campaigns are unique for our database. Under this test structure, when we receive the design change about the client names being lowercased, we don’t need to update our `fetchAllCampaigns`.

Another Example:

Let’s provide another example that our team encountered. Campaigns inside our database now need to also store the amount of money spent on each ad campaign. We’re adding a new column to our database, and changing the database schema.

case class Campaign(id: String,
    name: String,
    client: String,
    totalAdSpend: Int,
    startDate: UTCDate,
    endDate: UTCDate,
    deleted: Boolean)

Now, every test that has a campaign involved needs to be updated to include a new field; but under the “someObject” structure we only need to add two lines and all existing tests should be working fine again:

trait CampaignTestObjects {

    def someCampaign(id: String = "someId",
                     name: String = "someName",
                     client: String = "some_client",
                     totalAdSpend: Int = 123456,
                     startDate: UTCDate = UTCDate(1955, 10, 6),
                     endDate: UTCDate = UTCDate(1956, 10, 6),
                     deleted: Boolean = false): Campaign =
        Campaign(
            id = id,
            name = name,
            client = client,
            totalAdSpend = totalAdSpend,
            startDate = startDate,
            endDate = endDate,
            deleted = deleted)
}

BlogPostSchemaChange.scala

 

Behavior Driven Tests:

The purpose of the “someObject” structure is to minimize data maintenance within tests. We want to ensure that we’re disciplined about only setting data that the tests need to care about. There are cases where data might seem necessary for what we are testing, but we can abstract that data away to de-couple the test’s reliance on hard coded values. For example, suppose we have a function that returns the sum of all the `totalAdSpend` across our database.

def sumAllSpend(campaignsToSum: List[Campaign]):Int = 
    campaignsToSum.reduce(_.totalAdSpend + _.totalAdSpend)

To test this function, we might write a test like this:

it should "properly sum all totalAdSpend" in {
    Given("we store several unique campaigns")
    val campaignToStore = someCampaign(id = "someId", totalAdSpend = 123)
    val anotherCampaignToStore = someCampaign(id = "someNewId", totalAdSpend = 456)
    database.storeCampaign(campaignToStore)
    database.storeCampaign(anotherCampaignToStore)

    When("we sum all ad spend")
    val totalTotalAdSpend = sumAllSpend(database.fetchAllCampaigns())

    Then("the result is the sum of all ad spend")
    totalTotalAdSpend shouldBe 975
}

BlogPostNonBehaviorTest.scala

While this test does work, and it utilizes this “someObject” structure, it still forces data management at the test level.

`sumAllSpend` doesn’t really care about any one campaign’s `totalAdSpend` value. It only cares that we add all of the `totalAdSpend` values up correctly. We could instead write our test to assert on this behavior instead of doing the math ourselves and taking on the responsibility of managing more data.

it should "properly sum all totalAdSpend" in {
   Given("we store several unique campaigns")
   val campaignToStore = someCampaign(id = "someId")
   val anotherCampaignToStore = someCampaign(id = "someNewId")

   val allCampaignsToStore = List(campaignToStore,anotherCampaignToStore)
   allCampaignsToStore.foreach(campaign =&amp;gt; database.storeCampaign(campaign))

   When("we sum all ad spend")
   val totalTotalAdSpend = sumAllSpend(database.fetchAllCampaigns())

   Then("the result is the sum of all ad spend")
   totalTotalAdSpend shouldBe allCampaignsToStore.reduce(_.totalAdSpend + _.totalAdSpend)
}

BlogPostBehaviorTest.scala

With this test, we don’t care what campaigns sales were, we don’t care how many campaigns are stored, and we don’t care about any constant value. This test will return the sum of all campaign’s `totalAdSpend` value that we store in the database.

Conclusion:

In this introductory blog post, we explored the someObject testing structure in scala, but this concept is not intended to be language specific. Scala makes this concept easy to implement through the use of Default Parameter Values but in a future post I’ll show how it can be implemented through the Builder Pattern in a language like Java. Another unexplored “someObject” concept is the granularity of control in setting default data. This post introduces the “global” and test specific setting of default data, but doesn’t explore how to set test suite level data for our test objects, and the cases in which that might be useful. I’ll discuss that the future post as well.

Code snippets:

BlogIntro.scala
BlogUpdateFunction.scala
BlogAbstractToSuite.scala
BlogTestWithUniqueTestData.scala
BlogNewConstraint.scala
BlogBasicSomeObject.scala
BlogPostSchemaChange.scala
BlogPostNonBehaviorTest.scala
BlogPostBehaviorTest.scala

Augment your pattern library with page types

Pattern libraries sometimes fall short of helping enterprise teams build different products the same way. These palettes of components (toolbars, pop-ins) and patterns (searching, navigating) can be assembled into any number of UIs, leading to too many right answers. While the public pattern libraries like Google Material must accommodate countless unimagined applications, our private libraries can serve us better.

We have special insight into our own users’ workflows. A page type is a layout and set of patterns packaged together according to the workflow they support. If your pattern library is a basket of ingredients, your page types are the recipes.

These starting points are immensely helpful in a few ways:

  1. Designers starting from 80% instead of from scratch are more likely to approach their design problems in the same way.
  2. Development teams without designers often have everything they need to start building.
  3. Teams have vocabulary that connects patterns to workflows.
  4. Page types make workflow-specific pattern definitions possible.

Defining page types

Workflow-specific pattern definitions

Many pattern libraries, especially the older ones like Yahoo Design Pattern Library, take a bottom-up approach. Documentation starts with the component, noting its general purpose, but focusing primarily on its interactive states. A handful of examples show the component used in various contexts. It is up to designers to imagine how this information relates to their own projects.

Page types are top-down: in this workflow, these components are used in this way.

The example below shows two Object Editor pages with different interpretations of “Toolbar.” The top applies Google Material’s definition of Toolbar:

Toolbar actions appear above the view affected by their actions.

The bottom applies a workflow-specific definition of Toolbar:

If the object is edited indirectly and previewed, configuration actions and preview actions are separated into panel and toolbar, respectively. In this way, the user is not led to believe temporary preview modes (like zoom) are saved with their configuration.

It takes a lot of design thinking to work through how components could best serve a workflow. It is highly valuable, then, to document your best solutions.

Essential page types

Page type documentation prescribes the layout, component arrangement, and interactive patterns used to achieve a desired workflow. Here’s list of page types common to enterprise applications:

  • Manager
    Manipulate a collection of objects.
  • Editor
    Edit an object.
  • Detail
    Consume information through exploration.
  • Navigation
    Consume information by reading it.
Top: Object Manager, Bottom: Object Editor

Identifying page types

In application design, the layout and use of components on any given page create a workflow that serve the page’s central purpose. If the central purpose of your application page is not singular, your design is probably overcomplicated. Before you invent a new page type, reconsider your application architecture.

For example, the central purpose of a document editor is to edit a document. If the page is well-designed, its layout maximizes the editable area, and its buttons and tools all relate to editing. Notice that Google did not smash document editing, management, and publishing into one page.

Object Editor

Optimizing workflows

Sometimes different workflows serve the same central purpose. For example, direct editing and configuration are very different workflows even though their central purpose, Object Editing, is the same. In cases like these, it’s appropriate to offer variations.

Top: directly editing, Bottom: configuration with preview

While page types promote focused design, discipline should not compromise usability. In the examples above, direct editing is the easiest way to work with text. Configuration is the easiest way for non-technical users to change XML values, build an email template, or add filters to a photo, etc.

Optimizing content presentation

It’s important to differentiate between workflow and content presentation: tables emphasize data, lists emphasize titles, cards emphasize media, etc. If the workflow is the same, one page type can house various content types in their optimal formats. In the Object Manager examples below, relevant activities—finding, filtering, selecting, applying actions, etc.—are the same and can be accessed the same way.

Object Manager with tables

Object Manager with list

Similarly, let workflow define your page types, not content presentation or layout. “Table page” is not a good page type because your users do not want to table.

Using page types

Page type templates accelerate design and development projects by advancing their starting points. They also simplify the information architecture design process by providing constraint:

Each screen maps to a page type.

Therefore, each screen represents a workflow with singular purpose. Adhering to this principle influences decisions around how to group functionality into various pages.

Page type templates dropped into a IA map. Created in Mindmeister app.

An IA design that uses your company’s set of established page types as illustrations is more tangible to stakeholders.

What are your page types?

Page types are not new; website designers—especially those who use template-based CMS’s like Joomla—have been using them all along. They are essential to Information Architecture.

We application designers have been somewhat distracted. Pattern libraries—especially when incorporated into UI development environments like Storybook—are incredibly useful. However, only we know what we want to help our users do. Our own private pattern libraries can be far more workflow-aware than the public libraries from which we draw our inspiration.

What are your page types?


This post was originally published on Medium.com

Three Takeaways from CSSConf 2016

This year Bazaarvoice sponsored CSSConf 2016 in beautiful Boston, MA, USA and I was able to attend!
Here are my three top takeaways from CSSConf 2016:

Flexy Flexy Flexbox

A little over a year ago, our application team wasn’t sure how “stable” Flexbox or its spec were: there was already an old syntax, a new syntax, and a weird IE10 “tweener” syntax.

The layout advantages Flexbox brought were strong enough (*cough* vertical centering) that we decided to move forward with it and prefix all the things. Now browser support is so good that if you can drop IE8 and work around some known IE11 bugs, there is no reason not to use Flexbox in your designs right now.

A great reference I keep going back to for Flexbox is this css-tricks guide. Here are some other tips and tricks from the conference:

  • Flexbox is now available in Bootstrap 4
  • Use CSS Grid (when it becomes available) for major page layout, and Flexbox for UI elements
  • For mobile / small screens: add a media query and set the flex-direciton to column to stack your cells instead
  • Do as much as you can on the container to keep your code DRY (Don’t Repeat Yourself)
  • We can finally get rid of that pesky col-2, col-8, col-crapIAddedWrong grid system!

For more information, I recommend CSS4 Grid: True Layout Finally Arrives by Jen Kramer and It’s Time To Ditch The Grid System by Emily Hayman.

Stop Thinking In Pixels

The talk Stop Thinking In Pixels by Keith Grant was particularly enlightening.
The basic premise was not to micromanage your CSS:

Without fully understanding what CSS is doing for us, we try to push through it to control exactly what is going on in the browser.

Driving this point home, Keith recommended to stop thinking in pixels because

The pixels don’t matter. Let the browser do it.

You should instead be thinking in terms of the em and rem. Tools that simply convert px to em aren’t the answer either — you’re still thinking in terms of pixels and therefore missing out on some important benefits. Instead, learn from something like type scale and approach measurements with a fresh perspective.

I recommend watching the talk in full, but a quick cheatsheet follows:

Property Recommended Unit
font-size rem
padding, margin, border-radius, etc. em
border-width px

When in doubt, use em.

To summarize,

Ems are the most powerful when you fully embrace them.

Apps vs Documents

In this day and age we are all used to thinking in terms of “apps”. But the trinity of HTML, CSS, and JS was not conceived in this day and age. Two great quotes I wrote down from Component-Based Style Reuse by Pete Hunt are

CSS is great for documents, maybe not 2016 Apps

and

If you sat down and created styling in 2016, you would not come up with CSS

Our newest applications are written in React, which encourages developers to think of things in terms of components — pieces of UI that are reusable in different contexts. The Cascading part of CSS interferes with that, however: depending on the context your component is dropped into, it may look drastically different across usages. When that is not what you want, Pete’s ideas center around reusing components, not CSS classes.

As you can imagine, this idea is largely controversial in a conference with a name like CSSConf, but I will continue to keep my eye on it. Pete’s thought leadership on this topic inspires me to challenge norms and dare to envision things differently. After all, if we’re constantly fighting with our tool (CSS), that tool may not be right for the job.

Thanks for reading! For a full list of talks and slides from the conference, check out https://2016.cssconf.com/#videos.

How to seamlessly move 300 Million shoppers to a highly scalable architecture, part 2

Divide and Conquer

As Engineers, we often like nice clean solutions that don’t carry along what we like to call technical debt.  Technical debt literally is stuff that we have to go back to fix/rewrite later or that requires significant ongoing maintenance effort.  In a perfect world, we fire up the the new platform and move all the traffic over.  If you find that perfect world, please send an Uber for me. Add to this the scale of traffic we serve at Bazaarvoice, and it’s obvious it would take time to harden the new system.

The secret to how we pulled this off lies in the architecture choices to break apart the challenge into two parts: frontend and backend.  While we reengineered the front-end into the the new javascript solution, there were still thousands of customers using the template-based front end.  So, we took the original server side rendering code and turned it into a service talking to our new Polloi service.  This enabled us to handle request from client sites exactly like the Classic original system.

Also, we created a service improved upon the original API but was compatible from a specific version forward.  We chose to not try to be compatible for all version for all time, as all APIs go through evolution and deprecation.  We naturally chose the version that was compatible with the new Javascript front end.  With these choices made, we could independently decide when and how to move clients to the new backend architecture irrespective of the front-end service they were using.

A simplified view of this architecture looks like this:

Divide_Conquer_arch

With the above in place, we can switch a Javascript client to use the new version of the API through just changing the endpoint of the API key.  For a template-based client, we can change the endpoint to the new referring service through a configuration in our CDN Akamai.

Testing for compatibility is a lot of work, though not particularly difficult. API compatibility is pretty straight forward, which testing whether a template page renders correctly is a little more involved especially since those pages can be highly customized.  We found the most effective way to accomplish the later since it was a one time event was with manual inspection to be sure that the pages rendered exactly the same on our QA clusters as they did in the production classic system.

Success we found early on was based on moving cohorts of customers together to the new system. At first we would move a few at a time, making absolutely sure the pages rendered correctly, monitoring system performance, and looking for any anomalies.  If we saw a problem, we could move them back quickly through reversing the change in Akamai. At first much of this was also manual, so in parallel, we had to build up tooling to handle the switching of customers, which even included working with Akamai to enhance their API so we could automate changes in the CDN.

From moving a few clients at a time, we progressed to moving over 10s of clients at a time. Through a tremendous engineering effort, in parallel we improved the scalability of our ElasticSearch clusters and other systems which allowed us to move 100s of clients at a time, then 500 clients at time. As of this writing, we’ve moved over 5,000 sites and 100% of our display traffic is now being served from our new architecture.

More than just serving the same traffic as before, we have been able to move over display traffic for new services like our Curations product that takes in and processes millions of tweets, Instagram posts, and other social media feeds.  That our new architecture could handle without change this additional, large-scale use case is a testimony to innovative engineering and determination by our team over the last 2+ years. Our largest future opportunities are enabled because we’ve successfully been able to realize this architectural transformation.

Rearchitecting the Team

In addition to rearchitecting the service to scale, we also had to rearchitect our team. As we set out on this journey to rebuild our solution into a scalable, cloud based service oriented architecture, we had to reconsider the very way our teams are put together.  We reimagined our team structure to include all the ingredients the team needs to go fast.  This meant a big investment in devops – engineers that focus on new architectures, deployment, monitoring, scalability, and performance in the cloud.

A critical part of this was a cultural transformation where the service is completely owned by the team, from understanding the requirements, to code, to automated test, to deployment, to 24×7 operation.  This means building out a complete monitoring and alerting infrastructure and that the on-call duty rotated through all members of the team.  The result is the team becomes 100% aligned around the success of the service and there is no “wall” to throw anything over – the commitment and ownership stays with the team.

For this team architecture to succeed, the critical element is to ensure the team has all the skills and team players needed to succeed.  This means platform services to support the team, strong product and program managers, talented QA automation engineers that can build on a common automation platform, gifted technical writers, and of course highly talented developers.  These teams are built to learn fast, build fast, and deploy fast, completely independent of other teams.

Supporting the service-oriented teams, a key element is our Platform Infrastructure team we created to provide a common set of cloud services to support all our teams.  Platform Infrastructure is responsible for the virtual private cloud (VPC) supporting the new services running in amazon web services. This team handles the overall concerns of security, network, service discovery, and other common services within the VPC. They also set up a set of best practices, such as ensuring all cloud instances are tagged with the name of the team that started them.

To ensure the best practices are followed, the platform infrastructure team created “Beavers” (a play on words describing a engineer at Bazaarvoice, a “BVer”). An idea borrowed from Netflix’s chaos monkeys, these are automated processes that run and examine our cloud environment in real time to ensure the best practices are followed.  For example, the “Conformity Beaver” runs regularly and checks to make sure all instances and buckets are tagged with team names. If it finds one that is not, it infers the owner and emails team aliases of the problem.  If not corrected, Conformity Beaver can terminate the instance.  This is just one example of the many Beavers we have created to help maintain consistency in a world where we have turned teams lose to move as quickly as possible.

An additional key common capability created by the Platform Infrastructure team is our Badger monitoring services. Badger enables teams to easily plug in a common healthcheck monitoring capability and can automatically discover nodes as they are started in the cloud. This service enables teams to easily implement these healthcheck that is captured in a common place and escalated through a notification system in the event of a service degradation.

The Proof is in the Pudding

The Black Friday and Holiday shopping season of 2015 was one of the smoothest ever in the history of Bazaarvoice while serving record traffic. From Black Friday to Cyber Monday, we saw over 300 million visitors.  At peak on Black Friday, we were seeing over 97,000 requests per second as we served up over 2.6 billion review impressions, a 20% increase over the year before.  There have been years of hard work and innovation that preceded this success and it is a testimony to what our new architecture is capable of delivering.

Keys to success

A few ingredients we’ve found to be important to successfully pull off a large scale rearchitecture such as described here:

  • Brilliant people. There is no replacement for brilliant engineers who are fearless in adopting new technologies and tackling what some will say can’t be done.
  • Strong leaders – and the right leaders at the right time. Often the leaders that sell the vision and get an undertaking like this going will need to be supplemented with those that can finish strong.
  • Perseverance and Determination – building a new platform using new technologies is going to be a much bigger challenge than you can estimate, requiring new skills, new approaches, and lots of mistakes. You must be completely determined and focused on the end game.
  • Tie back to business benefit – keep business informed of the benefits and ensuring that those benefits can be delivered continuously rather than a big bang. It will be a large investment and it is important that the business see some level of return as quickly as possible.
  • Make space for innovation – create room for engineers to learn and grow. We support this through organizing hackathons and time for growth projects that benefit the individual, team, and company.

Reachitecture is a Journey

One piece of advice: don’t be too critical of yourself along the way; celebrate each step of the reachitecture journey. As software engineers, we are driven to see things “complete”, wrapped up nice and neat, finished with a pretty bow. When replacing an existing system of significant complexity, this ideal is a trap because in reality you will never be complete.  It has taken us over 3 years of hard work to reach this point, and there are more things we are in the process of moving to newer architectures. Once we complete the things in front of us now, there will be more steps to take since we live in an ever evolving landscape. It is important to remember that we can never truly be complete as there will be new technologies, new architectures that deliver more capabilities to your customers, faster, and at a lower cost. Its a journey.

Perhaps that is the reason many companies can never seem to get started. They understandably want to know “When will it be done?” “What is it going to cost?”, and the unpopular answers are of course, never and more than you could imagine.  The solution to this puzzle is to identify and articulate the business value to be delivered as a step in the larger design of a software platform transformation.  Trouble is of course, you may only realistically be able to design the first few steps of your platform rearchitecture, leaving a lot of technical uncertainty ahead. Get comfortable with it and embrace it as a journey.  Engineer solid solutions in a service oriented way with clear interfaces and your customers will be happy never knowing they were switched to the next generation of your service.

authored by Gary Allison

How to seamlessly move 300 Million shoppers to a highly scalable architecture, part 1

At Bazaarvoice, we’ve pulled off an incredible feat, one that is such an enormous task that I’ve seen other companies hesitate to take on. We’ve learned a lot along the way and I wanted to share some of these experiences and lessons in hopes they may benefit others facing similar decisions.

The Beginning

Our original Product Ratings and Review service served us well for many years, though eventually encountered severe scalability challenges. Several aspects we wanted to change: a monolithic Java code base, fragile custom deployment, and server-side rendering. Creative use of tenant partitioning, data sharding and horizontal read scaling of our MySQL/Solr based architecture allowed us to scale well beyond our initial expectations. We’ve documented how we have accomplished this scaling on our developer blog in several past posts if you’d like to understand more. Still, time marches on and our clients have grown significantly in number and content over the years. New use cases have come along since the original design: emphasis on the mobile user and responsive design, accessibility, the emphasis on a growing network of consumer generated content flowing between brands and retailers, and the taking on of new social content that can come in floods from Twitter, Instagram, Facebook, etc.

As you can imagine, since the product ratings and reviews in our system are displayed on thousands of retailer and brand websites around the world, the read traffic from review display far outweighs the write traffic from new reviews being created. So, the addition of clusters of Solr servers that are highly optimized for fast queries was a great scalability addition to our solution.

A highly simplified diagram of our classic architecture:

Highly simplified view of our Classic Architecture

However, in addition to fast review display when a consumer visited a product page, another challenge started emerging out of our growing network of clients. This network is comprised of Brands like Adidas and Samsung who collect reviews on their websites from consumers who purchased the product and then want to “syndicate” those reviews to a set of retailer ecommerce sites where shoppers can benefit from them. Aside from the challenges of product matching which are very interesting, under the MySQL architecture this could mean the reviews could be copied over and over throughout this network. This approach worked for several years, but it was clear we needed a plan for the future.

As we grew, so did the challenge of an expanding volume of data in the master databases to serve across an expanding network of clients. This, together with the need to deliver more front-end web capability to our customers, drove us to what I hope you will find is a fascinating story of rearchitecture.

The Journey Begins

One of the first things we decided to tackle was to start moving analytics and reporting off the existing platform so that we could deliver new insights to our clients showing how reviews are used by shoppers in their purchase decisions. This choice also enabled us to decouple the architecture and spin up parallel teams to speed delivery. To deliver these capabilities, we adopted big data architectures based on Hadoop and HBase to be able to assimilate hundreds of millions of web visits into analytics that would paint the full shopper journey picture for our clients. By running map reduce over the large set of review traffic and purchase data, we are able to give our clients insight into these shopper behaviors and help our clients better understand the return on investment they receive from consumer generated content. As we built out this big data architecture, we also saw the opportunity to offload reporting from the review display engine. Now, all our new reporting and insight efforts are built off this data and we are actively working to move existing reporting functionality to this big data architecture.

On the front end, flexibility and mobile was a huge driver in our rearchitecture. Our original template-driven, server-side rendering can provide flexibility, but that ultimate flexibility is only required in a small number of use cases. For the vast majority, a client-side rendering via javascript with behavior that can be configured through a simple UI would yield a better mobile-enabled shopping experience that’s easier for clients to control. We made the call early on not to try to force migration of clients from one front end technology to another. For one thing, it’s not practical for a first version of a product to be 100% feature function capable to the predecessor. For another, there was just simply no reason to make clients choose. Instead, as clients redesigned their sites and as new clients were onboard, they opt’ed in to the new front end technology.

We attracted some of the top javascript talent in the country to this ambitious undertaking. There are some very interesting details of the architecture we built that have been described on our developer blog and that are available as open source projects on in our bazaarvoice github organization. Look for the post describing our Scoutfile architecture in March of 2015. The BV team is committed to giving back to the Open Source community and we hope this innovation helps you in your rearchitecure journey.

On the backend, we took inspiration from both Google and Netflix. It was clear that we needed to build an elastic, scalable, reliable, cloud-based data store and query layer. We needed to reorganize our engineering team into autonomous service oriented teams that could move faster. We needed to hire and build new skills in new technologies. We needed to be able to roll this out as transparently as possible to our clients while serving live shopping traffic so no one knows its happening at all. Needless to say, we had our work cut out for us.

For the foundation of our new architecture, we chose Cassandra, an Open Source NoSQL data solution based on influence of ideas from Google and their BigTable architecture. Cassandra had been battle hardened at Netflix and was a great solution for a cloud resilient, reliable storage engine. On this foundation we built a service we call Emo, originally intended for sentiment analysis. As we made progress towards delivery, we began to understand the full potential of Cassandra and its NoSQL based architecture as our primary display storage.

With Emo, we have solved the potential data consistency issues of Cassandra and guarantee ACID database operations. We can also seamlessly replicate and coordinate a consistent view of all the rating and review data across AWS availability zones worldwide, providing a scalable and resilient way to serve billions of shoppers. We can also be selective in the data that replicates for example from the European Union (EU) so that we can provide assurances of privacy for EU based clients. In addition to this consistency capability, Emo provides a databus that allows any Bazaarvoice service to listen for the kinds of changes the service particularly needs, perfect for a new service oriented architecture. For example, a service can listen for the event of a review passing moderation which would mean that it should now be visible to shoppers.

While Emo/Cassandra gave us many advantages, its NoSQL query capability is limited to what Cassandra’s key-value paradigm. We learned from our experience with Solr that having a flexible, scalable query layer on top of the master datastore resulted in significant performance advantages for calculating on-demand results of what to display during a shopper visit. This query layer naturally had to provide the distributed advantages to match Emo/Cassandra. We chose ElasticSearch for our architecture and implemented a flexible rules engine we call Polloi to abstract the indexing and aggregation complexities away from engineers on teams that would use this service. Polloi hooks up to the Emo databus and provides near real time visibility to changes flowing into Cassandra.

The rest of the monolithic code base was reimplemented into services as part of our service oriented architecture. Since your code is a direct reflection of the team, as we took on this challenge we formed autonomous teams that owned everything full cycle from initial conception to operation in production. We built the teams with all the skills needed for success: product owners, developers, QA engineers, UX designers (for front end), DevOps engineers, and tech writers. We built services that managed the product catalog, UI Configuration, syndication edges, content moderation, review feeds, and many more. We have many of these rearchitected services now in production and serving live traffic. Some examples include services that perform the real time calculation of what Brands are syndicating consumer generated content to which Retailers, services that process client product catalog feeds for 100s of millions of products, new API services, and much more.

To make all of the above more interesting, we also created this service-oriented architecture to leverage the full power of Amazon’s AWS cloud. It was clear we had the uncommon opportunity to build the platform from the ground up to run in the cloud with monitoring, elastic resiliency, and security capabilities that were unavailable in previous data center environments. With AWS, we can take advantage of new hardware platforms with a push of a button, create multi datacenter failover capabilities, and use new capabilities like elastic MapReduce to deliver big data analytics to our clients. We build auto-scaling groups that allow our services to automatically add compute capacity as client traffic demands grow. We can do all of this with a highly skilled team that focuses on delivering customer value instead of hardware procurement, configuration, deployment, and maintenance.

So now after two plus years of hard work, we have a modern, scalable service-oriented solution that can mirror exactly the original monolithic service. But more importantly, we have a production hardened new platform that we will scale horizontally for the next 10 years of growth. We can now deliver new services much more quickly leveraging the platform investment that we have made and deliver customer value at scale faster than ever before.

So how did we actually move 300 million shoppers without them even knowing?  We’ll take a look at this in an upcoming post!

authored by Gary Allison

 

Automated Product Matching, Part II: Guidelines

This post continues the discussion from Automated Product Matching, Part I: Challenges.

System First, Algorithm Second

With each design iteration, I gradually came to appreciate how important it was to have an overall matching system that was well designed. The quality of the matching algorithm did not matter if its output was going to be impacted by failures in other parts of the system. The mistake of focusing on the algorithmic design first, and the system design second means that you wind up with an interesting technical talk, but you have not really solved the problem for the end users and/or the business. An example from my past experience might help give flavor for why the system view matters just as much as the algorithmic view.

In one of the earlier systems I worked on, after having successfully defined how to build a set of “canonical” products which would be used to match against all our incoming data, and having created a reasonably good matching algorithm, we were happy that we could now continually process and match all of our data each day and at scale. The problem was solved, but only in a static sense. We chose to ignore how new products would get into the canonical set. As time went on, this became more and more of a problem, until we finally had to address this omission. This was about the time when iPads first hit the market and the lack of freshness became glaringly obvious to anyone looking at iPads on our web site.

There was nothing algorithmically challenging about solving this: we knew how to create canonical products, but the code we built did not support adding new canonical products easily. Although the guts of the algorithmic logic could be re-used, the vast majority of the code that comprised the system we built around it needed to be redesigned. A little forethought here would have saved many months of additional work, not to mention all the bad user experiences that we were delivering due to our lack of matching newer products.

“A good algorithm in a bad system is indistinguishable from a bad algorithm.”

Keep it Simple

The difficulty of matching products ranges from easy to impossible. I recommend starting with an algorithm that focuses on matching the easiest things first and building a system around that. This allows you to start working on the important system issues sooner and get some form of product matching working faster. From a system design perspective, the product data needs to find its way to the matching algorithm, you will need a data model and data storage for the matches, you also need some access layer for the matches and you likely need to have some system for evaluating and managing the product matches.

The matching logic itself will be a very small percentage of the code you have to write and there are plenty of challenges in all these other areas. There are important lessons to be learned just in putting the system together, and even the simplest matching logic will lead to a greater understanding of how to build the appropriate data models you will need.

“For the human makers of things, the incompleteness and inconsistencies of our ideas become clear only during implementation.”

People cannot be Ignored

The topic name of “automated matching” implies that people will not be involved. Combine this with engineers who are conditioned to build systems that remove the rote, manual work from tasks and there is the risk of being completely blind to a few important questions.

Most fundamentally, you should ask whether you really need automated matching and whether it will be the most cost-effective solution. This is going to be determined by the scale of your problem. If your matching needs are on the order of only thousands of products, there are crowd-source solutions that make manual matching a very viable option. If your scale is on the order of millions, manual matching is not out of the question, though it may take some time and money to get through all the work. Once you get into the range of tens of millions, you likely have little choice but to use some form of automated matching.

Another option is a hybrid approach that uses algorithms to generate candidate matches and has people assigned to accept or reject the matches. This puts less pressure on the accuracy requirements of your algorithms and makes the people process more efficient, so it can be viewed as an optimization of a manual matching system. An approach that scales slightly better is to automatically match the easy products and defer the harder ones to manual matching or verification.

The other question about human involvement depends on how the quality of the matching system will be measured. Building training and/or evaluation data sets will likely require some human input and tools to support this work. Considering how feedback will be used is important because it can have an impact on the matching system and algorithm designs. Evaluation will likely need to be an ongoing process, so make sure consideration is given to the longer term human resource requirements.

“It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.” — Mark Twain

One Algorithm will not Rule Them All

Simply put: it is not possible for a single algorithm to do equally well matching all types of products. It is possible to use the same class of algorithm on some parts of the product category space, but you will need to parameterize the algorithm with category-specific information. Here are a couple examples to illustrate this point.

Consider the color attribute of a product. For a product like a dishwasher, color is a secondary characteristic and would be unimportant for the purpose of review content. For some types of products, say a computer monitor, the color might not even matter for price comparisons. For a $50 savings, many people are not going to care if their monitor is black or silver. On the other hand, for products like cosmetics, the color is the most essential feature of the product. If your algorithm universally treats all attributes the same while matching, regardless of the type of product it is, then it will necessarily perform poorly in some categories.

To get better accuracy you will have to invest in human domain expertise in either encoding domain-specific information, or training algorithms for each product category. If you have ever taken a hard look at camera products, there is a lot of cryptic symbology in the lens specifications and the other camera accessories. Without encoding knowledge from a domain expert, it is not going to be possible to match these types of products well. There’s no silver bullet. You can decide to allocate your time to one set of categories over another, but you should expect limited accuracy in the areas you have not invested in.

Another example lies in the contrast between consumer electronics and books. The product titles for consumer electronics are descriptive in that they contain a list of product features. With a rich enough title, there are enough features to yield relatively high confidence in matches. However, titles for books are arbitrary words and phrases chosen by the author and may give you little understanding of the contents. Similarity between book titles is not correlated with the similarity of their content.

“Do not mistake knowing something for knowing everything.”

Products are Not Strings

String-based matching algorithms may suffice depending on your targets for accuracy and coverage, but there is a hard limit on how well they will perform without imparting semantics to the strings. Not all words in product titles are created equal, so it helps to do something that is akin to part of speech tagging (e.g., The product “noun” is much more important than a product’s adjective, such as its color). Showing two different dishwashers as being the same might be a data error, but it is a characteristically different user experience than showing a dishwasher and a shoe as being the same. A string comparison algorithm might match the shoe to the dishwasher because it had the same color plus a few other strings in common, but no understanding that the mismatch of nouns “shoe” and “dishwasher” should trump anything else that might be indicating that they are similar.

You will need more than just adjectives and nouns though. There are many different types of adjectives used to describe products. There are colors, materials, dimensions, quantities, shapes, patterns, etc. and depending on the types of product, these may or may not matter in how you want to define product equivalence.

It is also true that just because two strings are different, it is not necessarily the case that they are referring to two different concepts. If you do not encode the knowledge that “loafer” and “shoe” are semantically similar, even though they have no string similarity, you will be limited in matching the variations that different data sources will provide. For more accurate results, it is important to semantically tokenize the strings so that your algorithms can work on a normalized, conceptual view of the products.

Some algorithmic technique might be helpful in dealing with these word synonyms, but if the domain vocabulary is restricted, it may even be feasible to manually curate the important variations and their parts of speech. Whether algorithmic or hand curated, you will need to encode this domain knowledge so that it is dependent on the product’s context. The string “apple” may be referring to a popular brand, a deciduous fruit or the scent of a hair care product. Category and peripheral information about the product will be needed to disambiguate “apple” and similar strings.

“Algorithms are for people who don’t know how to buy RAM.”

NLP Will Not Save You

Product titles are not amenable to generic natural language processing (NLP) solutions. Product titles are not well-formed sentences and have their own structure that often varies by the person or company that crafted them. Thinking that product matching can be solved with some off-the-shelf NLP techniques is a mistake. There are some NLP techniques that can be applied, but they have to be carefully tailored to work in this domain.

Consider the relative importance of word order between product titles for consumer electronics and for books. For electronics, the title word order does not really matter: “LCD TV 55 inch Sony” is not semantically different from “Sony 55 inch LCD TV”. Yet if you change the order of two words in a book’s title, you now have something completely different. “The Book of the Moon” and “The Moon Book” are two completely different books.

Product descriptions offer the best opportunity for the use of NLP techniques, since they tend to be natural language descriptions. Unfortunately, all sorts of peripheral concepts are included in the descriptions and this makes it hard to use them for product matching. It is also true that the descriptions for similar, but not identical products tend to look very similar. The best use of descriptions is in helping to determine the product’s category, which can help with matching, but do not expect that it will provide a strong signal for matching.

“If you find a solution and become attached to it, the solution may become your next problem.”

Design for Errors

Neither the input product data nor your matching algorithm will be 100% accurate. You need to make sure your algorithms are not rigidly expecting consistent data. This includes being able to compensate and/or correct bad data when it is detected. This is easier said than done, especially because we have all been conditioned to prefer more elegant and/or understandable code. Software code can look quite poetic when you do not have to litter it with constant sanity checks for edge cases and all the required exception handling this leads to. Unfortunately, the real world of crawl and feed data is not very elegant, nor will your algorithms produce flawless results.

This assumption about imperfect data should not be limited to the technical side of the product. I believe it is critically important that product designers work closely with the algorithmic designers to understand the characteristics of the data and the nature of the errors since this can be critical in designing a good user experience. Different algorithmic choices can result in different types of errors, and only by working together can the trade-offs be evaluated and good choices made which will influence how the users will perceive the product matching accuracy.

As a simple example of how designers and engineers can work together to make a better solution, suppose the engineers build a matching algorithm that outputs some measure of confidence. In isolation, the engineers will have to find the right threshold to balance accuracy and coverage, then declare matches for those above the threshold. In this scenario, the user interface designer only knows whether or not there are matches, so the interface is designed with the wording that says “matching products”. If these products are on the lower end of the confidence range, and they are bad matches, it will be a bad user experience.

Alternatively, if the designers are aware that there is a spectrum of match confidence, they could agree to expose those confidence values and instead of having to declare “matching products”, when the confidence is lower, they might opt to use softer wording like “similar products”, maybe even positioning them differently on the page. A user will not be quite as disappointed in the matching if they were only promised “similar” products.

“There are two ways to write error-free programs; only the third one works.” — Alan J. Perlis

Choose the Right Metrics

Suppose you have built your matching system, and an evaluation system to go along with it, then find out the accuracy rate is 95%. Assuming your system is giving reasonably good coverage, and in the presence of bad and missing data, this is definitely an impressive achievement. But what if within that 5% of the errors lies the current most popular products? If you weight error frequency by number of page views, the effective accuracy rate is going to be much, much lower. All the people viewing those mismatched product are not going to be impressed with your algorithm.

Even without considering weighting by page views, consider a situation where you display 20 products at a time on a page. With a 5% error rate, on average every page you show contains an error. Defined differently, this means your error rate is not 5% but 100%.

Matching algorithms will not be perfect, and even near perfect algorithms will need help. This help usually comes in the form of providing tools that allow human intervention to influence the overall quality or to influence the algorithmic output. When you are making an error on a highly visible product, someone should be able to be able to quickly override the algorithmic results to fix the problem.

“Williams and Holland’s Law: If enough data is collected, anything may be proven by statistical methods.”

Are We Done Yet?

There is no shortage of other product matching topics to discuss and interesting details to dive into. These first two blog posts have tried to capture some of the higher-level considerations. Future articles will provide more detailed examinations of these topics and some of the approaches we have taken in Bazaarvoice’s product matching systems.

Automated Product Matching, Part I: Challenges

Bazaarvoice’s flagship product is a platform for our clients to accept, display and manage consumer generated content (CGC) on their web sites. CGC includes reviews, ratings, images, videos, social network content, etc. Over the last few years, syndicating CGC from one site to another has become increasingly important to our customers. When a user submits a television review on Samsung’s branded web site, it benefits Samsung, Target and the consumer when that review can be shown on Target’s retail web site.

Before syndicating CGC became important to Bazaarvoice, our content could be isolated for each individual client. There was never any need for us to consider the question of whether our clients had any overlap in their product catalogs. With syndication, it is now vital for us to be able to match products across all our clients’ catalogs.

The product matching problem is not unique to Bazaarvoice. Shopping comparison engines, travel aggregators and ticket brokers are among the other domains that require comprehensive and scalable automated matching. This is a common enough problem that there are even a number of companies trying to grow a business based on providing product matching as a service.

Overview

I have helped design and build product matching systems five different times across two different domains and will share some of what I have learned about the characteristics of the problem and its solutions. This article will not be about specific algorithms or technologies, but guidelines and requirements that are needed when considering how to design a technical solution. This will address not just the algorithmic challenges, but also the equally important issues with designing product matching as a system.

Blog posts are best kept to a modest length, and I have many more thoughts to share on this topic than would be polite to include in a single article, so I have divided this discussion into two parts. This blog post is about the characteristics that make this an interesting and challenging problem. The second posting will focus on guidelines to follow when designing a product matching system.

The focus here will be on retail product matching, since that is where my direct experience lies. I am sure that there are additional lessons to be learned in other domains, but I think many of these insights may be more broadly applicable.

“If at first you don’t succeed, you must be a programmer.”

Imprecise Requirements

Product matching is one of those problems that initially seems straightforward, but whose complexity is revealed only after having immersed oneself in it. Even the most enlightened product manager is not going to have the time to spell out, in detail, how to deal with every nuance that arises. Understanding problems in depth, and filling in the large numbers of unspecified items with reasonably good solutions is why many software engineers are well paid. It is also what makes our jobs more interesting than most, since it allows us to invoke our problem solving and design skills, which we generally prefer to rote execution of tasks.

I am not proposing that the engineers should fill in all the details without consulting the product managers and designers, I only mean that the engineers should expect the initial requirements will need to be refined. Ideally both will work to fill in the gaps, but the engineers should expect they will be the ones uncovering and explaining the gaps.

“I have yet to see any problem, however complicated, which, when you looked at it in the right way, did not become still more complicated.” — Poul Anderson

What is a “Product”?

Language is inherently imprecise. The same word can refer to completely different concepts at different times and yet it causes no confusion when the people conversing share the same contextual information. On the other hand, software engineers creating a data model have to explicitly enumerate, encode, and give names to all the concepts in the system. This is a fundamental difference between how the engineers and others view the problem and can be a source of frustration when engineers begin to inject questions into the requirements process such as: “What is a product?”. Those that are not accustomed to diving into the concepts underlying their use of a word can often feel like this is a time-wasting, philosophical discussion.

I’ve run across 8 distinct concepts where the word “product” has been used. The most basic difference lies between those “things” that you are trying to match and the “thing” you are using as the basis of the match. Suppose you get a data feed from Acme, Inc. which includes a thing called an “Acme Giant Rubber Band” and that you also crawled the Kwik-E-Mart web site, which yielded a thing called an “Acme Giant Green Rubber Band”. You then ask the question, are these the same “product”? Here we have an abstract notion of a specific rubber band in our mind and we are asking the question of whether these specific items from those two data sources match this concept.

Now let us also suppose that the “Acme Giant Rubber Band” item in the Acme data feed has listed 6 different UPC values, which correspond to 6 different colors they manufacturer for the product. This means that the “thing” in the feed is really a set of individual items, while the “Acme Giant Green Rubber Band” we saw on the Kwik-E-Mart web site just is a single item. These two items are similar, but not identical product-related concepts.

With just this simple example, there are 3 different concepts floating around, yet for each of them the “product” is often the word people will use. For most domains, when you really start to explore the data model that is required, more than three product-related concepts will likely be needed.

Software designers must carefully consider how many different “product” concepts they need to model and those helping to define the requirements should appreciate the importance of, and invest time in understanding the differences between the concepts. The importance of getting this data model correct from the start cannot be stressed enough.

“If names are not correct, then language is not in accord with the truth of things. If language is not in accord with the truth of things, then affairs cannot be carried out successfully.” — Confucius

Equality for All?

You should start with the most basic of questions: What is a “match”? My experience working on product matching in different domains and varying use cases is that there is not a single definition of product equality that applies everywhere. For those that have never given product matching much thought beyond their intuition, this might seem like an odd statement: two products are either the same or they are not, right? By way of example, here is an illustration of why different use cases require different notions of equality.

Suppose you are shopping for some 9-volt batteries and you are interested in seeing which brands tend to last longer based on direct user experience. You do a search, you navigate through some web site and then will likely need to make a choice at some point: are you looking to buy the 2-pack, the 4-pack or the 8-pack?

Having to make a quantity choice at this point may be premature, but you usually have to make this choice to get at the review content. However, the information you are looking for, and likely the bulk of the review content, is independent of the size of the box in which it is packaged. Requiring a quantity choice to get at review content may just be a bad user experience, but regardless of that, you certainly would not want to miss out on relevant review content simply because you had chosen the wrong quantity at this point in your research.

The conclusion here is that reviews posted to the web page for the 2-pack and reviews posted for the page of an 8-pack should probably not be fragmented. Therefore, for the purposes of review content, these two products, which would have different UPC and/or EAN values, should be considered equivalent.

Now suppose you have made your decision on the brand of battery to buy and now you are looking for the best price on an 8-pack. For a price comparison, you most definitely do not want to be comparing the 2-pack prices along with its 8-pack equivalent. Here, for price comparisons, these two products should definitely not be considered equivalent.

Understanding that product equivalence varies by context is not only important for designing algorithms and software systems, but has a lot of implications for creating better user experiences. For the companies looking to offer product matching as a service, the flexibility they offer in tailoring the definition of equality for their clients will be an important factor in how broadly applicable their solutions will be.

“It is more important to know where you are going than to get there quickly. Do not mistake activity for achievement.” — Isocrates

Imperfect Data Sources

If all the products you need to match have been assigned a globally unique identifier, such as a UPC, EAN or ISBN, and you have access to that data, and the data can be trusted, then product matching could be trivial. However, not all products get assigned such a number and for those that do, you do not always have access to those values. As discussed, it is also true that a “match” cannot always be defined simply by the equality of unique identifiers.

Those that crawl the web for product data tend to think that a structured data feed is the answer to getting better data. However, the companies that create product feeds vary greatly in their competency. Even when competent, they may build their feed from one system’s database, while more useful information may be stored in another system. Further, the competitive business landscape can result in companies wanting to deliberately suppress or obfuscate identifying information. You also have the ubiquitous issues of software bugs and data entry errors to contend with. All these realities add up to the fact that data feeds are not a panacea for product matching.

So while we have the web crawling folks wishing for feed data, we simultaneously have the feed processing folks wishing for crawled data to fill in their gaps. The first piece of advice for building a product matching system is to assume you will need to accept data from a variety of data sources. The ability to fill in data gaps with alternative sources will allow you to get the best of both worlds. This also means you may not only be trying to match products between different sites, but you may need to match products within the same site and merge the data from different sources to form a single view of a product at a site. I know of one very large shopping comparison site that did not design for this case and found themselves without the ability to support particular types of new business opportunities.

“If you think the problem is bad now, just wait until we’ve solved it.” — Arthur Kasspe

Look Before You Leap

The specific algorithms and technologies one chooses for an automated product matching system should not be the primary focus. It is very tempting for us information scientists and engineers to dive right into the algorithmic and technical solutions. After all, this is predominantly what universities have trained us to focus on and, in some sense, is the more interesting part of the problem. You can choose almost any one of a host of algorithms and get some form of product matching fairly quickly. Depending on your specific quality requirements, a simple system may be enough, but if there are higher expectations for a matching system, you will need a lot more than just a fancy algorithm.

When more than simple matching is needed, it will not be the algorithm you use, but how you use the algorithm that will matter. This means really understanding the characteristics of the problem in the context of your domain. It is also important not to define the problem too narrowly. There are a bunch of seemingly tangential issues in product matching that are very easy to put into the bucket of “we can deal with that later”, but which turn out to be very hard to deal with after the fact. It is how well you handle all of these practical details that will most influence the overall success of the project.

Choosing a simplistic data model is an example where it may seem like a good starting approach. However, this will wind up being so deeply ingrained in the software, that it will become nearly impossible to change. You wind up with either serious capability limitations or a series of kludges that both complicate your software and lead to unintended side effects. I learned this from experience.

“A doctor can bury his mistakes but an architect can only advise his clients to plant vines.” — Frank Lloyd Wright

Up Next

This posting covers some of the important characteristics of the product matching problem. In the sequel, there will be some more specific guidelines for building matching systems.