Category Archives: Conferences

Telling Your Data To “Back Off!” (or How To Effectively Use Streams)

Foreword

Our Curations engineering team makes heavy use of serverless architecture. While this typically gives us the benefit of reduced costs, flexibility, and rapid development, it also requires us to ensure that our processes will run within the tight memory and lifecycle constraints of serverless instances.

In this article, I will describe an actual case where a scheduled job had started to fail, the discovery of the root cause and the refactor that resolved the issue. I will assume you have at least a rudimentary knowledge of Node JS and Amazon Web Services.

Content Export

Months previously, we built a scheduled job using AWS Lambdas that would kick off an export of our social content every 24 hours. In a nutshell, its purpose was to:

  1. Query for social content documents stored in a MongoDB server.
  2. Transform each document to a row in CSV format.
  3. Write out the CSV rows to a file in S3.

The resulting CSV files were meant for ingestion into an analytics tool to measure web site conversion impact.

It all worked fine for months. Then recently, this job began to fail.

Drowning in Content

For your viewing pleasure, here is an excerpted, condensed section of the main code:

function runExport(query) {
  MongoClient.connect(MONGODB_URL)
    .then((db) => {
      let exportCount = 0;
      db
        .collection('content')
        .find(query)
        .forEach(
          // iterator function to perform on each document
          (content) => {
            // transform content doc to a csv row, then push to S3 writer
            s3.write(csvTransform(content));
            exportCount++;
          },
          // done function, no more documents
          () => {
            db.close();
            s3.close();
            logger.info(`documents exported: ${exportCount}`);
          }
        );
    });
}

Each document from the database query is transformed into a CSV row, then pushed to our S3 writer which buffered all content until an s3.close(). At this point, the CSV data would be flushed out to the S3 destination bucket.

It was obvious from this implementation that heap usage would grow unbounded, as documents from the database would be loaded into resident memory as CSV data. As we aggregated content over the previous months, the data pushed into our export process began to generate frequent “out of heap memory” errors in our logs.

Memory Profile

To gain visibility into the memory usage, I wrote a quick and dirty MemProfiler class whose purpose was to sample memory usage using process.memoryUsage(), collecting the maximum heap used over the process lifetime.

class MemProfiler {
  constructor(collectInterval) {
    this.heapUsed = {
      min : null,
      max : null,
    };
    if (collectInterval) {
      this.start(collectInterval);
    }
  }

  start(collectInterval = 500) {
    this.interval = setInterval(() => this.sample(), collectInterval);
  }

  stop() {
    if (this.interval) {
      clearInterval(this.interval);
    }
  }

  reset() {
    this.stop();
    this.heapUsed = {
      min : null,
      max : null,
    };
  }

  sample() {
    const mem = process.memoryUsage();
    if (this.heapUsed.min === null || mem.heapUsed < this.heapUsed.min) {
      this.heapUsed.min = mem.heapUsed;
    }
    if (this.heapUsed.max === null || mem.heapUsed > this.heapUsed.max) {
      this.heapUsed.max = mem.heapUsed;
    }
  }
}

I then added the MemProfiler to the runExport code:

function runExport(query) {
  memProf.start();
  MongoClient.connect(MONGODB_URL)
    .then((db) => {
      let exportCount = 0;
      db
        .collection('content')
        .find(query)
        .forEach(
          (content) => {
            // transform content doc to a csv row, then push to S3 writer
            s3.write(csvTransform(content));
            exportCount++;
            memProf.sample();
          },
          () => {
            db.close();
            s3.close();
            logger.info(`documents exported: ${exportCount}`);
            console.log('heapUsed:');
            console.log(` max: ${memProf.heapUsed.max/1024/1024} MB`);
            console.log(` min: ${memProf.heapUsed.min/1024/1024} MB`);
          }
        );
    });
}

Running the export against a small set of 16,900 documents yielded this result:

documents exported: 16900
heapUsed:
max: 30.32764434814453 MB
min: 8.863059997558594 MB

Then running against a larger set of 34,000 documents:

documents exported: 34000
heapUsed:
max: 36.204505920410156 MB
min: 8.962921142578125 MB

Testing with additional documents increased the heap memory usage linearly as we expected.

Enter the Stream

The necessary code rewrite had one goal in mind: ensure the memory profile was constant, whether a total of one document was processed or a million. Since this was a batch process, we could trade off memory usage for processing time.

Let’s reiterate the primary operations:

  1. Fetch a document
  2. Transform the document to a CSV row
  3. Write the CSV row to S3

If I may present an analogy, we have widgets moving down the assembly line from one station to the next. The conveyor belt on which the widgets are transported is moving at a constant rate, which means the throughput at each station is constant. We could increase the speed of that conveyor belt, but only as fast as can be handled by the slowest station.

In Node JS, streams are the assembly line stations, and pipes are the conveyor belt. Let’s examine the three primary types of streams:

  1. Readable streams: This is typically an upstream data source that feeds data to a writable or transform stream. In our case, this is the MongoDB find query.
  2. Transform streams: This typically takes upstream data, performs a calculation or transformation operation on it and feeds it out downstream. In our case, this would take a MongoDB document and transform it to a CSV row.
  3. Writable streams: This is typically the terminating point of a data flow. In our case this is the writing of a CSV row to S3.

Readable Stream

Conveniently, the MongoDB Node JS driver’s find() function returns a Cursor object that extends the Node JS Readable stream. How convenient!

const streamContentReader = db.collection('content').find(query);

Transform Stream

All we need to do is extend the Transform stream class, and override the _transform() method to transform the content to a CSV row, push it out the other end, and signal upstream that it is ready for the next content.

class CsvTransformer extends Transform {
  constructor(options) {
    super(options);
  }

  _transform(content, encoding, callback) {
    // write row
    const { _id, authoredAt, client, channel, mediaType, permalink , text, textLanguage } = content;
    const csvRow = `${_id},${authoredAt},${client},${channel},${mediaType},${permalink}${text},${textLanguage}\n`;
    this.push(csvRow);
    // signal upstream that we are ready to take more data
    callback();
  }
}

Writable Stream

Let’s purposely mock the S3 Writer for now. It does nothing here, but realistically, we would buffer up incoming data, then flush out the buffer in chunks over the network for best throughput.

class S3Writer extends Writable {
  constructor(options) {
    super(options);
  }

  _write(data, encoding, callback) {
    // TODO: probably use a fixed size buffer
    // that we write to and then flush out to S3 using
    // multipart data upload
    callback();
  }
}

Pipe ‘Em Up

We now create the three streams:

const streamContentReader = db.collection('content').find(query);
const streamCsvTransformer = new CsvTransformer({ objectMode: true });
const streamS3Writer = new S3Writer();

Streams alone do nothing. What makes them useful is connecting the output from one stream to the input of another. In stream terminology, this is called piping, and is accomplished using the stream’s pipe method:

streamContentReader.pipe(streamCsvTransformer).pipe(streamS3Writer)

Although the native pipe method is perfectly functional, I highly recommend using the pump library. In addition to being more readable by passing in an array of streams to pipe together, we can also invoke then() when the pipe has finished, as well as handle close or error events emitted by the streams:

pump([ streamContentReader, streamCsvTransformer, streamS3Writer ])
  .then(() => {
    console.log(`documents exported: ${exportCount}`);
  })
  .catch((err) => console.error(err));

CSV and S3 Write Streams

I purposely did not implement the S3Writer above, because npm is a treasure trove of solutions! The s3-write-stream library will take care of buffering, using multipart upload and handling retries, so we don’t have to get our hands too dirty. And wouldn’t you know it, there is also the csv-write-stream library that will generate properly escaped CSV rows.

As an exercise, you the reader may want to try using the s3-write-stream and csv-write-stream. Trust me, once you get the hang of streaming, you will enjoy it!

Back Off!

Let’s touch back on the issue of memory usage.

What if a write stream becomes blocked for a period of time? Maybe we are writing data to a third-party service over HTTP and network congestion or service throttling is slowing the write rate. The readable stream will happily pump data in as fast as it can. Won’t that cause increased heap usage as all that data gets backed up and buffered?

The quick answer is “no”. In the implementation of the _write() method of your writable stream, when done with the data chunk (after writing to a file for example), call callback(). This signals to the incoming stream that it is ready to receive the next chunk. You can also provide a highWaterMark to the writable stream constructor to give a specific number of objects to buffer before pausing the incoming stream. This is all handled internally by the Node JS streams implementation. No more unbounded buffering!

For a deep dive into the concept of backpressure control, read this great article on Backpressuring in Streams.

Stream ‘Em If You Got ’em!

Our team is now in the process of utilizing Node JS streams whenever we read in volumes of data, process that data, then write out the results. We are now seeing great gains in both stability and maintainability!

References

Node JS Streams APIhttps://nodejs.org/api/stream.html

Article: Backpressuring in Streams: https://nodejs.org/en/docs/guides/backpressuring-in-streams/

MongoDB Cursor Streamhttp://mongodb.github.io/node-mongodb-native/2.2/api/Cursor.html

Pump Modulehttps://www.npmjs.com/package/pump

CSV Write Stream Module: https://www.npmjs.com/package/csv-write-stream

S3 Write Stream Module: https://www.npmjs.com/package/s3-write-stream

 

Context and Higher Order Components: Two Immediately Applicable Topics from the Advanced React Workshop

Thanks to Bazaarvoice I recently attended an “Advanced React Workshop” put on by React Training and taught by Ryan Florence, one of the creators of React Router.

Ryan was an engaging teacher and the workshop was filled with memorable quotes. Here are some highlights:

  • The great conundrum of accessibility is that learning it is not accessible.
  • Like forceUpdate, never use context alone. Always bring a friend with you.
  • Good abstractions can always recreate other abstractions.
  • Is that a Justin Beiber song?

The training took place over two days in East Austin and I learned a ton of stuff that I was able to immediately put to use in my day-to-day projects. Let’s dive right in.

Context

In addition to the familiar props and state, React components also have access to a magical thing called context.

Context is for when your application has a hierarchical structure and needs to pass data from a parent component down to another level of child component (child, grandchild, great grandchild, etc.). As the React documentation puts it,

“In some cases, you want to pass data through the component tree without having to pass the props down manually at every level. You can do this directly in React with the powerful “context” API.”

Before we move on to the context API itself, let’s make it a little more clear why you would want to use it.

Consider the previous example of a parent node with a child, grandchild, and great grandchild. In order for the parent to pass a property (via props) down to the great-grandchild, it would first need to pass the property through child and grandchild. Note that there is no other reason for child and grandchild to care about that property.

As you can imagine, the desire to eliminate this kind of cruft increases as your project grows in size and complexity.

Context to the Rescue

context helps by passing properties through the tree automatically and any component in the subtree can access them.

When Your Component is Providing Context

Components (like parent in the example above) define a static childContextTypes object that defines the properties that they will put on context. childContextTypes is very similar to the propTypes object that we’re already familiar with.

Additionally, components that are putting properties on context must actually provide their values, and do so via the getChildContext component member function. This function should return an object with the same structure as defined in childContextTypes.

For example:

// Parent.js

static childContextTypes = {
  rating: React.PropTypes.string
}

// Just like a React lifecycle method.
// Don't worry about when this actually gets called.
getChildContext () {
  return {
    rating: '4.5'
  }
}

When Your Component is Consuming Context

To continue the example, your great-grandchild component now must consume this property. It can do so by defining a static contextTypes object that describes the context properties it expects, and can then pull the properties off of this.context directly.

// GreatGrandchild.js

static contextTypes {
  rating: React.PropTypes.string
}

render () {
  const { rating } = this.context;
  // Do stuff with rating.
}

Common Use Cases

Great use cases for context include localization, theming, etc. Basically any time you have things you don’t want to have to pass down through a bunch of components that don’t need to know about them. In fact, React Redux uses context to make its store available to all of your project’s components – imagine how cumbersome it would be to pass the store to every single component!

Common Pitfalls

Name collisions is a common pitfall. Consider the case where a component that has both a parent and grand parent component that provide a rating property on context. The suggested workaround in this case is to name your context after your component:

class ParentComponent extends React.Component {
  static childContextTypes = {
    parentComponent: React.PropTypes.shape({
      rating: React.PropTypes.string
    })
  }
}

class GrandParentComponent extends React.Component {
  static childContextTypes = {
    grandParentComponent: React.PropTypes.shape({
      rating: React.PropTypes.string
    })
  }
}

Then your child component could differentiate the two: this.context.parentComponent.rating versus this.context.grandParentComponent.rating.

User Beware!

A post about context would not be complete without pointing out that the very first section in React’s documentation about context is titled “Why Not To Use Context”, and it includes dire warnings like:
“If you want your application to be stable, don’t use context.”, and “It [Context] is an experimental API and it is likely to break in future releases of React.”

Ryan did point out to us however, that even though the documentation warns that the Context API is likely to change, it is one of the only React APIs that has stayed constant 😛

Higher Order Components

In software, a higher order function is a function that returns another function. Consider:

function add (x) {
  return function (y) {
    return x + y;
  }
}

const addBy2 = add(2);

addBy2(10); // 12

Similarly, a Higher Order Component is a function that returns a (React) Component.

/**
 * A higher order component that facilitates the creation of
 * container components that simply wrap divs around their children.
 *
 * @param  {String} displayName - The displayName of the returned component.
 * @param  {String} className   - The class name to use for the wrapper div.
 * @return {Component}          - A React component.
 */
export function build(displayName, className) {
  const ContainerComponent = (props) => (
    <div className={className}>{ props.children }</div>
  )

  ContainerComponent.displayName = displayName;
  ContainerComponent.propTypes = {
    children: React.PropTypes.node
  };

  return ContainerComponent; 
}

// Usage:
export const WidgetHeader = build('WidgetHeader', 'bv_header');
export const WidgetBody = build('WidgetBody', 'bv_body');
export const WidgetFooter = build('WidgetFooter', 'bv_footer');

Higher Order Components are great for reducing redundancy across components, and can also be used to “wrap up” a lot of complexity and encapsulate it, which allows the rest of your components to stay simple and straightforward. If you were familiar with the old React mixins, Higher Order Components replace a lot of what mixins did.

That’s All, Folks!

Context and Higher Order Components are just two topics I was able to immediately integrate into my React projects here at Bazaarvoice. Hopefully you will find them useful too!

Three Takeaways from CSSConf 2016

This year Bazaarvoice sponsored CSSConf 2016 in beautiful Boston, MA, USA and I was able to attend!
Here are my three top takeaways from CSSConf 2016:

Flexy Flexy Flexbox

A little over a year ago, our application team wasn’t sure how “stable” Flexbox or its spec were: there was already an old syntax, a new syntax, and a weird IE10 “tweener” syntax.

The layout advantages Flexbox brought were strong enough (*cough* vertical centering) that we decided to move forward with it and prefix all the things. Now browser support is so good that if you can drop IE8 and work around some known IE11 bugs, there is no reason not to use Flexbox in your designs right now.

A great reference I keep going back to for Flexbox is this css-tricks guide. Here are some other tips and tricks from the conference:

  • Flexbox is now available in Bootstrap 4
  • Use CSS Grid (when it becomes available) for major page layout, and Flexbox for UI elements
  • For mobile / small screens: add a media query and set the flex-direciton to column to stack your cells instead
  • Do as much as you can on the container to keep your code DRY (Don’t Repeat Yourself)
  • We can finally get rid of that pesky col-2, col-8, col-crapIAddedWrong grid system!

For more information, I recommend CSS4 Grid: True Layout Finally Arrives by Jen Kramer and It’s Time To Ditch The Grid System by Emily Hayman.

Stop Thinking In Pixels

The talk Stop Thinking In Pixels by Keith Grant was particularly enlightening.
The basic premise was not to micromanage your CSS:

Without fully understanding what CSS is doing for us, we try to push through it to control exactly what is going on in the browser.

Driving this point home, Keith recommended to stop thinking in pixels because

The pixels don’t matter. Let the browser do it.

You should instead be thinking in terms of the em and rem. Tools that simply convert px to em aren’t the answer either — you’re still thinking in terms of pixels and therefore missing out on some important benefits. Instead, learn from something like type scale and approach measurements with a fresh perspective.

I recommend watching the talk in full, but a quick cheatsheet follows:

Property Recommended Unit
font-size rem
padding, margin, border-radius, etc. em
border-width px

When in doubt, use em.

To summarize,

Ems are the most powerful when you fully embrace them.

Apps vs Documents

In this day and age we are all used to thinking in terms of “apps”. But the trinity of HTML, CSS, and JS was not conceived in this day and age. Two great quotes I wrote down from Component-Based Style Reuse by Pete Hunt are

CSS is great for documents, maybe not 2016 Apps

and

If you sat down and created styling in 2016, you would not come up with CSS

Our newest applications are written in React, which encourages developers to think of things in terms of components — pieces of UI that are reusable in different contexts. The Cascading part of CSS interferes with that, however: depending on the context your component is dropped into, it may look drastically different across usages. When that is not what you want, Pete’s ideas center around reusing components, not CSS classes.

As you can imagine, this idea is largely controversial in a conference with a name like CSSConf, but I will continue to keep my eye on it. Pete’s thought leadership on this topic inspires me to challenge norms and dare to envision things differently. After all, if we’re constantly fighting with our tool (CSS), that tool may not be right for the job.

Thanks for reading! For a full list of talks and slides from the conference, check out https://2016.cssconf.com/#videos.

BVIO 2015 Summary and Presentations

Every year Bazaarvoice R&D throws BVIO, an internal technical conference followed by a two-day hackathon. These conferences are an opportunity for us to focus on unlocking the power of our network, data, APIs, and platforms as well as have some fun in the process. We invite keynote speakers from within BV, from companies who use our data in inspiring ways, and from companies who are successfully using big data to solve cool problems. After a full day of learning we engage in an intense, two-day hackathon to create new applications, visualizations, and insights into our extensive our data.

Continue reading for pictures of the event and videos of the presentations.

bvio-logo

This year we held the conference at the palatial Omni Barton Creek Resort in one of their well-appointed ballrooms.

omni

Participants arrived around 9am (some of us a little later). After breakfast, provided by Bazaarvoice, we got started with the speakers followed by lunch, also provided by Bazaarvoice, followed by more speakers.

bvio2015_presentation2 bvio2015_presentation

After the speakers came a “pitchfest” during which our Product team presented hackathon ideas and participants started forming teams and brainstorming.

bvio2015_bigidea bvio2015_bigidea2

Finally it was time for 48 hours of hacking, eating, and gaming (not necessarily in that order) culminating in project presentations and prizes.

bvio2015_hacking bvio2015_hacking2 bvio2015_gaming bvio2015_eating bvio2015_demo bvio2015_demo2

Presentations

Sephora: Consumer Targeted Content

Venkat Gopalan
Director of Architecture & Devops @ Sephora.com

Venkat presented on the work Sephora is doing around serving relevant, targeted content to their consumers in both the mobile and in-store space. It was a fascinating speech and we love to see our how our clients are innovating with us. Unfortunately due to technical difficulties we don’t have a recording 🙁

Philosophy & Design of The BV System of Record

John Roesler & Fahd Siddiqui
Bazaarvoice Engineers

This talk was about the overarching design of Bazaarvoice’s innovative data architecture. According to them there are aspects to it that may seem unexpected at first glance (especially not coming from a big data background), but are actually surprisingly powerful. The first innovation is the separation of storage and query, and the second is choosing a knowledge-base-inspired data model. By making these two choices, we guarantee that our data infrastructure will be robust and durable.

Realtime Bidding: Predicting the future, 10,000 times per second

Ian Clarke
Co-Founder and CTO at OneSpot

Ian has built and manages a team of world-class software engineers as well as data scientists at OneSpot™s. In his presentation he discusses how he applied machine learning and game theory to architect a sophisticated realtime bidding engine for OneSpot™ capable of predicting the behavior of tens of thousands of people per second.

New Amazon Machine Learning and Lambda architectures

Jeff Nun
Amazon Solutions Architect

In his presentation Jeff discusses the history of Amazon Machine Learning and the Lambda architecture, how Amazon uses it and you can use it. This isn’t just a presentation; Ian walks us through the AWS UI for building and training a model.

Thanks to Sharon Hasting, Dan Heberden, and the presenters for contributing to this post.

Output from bv.io

Looks like everyone had a blast at bv.io this year! Thank yous go out to the conference speakers and hackathon participants for making this year outstanding. Here are some tweets and images from the conference:

Continue reading

BV I/O: Peter Wang – Architecting for Data

Every year Bazaarvoice holds an internal technical conference for our engineers. Each conference has a theme and as a part of these conferences we invite noted experts in fields related to the theme to give presentations. The latest conference was themed “unlocking the power of our data.” You can read more about it here.

In this presentation Peter Wang, co-founder and president of Continuum Analytics, discusses data analysis, the challenges presented by big data, and opportunities technology provides to overcome those challenges. He also discusses the importance of performance and visualization as well as advances the concept of “engineering on principle” which he demonstrates by discussing the design of the A-10 Thunderbolt and SAGE computerized command and control center for United States air defense. Peter ends his talk by discussing the Python programming language and its suitability for data analysis tasks. The full talk is below.