Tag Archives: javascript

Telling Your Data To “Back Off!” (or How To Effectively Use Streams)

Foreword

Our Curations engineering team makes heavy use of serverless architecture. While this typically gives us the benefit of reduced costs, flexibility, and rapid development, it also requires us to ensure that our processes will run within the tight memory and lifecycle constraints of serverless instances.

In this article, I will describe an actual case where a scheduled job had started to fail, the discovery of the root cause and the refactor that resolved the issue. I will assume you have at least a rudimentary knowledge of Node JS and Amazon Web Services.

Content Export

Months previously, we built a scheduled job using AWS Lambdas that would kick off an export of our social content every 24 hours. In a nutshell, its purpose was to:

  1. Query for social content documents stored in a MongoDB server.
  2. Transform each document to a row in CSV format.
  3. Write out the CSV rows to a file in S3.

The resulting CSV files were meant for ingestion into an analytics tool to measure web site conversion impact.

It all worked fine for months. Then recently, this job began to fail.

Drowning in Content

For your viewing pleasure, here is an excerpted, condensed section of the main code:

function runExport(query) {
  MongoClient.connect(MONGODB_URL)
    .then((db) => {
      let exportCount = 0;
      db
        .collection('content')
        .find(query)
        .forEach(
          // iterator function to perform on each document
          (content) => {
            // transform content doc to a csv row, then push to S3 writer
            s3.write(csvTransform(content));
            exportCount++;
          },
          // done function, no more documents
          () => {
            db.close();
            s3.close();
            logger.info(`documents exported: ${exportCount}`);
          }
        );
    });
}

Each document from the database query is transformed into a CSV row, then pushed to our S3 writer which buffered all content until an s3.close(). At this point, the CSV data would be flushed out to the S3 destination bucket.

It was obvious from this implementation that heap usage would grow unbounded, as documents from the database would be loaded into resident memory as CSV data. As we aggregated content over the previous months, the data pushed into our export process began to generate frequent “out of heap memory” errors in our logs.

Memory Profile

To gain visibility into the memory usage, I wrote a quick and dirty MemProfiler class whose purpose was to sample memory usage using process.memoryUsage(), collecting the maximum heap used over the process lifetime.

class MemProfiler {
  constructor(collectInterval) {
    this.heapUsed = {
      min : null,
      max : null,
    };
    if (collectInterval) {
      this.start(collectInterval);
    }
  }

  start(collectInterval = 500) {
    this.interval = setInterval(() => this.sample(), collectInterval);
  }

  stop() {
    if (this.interval) {
      clearInterval(this.interval);
    }
  }

  reset() {
    this.stop();
    this.heapUsed = {
      min : null,
      max : null,
    };
  }

  sample() {
    const mem = process.memoryUsage();
    if (this.heapUsed.min === null || mem.heapUsed < this.heapUsed.min) {
      this.heapUsed.min = mem.heapUsed;
    }
    if (this.heapUsed.max === null || mem.heapUsed > this.heapUsed.max) {
      this.heapUsed.max = mem.heapUsed;
    }
  }
}

I then added the MemProfiler to the runExport code:

function runExport(query) {
  memProf.start();
  MongoClient.connect(MONGODB_URL)
    .then((db) => {
      let exportCount = 0;
      db
        .collection('content')
        .find(query)
        .forEach(
          (content) => {
            // transform content doc to a csv row, then push to S3 writer
            s3.write(csvTransform(content));
            exportCount++;
            memProf.sample();
          },
          () => {
            db.close();
            s3.close();
            logger.info(`documents exported: ${exportCount}`);
            console.log('heapUsed:');
            console.log(` max: ${memProf.heapUsed.max/1024/1024} MB`);
            console.log(` min: ${memProf.heapUsed.min/1024/1024} MB`);
          }
        );
    });
}

Running the export against a small set of 16,900 documents yielded this result:

documents exported: 16900
heapUsed:
max: 30.32764434814453 MB
min: 8.863059997558594 MB

Then running against a larger set of 34,000 documents:

documents exported: 34000
heapUsed:
max: 36.204505920410156 MB
min: 8.962921142578125 MB

Testing with additional documents increased the heap memory usage linearly as we expected.

Enter the Stream

The necessary code rewrite had one goal in mind: ensure the memory profile was constant, whether a total of one document was processed or a million. Since this was a batch process, we could trade off memory usage for processing time.

Let’s reiterate the primary operations:

  1. Fetch a document
  2. Transform the document to a CSV row
  3. Write the CSV row to S3

If I may present an analogy, we have widgets moving down the assembly line from one station to the next. The conveyor belt on which the widgets are transported is moving at a constant rate, which means the throughput at each station is constant. We could increase the speed of that conveyor belt, but only as fast as can be handled by the slowest station.

In Node JS, streams are the assembly line stations, and pipes are the conveyor belt. Let’s examine the three primary types of streams:

  1. Readable streams: This is typically an upstream data source that feeds data to a writable or transform stream. In our case, this is the MongoDB find query.
  2. Transform streams: This typically takes upstream data, performs a calculation or transformation operation on it and feeds it out downstream. In our case, this would take a MongoDB document and transform it to a CSV row.
  3. Writable streams: This is typically the terminating point of a data flow. In our case this is the writing of a CSV row to S3.

Readable Stream

Conveniently, the MongoDB Node JS driver’s find() function returns a Cursor object that extends the Node JS Readable stream. How convenient!

const streamContentReader = db.collection('content').find(query);

Transform Stream

All we need to do is extend the Transform stream class, and override the _transform() method to transform the content to a CSV row, push it out the other end, and signal upstream that it is ready for the next content.

class CsvTransformer extends Transform {
  constructor(options) {
    super(options);
  }

  _transform(content, encoding, callback) {
    // write row
    const { _id, authoredAt, client, channel, mediaType, permalink , text, textLanguage } = content;
    const csvRow = `${_id},${authoredAt},${client},${channel},${mediaType},${permalink}${text},${textLanguage}\n`;
    this.push(csvRow);
    // signal upstream that we are ready to take more data
    callback();
  }
}

Writable Stream

Let’s purposely mock the S3 Writer for now. It does nothing here, but realistically, we would buffer up incoming data, then flush out the buffer in chunks over the network for best throughput.

class S3Writer extends Writable {
  constructor(options) {
    super(options);
  }

  _write(data, encoding, callback) {
    // TODO: probably use a fixed size buffer
    // that we write to and then flush out to S3 using
    // multipart data upload
    callback();
  }
}

Pipe ‘Em Up

We now create the three streams:

const streamContentReader = db.collection('content').find(query);
const streamCsvTransformer = new CsvTransformer({ objectMode: true });
const streamS3Writer = new S3Writer();

Streams alone do nothing. What makes them useful is connecting the output from one stream to the input of another. In stream terminology, this is called piping, and is accomplished using the stream’s pipe method:

streamContentReader.pipe(streamCsvTransformer).pipe(streamS3Writer)

Although the native pipe method is perfectly functional, I highly recommend using the pump library. In addition to being more readable by passing in an array of streams to pipe together, we can also invoke then() when the pipe has finished, as well as handle close or error events emitted by the streams:

pump([ streamContentReader, streamCsvTransformer, streamS3Writer ])
  .then(() => {
    console.log(`documents exported: ${exportCount}`);
  })
  .catch((err) => console.error(err));

CSV and S3 Write Streams

I purposely did not implement the S3Writer above, because npm is a treasure trove of solutions! The s3-write-stream library will take care of buffering, using multipart upload and handling retries, so we don’t have to get our hands too dirty. And wouldn’t you know it, there is also the csv-write-stream library that will generate properly escaped CSV rows.

As an exercise, you the reader may want to try using the s3-write-stream and csv-write-stream. Trust me, once you get the hang of streaming, you will enjoy it!

Back Off!

Let’s touch back on the issue of memory usage.

What if a write stream becomes blocked for a period of time? Maybe we are writing data to a third-party service over HTTP and network congestion or service throttling is slowing the write rate. The readable stream will happily pump data in as fast as it can. Won’t that cause increased heap usage as all that data gets backed up and buffered?

The quick answer is “no”. In the implementation of the _write() method of your writable stream, when done with the data chunk (after writing to a file for example), call callback(). This signals to the incoming stream that it is ready to receive the next chunk. You can also provide a highWaterMark to the writable stream constructor to give a specific number of objects to buffer before pausing the incoming stream. This is all handled internally by the Node JS streams implementation. No more unbounded buffering!

For a deep dive into the concept of backpressure control, read this great article on Backpressuring in Streams.

Stream ‘Em If You Got ’em!

Our team is now in the process of utilizing Node JS streams whenever we read in volumes of data, process that data, then write out the results. We are now seeing great gains in both stability and maintainability!

References

Node JS Streams APIhttps://nodejs.org/api/stream.html

Article: Backpressuring in Streams: https://nodejs.org/en/docs/guides/backpressuring-in-streams/

MongoDB Cursor Streamhttp://mongodb.github.io/node-mongodb-native/2.2/api/Cursor.html

Pump Modulehttps://www.npmjs.com/package/pump

CSV Write Stream Module: https://www.npmjs.com/package/csv-write-stream

S3 Write Stream Module: https://www.npmjs.com/package/s3-write-stream

 

Context and Higher Order Components: Two Immediately Applicable Topics from the Advanced React Workshop

Thanks to Bazaarvoice I recently attended an “Advanced React Workshop” put on by React Training and taught by Ryan Florence, one of the creators of React Router.

Ryan was an engaging teacher and the workshop was filled with memorable quotes. Here are some highlights:

  • The great conundrum of accessibility is that learning it is not accessible.
  • Like forceUpdate, never use context alone. Always bring a friend with you.
  • Good abstractions can always recreate other abstractions.
  • Is that a Justin Beiber song?

The training took place over two days in East Austin and I learned a ton of stuff that I was able to immediately put to use in my day-to-day projects. Let’s dive right in.

Context

In addition to the familiar props and state, React components also have access to a magical thing called context.

Context is for when your application has a hierarchical structure and needs to pass data from a parent component down to another level of child component (child, grandchild, great grandchild, etc.). As the React documentation puts it,

“In some cases, you want to pass data through the component tree without having to pass the props down manually at every level. You can do this directly in React with the powerful “context” API.”

Before we move on to the context API itself, let’s make it a little more clear why you would want to use it.

Consider the previous example of a parent node with a child, grandchild, and great grandchild. In order for the parent to pass a property (via props) down to the great-grandchild, it would first need to pass the property through child and grandchild. Note that there is no other reason for child and grandchild to care about that property.

As you can imagine, the desire to eliminate this kind of cruft increases as your project grows in size and complexity.

Context to the Rescue

context helps by passing properties through the tree automatically and any component in the subtree can access them.

When Your Component is Providing Context

Components (like parent in the example above) define a static childContextTypes object that defines the properties that they will put on context. childContextTypes is very similar to the propTypes object that we’re already familiar with.

Additionally, components that are putting properties on context must actually provide their values, and do so via the getChildContext component member function. This function should return an object with the same structure as defined in childContextTypes.

For example:

// Parent.js

static childContextTypes = {
  rating: React.PropTypes.string
}

// Just like a React lifecycle method.
// Don't worry about when this actually gets called.
getChildContext () {
  return {
    rating: '4.5'
  }
}

When Your Component is Consuming Context

To continue the example, your great-grandchild component now must consume this property. It can do so by defining a static contextTypes object that describes the context properties it expects, and can then pull the properties off of this.context directly.

// GreatGrandchild.js

static contextTypes {
  rating: React.PropTypes.string
}

render () {
  const { rating } = this.context;
  // Do stuff with rating.
}

Common Use Cases

Great use cases for context include localization, theming, etc. Basically any time you have things you don’t want to have to pass down through a bunch of components that don’t need to know about them. In fact, React Redux uses context to make its store available to all of your project’s components – imagine how cumbersome it would be to pass the store to every single component!

Common Pitfalls

Name collisions is a common pitfall. Consider the case where a component that has both a parent and grand parent component that provide a rating property on context. The suggested workaround in this case is to name your context after your component:

class ParentComponent extends React.Component {
  static childContextTypes = {
    parentComponent: React.PropTypes.shape({
      rating: React.PropTypes.string
    })
  }
}

class GrandParentComponent extends React.Component {
  static childContextTypes = {
    grandParentComponent: React.PropTypes.shape({
      rating: React.PropTypes.string
    })
  }
}

Then your child component could differentiate the two: this.context.parentComponent.rating versus this.context.grandParentComponent.rating.

User Beware!

A post about context would not be complete without pointing out that the very first section in React’s documentation about context is titled “Why Not To Use Context”, and it includes dire warnings like:
“If you want your application to be stable, don’t use context.”, and “It [Context] is an experimental API and it is likely to break in future releases of React.”

Ryan did point out to us however, that even though the documentation warns that the Context API is likely to change, it is one of the only React APIs that has stayed constant 😛

Higher Order Components

In software, a higher order function is a function that returns another function. Consider:

function add (x) {
  return function (y) {
    return x + y;
  }
}

const addBy2 = add(2);

addBy2(10); // 12

Similarly, a Higher Order Component is a function that returns a (React) Component.

/**
 * A higher order component that facilitates the creation of
 * container components that simply wrap divs around their children.
 *
 * @param  {String} displayName - The displayName of the returned component.
 * @param  {String} className   - The class name to use for the wrapper div.
 * @return {Component}          - A React component.
 */
export function build(displayName, className) {
  const ContainerComponent = (props) => (
    <div className={className}>{ props.children }</div>
  )

  ContainerComponent.displayName = displayName;
  ContainerComponent.propTypes = {
    children: React.PropTypes.node
  };

  return ContainerComponent; 
}

// Usage:
export const WidgetHeader = build('WidgetHeader', 'bv_header');
export const WidgetBody = build('WidgetBody', 'bv_body');
export const WidgetFooter = build('WidgetFooter', 'bv_footer');

Higher Order Components are great for reducing redundancy across components, and can also be used to “wrap up” a lot of complexity and encapsulate it, which allows the rest of your components to stay simple and straightforward. If you were familiar with the old React mixins, Higher Order Components replace a lot of what mixins did.

That’s All, Folks!

Context and Higher Order Components are just two topics I was able to immediately integrate into my React projects here at Bazaarvoice. Hopefully you will find them useful too!

My Hacktoberfest

This past October I participated in an awesome Open Source event called “Hacktoberfest”, sponsored by Digital Ocean and GitHub.

Hacktoberfest is a month-long celebration of Open Source where developers are encouraged to contribute to the community. Participation is easy:

  1. Pull requests can be made in any GitHub-hosted repositories/projects.
  2. A contribution can be anything—fixing bugs, creating new features, or updating and writing documentation.

Further, if you opened four pull requests in Open Source repositories between October 1st and October 31st you would win a cool Hacktoberfest t-shirt and other swag.

Maintainers of Open Source projects (including some here at BV) were asked to tag open issues with “Hacktoberfest” if they wanted help with that issue during the event. GitHub provides the ability to search issues based tags, so it was really easy to find cool projects and issues to work on.

I personally started off small, helping one team track down a bug with their JSON files, and another finish a database for movies used by their front-end application (similar to IMDB).

Next I found a Hacktoberfest issue in the the New York Times’s kyt repository. Kyt is a build, test and development tool for advanced JavaScript apps. I ended up helping them fix a bug in one of their setup scripts.

Then came my Hacktoberfest pièce de résistance.

In my 20% time here at Bazaarvoice I had been playing around with browser extensions / add-ons, specifically in an effort to make implementing our products easier for our clients. So when I saw that Mozilla and the Mozilla Developer Network (MDN) were asking for someone to create a browser extension for them, I was immediately interested.

They noticed that a popular type of extension being authored was what they were calling a “replacement” add-on, something that would replace words or phrases in a web page with alternate words, or images, etc.

In their Web Extensions Examples repository, they were looking for an example of such an add-on that they could turn into a “How to Write your First Add-On” tutorial. Thus their two main requirements were:

  1. The code must be simple and easy to follow for beginners.
  2. The code must be performant because it would likely be copied a lot.

Seeing as how readability and performance are two of the main things that we check for in every code review here at Bazaarvoice, this was right up my alley!

I was so excited that I stayed up all weekend to finish the project:

I submitted my pull request, worked with the developers at Mozilla, and was so proud when my Emoji Substitution contribution was merged into their repository. What a rush!

As we traded Hacktoberfest-themed emoji (???? and ???? were my favorites), fixed bugs, and fleshed out their projects, it was really cool to lend my expertise and experience the gratitude of all the teams I worked with – this is what Open Source is all about!

I had a great time participating in Hacktoberfest this year and will definitely do it again next year. You should join me!

Injecting Applications onto Third-Party Webpages Made Easy

Bazaarvoice’s Small Web App Technologies (SWAT) team is pleased to announce that we are open sourcing swat-proxy – a tool to inject applications onto third-party webpages.

In third-party web application development it is difficult to be certain how our applications will look and behave on a client’s webpage until they are implemented. Any number of things could interfere – including other third-party applications! Delivering applications that don’t work can obviously have a severe negative impact on both our clients and us.

One solution is to inject – or proxy – our applications onto the client’s web page. This way we can ensure they work correctly – before they go into production. I wrote swat-proxy to do exactly that, acting as a man-in-the-middle between browser and web server. As the browser requests web pages from the server, swat-proxy intercepts the response and proxies our application into it. The browser renders the web page as if it contained our application all along – exactly simulating the client having implemented it. Now we can be certain how our applications will look and behave.

Other tools exist to accomplish this task, but none are as front-end developer-friendly as swat-proxy: it is written entirely in Javascript – plugging in nicely to our existing workflows – and uses familiar CSS selectors to target DOM elements when injecting content. It is run locally using NodeJS and is very easy to use.

We have found swat-proxy to be incredibly useful when rapidly iterating on prototypes and ensuring the behavior of our applications before they are released to production – we hope you do too! We are releasing it to the larger world as open source, under the Apache 2.0 license. Please download it, try it out, and let us know what you think (in comments below, or as issues or pull requests on Github).

Scoutfile: A module for generating a client-side JS app loader

A couple of years ago, my former colleague Alex Sexton wrote about the techniques that we use at Bazaarvoice to deploy client-side JavaScript applications and then load those applications in a browser. Alex went into great detail, and it’s a good, if long, read. The core idea, though, is pretty simple: an application is bootstrapped by a “scout” file that lives at a URL that never changes, and that has a very short TTL. Its job is to load other static resources with long TTLs that live at versioned URLs — that is, URLs that change with each new deployment of the application. This strategy balances two concerns: the bulk of application resources become highly cacheable, while still being easy to update.

In order for a scout file to perform its duty, it needs to load JavaScript, load CSS, and host the config that says which JS and CSS to load. Depending on the application, other functionality might be useful: the ability to detect old IE; the ability to detect DOM ready; the ability to queue calls to the application’s methods, so they can be invoked for real when the core application resources arrive.

At Bazaarvoice, we’ve been building a lot of new client-side applications lately — internal and external — and we’ve realized two things: one, it’s very silly for each application to reinvent this particular wheel; two, there’s nothing especially top secret about this wheel that would prevent us from sharing it with others.

To that end, I’m happy to release scoutfile as an NPM module that you can use in your projects to generate a scout file. It’s a project that Lon Ingram and I worked on, and it provides both a Grunt task and a Node interface for creating a scout file for your application. With scoutfile, your JavaScript application can specify the common functionality required in your scout file — for example, the ability to load JS, load CSS, and detect old IE. Then, you provide any code that is unique to your application that should be included in your scout file. The scoutfile module uses Webpack under the hood, which means you can use loaders like json! and css! for common tasks.

The most basic usage is to npm install scoutfile, then create a scout file in your application. In your scout file, you specify the functionality you need from scoutfile:

var App = require('scoutfile/lib/browser/application');
var loader = require('scoutfile/lib/browser/loader');

var config = require('json!./config.json');
var MyApp = App('MyApp');

MyApp.config = config;

loader.loadScript(config.appJS);
loader.loadStyleSheet(config.appCSS);

Next, you can generate your scout file using a simple Node script:

var scout = require('scoutfile');
scout.generate({
  appModules: [
    {
      name: 'MyApp',
      path: './app/scout.js'
    }
  ],

  // Specify `pretty` to get un-uglified output.
  pretty: true
}).then(function (scout) {
  console.log(scout);
});

The README contains a lot more details, including how to use flags to differentiate production vs. development builds; how to configure the Grunt task; how to configure the “namespace” that is occupied on window (a necessary evil if you want to queue calls before your main application renders); and more.

There are also several open issues to improve or add functionality. You can check out the developer README if you’re interested in contributing.