Category Archives: Uncategorized

Semantically Compress Text to Save On LLM Costs

Introduction

Large language models are fantastic tools for unstructured text, but what if your text doesn’t fit in the context window? Bazaarvoice faced exactly this challenge when building our AI Review Summaries feature: millions of user reviews simply won’t fit into the context window of even newer LLMs and, even if they did, it would be prohibitively expensive.

In this post, I share how Bazaarvoice tackled this problem by compressing the input text without loss of semantics. Specifically, we use a multi-pass hierarchical clustering approach that lets us explicitly adjust the level of detail we want to lose in exchange for compression, regardless of the embedding model chosen. The final technique made our Review Summaries feature financially feasible and set us up to continue to scale our business in the future.

The Problem

Bazaarvoice has been collecting user-generated product reviews for nearly 20 years so we have a lot of data. These product reviews are completely unstructured, varying in length and content. Large language models are excellent tools for unstructured text: they can handle unstructured data and identify relevant pieces of information amongst distractors.

LLMs have their limitations, however, and one such limitation is the context window: how many tokens (roughly the number of words) can be put into the network at once. State-of-the-art large language models, such as Athropic’s Claude version 3, have extremely large context windows of up to 200,000 tokens. This means you can fit small novels into them, but the internet is still a vast, every-growing collection of data, and our user-generated product reviews are no different.

We hit the context window limit while building our Review Summaries feature that summarizes all of the reviews of a specific product on our clients website. Over the past 20 years, however, many products have garnered thousands of reviews that quickly overloaded the LLM context window. In fact, we even have products with millions of reviews that would require immense re-engineering of LLMs to be able to process in one prompt.

Even if it was technically feasible, the costs would be quite prohibitive. All LLM providers charge based on the number of input and output tokens. As you approach the context window limits for each product, of which we have millions, we can quickly run up cloud hosting bills in excess of six figures.

Our Approach

To ship Review Summaries despite these technical, and financial, limitations, we focused on a rather simple insight into our data: Many reviews say the same thing. In fact, the whole idea of a summary relies on this: review summaries capture the recurring insights, themes, and sentiments of the reviewers. We realized that we can capitalize on this data duplication to reduce the amount of text we need to send to the LLM, saving us from hitting the context window limit and reducing the operating cost of our system.

To achieve this, we needed to identify segments of text that say the same thing. Such a task is easier said than done: often people use different words or phrases to express the same thing.

Fortunately, the task of identifying if text is semantically similar has been an active area of research in the natural language processing field. The work by Agirre et. al. 2013 (SEM 2013 shared task: Semantic Textual Similarity. In Second Joint Conference on Lexical and Computational Semantics) even published a human-labeled data of semantically similar sentences known as the STS Benchmark. In it, they ask humans to indicate if textual sentences are semantically similar or dissimilar on a scale of 1-5, as illustrated in the table below (table from Cer et. al., SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation):

The STSBenchmark dataset is often used to evaluate how well a text embedding model can associate semantically similar sentences in its high-dimensional space. Specifically, Pearson’s correlation is used to measure how well the embedding model represents the human judgements.

Thus, we can use such an embedding model to identify semantically similar phrases from product reviews, and then remove repeated phrases before sending them to the LLM.

Our approach is as follows:

First, product reviews are segmented the into sentences.
An embedding vector is computed for each sentence using a network that performs well on the STS benchmark
Agglomerative clustering is used on all embedding vectors for each product.
An example sentence – the one closest to the cluster centroid – is retained from each cluster to send to the LLM, and other sentences within each cluster are dropped.
Any small clusters are considered outliers, and those are randomly sampled for inclusion in the LLM.
The number of sentences each cluster represents is included in the LLM prompt to ensure the weight of each sentiment is considered.

This may seem straightforward when written in a bulleted list, but there were some devils in the details we had to sort out before we could trust this approach.

Embedding Model Evaluation

First, we had to ensure the model we used effectively embedded text in a space where semantically similar sentences are close, and semantically dissimilar ones are far away. To do this, we simply used the STS benchmark dataset and computed the Pearson correlation for the models we desired to consider. We use AWS as a cloud provider, so naturally we wanted to evaluate their Titan Text Embedding models.

Below is a table showing the Pearson’s correlation on the STS Benchmark for different Titan Embedding models:

Model	Dimensionality	Correlation on STS Benchmark
amazon.titan-embed-text-v1	1536	0.801031
amazon.titan-embed-text-v2:0	256	0.818282
amazon.titan-embed-text-v2:0	512	0.854073
amazon.titan-embed-text-v2:0	1024	0.868574
State-of-the-art		0.929

So AWS’s embedding models are quite good at embedding semantically similar sentences. This was great news for us – we can use these models off the shelf and their cost is extremely low.

Semantically Similar Clustering

The next challenge we faced was: how can we enforce semantic similarity during clustering? Ideally, no cluster would have two sentences whose semantic similarity is less than humans can accept – a score of 4 in the table above. Those scores, however, do not directly translate to the embedding distances, which is what is needed for agglomerative clustering thresholds.

To deal with this issue, we again turned to the STS benchmark dataset. We computed the distances for all pairs in the training dataset, and fit a polynomial from the scores to the distance thresholds.

This polynomial lets us compute the distance threshold needed to meet any semantic similarity target. For Review Summaries, we selected a score of 3.5, so nearly all clusters contain sentences that are “roughly” to “mostly” equivalent or more.

It’s worth noting that this can be done on any embedding network. This lets us experiment with different embedding networks as they become available, and quickly swap them out should we desire without worrying that the clusters will have semantically dissimilar sentences.

Multi-Pass Clustering

Up to this point, we knew we could trust our semantic compression, but it wasn’t clear how much compression we could get from our data. As expected, the amount of compression varied across different products, clients, and industries.

Without loss of semantic information, i.e., a hard threshold of 4, we only achieved a compression ratio of 1.18 (i.e., a space savings of 15%).

Clearly lossless compression wasn’t going to be enough to make this feature financially viable.

Our distance selection approach discussed above, however, provided an interesting possibility here: we can slowly increase the amount of information loss by repeatedly running the clustering at lower thresholds for remaining data.

The approach is as follows:

Run the clustering with a threshold selected from score = 4. This is considered lossless.
Select any outlying clusters, i.e., those with only a few vectors. These are considered “not compressed” and used for the next phase. We chose to re-run clustering on any clusters with size less than 10.
Run clustering again with a threshold selected from score = 3. This is not lossless, but not so bad.
Select any clusters with size less than 10.
Repeat as desired, continuously decreasing the score threshold.

So, at each pass of the clustering, we’re sacrificing more information loss, but getting more compression and not muddying the lossless representative phrases we selected during the first pass.

In addition, such an approach is extremely useful not only for Review Summaries, where we want a high level of semantic similarity at the cost of less compression, but for other use cases where we may care less about semantic information loss but desire to spend less on prompt inputs.

In practice, there are still a significantly large number of clusters with only a single vector in them even after dropping the score threshold a number of times. These are considered outliers, and are randomly sampled for inclusion in the final prompt. We select the sample size to ensure the final prompt has 25,000 tokens, but no more.

Ensuring Authenticity

The multi-pass clustering and random outlier sampling permits semantic information loss in exchange for a smaller context window to send to the LLM. This raises the question: how good are our summaries?

At Bazaarvoice, we know authenticity is a requirement for consumer trust, and our Review Summaries must stay authentic to truly represent all voices captured in the reviews. Any lossy compression approach runs the risk of mis-representing or excluding the consumers who took time to author a review.

To ensure our compression technique was valid, we measured this directly. Specifically, for each product, we sampled a number of reviews, and then used LLM Evals to identify if the summary was representative of and relevant to each review. This gives us a hard metric to evaluate and balance our compression against.

Results

Over the past 20 years, we have collected nearly a billion user-generated reviews and needed to generate summaries for tens of millions of products. Many of these products have thousands of reviews, and some up to millions, that would exhaust the context windows of LLMs and run the price up considerably.

Using our approach above, however, we reduced the input text size by 97.7% (a compression ratio of 42), letting us scale this solution for all products and any amount of review volume in the future.
In addition, the cost of generating summaries for all of our billion-scale dataset reduced 82.4%. This includes the cost of embedding the sentence data and storing them in a database.

What was Old is New: Finding Joy in Modernising Legacy Systems

(cover image from ThisisEngineering RAEng)

Let’s face it: software is easier to write than maintain. This is why we, as software engineers, prefer to just “rip it out and start over” instead of trying to understand what another developer (or our past self) was thinking. We seem to have collectively forgotten that “programs must be written for people to read, and only incidentally for machines to execute”. You know it’s true — we’ve all had to painstakingly trace through a casserole of spaghetti code and thin, old-world-style abstractions digging for the meat of the program only to find nothing but a mess at the bottom of our plates.

It’s easy to yell “WTF” and blame the previous dev, but the truth is often more complicated. We can’t see the future, so it’s impossible to understand how requirements, technology, or business goals will grow when we design a net-new system. As a result, systems can become unreadable as their scope increases along with the business’s dependency on them. This is a bit of a paradox: older, harder to maintain systems often provide the most value. They are hard to work on because they’ve grown with the company, and scary to work on because breaking it could be a catastrophe.

Here’s where I’m calling you out: if you like hard, rewarding problems… try it. Take the oldest system you have and make it maintainable. You know the one I’m talking about — the one no one will “own”. That one the other departments depend on but engineers hate. The one you had to patch Log4Shell on first. Do it. I dare you.I recently had such an opportunity to update a decade old machine learning system at Bazaarvoice. On the surface, it didn’t sound exciting: this thing didn’t even have neural networks! Who cares! We ll… it mattered. This system processes nearly every user-generated product review received by Bazaarvoice — nearly 9 million per month — and does so with 90 million inference calls to machine learning models. Yup — 90 million inferences! It’s a huge scale, and I couldn’t wait to dive in.

In this post, I’ll share how modernizing this system through a re-architecture, instead of a re-write, allowed us to make it scalable and cost-effective without having to rip out all of the code and start over. The resulting system is serverless, containerized, and maintainable while reducing our hosting costs by nearly 80%. This post is a companion piece to a talk I recently presented at AWS Data Summit for Software Companies on generating value from data by leveraging our best practices to ensure success in machine learning projects. This post discusses the technical aspects in more detail, but you can watch the high-level talk linked at the end.

Something Old

First, let’s take a look at what we’re dealing with here. The legacy system my team was updating moderates user-generated content for all of Bazaarvoice. Specifically, it determines if each piece of content is appropriate for our client’s websites.

This sounds straightforward — eliminate obvious infractions such as hate speech, foul language, or solicitations — but in practice, it’s much more nuanced. Each client has unique requirements for what they consider appropriate. Beer brands, for example, would expect discussions of alcohol, but a children’s brand may not. We capture these client-specific options when we onboard new clients, and our Client Services team encodes them into a management database.

For some added complexity, we also sample a subset of our content to be moderated by human moderators. This allows us to continuously measure the performance of our models and discover opportunities for building more models.

The full architecture of our legacy system is shown below:

*Our legacy moderation system hosted machine learning models on a single EC2 instance. This made deployments slow and limited scalability to the host’s memory size.*

This system has some serious drawbacks. Specifically — all of the models are hosted on a single EC2 instance. This wasn’t due to bad engineering — just the inability of the original programmers to foresee the scale desired by the company. No one thought that it would grow as much as it did. In addition, the system suffered from developer rejection: it was written in Scala, which few engineers understood. Thus, it was often overlooked for improvement since no one wanted to touch it.

As a result, the system continued to grow in a keep-the-lights-on manner. Once we got around to re-architecting it, it was running on a single x1e.8xlarge instance. This thing had nearly a terabyte of ram and costs about $5,000/month (unreserved) to operate. Don’t worry, though, we just launched a second one for redundancy and a third for QA 🙃.

This system was costly to run and was at a high risk of failure (a single bad model can take down the whole service). Furthermore, the code base had not been actively developed and was thus significantly out of date with modern data science packages and did not follow our standard practices for services written in Scala.

Something New

When redesigning this system we had a clear goal: make it scalable. Reducing operating costs was a secondary goal, as was easing model and code management.

The new design we came up with is illustrated below:

*Our new architecture deploys each model to a SageMaker Serverless endpoint. This lets us scale the number of models without limit while maintaining a small cost footprint.*

Our approach to solving all of this was to put each machine learning model on an isolated SageMaker Serverless endpoint. Like AWS Lambda functions, serverless endpoints turn off when not in use — saving us runtime costs for infrequently used models. They also can scale out quickly in response to increases in traffic.

In addition, we exposed the client options to a single microservice that routes content to the appropriate models. This was the bulk of the new code we had to write: a small API that was easy to maintain and let our data scientists more easily update and deploy new models.

This approach has the following benefits:

Decreased the time to value by over 6x. Specifically, routing traffic to existing models is instantaneous, and deploying new models can be done in under 5 minutes instead of 30.
Scale without limit – we currently have 400 models but plan to scale to thousands to continue to increase the amount of content we can automatically moderate.
Saw a cost reduction of 82% moving off EC2 as the functions turn off when not in use, and we’re not paying for top-tier machines that are underutilized.

Simply designing an ideal architecture, however, isn’t the really ~~interesting~~ hard part of rebuilding a legacy system — you have to migrate to it.

Something Borrowed

Our first challenge in migration was figuring out how the heck to migrate a Java WEKA model to run on SageMaker, let alone SageMaker Serverless.

Fortunately, SageMaker deploys models in Docker containers, so at least we could freeze the Java and dependency versions to match our old code. This would help ensure the models hosted in the new system returned the same results as the legacy one.

To make the container compatible with SageMaker, all you need to do is implement a few specific HTTP endpoints:

POST /invocation — accept input, perform inference, and return results.
GET /ping — returns 200 if the JVM server is healthy

(We chose to ignore all of the cruft around BYO multimodel containers and the SageMaker inference toolkit.)

A few quick abstractions around com.sun.net.httpserver.HttpServer and we were ready to go.

And you know what? This was actually pretty fun. Toying with Docker containers and forcing something 10 years old into SageMaker Serverless had a bit of a tinkering vibe to it. It was pretty exciting when we got it working — especially when we got the legacy code to build it in our new sbt stack instead of maven. The new sbt stack made it easy to work on, and containerization ensured we could get proper behavior while running in the SageMaker environment.

Something Blue

So we have the models in containers and can deploy them to SageMaker — almost done, right? Not quite.

The hard lesson about migrating to a new architecture is that you must build three times your actual system just to support migration.

In addition to the new system, we had to build:

A data capture pipeline in the old system to record inputs and outputs from the model. We used these to confirm that the new system would return the same results.
A data processing pipeline in the new system to compute results and compare them to the data from the old system. This involved a large amount of measurement with Datadog and needed to offer the ability to replay data when we found discrepancies.
A full model deployment system to avoid impacting the users of the old system (which would simply upload models to S3). We knew we wanted to move them to an API eventually, but for the initial release, we needed to do so seamlessly.

All of this was throw-away code we knew we could toss once we finished migrating all of the users, but we still had to build it and ensure that the outputs of the new system matched the old.

Expect this upfront.

While building the migration tools and systems certainly took more than 60% of our engineering time on this project, it too was a fun experience. Unit testing became more like data science experiments: we wrote whole suites to ensure that our output matched exactly. It was a different way of thinking that made the work just that much more fun. A step outside our normal boxes, if you will.

So… Just Try It

Next time you’re tempted to rebuild a system from code up, I’d like to encourage you to try migrating the architecture instead of the code. You’ll find interesting and rewarding technical challenges and will likely enjoy it much more than debugging unexpected edge cases of your new code.

The talk I gave is a bit more high-level and goes into the MLOps side of things. Check it out here:

Response API Demo App

Are you looking to develop your own application on top of the Bazaarvoice Response API? Well, we got something for you. The Response API Demo App is a simple Node-React application which demonstrates how to use Response API in conjunction with our 3-legged OAuth2 API. It is recommended to go through the Developer Portal and read about OAuth2 API and the Response API before diving into the application architecture below.

This application was bootstrapped with Create React App and consists of two separate components – the front-end client side in React and a back-end server side in NodeJS.

Let’s talk more about the back-end architecture first. Almost all of the server-side logic is contained in server.js. Using the Express framework for NodeJS, this file defines the following endpoints:

/api/redirect – This endpoint is supposed to handle any incoming redirections from OAuth2, obtain an access token using either an authorization code (if it is a first time login) or a refresh token (if the existing access token has expired or is expiring soon). If you are building an app integrated with OAuth2, you must define an endpoint to handle redirects in a similar way.
/api/check-login – This is an application specific endpoint which was defined so that the client side can easily verify using the server if a user is logged in or not so that they can then be taken to the login page.
Abstraction of GET, POST, PATCH, DELETE calls to the Response API – These endpoints are basically a clone of the endpoints provided by the Response API which are documented here on the Developer Portal. This abstraction is done so that the client side can call these secured endpoints in-directly without having to expose any of the OAuth2 credentials through the user’s browser. All our logic that relates to obtaining and refreshing OAuth tokens stays on the server side securely.

This application uses express-session for storing and maintaining user sessions using browser cookies which is important for being able to maintain multiple users without storing state on your backend. However, this implementation of user sessions is not suitable for production applications and you should probably be maintaining user sessions using a combination of cookies and session storage.

The server side credentials and configurations are supposed to be stored in server/server-config.json which is then picked up by the Node server. Note that these credentials are confidential and should not be exposed to the client side in any way.

Coming to the front-end architecture, we are using the React library and React Semantic UI for the UI Components. The client side does a bunch of stuff ranging from querying a review from the Conversations API and then using that Review ID to obtain/add/modify/delete the corresponding responses to that review using the Response API. Following components make up the core of the front-end:

Login Page – If a user is not logged in (we check this using the /api/check-login endpoint as discussed above), we redirect them to this page which has a little widget that allows them to go to a Bazaarvoice login page, enter their Bazaarvoice Portal login credentials and once they are properly authenticated, the OAuth2 API redirects them to the /api/redirect endpoint which then handles all the token exchange logic. You can configure this redirect URI to be whatever you want during your app provisioning process but for the demo application to work as expected, it should be http://localhost:5000/api/redirect as you can see in the config file. All of the OAuth2 login and token exchange process happens during this step.

Search Page – This page essentially serves as the landing page for this application where a user can enter a Review ID (this can be obtained using Workbench or the Conversations API) and then they are taken to the review page.

Review Page – Once the user is authenticated and they search for a valid Review ID, they are taken to this page which fetches the specified review by calling the Conversations API and then fetches all responses for that review by calling the application’s backend which in turn calls the Response API. So, on this page, a user can see contents of a review along with all responses. This also allows them to add a new response, edit an existing one or delete a response.

You might have noticed a department field on a response which appears as a drop-down menu. These menu options have been hard-coded in the demo application and can be modified in the departmentFormOptions.js file. Further, client.js is an utility file that presents all the commonly used API calls as simple functions accessible to all components on the client side.

The client side configurations are stored in response-demo/client/src/utils/config.js file. This is built into the the project when the front-end client is packaged before running. These configurations are accessible to the browser so you shouldn’t store anything confidential here.

Apart from that, the application follows a pretty standard architecture and you can read more about it here. On the deployment side of things, there is a Dockerfile which builds the client artifacts, sets up the server and gets the application up and running on your local environment. Follow these instructions to get it up and running locally. The app can also be deployed to Flynn by making changes to the redirect URI in configurations and also making sure that your Flynn Redirect URI is added to the list of allowed Redirect URIs for your app credentials.

Publishing Load Test Results with Jmeter, Jenkins and S3

A while ago, I published a blog post that presented a tutorial overview of how to use Jmeter for load testing a typical RESTful API.

This post builds upon that original post with handy information on some updated reporting features of Jmeter as well a quick dive into how you can better propagate your load results in a continuous build environment (ala Jenkins).

if you’re completely new to Jmeter, please read my previous blog post, linked above before proceeding.

For those of you experienced with Jmeter (hopefully you already have some tests you’re wanting to deploy somewhere rather than let them live contentedly on your local machine), get ready because you’re gonna love this

It’s truly amazing the gifs you can find on the web.

Meet the New Jmeter, Same as the Old Jmeter (not really)

The previous load test tutorial covered setting up a load test from scratch using an early version of Jmeter. One of the benefits of using Jmeter, as outlined in the original post was that it has been around for quite a long time, is well known, stable and free. However, one of the major cons in using Jmeter is that its’ well, old – and has gone for a long while without major updates – especially when it comes to report generation.

Well, welcome to Jmeter 3! It still comes with that same, great, vintage 90s UI but now with a hoard a bug fixes, stability upgrades and – wait for it – native HTML report generation, wooo! Check out version 3’s documentation here:

Installation and Setup

So, let’s get down to it! You can find the latest build of Jmeter 3 here: https://jmeter.apache.org/download_jmeter.cgi

Steps to setting up Jmeter 3 locally are similar to the installation steps covered in my original blog post. Pre-installation requirements are the same:

Java JDK (version 7 or later)
Be sure your Java HOME local environment or path variables are set before execution.

Functionally, Jmeter 3 appears identical to older versions of the application but there are a lot of changes under the hood. I recommend unarchiving and installing Jmeter 3 in its own directory, separate from any previous version of Jmeter you may have previously installed.

Migrating your tests

If you’re following along with this tutorial and wish to create a new load test from scratch, please follow the test setup steps in the original post’s tutorial however, be sure to SKIP the 3^rd party report and listener portion of the test setup (guess what, you won’t be needing those plugins anymore).

If you have an existing test you wish to simply fire up in Jmeter 3, your mileage will vary in this case. Some of the 3^rd party listeners and other tools you may have used in your setup may be incompatible with the newer version of Jmeter. If you have any custom listeners embedded in your test I recommend doing the following:

Open your original test in your older version of Jmeter
Remove 3^rd party listeners, samplers or elements from your test (right-click and select remove)
Save the updated test as a new test
Open the newly created test in Jmeter 3

From this point, you won’t really need any of your original report listeners configured in your test, however, for debugging purposes, I do recommend keeping a simple response listener in your test – one for successful API responses, another separate for failed responses – just to help with debugging any issues your test might run into.

In a few cases, I’ve found despite removing any 3^rd party plugins that some tests have compatibility issues with Jmeter 3. Sadly, the easiest course of action is to rebuild the test from scratch. Such is life (sigh).

Such is life…

Executing Tests with Report Generation

Given you have your initial test configured for Jmeter 3, let’s start up the app and view it in the new UI (though it looks like the old UI). Once you start Jmeter froim the command line, you’ll see a message like this in your terminal:

================================================================================
Don't use GUI mode for load testing, only for Test creation and Test debugging !
For load testing, use NON GUI Mode & adapt Java Heap to your test requirements
================================================================================

You don’t say!? If you’re a long-time user of earlier version of Jmeter, you’re probably thinking, “Well, no duh!” right?

Open your load test in the UI and give it a look-over to make sure its configured the way you intend. It should be minimal simple thread group, your HTTP samplers, your results tree report and any CSV data resources you might need.

Your test may look something like this

Next, save your test and close Jmeter (we’re now done with the Jmeter UI).

Now, we are going to use Jmeter’s command line interface to execute the load test and generate the report for the test.

Do the following:

Open your CLI terminal and navigate to your Jmeter 3 bin directory. Execute the following command:

jmeter -n -t <test JMX file> -l <test log file> -e -o <Path to output folder>

Be sure to set the ‘test JMX file’ argument to the path where your load test’s .JMX file resides, relative to the Jmeter bin directory.

The test log file should be an arbitrary log file path/name that currently does not exist.

The output folder path is where your report will be written to. This must be an empty directory.

A couple of notes:

If you configured your test to run for a set period, just sit back and wait. The report will be generated in the output directory you provided once the test execution finishes.

If you configured it to run in perpetuity – you will need to formally stop the test before report generation will execute (simply killing your terminal session won’t help you here).

To stop your Jmeter test from the command line, simply open a new terminal window and enter one of the following commands:

Control + . – This stops Jmeter, halting any active listeners.

Control + , – This sends a shutdown command to Jmeter – it will wait until all active threads have finished their current tasks before closing shop (this is the recommended method to use unless you’re impatient

or there is an emergency – like, oh you forgot you pointed your test at production).

Are you really going to argue with Chuck?

Once your test has gracefully exited, navigate to the output folder you specified and open your newly minted report in your web browser of choice.

Jmeter 3 HTML report

Pretty sweet huh?

If you haven’t seen Jmeter’s dashboard report before, you’re probably thinking, “golly, there’s a lot of stuff here” – and there is. The dashboard report contains a pretty wide variety of data on your test execution. Enumerating through all of that here would be exhausting however – if you’re interested in all the gory details of Jmeter’s dashboard metrics, check out this in-depth explanation here:

https://jmeter.apache.org/usermanual/generating-dashboard.html

Show and Tell (mostly show)

So now you have this fancy load test report but hoarding all this metric goodness on your local development workstation is no fun for the rest of your team. But that’s OK because if you have access to a continuous integration build environment (ala – Jenkins), the rest of this blog post will walk you through taking your Jmeter test and throwing it into a Jenkins build that allows you to not only run your test remotely but also save and propagate your results to the rest of your agile team.

Make your test available

First things first, you’re going to need to take your Jmeter test, get it off your local machine and into some form of web-accessible delivery. If you have AWS access, you could place your .JMX files and resources into a bucket and be done with it. In our case, I recommend creating a private Github repository for your tests and committing your Jmeter working directory, tests and resource files to that repository. There’s a couple of reasons for this:

It’s easy to access a given Github repository with Jenkins
You can manage updates and versions of your tests just like source code
You can manage variants of your load tests using Github’s branch management features

That said, before tossing your tests into Github’s gaping maw, here’s an important note regarding information security:

Depending on your type of load test, you may have some API or other service access keys stored in your test. You don’t want to check in API or other access keys into your repository (even if it is marked private). But don’t just take my word for it:

https://softwareengineering.stackexchange.com/questions/205606/strategy-for-keeping-secret-info-such-as-api-keys-out-of-source-control

If you’ve already committed your tests to your repository and you still have some software keys in there and are thinking, “Oh boy… What now!?”, here’s a link to some handy notes on how to extract some sensitive values from your test or code using some handy features of Git:

https://gist.github.com/derzorngottes/3b57edc1f996dddcab25

Getting Up and Running with Jenkins

Jenkins is one of the most widely available CI build tools in use today. As such, we’re going to cover how to set up a Jenkins task to run your load tests. If you’re using some other build system (e.g. Team City) this walk-through won’t be as pertinent but essentially, the tasks will be the same.

Step 1: Get access to Jenkins:

Assuming you already have a Jenkins build server of some sort available to your organization – let’s log into it. If you don’t have access to Jenkins, talk to your friendly neighborhood dev ops engineer. Note, that for this walkthrough, you’re going to need some level of administrator access to set up everything we need to execute our tests.

Step 2: Create and Setup your Job with Parameters:

Do the following:

From the Jenkins home page, choose a directory you wish to run your load test jobs from (or create one).
Click the “New Item” link and then choose “Freestyle Project”
Give your project a name and description
Choose the “This project is parameterized” option – we’ll use parameters to allow you to configure your load test If needed.
Select the “New string parameter” option and create a new parameter named, “thread”
Set the default value to whatever you want your test’s default number of threads to be (e.g. 40).
Create another string parameter and name it “ramp” – set its default value to the number of seconds you wish your test to take to ramp up to its maximum load.

If your test is going to loop, create another string parameter named, “loops” and set its default, numeric value.

If you have an API key (or keys) you need for your test to execute, as mentioned above, and have already removed said keys from your test (substituting them for a placeholder value) you can create additional string parameters here (naming them key1, key2, etc.). You can then set your API key(s) to the default value field. Granted, you are propagating your API key here but this is a much better practice in terms of security than leaving them in a potentially publicly accessible repository as your Jenkins instance should be private and protected.

Step 3: Git Jenkins to Talk to Git:

While still in the Jenkins job configuration screen, under the Source Code Management section, select “Git” and enter your repository’s URL alias into the Repository URL field.

If this is your first time doing this, Jenkins expects a specific format for your Github URL strings. It should look something like this:

https://github.com:<your main repository name>/<your project name>

If you’re using a publicly accessible Github repository, skip the next step, enter your preferred branch name (or */master if you don’t need a specific branch).

Next, select the appropriate credentials option from the relative drop down menu to allow Jenkins to properly introduce itself to your Github repo (if you’re having trouble with this, ask your dev ops engineer or Jenkins server admin for help – and maybe do something nice for them while you’re at it like get them a thank you card or some cookies or something, because at this point, you’d be out of luck without them, right?).

Step 5: Build Settings:

You can kick off your tests manually however, it’s probably better to set up timer or other method to handle this for you. Try the following:

Under the Build Triggers heading, check the Build periodically option

Enter your time range you wish the build to trigger at (this will be in Cron notation – which is tricky if you’re seeing this for the first time).

This article does a decent job explain Cron’s formatting: http://www.nncron.ru/help/EN/working/cron-format.htm

Now, click on the Add build step option and from the dropdown menu, select ‘Execute shell’.

Here, we will define the command we will use to execute the Jmeter test once Jenkins pulls the test resources from Github. This will be like the shell command we used to run Jmeter locally as described above.

Syncing to Amazon Web Services

Now we need to make sure we’re saving our generated report somewhere so the rest of our team can review it. Jenkins has a Jmeter performance plugin that can archive our results. Also, we could just have Jenkins archive the output directory Jmeter generates the report in (though handing all the recursive directories it will build might be a pain).

Given you have an AWS account, you have the options to sync your reports to an AWS bucket and host them like a static web page. This method would be beneficial if you need to expose your load test results to team members or other parties that do not or should not have access to your build systems (e.g. product manager, support team members, etc.).

If you don’t have access to AWS, aren’t sure you have access to AWS or may have issues with your build server communicating to AWS, once again, now is a fine time to make friends with (*cough* bribe *cough*) your friendly neighborhood dev ops engineer.

Let’s say you’ve taken care of that and you and your dev ops guru are best buddies now. Next, you’ll need to further configure your Jenkins job by adding another Execute shell build step. Make sure this one is ordered sequentially after your Jmeter test execution step.

In the Execution shell, simply add a command to change to the output directory your report will be in (defined in your test execution command) and use the AWS sync command to send your HTML report to AWS:

cd bin/target
 aws s3 sync . s3://mybucketroot/mybucketname

Save your Jenkins job. You’re done (with Jenkins at least).

Now, in the above step, we are assuming you have a bucket already created in AWS where you can sync your data to. You’re also going to need access to modify how the bucket works in terms of its permissions.

Making Your Test Results Readable from the Web:

By default, your AWS bucket is out there on the web, you can read from it, write from it, but your other team members will have trouble reading the HTML report from that location. We need to configure the bucket so that it will display its contents like a static web page. Here’s how:

Log into your AWS account and access your specified bucket.

Click on the bucket and select the Properties option

Click the Static website hosting panel

Enable the static hosting option

Specify your index.html file location (this should be the relative path to the main HTML file for your uploaded report).

Save your changes and boom – now you’re done.

Save and run your job. If your account permissions are all set up correctly, your test should execute and the test results will be uploaded to the specified AWS bucket.

From here, simply access your AWS bucket using the following styled web link:

https://s3.amazonaws.com/myBucketRoot/myBucketName

Your test results should now be viewable just as if they were a static web page by anyone on your team.

Wrapping it Up:

Now that you have your load tests residing in a Github repository, feel free to leverage Github’s branching features to allow you to manage multiple iterations of your load tests:

Commit different variations of the same branch to new branches

Use Jenkin’s copy feature to copy your existing job, and just point it to your different branches to allow you to execute multiple test types.

Use Jmeter’s UI locally to copy/edit your existing tests to make additional iterations. Save these to your repo and commit them just like you would any other coding project.

Again, use Jenkin’s copy feature to create variations of your test run jobs – this time, just edit your shell command to execute your desired .JMX test file per a given execution.

Thanks for reading and I hope this give you further ideas and insight into how you can leverage Jmeter to further improve your software projects through load testing.

I want to be a UX Designer. Where do I start?

So many folks are wonder what they need to do to make a career of User Experience Design. As someone who interviewed many designers before, I’d say the only gate between you and a career in UX that really matters is your portfolio. Tech moves too fast and is too competitive to worry about tenure and experience and degrees. If you can bring it, you’re in!

That doesn’t mean school is a waste of time, though. Some of the best UX Design candidates I’ve interviewed came from Carnegie Mellon. We have a UX Research intern from the University of Texas on staff right now, and I’m blown away by her knowledge and talent. A good academic program can help you skip a lot of trial-by-fire and learning things the painful way. But most of all, a good academic program can feed you projects to use as samples in your portfolio. But goodness, choose your school carefully! I’ve also felt so bad for another candidate whose professors obviously had no idea what they were talking about.

Okay, so that portfolio… what should it demonstrate? What sorts of samples should it include? Well, that depends on what sort of UX Designer you want to be.

Below is a list of to-dos, but before you jump into your project, I strongly suggest forming a little product team. Your product team can be your your knitting circle, your best friend and next-best-friend, a fellow UX-hopeful. It doesn’t really matter so long as your team is comprised of humans.

I make this suggestion because I’ve observed that many UX students actually have projects under their belt, but they are mostly homework assignments they did solo. So they are going through the motions of producing journey maps, etc., but without really knowing why. So then they imagine to themselves that these deliverables are instructions. This is how UX Designers instruct engineers on what to do. Nope.

The truth is, deliverables like journey maps and persona charts and wireframes help other people instruct us. In real life, you’ll work with a team of engineers, and those folks must have opportunites to influence the design; otherwise, they won’t believe in it. And they won’t put their heart and soul into building it. And your mockups will look great, and the final product will be a mess of excuses.

So, if you can demonstrate to a hiring manager that you know how to collaborate, dang. You are ahead of the pack. So round up your jackass friends, come up with a fun team name, and…

If you want to be a UX Researcher,

Demonstrate product discovery.

Identify a market you want to affect, for example, people who walk their dogs.
Interview potential customers. Learn what they do, how they go about doing it, and how they feel at each step. (Look up “user journey” and “user experience map”)
Organize customers into categories based on their behaviors. (Look up “personas”)
Determine which persona(s) you can help the most.
Identify major pain points in their journey.
Brainstorm how you can solve these pain points with technology.

Demonstrate collaboration.

Allow the customers you interview to influence your understanding of the problem.
Invite others to help you identify pain points.
Invite others to help you brainstorms solutions.

If you want to be a UI designer,

Demonstrate ideation.

Brainstorm multiple ways to solve a problem
Choose the most compelling/feasible solution.
Sketch various ways that solution could be executed.
Pick the best concept and wireframe the most basic workflow. (Look up “hero flow”
Be aware of the assumptions your concept is based upon. Know that if you cannot validate them, you might need to go back to the drawing board. (Look up “product pivoting”)

Demonstrate collaboration.

Invite other people to help you brainstorm.
Let others vote on which concept to pursue.
Use a whiteboard to come up with the execution plan together.
Share your wireframes with potential customers and to see if the concept actually resonates with them.

If you want to be an IX Designer and Information Architect,

Demonstrate prototyping skill.

Build a prototype. The type of prototype depends on what you want to test. If you are trying to figure out how to organize the screens in your app, just labeled cards would work. (Look up “card sorting). If you want to test interactions, a coded version of the app with dummy content is nice, but clickable wireframes might be sufficient.
Plan your test. List the fundamental tasks people must be able to perform for your app to even make sense.
Correct the aspects of your design that throw people off or confuse people.

Demonstrate collaboration.

Allow customers to test-drive your prototype. (Look up “usability testing”)
Ask others to help you think of the best ways to revise your design based on the usability test results.

If you want to be a visual designer,

Demonstrate that you are paying attention.

Collect inspiration and media that you think your customers would like. Hit up dribbble and muzli and medium and behance and google image search and, and, and.
Organize all this media by mood: the pale ones, the punchy ones, the fun ones, whatever.
Pick the mood that matches the way you want people to feel when they use your app.
Style the wireframes with colors and graphics to match that mood.
Bonus: create a marketing page, a logo, business cards, and other graphic design assets that show big thinking.

Demonstrate collaboration.

Ask customers what media and inspiration they like. Let them help you collect materials.
Ask customers how your mood boards make them feel, in their own words.

Whew! That’s a lot of work! I know. At the very least, school buys you time to do all this stuff. And it’s totally okay to focus on just UX Research or just Visual Design and bill yourself as a specialist. Anyway, if you honestly enjoy UX Design, it will feel like playing. And remember to give your brain breaks once in a while. Go outside and ride your bike; it’ll help you keep your creative energy high.

Hope that helps, and good luck!

This article was originally published on Medium, “How to I break into UX Design?“

WHERE WE’RE GOING, WE DON’T NEED ROADS… BUT WE WILL NEED FLAGS

In the previous blog post I introduced our stripe-ctf-2-vm, a self-contained capture the flag puzzle ladder in one vm. In this post, I’d like to talk about how we used the vm to introduce the security mindset to our developers here at Bazaarvoice.

One of the tenets in R&D is to responsibly “fail fast, win fast” with data-driven decisions. It was time for a grand experiment. We put on our lab coats and called for a dozen volunteers for a test training session. The format of this session ran like this:

Each participant was a lone competitor.
Discussion was encouraged to promote collaboration.
Each puzzle was timeboxed based on the feel of the room.
Each puzzle/timebox would conclude with a discussion of the solution.

What We Learned, Part 1

Gamifying learning is hard. There’s a whole field dedicated to this for a reason. We felt like we hit the broad strokes by providing: an increasingly difficult ladder of challenges, competition, social interaction (almost, see below), and timely feedback. The experiment led us to some very important tweaks…

Lone competitors don’t collaborate effectively, for obvious reasons. Attendees should form small teams of three or four for a better social element. Consider mixing technical skill sets and/or functional teams. That really helps.
There’s a balance to be had between freedom to play with a puzzle and frustration. It’s a good idea to track the time closely and cut to the solution before tables are flipped.
You also need to track the time so you can make an educated guess on how long future training sessions should be.
Proctoring and encouragement is needed for those new to CTF or security puzzles. This is easier to do when small teams are formed.
Prizes rewarded per level don’t motivate very much.

Taking It to the Big Leagues

Armed with what we learned and bolstered by testimonials from some of the attendees, we proposed to management a way to take this further. Now, BV R&D management is very pro-learning and our VP takes security very seriously. We didn’t have to sell the Why very much, just the How. Since this education was seen as vital to the growth of our engineering effort, we made the case that all one hundred and twenty plus of our engineers needed to train. We provided a plan for about nine sessions with about twenty engineers each. Once we had management buy-in, we did a two minute elevator-pitch to all of engineering so they had context for the meeting invites that followed.

If you don’t have rapport with your management about the value of secure code, you have your work cut out for you. We suggest you leverage your own talent to locate a few security flaws either through code review, pen testing, or fuzzing tools if you’ve got the skills. The likely low-hanging fruit for a web app exploit are unsanitized inputs, cross-site scripting, and SQL injection. Put together a presentation that educates on the rising number of web app breaches, the cost of fixing flaws in production, and the fact that you just found flaws in production. Paying for external pen testing or security training, or security tools can be expensive. Use what leverage you get from the presentation to instigate a bottom-up approach like a book club or training like this CTF session to cultivate security mavens who can advise the engineers around them.

The Pattern for a Session

Once you have roomful of engineers ready get their security on, what do you do? We divided them up into small teams and tried to get a mix between front-end, back-end, and QA skill sets when possible. The next thirty or so minutes were spent on helping the participants setup their VMs and log into them via SSH. Most of the issues you will run into will likely stem from mix-ups with localhost VBox adapters or trying to run and log into the VM directly from the VBox UI rather than via their terminal and shell script. When your engineers are logged into the VM, you’re ready to run through the introductory slides.

As soon as fingers are poised to take on the puzzles, give everyone the password and present the staging slide for Puzzle 0 (0, because we are programmers). Set up the puzzle and let the teams go to town on it. It’s a good idea to remind everyone to be vocal about looting the puzzle. And make sure they know they can look at the source code in the VM or in the github repo (no, the code you can see in the levels/0 directory is not the running instance of it so don’t modify it and expect to see the changes). A few minutes later, show the hint slide. Feel out your audience though. Easier puzzles may not need a hint, harder puzzles may need more breathing room before or after the hint is shown.

After you have a winner, or better yet, wait until you have two victorious teams (you don’t want to move too fast), show the solution and remediation slides and talk through ways the vulnerability could have been avoided. Any examples specific to your stack or technology you can use to bolster the remediation discussion will strengthen the discussion enormously. Once all the teams have unlocked the next level, repeat until you run out of time or until you solve all the puzzles. The latter means you are a bunch of badasses. Srsly!

Go as far as you can, and reward the team that conquered the most puzzles with something highly coveted: Pokemon figures. Or money. Whichever.

Your mileage will very of course, but we found the following timing worked for BV R&D across our three hour training sessions:

What We Learned, Part 2

Here’s what we learned from conducting nine sessions:

Scheduling 120+ people for a three hour training session is hard. Set aside admin time to handle rescheduling requests and plan for a make-up session.
Generate a VM per session so the loot is different each time.
For our engineers, a three hour session allowed the class to work through the first 5 or 6 puzzles as a group which made for a good introduction to security vulnerabilities.
Send out instructions and links to your generated VMs in the meeting invite.
Even so, plan on spending the first thirty to forty minutes setting up VMs.
Don’t send out the ctf user password in the invite; some clever, motivated individual will work through all the problems before the session. 🙂 If they do that, they should proctor the session with you.
If your company uses HipChat, Slack, or similar, set up an invite-only room per session for that session’s attendees only. It helps to have a means of cut/pasting the solutions to the group.
Don’t forget to work the room to give encouragement and redirect rabbit holing. Each team is different and some prefer to be left alone. Learn to feel that and respect it.
Some engineers will get hung up on the implementation details of the puzzles: “I don’t need to learn bloody PHP.” Try to impress upon them the larger vulnerability exercised by the puzzle. Try leading them through the thought exercise of how the security flaw (and not the language-specific flaw) may be at work in your company’s code today.
Some teams will race ahead after unlocking the next level. Most of the time that will even out as the CTF continues. Occasionally you have an ace security-minded team. If they continually blaze ahead, you may need to get their buy-in to take them out of direct competition with the other teams for morale’s sake. Perhaps they can help proctor the others?
Use something simple like a GoogleDoc form to take a quick poll of participants after each session. You will get invaluable feedback on how to better customize this to your organization.

We hope you find this unique way of introducing the secure coding mindset to your engineering organization useful and fun. It is just the tip of the iceberg. Once your engineers are thinking this way, you will need to develop further plans to encourage more education on the common secure coding patterns themselves. Consider periodic secure coding code reviews or pair programming with the experienced security-minded folks on your team. Forming teams to compete in other CTF competitions and internal security bug bounties might also help. Please don’t hesitate to contact us if you want tips on how to implement this for yourself or if you just want to let us know how it worked for you.

Intern Demo Day

As the summer comes to an end, so do the internships for numerous university students here at Bazaarvoice. This past week, the interns were given an opportunity to present a summary of their accomplishments. This afternoon of presentations, known as the Bazaarvoice “Intern Demo Day”, highlighted the various achievements throughout the company, not just in the R&D department.

The following is a short summary of the great work our interns complete this summer as well as some images from the “Intern Demo Day”.

CHASE PORTER: My project, which I have named “The Great (red)Shift”, is intended to improve data accessibility for computed aggregated counts of various canonical events written to HBase. To do this I designed a data warehouse in Amazon Redshift that I loaded with transformed aggregated counts extracted from the tables in HBase. This makes the counts readily SQL query-able in an incredibly fast system whereas before they had to be computed with performance heavy queries from Raw Logs generated by Cookie Monster. The biggest block for this project was in processing the data from HBase which was stored as serialized bytes and needed to be handled uniquely for different types of canonical events (i.e. pageviews, impressions, features) to translate into a readable form for Redshift.

BEN DEVORE: My product is web crawler written in node.js that scrapes clients’ webpages for product data in order to build their product feeds for them. For many of Bazaarvoice’s smaller clients, building and maintaining their product feed is a significant obstacle in the onboarding process. This tool aims to clear that obstacle by taking this task out of there hands.

STONEY MCCRARY: So I have been fortunate enough to get to work on several different pieces in curations but I am going to talk on what I have been hammering on for the last couple of weeks. More and more of our high volume clients are receiving millions of hits a day and this has caused performance to become a higher priority problem for them. In response to this, we are focusing our efforts on building a new display with performance in mind. Performance for the display centers around only providing the minimal amount of data needed and supply the rest as necessary. The piece I will be showing is the display carousel and how it dynamically loads and dumps the data to allow for faster loading and to keep browser memory low.

ZESHAN ANWAR: Eagle is a dashboard built for our Incubator team. With so many moving parts, it was important we had a summarized ‘birds-eye’ view of the team in one place. Eagle was initially meant to be an aggregation of all our Jenkin builds; a single view of all our jobs across our different Jenkins environments. However, it grew to also include JIRA and GitHub statistics. My other project was optimizing our UI tests by having them run concurrently. Our old UI tests were extremely slow, and by running them in parallel we drastically reduced test times.

BRENDON KELLEY: Testing Framework: This summer my project was to help build out a new testing framework for Curations. The current automation tests used for Curations is Saladhands. Before my internship, there wasn’t much if any automation tests for the submission/direct upload capability of Curations. I worked on creating tests and a CI environment for submission in a new testing framework called Intern. One of tests includes a language translation test using mongoDB as an endpoint to store the various languages’ text. Intern is a javascript based testing framework which will allow developers to contribute to writing tests since Curations is mostly javascript. I’ve also worked on updating and creating new console tests in this framework. The foundation built this summer in Intern will enable the ability to further contribute to the framework.

KRYSTINA DIAO: My main project for the summer was to analyze and report the effectiveness of the implementation of the new Connections Knowledgebase. Through Salesforce, I collected and analyzed the number of cases, time spent on each case, etc. After drawing my conclusions, I decided to present my findings via data visualization methods (JavaScript’s C3 and D3 libraries) and provide actionable insights on how this information can be leveraged. This information is valuable in that it can be used for future product KB decisions, as well as understanding how much time, manpower, and money is saved by having a KB.

MARKO SAVIC: Over the summer, I was a part of the SEO Team. I managed to create a tool on pagesManaged and keywordsManaged feed for every Spotlights client. Generated feeds will be consumed by SeoClarity tools on a daily basis. This helps in identifying search rank gains on the specified keywords and pages where Spotlights are present. The SeoClarity reporting will help in proving out Spotlights value and eventually lead to Spotlights renewal/upticks.Also, I created algorithm tweaks on the PINS (Post Interaction Notification System) Generator that take into account product activeness, product relevancy and review count, and use them to ask the user to write reviews on the most relevant products.

TREVOR NELLIGAN: Here is a description of my project: I worked on the Aperture Component library and many of the projects it supports. Aperture is build in React, and its purpose is to be used as an internal Bazaarvoice tool for constructing web pages. Using Aperture, anyone at Bazaarvoice can easily create a functional, intuitive, Bazaarvoice themed webpage, all with the building blocks Aperture provides.

Using the Aperture library, I helped the construction of numerous pages for the curations beta console. I personally built the interface for a new client-facing template builder, which will allow clients to create curations templates quickly and easily without having to go through an implementation engineer and a long process, as was the case previously. I also supplied custom Aperture components for several projects, like the content curation beta page.

RAMIE RAUFDEEN: The mixer is a component of our product recommendations engine which differentiates shoppers, and optimizes recommendations for them. This is primarily derived from their shopping behavior – in real time. Prior to the mixer, product recommendations were aggregated from multiple sources, using the same algorithm for every shopper. Shoppers are now categorized based off of a set of rules (using the shopper’s profile data), each of the rules map to a plan (which you can think of as an ‘algorithm’). A plan defines how recommendations should be mixed from each of the sources. For example, if source B has proven to have a higher conversion rate for ‘heavy-shoppers’, the plan for ‘heavy-shopper’ would give a higher weighting to source B. We can now target specific types of shoppers when it comes to product recommendations. This also sets the groundwork for a more granular machine learning implementation in the future.

We want to thank all the interns who spent time with us this summer and wish you the best back at school. We look forward to hearing about all the great things you all develop in the future.

If you are interested in an internship at Bazaarvoice, please contact kindall.heye@bazaarvoice.com.

HackerX event hosted at Bazaarvoice

The Bazaarvoice headquarters hosted the July 20th HackerX event in Austin, Texas. The event featured not only Bazaarvoice, but also included Facebook, Amazon, and Indeed. 70+ engineers participated in onsite interviews and networking. HackerX commented that “this was one of the most successful events” they have ever seen.

Gary Allison, Executive Vice President of Engineering, kicked off the event with a compelling message about Bazaarvoice and why this is an awesome place to work.

HackerX started in 2012 with invite-only, face-to-face recruiting events that connect tech talent with some of the world’s most innovative companies. Currently, they operate over 100+ events in 40+ cities, 15+ countries annually.

See www.hackerx.org for additional information.

Conversations API Deprecation for Versions 5.2 and 5.3

Today we are announcing an important change to our Conversations API service:

On April 30, 2017 service will end for Conversations API versions 5.2 and 5.3

By deprecating older versions of our API service, we can refocus our energies on the current and future API services, which we feel offer the most benefits to our customers. Please visit our Upgrade Guide to learn more about the Conversations API, our API versioning, and the steps necessary to support the upgrade.

We understand that this change will require effort on your part. Bazaarvoice is committed to making this transition easy for you. We are prepared to assist you in a number of ways:

Pre-notification: You have more than 12 months to plan for and implement the change.
Documentation: We have specific documentation to help you.
Support: Our support team is ready to address any questions you may have.
Services: Our services teams are available to provide additional assistance.

In summary, on April 30, 2017, Conversations API versions released before 5.4 will no longer be available. Applications and websites using versions before 5.4 will no longer function properly after April 30, 2017. If your custom application or website is making API calls to Conversations API versions 5.2 or 5.3 you will need to upgrade to the current Conversations API (5.4). Applications using Conversations API versions 5.4 and later will continue to receive uninterrupted API service.

If you have any questions about this notice, please submit a case in Spark. We will periodically update this blog and our developer Twitter feed (@BazaarvoiceDev) as we move closer to the change of service date.

Visit the following page Coversations API 2017 Deprecation to learn more.

Thank you for your partnership,
Chris Kauffman
Manager, Product Management

What does a data scientist do?

More and more companies and industries are grappling with the challenges of extracting value from large amounts of data. Data scientists, the people whose job it is to overcome these challenges, are becoming more prominent, yet what it is they do, and how they’re different than software engineers, is still a mystery to a lot of people. The goal of this article is to explain one of the most important tools that data scientists use: machine learning (ML). The bottom-line is: using ML is slow, costly, and error-prone, but it allows companies to achieve business objectives that are unattainable any other way.

Just like a software engineer, the goal of a data scientist is to develop programs that perform business functions. In software engineering, the engineer writes a program that encodes all of the “rules” for what the program is supposed to do, based on the requirements. For example, take the task of returning all of the product reviews for a given product ID. The rules here include things like how to make sure the product ID is valid (and what to do when it’s not), how to query a database of reviews with the product ID, and how to format the reviews to be returned. A reasonably skilled software engineer can easily write a program that encodes all of these rules.

However, there are many problems where it is not feasible for anyone to write down all of the rules required for a software program to perform some task. Sometimes this is because the rules are simply not known, and other times it’s because there are way too many rules. Several good example of the latter type come from natural language processing (NLP), like the problem of predicting the sentiment of movie reviews (i.e. did the reviewer like the movie or not?). Like nearly all NLP problems, this is something that a human could do reasonably well, but it doesn’t mean that a person could easily write down a set of rules for how they made their decisions. If you had to, you’d probably start by listing key words or phrases that would indicate the reviewer’s sentiment, like “great”, “liked”, “smash hit”, “boring” and “terrible”, and use their appearance in a review to judge the sentiment. This will only work so far, because language is much more complex than that. You’d need additional rules to get around the fact that many of the key words can mean different things in different contexts, more rules to cover all of they ways that you can express sentiment without using a key word, and even more rules to detect when sentiment gets flipped by a negating phrase. Add to this the fact that people don’t actually write in perfect English, and you’d end up with a seemingly endless list of rules and an impossible software engineering problem.

ML has completely different approach: instead of writing out the rules in a top-down fashion, ML attempts to infer the rules in a bottom-up way, from data. For the problem of predicting the sentiment of movie reviews, you’d take a set of movie reviews with the actual sentiment of each, and feed them into a ML program. This ML program then literally outputs another program that takes in a movie review and outputs a prediction of its sentiment (with some expected accuracy). One reason that ML can work (not perfectly) on NLP problems is because where a human would have a very hard time creating millions and millions of rules, a computer has no such limitation.

Of course, there are a few catches to using ML. First, the data is not always cheap. Sentiment analysis of movie reviews has been popular topic only because many movie reviews online come with ratings (usually on a scale of 1 to 5), which tell you the answer — the term for this in ML is “ground truth”. For many problems, the ground truth is not readily available and can be very costly to figure it out. A recent Kaggle competition on this topic used a dataset of 50,000 movie reviews — imagine needing to read every single one of these to determine the ground truth sentiment.

Another catch, which I’ve already mentioned, is that ML will rarely produce a program with perfect accuracy. This is because for most real-world problems, it’s impossible to provide the ML program with all of the relevant information. For NLP problems, humans have a wealth of knowledge about what words mean, while computers do not. Many other real-world problems involve predicting human behavior, but it’s impossible to observe everything that’s going on in people’s heads. While ML algorithms are designed to make the most out of limited information, they’re only viable as a solution when the business objectives can tolerate some amount of error.

ML solutions are also slow to develop, even more so than standard software engineering solutions. As mentioned earlier, ML solutions need to work with limited information, which means that it’s impossible to know whether ML will meet the business’s requirements beforehand. Effective data scientists will make educated guesses about the data, ML algorithms, and algorithm parameters that are most likely to succeed based on the problem, but experimentation is always required. This can mean multiple iterations refining the data, algorithms, and parameters before a definitive answer can be reached about whether an ML solution will work.

Last, ML solutions in production are costly to maintain. Their performance needs to be continuously monitored, because their performance can change over time as the characteristics of the data they’re analyzing changes. In the case of predicting the sentiment of movie reviews, just changes in writing style could drop the accuracy significantly. Processes and instrumentation are required to evaluate a statistically significant sample of the solution’s predictions, and take actions to improve performance whenever it drops such as creating a new program using newer data.

I hope this de-mystifies some of what data scientists do, and explains one of the important tools they use to extract value from large amounts of data.