Can You Keep a Secret?

We all have secrets. Sometimes, these are guilty pleasures that we try to keep hidden, like watching cheesy reality TV or indulging in strange comfort food. We often worry:

“How do we keep the secret safe?”

“What could happen if someone finds out the secret?”

“Who is keeping a secret?”

“What happens if we lose a secret?”

At Bazaarvoice, our security team starts with these same questions when it comes to secret management. They are less interested in trivial secrets like who left their dishes in the office sink or who finished the milk. Instead, they focus on the secrets used in the systems that power a wide range of products for the organization. They include:

  • API keys and tokens
  • Database connection strings that contain passwords
  • Encryption keys
  • Passwords

With hundreds of systems and components depending on secrets at Bazaarvoice, enumerating them and answering the above questions quickly, consistently and reliably can become a challenge without guiding standards. 

This post will discuss secret management and explain how Bazaarvoice implemented a Secret Catalog using open-source software. The post will provide reusable examples and helpful tips to reduce risk and improve compliance for your organization. Let’s dive in!

Secrets Catalog

Bazaarvoice is ISO27001 compliant and ensures its systems leverage industry standard tools and practices to store secrets securely. However, it isn’t always in the same place, and secrets can be stored using different tools. Sometimes AWS Secret Manager makes sense; other times, AWS KMS is a better choice for a given solution. You may even have a multi-cloud strategy, further scattering secrets. This is where a Secrets Catalog is extremely useful, providing a unified view of secrets across tools and vendors.

It may sound a little boring, and the information captured isn’t needed most of the time. However, in the event of an incident, having a secret catalog becomes an invaluable resource in reducing the time you need to resolve the problem.

Bazaarvoice recognized the value of a secrets catalog and decided to implement it. The Security Team agreed that each entry in the catalog must satisfy the following criteria:

  • A unique name
  • A good description of its purpose
  • Clear ownership
  • Where it is stored
  • A list of dependent systems
  • References to documentation to remediate any related incident, for example, how to rotate an API key

Crucially, the value of the secret must remain in its original and secure location outside of the Catalog, but it is essential to know where the secret is stored. Doing so avoids a single point of failure and does not hinder any success criteria.

Understanding where secrets are stored helps identify bad practices. For example, keeping an SSH key only on team members’ laptops would be a significant risk. A person can leave the company, win the lottery, or even spill a drink on their laptop (we all know someone!). The stores already defined in the catalog guide people in deciding how to store a new secret, directing them to secure and approved tools resistant to human error.

Admittedly, the initial attempt to implement the catalog at Bazaarvoice didn’t quite go to plan. Teams began meeting the criteria, but it quickly became apparent that each team produced different interpretations stored in multiple places and formats. Security would not have a unified view when multiple secret catalogs would ultimately exist. Teams would need additional guidance and a concrete standard to succeed.

We already have a perfect fit for this!

Bazaarvoice loves catalogs. After our clients, they might be our favourite thing. There is a product catalog for each of our over ten thousand clients, a data catalog, and, most relevantly, a service catalog powered by Backstage.

“Backstage unifies all your infrastructure tooling, services, and documentation to create a streamlined development environment from end to end.”

https://backstage.io/docs/overview/what-is-backstage

Out of the box, it comes with core entities enabling powerful ecosystem modeling:

https://backstage.io/docs/features/software-catalog/system-model#ecosystem-modeling

As shown in the diagram above, at the core of the Backstage model are Systems, Components, and Resources, the most relevant entities for secret management. You can find detailed descriptions of each entity in the Backstage modeling documentation. Briefly, they can be thought about as follows:

System – A collection of lower-level entities, including Components and Resources, acting as a layer of abstraction.

Component – A piece of software. It can optionally produce or consume APIs.

Resource – The infrastructure required by Components to operate.

New Resource Types

Resources are one of the Backstage entities used to represent infrastructure. They are a solid fit for representing secrets, allowing us to avoid writing custom in-house software to do the same job. Therefore, we defined two new Resource Types: secret and secret-store.

Tip: Agree on the allowed Types upfront to avoid a proliferation of variations such as ‘database’ and ‘db’ that can degrade searching and filtering.

Having already invested the effort in modeling the Bazaarvoice ecosystem, adding the necessary YAML to define secrets and secret stores was trivial for teams. 

Example minimal secret store:

apiVersion: backstage.io/v1alpha1
kind: Resource
metadata:
  name: aws-secrets-manager
  description: Resources of type 'secret' can depend on this Resource to indicate that it is stored in AWS Secrets Manager
spec:
  type: secret-store
  owner: team-x

Example minimal secret:

apiVersion: backstage.io/v1alpha1
kind: Resource
metadata:
  name: system-a-top-secret-key
  description: An example key stored in AWS Secrets Manager
  links:
    - url:https://internal-dev-handbook/docs/how-to-rotate-secrets 
      title: Rotation guide
spec:
  type: secret
  owner: team-1
  system: system-a
  dependsOn:
    - resource:aws-secrets-manager

Finally, to connect the dots to existing Systems, Components, and Resources, simply add a dependsOn section to their definitions. For example:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: system-a-component
  ...
spec:
  ...
  dependsOn:
    - resource:system-a-top-secret-key

How does it look?

It’s fantastic in our eyes! Let’s break down the image above.

The “About” section explains the secret, including which system it’s associated with, its owner, and the fact that it’s currently in production. It also provides a link to the source code and a way to report issues or mistakes, such as typos.

The “Relations” section, powered by an additional plugin, provides a visual and interactive graph that displays the relationships associated with the secret. This graph helps users quickly build a mental picture of the resources in the context of their systems, owners, and components. Navigating through this graph has proven to be an effective and efficient mechanism for understanding the relationships associated with the secret.

The “Links” section offers a consistent place to reference documentation related to secret management. 

Lastly, the “PagerDuty” plugin integrates with the on-call system, eliminating the need for manual contact identification during emergency incidents.

The value of Backstage shines through the power of the available plugins. Searching and filtering empower discoverability, and the API opens the potential for further integrations to internal systems.

Keeping it fresh

Maintaining accurate and up-to-date documentation is always a challenge. Co-locating the service catalog with related codebases helps avoid the risk of it becoming stale and has become a consideration for reviewing and approving changes. 

We are early in our journey with this approach to a secrets catalog and aim to share our incremental improvements in further blog posts as we progress.

How We Scale to 16+ Billion Calls

The holiday season brings a huge spike in traffic for many companies. While increased traffic is great for retail business, it also puts infrastructure reliability to the test. At times when every second of uptime is of elevated importance, how can engineering teams ensure zero downtime and performant applications? Here are some key strategies and considerations we employ at Bazaarvoice as we prepare our platform to handle over 16 Billion API calls during Cyber Week.

Key to approaching readiness for peak load events is defining the scope of testing. Identify which services need to be tested and be clear about success requirements.  A common trade off will be choosing between reliability and cost. When making this choice, reliability is always the top priority. ‘Customer is Key’ is a key value at Bazaarvoice, and drives our decisions and behavior.  Service Level Objectives (SLOs) drive clarity of reliability requirements through each of our services.

Reliability is always the top priority

When customer traffic is at its peak, reliability and uptime must take precedence over all other concerns. While cost efficiency is important, the customer experience is key during these critical traffic surges. Engineers should have the infrastructure resources they need to maintain stability and performance, even if it means higher costs in the short-term.

Thorough testing and validation well in advance is essential to surfacing any issues before the holidays. All critical customer-facing services undergo load and failover simulations to identify performance bottlenecks and points of failure. In a Serverless-first architecture, ensuring configuration like reserved concurrency and quota limits are sufficient for autoscaling requirements are valuable to validate.  Often these simulations will uncover problems you have not previously encountered. For example, in this year’s preparations our load simulations uncovered scale limitations in our redis cache which required fixes prior to Black Friday.

“It’s not only about testing the ability to handle peak load”

It’s important to note readiness is not only about testing the ability to handle peak load. Disaster recovery plans are validated through simulated scenarios. Runbooks are verified as up-to-date, to ensure efficient incident response in the event something goes wrong. Verifying instrumentation and infrastructure that supports operability are tested, ensuring our tooling works when we need it most.

Similarly ensuring the appropriate tooling and processes are in place to address security concerns is another key concern. Preventing DDoS attacks which could easily overwhelm the system if not identified and mitigated, preventing impact of service availability.

Predicting the future

Observability through actionable monitoring, logging, and metrics provides the essential visibility to detect and isolate emerging problems early. It also provides the historical context and growth of traffic data over time, which can help forecast capacity needs and establish performance baselines that align with real production usage. In addition to quantitative measures, proactively reaching out to clients means we are in step with client needs about expected traffic helping align testing to actual usage patterns.  This data is important to simulate real world traffic patterns based on what has gone before, and has enabled us to accurately predict Black Friday traffic trends. However it’s important our systems are architected to scale with demand, to handle unpredicted load if need be, key to this is observing and understanding how our systems behave in production.

Traffic Trends

What did it look like this year? Consumer shopping patterns remained quite consistent on an elevated scale. Black Friday continues to be the largest shopping day of the year, and consumers continue to shop online in increasing numbers. During Cyber Week alone, Bazaarvoice handled over 16 Billion API calls.

Solving common problems once

While individual engineering teams own service readiness, having a coordinated effort ensures all critical dependencies are covered. Sharing forecasts, requirements, and learnings across teams enables better preparation. Testing surprises on dependent teams should be avoided through clear communication.

Automating performance testing, failover drills, and monitoring checks as part of regular release cycles or scheduled pipelines reduces the overhead of peak traffic preparation. Following site reliability principles and instilling always-ready operational practices makes services far more resilient year-round. 

For example, we recently put in place a shared dev pattern for continuous performance testing.  This involves a quick setup of k6 performance script, an example github action pipeline and observability configured to monitor performance over time. We also use an in-house Tech Radar to converge on common tooling so a greater number of teams can learn and stand on the shoulders of teams who have already tried and tested tooling in their context.

Other examples include, adding automation to performance tests to replay production requests for a given load profile makes tests easier to maintain, and reflect more accurately production behavior. Additionally, make use of automated fault injection tooling, chaos engineering and automated runbooks.

Adding automation and ensuring these practices are part of your everyday way of working are key to reducing the overhead of preparing for the holidays.

Consistent, continuous training conditions us to always be ready

Moving to an always-ready posture ensures our infrastructure is scalable, reliable and robust all year round. Implementing continuous performance testing using frequent baseline tests provides frequent feedback on performance from release to release.  Automated operational readiness service checks ensure principles and expectations are in place for production services and are continuously checked.  For example, automated checking of expected monitors, alerts, runbooks and incident escalation policy requirements.

At Bazaarvoice our engineering teams align on shared System Standards which gives technical direction and guidance to engineers on commonly solved problems, continuously evolving our systems and increasing our innovation velocity.  To use a trail running analogy, System Standards define the preferred paths and combined with Tech Radar, provide recommendations to help you succeed.  For example, what trail running shoes should I choose, what energy refuelling strategy should I use, how should I monitor performance.  The same is true for building resilient reliable software, as teams solve these common problems, share the learnings for those teams which come after.

Looking Ahead

With a relentless focus on reliability, scalability, continuous testing, enhanced observability, and cross-team collaboration, engineering organizations can optimize performance and minimize downtime during critical traffic surges. 

Don’t forget after the peak has passed and we have descended from the summit, analyze the data.  What went well, what didn’t go well, and what opportunities are there to improve for the next peak.

Flyte 1 Year In

On the racetrack of building ML applications, traditional software development steps are often overtaken. Welcome to the world of MLOps, where unique challenges meet innovative solutions and consistency is king. 

At Bazaarvoice, training pipelines serve as the backbone of our MLOps strategy. They underpin the reproducibility of our model builds. A glaring gap existed, however, in graduating experimental code to production.

Rather than continuing down piecemeal approaches, which often centered heavily on Jupyter notebooks, we determined a standard was needed to empower practitioners to experiment and ship machine learning models extremely fast while following our best practices.

(cover image generated with Midjourney)

Build vs. Buy

Fresh off the heels of our wins from unifying our machine learning (ML) model deployment strategy, we first needed to decide whether to build a custom in-house ML workflow orchestration platform or seek external solutions.

When deciding to “buy” (it is open source after all), selecting Flyte as our workflow management platform emerged as a clear choice. It saved invaluable development time and nudged our team closer to delivering a robust self-service infrastructure. Such an infrastructure allows our engineers to build, evaluate, register, and deploy models seamlessly. Rather than reinventing the wheel, Flyte equipped us with an efficient wheel to race ahead.

Before leaping with Flyte, we embarked on an extensive evaluation journey. Choosing the right workflow orchestration system wasn’t just about selecting a tool but also finding a platform to complement our vision and align with our strategic objectives. We knew the significance of this decision and wanted to ensure we had all the bases covered. Ultimately the final tooling options for consideration were Flyte, Metaflow, Kubeflow Pipelines, and Prefect.

To make an informed choice, we laid down a set of criteria:

Criteria for Evaluation

Must-Haves:

  • Ease of Development: The tool should intuitively aid developers without steep learning curves.
  • Deployment: Quick and hassle-free deployment mechanisms.
  • Pipeline Customization: Flexibility to adjust pipelines as distinct project requirements arise.
  • Visibility: Clear insights into processes for better debugging and understanding.

Good-to-Haves:

  • AWS Integration: Seamless integration capabilities with AWS services.
  • Metadata Retention: Efficient storage and retrieval of metadata.
  • Startup Time: Speedy initialization to reduce latency.
  • Caching: Optimal data caching for faster results.

Neutral, Yet Noteworthy:

  • Security: Robust security measures ensuring data protection.
  • User Administration: Features facilitating user management and access control.
  • Cost: Affordability – offering a good balance between features and price.

Why Flyte Stood Out: Addressing Key Criteria

Diving deeper into our selection process, Flyte consistently addressed our top criteria, often surpassing the capabilities of other tools under consideration:

  1. Ease of Development: Pure Python | Task Decorators
    • Python-native development experience
  2. Deployment: Kubernetes Cluster
    • Flyte’s native Kubernetes integration simplified the deployment process
  3. Pipeline Customization
    • Easily customize any workflow and task by modifying the task decorator
  4. Visibility
    • Easily accessible container logs
    • Flyte decks enable reporting visualizations 

The Bazaarvoice customization

As with any platform, while Flyte brought many advantages, we needed a different plug-and-play solution for our unique needs. We anticipated the platform’s novelty within our organization. We wanted to reduce the learning curve as much as possible and allow our developers to transition smoothly without being overwhelmed.

To smooth the transition and expedite the development process, we’ve developed a cookiecutter template to serve as a launchpad for developers, providing a structured starting point that’s standardized and aligned with best practices for Flyte projects. This structure empowers developers to swiftly construct training pipelines.

The most relevant files provided by the template are:

  • Pipfile    - Details project dependencies 
  • Dockerfile - Builds docker container
  • Makefile   - Helper file to build, register, and execute projects
  • README.md  - Details the project 
  • src/
    • tasks/
    • Workflows.py (Follows the Kedro Standard for Data Layers)
      • process_raw_data - workflow to extract, clean, and transform raw data 
      • generate_model_input - workflow to create train, test, and validation data sets 
      • train_model - workflow to generate a serialized, trained machine learning model
      • generate_model_output - workflow to prevent train-serving skew by performing inference on the validation data set using the trained machine learning model
      • evaluate - workflow to evaluate the model on a desired set of performance metrics
      • reporting - workflow to summarize and visualize model performance
      • full - complete Flyte pipeline to generate trained model
  • tests/ - Unit tests for your workflows and tasks
  • run - Simplifies running of workflows

In addition, a common challenge in developing pipelines is needing resources beyond what our local machines offer. Or, there might be tasks that require extended runtimes. Flyte does grant the capability to develop locally and run remotely. However, this involves a series of steps:

  • Rebuild your custom docker image after each code modification.
  • Assign a version tag to this new docker image and push it to ECR.
  • Register this fresh workflow version with Flyte, updating the docker image.
  • Instruct Flyte to execute that specific version of the workflow, parameterizing via the CLI.

To circumvent these challenges and expedite the development process, we designed the template’s Makefile and run script to abstract the series of steps above into a single command!

./run —remote src/workflows.py full

The Makefile uses a couple helper targets, but overall provides the following commands:

  • info       - Prints info about this project
  • init       - Sets up project in flyte and creates an ECR repo 
  • build      - Builds the docker image 
  • push       - Pushes the docker image to ECR 
  • package    - Creates the flyte package 
  • register   - Registers version with flyte
  • runcmd     - Generates run command for both local and remote
  • test       - Runs any tests for the code
  • code_style - Applies black formatting & flake8

Key Triumphs

With Flyte as an integral component of our machine learning platform, we’ve achieved unmatched momentum in ML development. It enables swift experimentation and deployment of our models, ensuring we always adhere to best practices. Beyond aligning with fundamental MLOps principles, our customizations ensure Flyte perfectly meets our specific needs, guaranteeing the consistency and reliability of our ML models.

Closing Thoughts

Just as racers feel exhilaration crossing the finish line, our team feels an immense sense of achievement seeing our machine learning endeavors zoom with Flyte. As we gaze ahead, we’re optimistic, ready to embrace the new challenges and milestones that await. 🏎️

If you are drawn to this type of work, check out our job openings.

What was Old is New: Finding Joy in Modernising Legacy Systems

(cover image from ThisisEngineering RAEng)

Let’s face it: software is easier to write than maintain. This is why we, as software engineers, prefer to just “rip it out and start over” instead of trying to understand what another developer (or our past self) was thinking. We seem to have collectively forgotten that “programs must be written for people to read, and only incidentally for machines to execute”. You know it’s true — we’ve all had to painstakingly trace through a casserole of spaghetti code and thin, old-world-style abstractions digging for the meat of the program only to find nothing but a mess at the bottom of our plates.

It’s easy to yell “WTF” and blame the previous dev, but the truth is often more complicated. We can’t see the future, so it’s impossible to understand how requirements, technology, or business goals will grow when we design a net-new system. As a result, systems can become unreadable as their scope increases along with the business’s dependency on them. This is a bit of a paradox: older, harder to maintain systems often provide the most value. They are hard to work on because they’ve grown with the company, and scary to work on because breaking it could be a catastrophe.

Here’s where I’m calling you out: if you like hard, rewarding problems… try it. Take the oldest system you have and make it maintainable. You know the one I’m talking about — the one no one will “own”. That one the other departments depend on but engineers hate. The one you had to patch Log4Shell on first. Do it. I dare you.I recently had such an opportunity to update a decade old machine learning system at Bazaarvoice. On the surface, it didn’t sound exciting: this thing didn’t even have neural networks! Who cares! We ll… it mattered. This system processes nearly every user-generated product review received by Bazaarvoice — nearly 9 million per month — and does so with 90 million inference calls to machine learning models. Yup — 90 million inferences! It’s a huge scale, and I couldn’t wait to dive in.

In this post, I’ll share how modernizing this system through a re-architecture, instead of a re-write, allowed us to make it scalable and cost-effective without having to rip out all of the code and start over. The resulting system is serverless, containerized, and maintainable while reducing our hosting costs by nearly 80%. This post is a companion piece to a talk I recently presented at AWS Data Summit for Software Companies on generating value from data by leveraging our best practices to ensure success in machine learning projects. This post discusses the technical aspects in more detail, but you can watch the high-level talk linked at the end.

Something Old

First, let’s take a look at what we’re dealing with here. The legacy system my team was updating moderates user-generated content for all of Bazaarvoice. Specifically, it determines if each piece of content is appropriate for our client’s websites.

Photo by Diane Picchiottino

This sounds straightforward — eliminate obvious infractions such as hate speech, foul language, or solicitations — but in practice, it’s much more nuanced. Each client has unique requirements for what they consider appropriate. Beer brands, for example, would expect discussions of alcohol, but a children’s brand may not. We capture these client-specific options when we onboard new clients, and our Client Services team encodes them into a management database.

For some added complexity, we also sample a subset of our content to be moderated by human moderators. This allows us to continuously measure the performance of our models and discover opportunities for building more models.

The full architecture of our legacy system is shown below:

Our legacy moderation system hosted machine learning models on a single EC2 instance. This made deployments slow and limited scalability to the host’s memory size.

This system has some serious drawbacks. Specifically — all of the models are hosted on a single EC2 instance. This wasn’t due to bad engineering — just the inability of the original programmers to foresee the scale desired by the company. No one thought that it would grow as much as it did. In addition, the system suffered from developer rejection: it was written in Scala, which few engineers understood. Thus, it was often overlooked for improvement since no one wanted to touch it.

As a result, the system continued to grow in a keep-the-lights-on manner. Once we got around to re-architecting it, it was running on a single x1e.8xlarge instance. This thing had nearly a terabyte of ram and costs about $5,000/month (unreserved) to operate. Don’t worry, though, we just launched a second one for redundancy and a third for QA 🙃.

This system was costly to run and was at a high risk of failure (a single bad model can take down the whole service). Furthermore, the code base had not been actively developed and was thus significantly out of date with modern data science packages and did not follow our standard practices for services written in Scala.

Something New

When redesigning this system we had a clear goal: make it scalable. Reducing operating costs was a secondary goal, as was easing model and code management.

The new design we came up with is illustrated below:

Our new architecture deploys each model to a SageMaker Serverless endpoint. This lets us scale the number of models without limit while maintaining a small cost footprint.

Our approach to solving all of this was to put each machine learning model on an isolated SageMaker Serverless endpoint. Like AWS Lambda functions, serverless endpoints turn off when not in use — saving us runtime costs for infrequently used models. They also can scale out quickly in response to increases in traffic.

In addition, we exposed the client options to a single microservice that routes content to the appropriate models. This was the bulk of the new code we had to write: a small API that was easy to maintain and let our data scientists more easily update and deploy new models.

This approach has the following benefits:

  • Decreased the time to value by over 6x. Specifically, routing traffic to existing models is instantaneous, and deploying new models can be done in under 5 minutes instead of 30.
  • Scale without limit – we currently have 400 models but plan to scale to thousands to continue to increase the amount of content we can automatically moderate.
  • Saw a cost reduction of 82% moving off EC2 as the functions turn off when not in use, and we’re not paying for top-tier machines that are underutilized.

Simply designing an ideal architecture, however, isn’t the really interesting hard part of rebuilding a legacy system — you have to migrate to it.

Something Borrowed

Our first challenge in migration was figuring out how the heck to migrate a Java WEKA model to run on SageMaker, let alone SageMaker Serverless.

Fortunately, SageMaker deploys models in Docker containers, so at least we could freeze the Java and dependency versions to match our old code. This would help ensure the models hosted in the new system returned the same results as the legacy one.

Photo from JJ Ying

To make the container compatible with SageMaker, all you need to do is implement a few specific HTTP endpoints:

  • POST /invocation — accept input, perform inference, and return results.
  • GET /ping — returns 200 if the JVM server is healthy

(We chose to ignore all of the cruft around BYO multimodel containers and the SageMaker inference toolkit.)

A few quick abstractions around com.sun.net.httpserver.HttpServer and we were ready to go.

And you know what? This was actually pretty fun. Toying with Docker containers and forcing something 10 years old into SageMaker Serverless had a bit of a tinkering vibe to it. It was pretty exciting when we got it working — especially when we got the legacy code to build it in our new sbt stack instead of maven. The new sbt stack made it easy to work on, and containerization ensured we could get proper behavior while running in the SageMaker environment.

Something Blue

So we have the models in containers and can deploy them to SageMaker — almost done, right? Not quite.

Photo by Tim Gouw

The hard lesson about migrating to a new architecture is that you must build three times your actual system just to support migration.

In addition to the new system, we had to build:

  • A data capture pipeline in the old system to record inputs and outputs from the model. We used these to confirm that the new system would return the same results.
  • A data processing pipeline in the new system to compute results and compare them to the data from the old system. This involved a large amount of measurement with Datadog and needed to offer the ability to replay data when we found discrepancies.
  • A full model deployment system to avoid impacting the users of the old system (which would simply upload models to S3). We knew we wanted to move them to an API eventually, but for the initial release, we needed to do so seamlessly.

All of this was throw-away code we knew we could toss once we finished migrating all of the users, but we still had to build it and ensure that the outputs of the new system matched the old.

Expect this upfront.

While building the migration tools and systems certainly took more than 60% of our engineering time on this project, it too was a fun experience. Unit testing became more like data science experiments: we wrote whole suites to ensure that our output matched exactly. It was a different way of thinking that made the work just that much more fun. A step outside our normal boxes, if you will.

So… Just Try It

Next time you’re tempted to rebuild a system from code up, I’d like to encourage you to try migrating the architecture instead of the code. You’ll find interesting and rewarding technical challenges and will likely enjoy it much more than debugging unexpected edge cases of your new code.


The talk I gave is a bit more high-level and goes into the MLOps side of things. Check it out here:

Kedro 6 Months In

We build AI software in two modes: experimentation and productization. During experimentation, we are trying to see if modern technology will solve our problem. If it does, we move on to productization and build reliable data pipelines at scale.

This presents a cyclical dependency when it comes to data engineering. We need reliable and maintainable data engineering pipelines during experimentation, but don’t know what that pipeline should do until after we’ve completed the experiments. In the past, I and many data scientists I know have used an ad-hoc combination of bash scripts and Jupyter Notebooks to wrangle experimental data. While this may have been the fastest way to get experimental results and model building, it’s really a technical debt that has to be paid down the road.

The Problem

Specifically, the ad-hoc approach to experimental data pipelines causes pain points around:

  • Reproducibility: Ad-hoc experimentation structures puts you at risk of making results that others can’t reproduce, which can lead to product downtime if or when you need to update your approach. Simple mistakes like executing a notebook cell twice or forgetting to seed a random number generator can usually be caught. But other, more insidious problems can occur, such as behavior changes between dependency versions.
  • Readability: If you’ve ever come across another person’s experimental code, you know it’s hard to find where to start. Even documented projects might just say “run x script, y notebook, etc”, and it’s often unclear where the data come from and if you’re on the right track. Similarly, code reviews for data science projects are often hard to read: it’s asking a lot for a reader to differentiate between notebook code for data manipulation and code for visualization.
  • Maintainability: It’s common during data science projects to do some exploratory analysis or generate early results, and then revise how your data is processed or gathered. This becomes difficult and tedious when all of these steps are an unstructured collection of notebooks or scripts. In other words, the pipeline is hard to maintain: updating or changing it requires you to keep track of the whole thing.
  • Shareability: Ad-hoc collections of notebooks and bash scripts are also difficult for a team to work on concurrently. Each member has to ensure their notebooks are up to date (version control on notebooks is less than ideal), and that they have the correct copy of any intermediate data.

Enter Kedro

A lot of the issues above aren’t new to the software engineering discipline and have been largely solved in that space. This is where Kedro comes in. Kedro is a framework for building data engineering pipelines whose structure forces you to follow good software engineering practices. By using Kedro in the experimentation phase of projects, we build maintainable and reproducible data pipelines that produce consistent experimental results.

Specifically, Kedro has you organize your data engineering code into one or more pipelines. Each pipeline consists of a number of nodes: a functional unit that takes some data sets and parameters as inputs and produces new data sets, models, or artifacts.

This simple but strict project structure is augmented by their data catalog: a YAML file that specifies how and where the input and output data sets are to be persisted. The data sets can be stored either locally or in a cloud data storage service such as S3.

I started using Kedro about six months ago, and since then have leveraged it for different experimental data pipelines. Some of these pipelines were for building models that eventually were deployed to production, and some were collaborations with team members. Below, I’ll discuss the good and bad things I’ve found with Kedro and how it helped us create reproducible, maintainable data pipelines.

The Good

  • Reproducibility: I can’t say enough good things here: they nailed it. Their dependency management took a bit of getting used to but it forces a specific version on all dependencies, which is awesome. Also, the ability to just type kedro install and kedro run to execute the whole pipeline is fantastic. You still have to remember to seed random number generators, but even that is easy to remember if you put it in their params.yml file.
  • Function Isolation: Kedro’s fixed project structure encourages you to think about what logical steps are necessary for your pipeline, and write a single node for each step. As a result, each node tends to be short (in terms of lines of code) and specific (in terms of logic). This makes each node easy to write, test, and read later on.
  • Developer Parallelization: The small nodes also make it easier for developers to work together concurrently. It’s easy to spot nodes that won’t depend on each other, and they can be coded concurrently by different people.
  • Intermediate Data: Perhaps my favorite thing about Kedro is the data catalog. Just add the name of an output data set to catalog.yml and BOOM, it’ll be serialized to disk or your cloud data store. This makes it super easy to build up the pipeline: you work on one node, commit it, execute it, and save the results. It also comes in handy when working on a team. I can run an expensive node on a big GPU machine and save the results to S3, and another team member can simply start from there. It’s all baked in.
  • Code Re-usability: I’ll admit I have never re-used a notebook. At best I pulled up an old one to remind myself how I achieved some complex analysis, but even then I had to remember the intricacies of the data. The isolation of nodes, however, makes it easy to re-use them. Also, Kedro’s support for modular pipelines (i.e., packaging a pipeline into a pip package) makes it simple to share common code. We’ve created modular pipelines for common tasks such as image processing.

The Bad

While Kedro has solved many of the quality challenges in experimental data pipelines, we have noticed a few gotchas that required less than elegant work arounds:

  • Incremental Dataset: This support exists for reading data, but it’s lacking for writing datasets. This affected us a few times when we had a node that would take 8-10 hours to run. We lost work if the node failed part of the way through. Similarly, if the result data set didn’t fit in memory, there wasn’t a good way to save incremental results since the writer in Kedro assumes all partitions are in memory. This GitHub issue may fix it if the developers address it, but for now you have to manage partial results on your own.
  • Pipeline Growth: Pipelines can quickly get hard to follow since the input and outputs are just named variables that may or may not exist in the data catalog. Kedro Viz helps with this, but it’s a bit annoying to switch between the navigator and code. We’ve also started enforcing name consistency between the node names and their functions, as well as the data set names in the pipeline and the argument names in the node functions. Finally, making more, smaller pipelines is also a good way to keep your sanity. While all of these techniques help you to mentally keep track, it’s still the trade off you make for coding the pipelines by naming the inputs and outputs.
  • Visualization: This isn’t really considered much in Kedro, and is the one thing I’d say notebooks still have a leg up on. Kedro makes it easy for you to load the Kedro context in a notebook, however, so you can still fire one up to do some visualization. Ultimately, though, I’d love to see better support within Kedro for producing a graphical report that gets persisted to the 08_reporting layer. Right now we worked around this by making a node that renders a notebook to disk, but it’s a hack at best. I’d love better support for generating final, highly visual reports that can be versioned in the data catalog much like the intermediate data.

Conclusion

So am I a Kedro convert? Yah, you betcha. It replaces the spider-web of bash scripts and Python notebooks I used to use for my experimental data pipelines and model training, and enables better collaboration among our teams. It won’t replace a fully productionalized stream-based data pipeline for me, but it absolutely makes sure my experimental pipelines are maintainable, reproducible, and shareable.

Root Cause Analysis for Hadoop Applications

Parth Shah and Thai Bui

Overview

One of the reasons why Hadoop jobs are hard to operate is their inability to provide clear, actionable error diagnostic messages for users. This stems from the fact that Hadoop consists of many interrelated components. When a component fails or behaves poorly, the failure will be cascaded to its dependent components which causes a job to fail.

This blog post is an attempt to help to solve that problem by created a user-friendly, self-serving and actionable Hadoop diagnostic system.

Our Goals

Due to its complex nature, the project was split into multiple components. First, we prototyped a diagnostic tool to help debug Hadoop application failure by providing a clear root cause analysis and save engineering time. Second we purposely inflict failures on a cluster (via a method called chaos testing) and collected the data to understand how certain log messages map to errors. Finally, we examined regular expression as well as natural language processing (NLP) techniques to automatically provide root cause analysis in production.

To document this, we organized the blog in the following sections:

  • Error Message Analytics Portal
    • To provide a quick glance at the known root cause.
  • Datadog Dashboard
    • To calculate failure rates related to the unknown root cause and known root cause.
    • Separate infrastructure failure from a missing data failure (missing partition).
  • Data Access
    • All relevant logs messages from services like Yarn, Oozie, HDFS, hive server, etc. were collected and stored under an S3 bucket with an expiration policy.
  • Chaos Data Generation
    • Using chaos testing, we produced actual errors related to memory, network, etc. This was done to understand relationships between log messages and root cause errors.
    • A service was made to create an efficient and simple way to run chaos tests and collect its corresponding/related log data.
  • Diagnostic Message Classification
    • Due to the simple and repetitive nature of log messages (low in entropy), we built a natural language processing model that classified specific error types of an unknown failure.
    • Tested the model on the chaos data for the specific workload at Bazaarvoice

Error Message Analytics Portal

Bazaarvoice provides an internal portal tool for managing analytics applications by the end-users. The following screenshot demonstrates an example message of a “Job failures due to missing data”. This message is caught by using a simple regular expression of the stack trace. A regular expression works because there is only one way a job could fail due to missing data.



DataDog Dashboard

What is a partition?

Partitioning is a strategy of dividing a table into related parts based on date, components or other categories. Using a partition, it is easy and faster to query a portion of data. However, the partition a job is querying can sometimes not be available which causes a job to fail by our design. The dashboard below calculates metrics and keeps track of jobs that failed due to an unavailable partition.

The dashboard classifies failed jobs as either a partition failure (late/missing data) or an unknown failure (Hadoop application failure). Our diagnostic tool attempts to find the root cause of the unknown failures since a late or missing data is an easy problem to solve. 



Data Access

Since our clusters are powered by Apache Ambari, we leveraged and enhanced Ambari Logsearch Logfeeder to ship logs of relevant services directly to S3 and partitioned the data by as shown in the raw_log of the Directory Tree Diagram below. However, as the dataset got bigger, partitioning was not enough to efficiently query the data. To speed up read performance and to iterate faster, the data was later converted into ORC format.



Convert JSON logs to ORC logs

DROP TABLE IF EXISTS default.temp_table_orc;
CREATE EXTERNAL TABLE default.temp_table_orc (
cluster STRING,
file STRING,
thread_name STRING,
level STRING,
event_count INT,
ip STRING,
type STRING,
...
hour INT,
component STRING
) STORED AS ORC
LOCATION 's3a://<bucket_name>/orc-log/${workflowYear}/${workflowMonth}/${workflowDay}/${workflowHour}/';
INSERT INTO default.temp_table_orc SELECT * FROM bazaar_magpie_rook.rook_log
WHERE year=${workflowYear} AND month=${workflowMonth} AND day=${workflowDay} AND hour=${workflowHour};

Sample Queries for Root Cause Diagnosis

SELECT t1.log_message,
       t1.logtime,
       t1.level,
       t1.type,
       t2.Frequency
FROM
  (SELECT log_message,
          logtime,
          TYPE,
          LEVEL
   FROM
     ( SELECT log_message,
              logtime,
              TYPE,
              LEVEL,
              row_number() over (partition BY log_message
                                 ORDER BY logtime) AS r
      FROM bazaar_magpie_rook.rook_log
      WHERE cluster_name = 'cluster_name'
        AND DAY = 28
        AND MONTH=06
        AND LEVEL = 'ERROR' ) S
   WHERE S.r = 1 ) t1
LEFT JOIN
  (SELECT log_message,
          COUNT(log_message) AS Frequency
   FROM bazaar_magpie_rook.rook_log
   WHERE cluster_name = 'cluster_name'
     AND DAY = 28
     AND MONTH=06
     AND LEVEL = 'ERROR'
   GROUP BY log_message) t2 ON t1.log_message = t2.log_message
WHERE t1.logtime BETWEEN '2019-06-28 13:05:00' AND '2019-06-28 13:30:00'
ORDER BY t1.logtime
LIMIT 400;

SELECT log_message, logtime, type
FROM bazaar_magpie_rook.rook_log WHERE level = 'ERROR' AND type != 'logsearch_feeder' AND logtime BETWEEN '2019-06-24 09:05:00' AND '2019-06-24
12:30:10'
ORDER BY logtime
LIMIT 1000;

SELECT log_message, COUNT(log_message) AS Frequency
FROM  bazaar_magpie_rook.rook_log
WHERE cluster_name = 'dev-blue-3' AND level = 'ERROR' AND logtime BETWEEN '2019-06-27 13:50:00' AND '2019-06-29 14:30:00'
GROUP BY log_message
ORDER BY COUNT(log_message) DESC
LIMIT 10;

Chaos Data Generation

The chaos data generation process makes up a bulk of this project as well is the most important. It stems from the process of chaos testing to experiment on software and infrastructure in order to understand the systems ability to withstand unexpected or turbulent conditions. This concept of testing was first introduced by Netflix in 2011. The following pseudocode explains how it works.

Submit a normal job through an API on a specific cluster

Create a list of IP addresses for all the testable nodes in that cluster (such as the workers nodes).

Once we have all associated nodes, inject failures into them (stress test memory or network)

While the job is not done

    Let the job run

At this point, the job has either failed or succeeded

Stop all stress tests

Gather the job details such as start time, end time, status, and most importantly its associated stress test type (memory, packet_corruption, etc.). 

Store job details and its report in JSON format

Job Report Example

{
"duration":"45 Minutes",
"nominalTime":"2018-10-27 07:00:00",
"cost":0.7974499089253186,
"downloadLinks":[],
"errorMessage":"Error: Main class [org.apache.oozie.action.hadoop.Hive2Main], exit code [2]",
"startTime":"2019-07-25 18:45:37",
"stopTime":"2019-07-25 19:31:18","failedAction":"hive-action",
"chaos_error":"packet_corruption",
"workflowId":"0001998-190628043141230-oozie-oozi-W",
"status":"KILLED"
}

We followed the same process to generate chaos data on different types of injected failures such as:

  • High Memory Utilization on the Host
  • Packet Corruption
  • Packet Loss
  • High Memory Util on a Container

While there are certainly more errors the model is capable of learning and classifying, for the purpose of prototyping we kept the type of failures to 2-3 categories.

Diagnostic Message Classification

In this section, we explore 2 types of error classification methods, a simple regex and supervised learning.

Short Term Solution with Regex

There are many ways to analyze text for different patterns. One of these ways is called Regular Expression Matching (regex). Regex is “special strings representing a pattern to be matched in search operation”. One use of regex is finding a keyword like “partition”. When a job fails due to missing partitions its error message usually looks something like this “Not all partitions are available even after 16 attempts“. The regex for this specific case would look like \W*((?i)partition(?-i))\W* .

A regex log parser could easily be able to identify this error and do what is necessary to fix the problem. However, the power of regex is very limiting when it comes to analyzing complex log messages and stack traces. When we find a new error, we would have to manually hardcode a new regular expression to pattern match the new error type which is very error-prone and tedious in the long run.

Since the Hadoop eco-system includes many components, there are many combinations of log messages that can be produced; a simple regex parser would be difficult to work here because it can not process general similarities between different sentences. This is where natural language processing helps.

Long Term Solution with Supervised Learning

The idea behind this solution is to use natural language processing to process log messages and supervised learning to learn and classify errors. This model should work well with log data vs. normal language due since machine logs are more structured and low in entropy. You can think of entropy as a measure of how unstructured or random a set of data is. The English language tends to have high entropy relative to machine logs due to its sometimes illogical sentence structure. On the other hand, machine-generated log data is very repetitive which makes it easier to model and classify.

Our model required pre-processing of logs which are called tokenization. Tokenization is the process of taking a set of text and breaking it up into its individual words or tokens. Next, the set of tokens were used to model their relationship in high dimensional space using Word2Vec. Word2Vec is a widely popular model used for learning vector representation of words called, “word embeddings” (word2vec). Finally, we use the vector representation to train a simple logistic regression classifier using the chaos data generated earlier. The diagram below shows a similar training processing from Experience Report: Log Mining using Natural Language Processing and Application to Anomaly Detection by Christophe Bertero, Matthieu Roy, Carla Sauvanaud and Gilles Tredan.



In contrast to the diagram above, since we only trained the classifier on error logs, it was not possible to classify a normal system as it should produce no error logs. Instead, we trained the data on different types of stressed systems. Using the dataset generated by chaos testing, we were able to identify the root cause for each of the error message for each failed job. This allowed us to make our training dataset as shown below. We split our dataset into a simple 70% training and 30% for testing.

Subsequently, each log message was tokenized to create a word vector. All the tokens were inputted into a pre-trained word2vec model which mapped out each word in a vector space, creating a word embedding. Each line was represented as a list of vectors and an average of all the vectors in a log message represents the feature vector in high dimensional space. We then inputted each feature vector and its labels into a classification algorithm such as logistic regression or random forest to create a model that could predict the root cause error from a logline. However, since failure often contains multiple error log messages, it would not be logical to simply reach a root cause conclusion from just a single line of the log from a long error log. A simple way to tackle this problem would be to window the log and input the individual lines from the window into the model and output the final error as the most common error outputted by all the lines in a window. This is a very naive way of breaking apart long error log so more research would have to be done to process error logs without losing the valuable insight of their dependent messages.

Below are a few examples of error log messages and their tokenized version.

Example Raw Log

java.lang.RuntimeException: org.apache.thrift.transport.TSaslTransportException: No data or no sasl data in the stream at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:219)
    at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:269)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191] Caused by: org.apache.thrift.transport.TSaslTransportException: No data or no sasl data in the stream at org.apache.thrift.transport.TSaslTransport.open(TSaslTransport.java:328)
    at org.apache.thrift.transport.TSaslServerTransport.open(TSaslServerTransport.java:41) ~[hive-exec-3.1.0.3.1.0.0-78.jar:3.1.0-SNAPSHOT] at org.apache.thrift.transport.TSaslServerTransport$Factory.getTransport(TSaslServerTransport.java:216)

Tokenized Raw Log

[java,lang,RuntimeException,org,apache,thrift,transport,TSaslTransportException,No,data,no,sasl,data,stream,org,apache,thrift,transport,TSaslServerTransport,Factory,getTransport,TSaslServerTransport,java219]
[org,apache,thrift,server,TThreadPoolExecutor,WorkerProcess,run,TThreadPoolServer,java,269]
...

Results

The model accurately predicted the root cause of failed job with 99.3% accuracy in our test dataset. At an initial glance, the metrics calculated on the test data look very promising. However, we still need to evaluate its efficiency in production to get a more accurate picture. The initial success in this proof of concept warrants further experimentation, testing, and investigation for using NLP to classify errors. 

Training Data 70% (Top 20 Rows)

Test Data Results 30% (Top 20 Rows)

Conclusion

In order to implement this tool in production, data engineers will have to automate certain aspects of the data aggregation and model building pipeline

Self Reporting Errors

A machine learning model is only as good as its data. For this tool to be robust and accurate, any time engineers encounter a new error or an error that the model does not know about, they should report their belief of the root cause so the corresponding errors logs get tagged with the specific root cause error. For instance, when a job fails for a similar reason, the model will be able to diagnose and classify its error type. This can be done through a simple API, form, Hive query that takes in the job id and its assumed root_cause. The idea behind is this that by creating a form, we could manual label log messages and errors in case the classifier fails to correctly diagnose the true problem. 

The self-reported error should take the form of the chaos error to ensure the existing pipeline will work.

Automating Chaos Data Generation

The chaos tests should run on a regular basis in order to keep the model up to date. This could be done by creating a Jenkins job that automatically runs on a regular cadence. Often times, running a stress test will make certain nodes to be unhealthy. It causes our automation to be unable to SSH into the nodes again to stop the stress test such as when the stress test is interfering with connectivity to the node (see the error below). This can be solved by creating a time limit for the stress test, so the script does not have to ssh again once the job has finished. In the long run, as the self-reported errors grow, the model should rely on chaos test generated test less. Chaos tests produce logs for very extreme situations which could be not typical for a normal production environment. 

java.lang.RuntimeException: Unable to setup local port forwarding to 10.8.100.144:22
    at com.bazaarvoice.rook.infra.util.remote.Gateway.connect(Gateway.java:103)
    at com.bazaarvoice.rook.infra.regress.yarn.RootCauseTest.kill_memory_stress_test(RootCauseTest.java:174)

Hackathon 2019

This year’s Bazzarvoice Hackathon coincided with our annual all hands meeting in Austin. Our global offices took time to work on projects that focused on innovation, social integrations, and improved efficiencies. Teams across our departments participated This included: R&D, Product, Customer Services, and Knowledge Base.

Hackathon teams took two days to work on their projects. The following day, teams present their outcomes in a science fair setting while the company voted on the projects.  The top 10 teams then then went on to present to the entire company.

Thanks to all who participated and especially those who organized the various activities. We hope to see many of these projects become new product enhancements.

Response API Demo App

Are you looking to develop your own application on top of the Bazaarvoice Response API? Well, we got something for you. The Response API Demo App is a simple Node-React application which demonstrates how to use Response API in conjunction with our 3-legged OAuth2 API. It is recommended to go through the Developer Portal and read about OAuth2 API and the Response API before diving into the application architecture below.

This application was bootstrapped with Create React App and consists of two separate components – the front-end client side in React and a back-end server side in NodeJS.

Let’s talk more about the back-end architecture first. Almost all of the server-side logic is contained in server.js. Using the Express framework for NodeJS, this file defines the following endpoints:

  • /api/redirect – This endpoint is supposed to handle any incoming redirections from OAuth2, obtain an access token using either an authorization code (if it is a first time login) or a refresh token (if the existing access token has expired or is expiring soon). If you are building an app integrated with OAuth2, you must define an endpoint to handle redirects in a similar way.
  • /api/check-login – This is an application specific endpoint which was defined so that the client side can easily verify using the server if a user is logged in or not so that they can then be taken to the login page.
  • Abstraction of GET, POST, PATCH, DELETE calls to the Response API –  These endpoints are basically a clone of the endpoints provided by the Response API which are documented here on the Developer Portal. This abstraction is done so that the client side can call these secured endpoints in-directly without having to expose any of the OAuth2 credentials through the user’s browser. All our logic that relates to obtaining and refreshing OAuth tokens stays on the server side securely.

This application uses express-session for storing and maintaining user sessions using browser cookies which is important for being able to maintain multiple users without storing state on your backend. However, this implementation of user sessions is not suitable for production applications and you should probably be maintaining user sessions using a combination of cookies and session storage.

The server side credentials and configurations are supposed to be stored in server/server-config.json which is then picked up by the Node server. Note that these credentials are confidential and should not be exposed to the client side in any way.

Coming to the front-end architecture, we are using the React library and React Semantic UI for the UI Components. The client side does a bunch of stuff ranging from querying a review from the Conversations API and then using that Review ID to obtain/add/modify/delete the corresponding responses to that review using the Response API. Following components make up the core of the front-end:

  • Login Page – If a user is not logged in (we check this using the /api/check-login endpoint as discussed above), we redirect them to this page which has a little widget that allows them to go to a Bazaarvoice login page, enter their Bazaarvoice Portal login credentials and once they are properly authenticated, the OAuth2 API redirects them to the /api/redirect endpoint which then handles all the token exchange logic. You can configure this redirect URI to be whatever you want during your app provisioning process but for the demo application to work as expected, it should be http://localhost:5000/api/redirect as you can see in the config file. All of the OAuth2 login and token exchange process happens during this step.
  • Search Page – This page essentially serves as the landing page for this application where a user can enter a Review ID (this can be obtained using Workbench or the Conversations API) and then they are taken to the review page.

  • Review Page – Once the user is authenticated and they search for a valid Review ID, they are taken to this page which fetches the specified review by calling the Conversations API and then fetches all responses for that review by calling the application’s backend which in turn calls the Response API. So, on this page, a user can see contents of a review along with all responses. This also allows them to add a new response, edit an existing one or delete a response.

You might have noticed a department field on a response which appears as a drop-down menu. These menu options have been hard-coded in the demo application and can be modified in the departmentFormOptions.js file. Further, client.js is an utility file that presents all the commonly used API calls as simple functions accessible to all components on the client side.

The client side configurations are stored in response-demo/client/src/utils/config.js file. This is built into the the project when the front-end client is packaged before running. These configurations are accessible to the browser so you shouldn’t store anything confidential here.

Apart from that, the application follows a pretty standard architecture and you can read more about it here. On the deployment side of things, there is a Dockerfile which builds the client artifacts, sets up the server and gets the application up and running on your local environment. Follow these instructions to get it up and running locally. The app can also be deployed to Flynn by making changes to the redirect URI in configurations and also making sure that your Flynn Redirect URI is added to the list of allowed Redirect URIs for your app credentials.

Vger Lets You Boldly Go . . .

Are you working on an agile team? Odds are high that you probably are. Whether you do Scrum/Kanban/lean/extreme, you are all about getting work done with the least resistance possible. Heck, if you are still on Waterfall, you care about that.  But how well are you doing? Do you know? Is that something a developer or a lead should even worry about or is a SEP? That’s a trick question. If your team is being held accountable and there is a gap between their expectations and your delivery, by the transitive property, you should worry about some basic lean metrics.

Here at Bazaarvoice, we are agile and overwhelmingly leverage kanban. Kanban emphasizes the disciplines of flow and continuous improvement. In an effort to make data-driven decisions about our improvements, we needed an easy way to get the relevant data. With just JIRA and GitHub alone, access to the right data has a significant barrier to entry.

So, like any enterprising group of engineers, we built an app for that.

What we did

Some of us had recently gone through an excellent lean metric forecasting workshop with Troy Magennis from Focused Objective. In his training he presented the idea of displaying a quadrant of lean metrics in order to force a narrative for a teams behavior, and to avoid overdriving on a single metric. This really resonated with me and seemed like good paradigm for the app we wanted to build.

And thus, Vger was born.

We went with a simple quadrant view with very bookmarkable url parameters. We made it simple to for teams to self-service by giving them an interface to make their own “Vger team” and add whatever “Vger boards” they need.  Essentially, if you can make a JQL query and give it a board in JIRA, Vger can graph the metrics for it. In the display, we provide a great deal of flexibility by letting teams configure date ranges for the dashboard, work types to be displayed, and the JIRA board columns to be considered as working/non-working.

Now the barrier to entry for lean metrics is down to “can you open a browser.”  Not too shabby.

The Quadrant View

We show the following in the quadrant view:

1. Throughput – The number of completed tickets per week.

2. Variation – the variation (standard deviation/mean) for the Throughput.

3. Backlog Growth – the tickets opened versus closed.

4. Lead Times – The lead times for the completed tickets. This also provides a detailed view by Jira board column to see where you spend most of your time.

We at Bazaarvoice are conservative gamblers, so you’ll see the throughput and lead time quadrants show the 50%, 80%, and 90% likelihood (the inverse of percentile).  We do this because relying on the average or mean is not your friend. Who want’s to bet on a coin toss? Not us. We like to be right say, eight out of ten times.

The Quarterly View

Later, we were asked to show throughput by quarter to help with quarterly goal planning. We created a side-car page for this.  It shows Throughput by quarter:

We also built a scatterplot for lead times so outliers could be investigated:

This view has zoomable regions and each point lets you click through to the corresponding JIRA ticket. So that’s nice.

But Wait! Git This….

From day one, we chose to show the same Quadrant for GitHub Pull Requests.

Note that we show rejected and merged lines in the PR Volume quadrant.  We also support overlaying your git tags on both PR and JIRA ticket data.  Pretty sweet!

I Want to Do More

Vger lets you download throughput data from the Quadrant and Quarterly views. You can also download lead time from the Quarterly view too. This lets teams and individuals perform their own visualizations and investigations on these very useful lean metrics.

But Why?

Vger was built with three use cases in mind:

Teams should be informed in retros

Teams should have easy access to these key lean metrics in their retros. We recommend that they start off viewing the quadrant and seeing if they agree with the narrative the retro facilitator presents. They should also consider the results of any improvement experiments they tried. Did the new behavior make throughput go up as they hoped it would? It the new behavior reduce time spent in code review? Did it reduce the number open bugs? etc.  Certainly not everything in a retro should be mercilessly data-driven, but it is a key element to a culture of continuous improvement.

Managers should know this data and speak to it

Team managers commonly speak to how their teams are progressing. These discussions should be data-driven, and most importantly it should be driven by the same data the team has access to (and hopefully retros to). It should also be presented in a common format that still provides for some customization. NOTE: You should avoid comparing team to team in Vger or a similar visualization. In most situations, that way leads to futility, confusion, and frustration.

We should have data to drive data-driven decisions about the future

Lean forecasting is beyond the scope of this post however, Troy Magennis has a fine take on it.  My short two cents on the matter is: a reasonably functioning team with even a little bit of run time should never be asked “how long will it take?”  Drop that low value ritual and do the high value task of decomposing the work, then forecast with historical data. Conveniently, you can download this historical data from Vger you used in your spreadsheet of choice.  I happen to like monte carlo simulations myself.

Isn’t This for Kanban?

You’ll note I used the term “lean metrics” throughout. I wanted to avoid any knee-jerk “kanban vs scrum vs ‘how we do things'”reaction. These metrics apply no matter what methodology you consciously (or unconsciously) use for the flow of work through your team.  It was built for feature development teams in mind, but we had good success when our client implementation team started using it as an early adopter. It allowed them to have a clear view into their lead time details and ferret out how much time was really spent waiting on clients to perform an action on their end.

Cool. How Do I Get a Vger?

We open sourced it here, so help yourself. This presented as “it worked for us”-ware and is not as finely polished as it could be, so it has some caveats. It is a very simple serverless app. We use JIRA and GitHub, so only those tools are currently supported. If you use similar, give Vger a try!

What’s Next?

If your fingers are itching to contribute, here’s some ideas:

  • Vger’s ETL process could really use an update
  • The Quadrant view UI really needs an update to React to match the Quarterly view
  • Make it flexible for your chosen issue tracker or source control?
  • How about adding a nice Cumulative Flow Diagram?

 

Still Looking Good While Testing: Automated Testing With a Visual Regression Service Part II

If you’ve followed our blog regularly, you’ve probably read our post on using visual regression testing tools and services to better test your applications’ front end look and feel.  If not, take a few minutes to read through our previous post on this topic.

Now that you’re up to speed, let’s take what we did in our previous post to the next level.

In this post, we’re going to show how you can alter your test configuration and code to test mobile browsers through a given testing service (browserstack) as well as get better reporting out of our tests natively and through options available to us in Jenkins.

Adjusting the Fuzz Factor:

If you run your test multiple times across several browsers with the visual regression service  using default settings, you may notice a handful of exception images cropping up in the diff directory which look nearly identical.

browser-fight

Browser fight!!!

 

 

 

 

 

 

 

If you happen to be testing across several browsers (using Browserstack or a similar service) which potentially have these browsers hosted in real or virtual environments with widely differing screen resolutions, this may impact how the visual-regression service performs its image diff comparison.

To alleviate this, either ensure your various browser hosts are able to display the browser in the same resolution or add a slight increase in your test’s mismatch tolerance.

As mentioned in our previous post, you can do this by editing the values under the ‘visual regression service’ object in your project’s wdio.conf file.  For example:

visualRegression: {
  compare: new VisualRegressionCompare.LocalCompare({
    referenceName: getScreenshotName(path.join(process.cwd(), 'screenshots/reference')),
    screenshotName: getScreenshotName(path.join(process.cwd(), 'screenshots/screen')),
    diffName: getScreenshotName(path.join(process.cwd(), 'screenshots/diff')),
    misMatchTolerance: 0.25,
  }),

Setting the mismatch tolerance value to 0.25 in the above snipped would allow the regression service a 25% margin of error when checking screen shots against any reference images it has captured.

Better Test Results:

Also mentioned in the previous post, one of the drawbacks to given examples of using the visual-regression service in our tests is that there is little feedback being returned other than the output of our image comparison.

head-in-sand

Having fun with that lack of feedback?

 

 

 

 

 

However, with a bit of extra code, we can make usable assertion statements in our test executions once a screen comparison event is generated.

The key is the checkElement() method which is really a WebdriverIO method that is enhanced by the visual-regression service.  This method returns an object that contains some meta data about the check being requested for the provided argument.

We can assign the method call to a new variable, which, once we ‘deserialize’ the object into something readable (e.g. a JSON string) we can then leverage Chai or some other framework to use assertions to make our tests more descriptive.

Here’s an example:

it('should test some things visually', () => {
  ...
  let returnedContents = JSON.stringify(browser.checkElement('<my web element>');
  console.log(returnedContents);
  assert.includes('<deserialized text to assert for', returnedContents);
});

In the above code snippet, near the end of the test, we are calling ‘checkElement()’ to do a visual comparison of the contents of the given web element selector, then converting the object returned by ‘checkElement()’ to a string.  Afterward, we are asserting there is some text/string content within the stringified object our comparison returned.

In the case of the text assertion, we would want to assert a successful match message is contained within the returned object.  This is because the ‘checkElement()’ method, while it may return data that indicates a test failure, on its own will not trigger an exception that would appropriately fail our test should an image comparison mismatch occur.

Adding Mobile

going-mobile

Oooh – someone downloaded the pancake app!

 

 

 

 

 

Combining WebdriverIO’s framework along with the visual-regression service and a browser test service like Browserstack, we can create tests that run against real browsers.  To do this, we will need to make some changes to our WebdriverIO config.  Try the following:

  1. Make a copy of your wdio.conf.js
  2. Name the copy wdio.mobile.conf.js
  3. Edit your package.json file
  4. Copy the ‘test’ key/value pair under ‘scripts’ and past it to a new line, rename it ‘test:mobile’
  5. Point the test:mobile script to the wdio.mobile.conf.js config file and save your changes

Next, we need to edit the contents of the wdio.mobile.conf.js script to run tests only against mobile devices.  The reason for adding a whole, new test config and script declaration is that due to the way mobile devices behave, there are some settings to declare for mobile browser testing with WebdriverIO and Browserstack which are incompatible with running tests against desktop browsers.

Edit the code block at the top of the mobile config file, changing it to the following:

var path = require('path');
var VisualRegressionCompare = require('wdio-visual-regression-service/compare');
function getScreenshotName(basePath) {
  return function(context) {
    var type = context.type;
    var testName = context.test.title;
    var browserVersion = parseInt(context.browser.version, 10);
    var browserName = context.browser.name;
    var browserOrientation = context.meta.orientation;
    return path.join(basePath,
    `${testName}_${browserName}_v${browserVersion}_${browserOrientation}_mobile.png`);
  };
}

Note that we’ve removed the declarations for height and width dimensions.  As Browserstack allows us to test on actual mobile devices, defining viewport constraints are not only unnecessary, but it will result in our tests failing to execute (the height and width dimension object can’t be passed to webdriver as part of a configuration for a real, mobile device).

Next, update the visual-regression service’s orientation configuration near the bottom of the mobile config files to the following:

orientations: ['portrait', 'landscape'],

Since applications using responsive design can sometime break when moving from one orientation to another on mobile devices, we’ll want to run out tests in both orientation modes.

Stating this behavior in our configuration above will trigger our tests to automatically switch orientations and retest.

Finally, we’ll need to update our browser capability settings in this config file:

capabilities: [
{
  device: 'iPhone 8',
  'browserstack.local': false,
  'realMobile': true,
  project: 'My project - iPhone 8',
},
{
  device: 'iPad 6th',
  'browserstack.local': false,
  'realMobile': true,
  project: 'My project - iPad',
},
{
  device: 'Samsung Galaxy S9',
  'browserstack.local': false,
  'realMobile': true,
  project: 'My project - Galaxy S9',
},
{
  device: 'Samsung Galaxy Note 8',
  'browserstack.local': false,
  'realMobile': true,
  project: 'My project - Note 8',
},
]

In the code above, the ‘realMobile’ descriptor is necessary to run tests against modern mobile devices through Browserstack.  For more information on this, see Browserstack’s documentation.

Once your change is saved, try running your tests on mobile by doing:

npm run test:mobile

Examine the image outputs of your tests and compare them to the results from your desktop test run.  Now you should be able to run tests for your app’s UI across a wide range of browsers (mobile and desktop).

Taking Things to Jenkins

You could include the project we’ve put together so far into the codebase of your app (just copy the dev dependencies from your package.json your specs and configs into the existing project).

Another option is to use this as a standalone testing tool to apply some base screen-shot based verification of your app.

In the following example, we’ll go through the steps you can to set up a project like this as a resource to use in Jenkins to test a given application and send the results to members of your team.

Consider the value of being able to generate a series of targeted screen shots of elements of your app per build and send them to team members like a product manager or UX designer.  This isn’t necessarily ‘heavy lifting’ from a dev ops standpoint but automating any sort of feedback for your app will trend toward delivering a better app, faster.

Assuming you have your project hosted in Github, you have administrative access to Jenkins and that your app is hosted in an environment Jenkins can access (your organization’s Amazon S3 bucket, hosted from a Docker image, etc.):

I. Create a new Jenkins job

jenkins-proj-name

 

 

 

 

* Create a new Jenkins task within the same space the job that builds and ships your web app (should one exist).  Assign it a name.

II. Configure the job to pull from your image testing app from Github

jenkins-set-repo

 

 

 

 

 

* Go to the source control configuration portion of the job.  Enter the info for your repository for the test app you’ve built.

III. Set the project to build periodically

jenkins-set-trigger

 

 

 

 

 

 

* Within the build settings of the job, choose the build periodically option and configure your job’s frequency.

Ideally, you would set this job to trigger once the task that builds and ‘ships’ your app completes successfully (thus, executing your tests every time a new version of the app is published).

Alternatively, to run periodically based on a given time, then enter something like ‘* 23 * * *’ into Jenkins’ cron configuration field to set your job to run once every day (in this case, at 11 PM, relative to your server’s configuration).

IV. Build your shell execution script

jenkins-shell-script

 

 

 

 

 

 

 

* Create a build step and for your task and choose the shell script option.  You can embed your npm test script execution here.

V. Collect your artifacts

From the list of post build options, choose the ‘archive artifacts’ option.  In the available text field, enter the path to the generated screen shots you wish to capture.  Note that this path may differ depending on how your Jenkins server is configured.  If you’re having trouble pinpointing the exact path to your artifacts, echo the ‘pwd’ command in your shell script step to have the job list your working directory in the job’s console output then work from there.

jenkins-archive

VI. Sending email notifications

jenkins-email-config

 

 

 

 

 

 

* Lastly, choose from the post build options menu the advanced email notification option.

Enter an email or list of email addresses you wish to contact once this job completes.  Fill out your email settings (subject line, body, etc.) and be sure to enter the path to the screen resources you will wish to attach to a given email.

Save your changes and you are ready to go!

This job will run on a regular basis, execute a given set of UI specific tests and send screen capture information to inquiring team members who would want to know.

You can create more nuanced runs but using Jenkins’ clone feature to copy this job, altering to run your mobile-specific tests, diversifying your test runs.

Further Reading

If you’re looking to dive further into visual regression testing, there are several code examples online worth reviewing.

Additionally, there are other visual regression testing services worth investigating such as Percy.IO, Screener.IO and Fluxguard.