Category Archives: DevOps

Flyte 1 Year In

On the racetrack of building ML applications, traditional software development steps are often overtaken. Welcome to the world of MLOps, where unique challenges meet innovative solutions and consistency is king. 

At Bazaarvoice, training pipelines serve as the backbone of our MLOps strategy. They underpin the reproducibility of our model builds. A glaring gap existed, however, in graduating experimental code to production.

Rather than continuing down piecemeal approaches, which often centered heavily on Jupyter notebooks, we determined a standard was needed to empower practitioners to experiment and ship machine learning models extremely fast while following our best practices.

(cover image generated with Midjourney)

Build vs. Buy

Fresh off the heels of our wins from unifying our machine learning (ML) model deployment strategy, we first needed to decide whether to build a custom in-house ML workflow orchestration platform or seek external solutions.

When deciding to “buy” (it is open source after all), selecting Flyte as our workflow management platform emerged as a clear choice. It saved invaluable development time and nudged our team closer to delivering a robust self-service infrastructure. Such an infrastructure allows our engineers to build, evaluate, register, and deploy models seamlessly. Rather than reinventing the wheel, Flyte equipped us with an efficient wheel to race ahead.

Before leaping with Flyte, we embarked on an extensive evaluation journey. Choosing the right workflow orchestration system wasn’t just about selecting a tool but also finding a platform to complement our vision and align with our strategic objectives. We knew the significance of this decision and wanted to ensure we had all the bases covered. Ultimately the final tooling options for consideration were Flyte, Metaflow, Kubeflow Pipelines, and Prefect.

To make an informed choice, we laid down a set of criteria:

Criteria for Evaluation

Must-Haves:

  • Ease of Development: The tool should intuitively aid developers without steep learning curves.
  • Deployment: Quick and hassle-free deployment mechanisms.
  • Pipeline Customization: Flexibility to adjust pipelines as distinct project requirements arise.
  • Visibility: Clear insights into processes for better debugging and understanding.

Good-to-Haves:

  • AWS Integration: Seamless integration capabilities with AWS services.
  • Metadata Retention: Efficient storage and retrieval of metadata.
  • Startup Time: Speedy initialization to reduce latency.
  • Caching: Optimal data caching for faster results.

Neutral, Yet Noteworthy:

  • Security: Robust security measures ensuring data protection.
  • User Administration: Features facilitating user management and access control.
  • Cost: Affordability – offering a good balance between features and price.

Why Flyte Stood Out: Addressing Key Criteria

Diving deeper into our selection process, Flyte consistently addressed our top criteria, often surpassing the capabilities of other tools under consideration:

  1. Ease of Development: Pure Python | Task Decorators
    • Python-native development experience
  2. Pipeline Customization
    • Easily customize any workflow and task by modifying the task decorator

  3. Deployment: Kubernetes Cluster
  4. Visibility
    • Easily accessible container logs
    • Flyte decks enable reporting visualizations

  5. Flyte’s native Kubernetes integration simplified the deployment process.

The Bazaarvoice customization

As with any platform, while Flyte brought many advantages, we needed a different plug-and-play solution for our unique needs. We anticipated the platform’s novelty within our organization. We wanted to reduce the learning curve as much as possible and allow our developers to transition smoothly without being overwhelmed.

To smooth the transition and expedite the development process, we’ve developed a cookiecutter template to serve as a launchpad for developers, providing a structured starting point that’s standardized and aligned with best practices for Flyte projects. This structure empowers developers to swiftly construct training pipelines.

The most relevant files provided by the template are:

  • Pipfile    - Details project dependencies 
  • Dockerfile - Builds docker container
  • Makefile   - Helper file to build, register, and execute projects
  • README.md  - Details the project 
  • src/
    • tasks/
    • Workflows.py (Follows the Kedro Standard for Data Layers)
      • process_raw_data - workflow to extract, clean, and transform raw data 
      • generate_model_input - workflow to create train, test, and validation data sets 
      • train_model - workflow to generate a serialized, trained machine learning model
      • generate_model_output - workflow to prevent train-serving skew by performing inference on the validation data set using the trained machine learning model
      • evaluate - workflow to evaluate the model on a desired set of performance metrics
      • reporting - workflow to summarize and visualize model performance
      • full - complete Flyte pipeline to generate trained model
  • tests/ - Unit tests for your workflows and tasks
  • run - Simplifies running of workflows

In addition, a common challenge in developing pipelines is needing resources beyond what our local machines offer. Or, there might be tasks that require extended runtimes. Flyte does grant the capability to develop locally and run remotely. However, this involves a series of steps:

  • Rebuild your custom docker image after each code modification.
  • Assign a version tag to this new docker image and push it to ECR.
  • Register this fresh workflow version with Flyte, updating the docker image.
  • Instruct Flyte to execute that specific version of the workflow, parameterizing via the CLI.

To circumvent these challenges and expedite the development process, we designed the template’s Makefile and run script to abstract the series of steps above into a single command!

./run —remote src/workflows.py full

The Makefile uses a couple helper targets, but overall provides the following commands:

  • info       - Prints info about this project
  • init       - Sets up project in flyte and creates an ECR repo 
  • build      - Builds the docker image 
  • push       - Pushes the docker image to ECR 
  • package    - Creates the flyte package 
  • register   - Registers version with flyte
  • runcmd     - Generates run command for both local and remote
  • test       - Runs any tests for the code
  • code_style - Applies black formatting & flake8

Key Triumphs

With Flyte as an integral component of our machine learning platform, we’ve achieved unmatched momentum in ML development. It enables swift experimentation and deployment of our models, ensuring we always adhere to best practices. Beyond aligning with fundamental MLOps principles, our customizations ensure Flyte perfectly meets our specific needs, guaranteeing the consistency and reliability of our ML models.

Closing Thoughts

Just as racers feel exhilaration crossing the finish line, our team feels an immense sense of achievement seeing our machine learning endeavors zoom with Flyte. As we gaze ahead, we’re optimistic, ready to embrace the new challenges and milestones that await. 🏎️

If you are drawn to this type of work, check out our job openings.

Vger Lets You Boldly Go . . .

Are you working on an agile team? Odds are high that you probably are. Whether you do Scrum/Kanban/lean/extreme, you are all about getting work done with the least resistance possible. Heck, if you are still on Waterfall, you care about that.  But how well are you doing? Do you know? Is that something a developer or a lead should even worry about or is a SEP? That’s a trick question. If your team is being held accountable and there is a gap between their expectations and your delivery, by the transitive property, you should worry about some basic lean metrics.

Here at Bazaarvoice, we are agile and overwhelmingly leverage kanban. Kanban emphasizes the disciplines of flow and continuous improvement. In an effort to make data-driven decisions about our improvements, we needed an easy way to get the relevant data. With just JIRA and GitHub alone, access to the right data has a significant barrier to entry.

So, like any enterprising group of engineers, we built an app for that.

What we did

Some of us had recently gone through an excellent lean metric forecasting workshop with Troy Magennis from Focused Objective. In his training he presented the idea of displaying a quadrant of lean metrics in order to force a narrative for a teams behavior, and to avoid overdriving on a single metric. This really resonated with me and seemed like good paradigm for the app we wanted to build.

And thus, Vger was born.

We went with a simple quadrant view with very bookmarkable url parameters. We made it simple to for teams to self-service by giving them an interface to make their own “Vger team” and add whatever “Vger boards” they need.  Essentially, if you can make a JQL query and give it a board in JIRA, Vger can graph the metrics for it. In the display, we provide a great deal of flexibility by letting teams configure date ranges for the dashboard, work types to be displayed, and the JIRA board columns to be considered as working/non-working.

Now the barrier to entry for lean metrics is down to “can you open a browser.”  Not too shabby.

The Quadrant View

We show the following in the quadrant view:

1. Throughput – The number of completed tickets per week.

2. Variation – the variation (standard deviation/mean) for the Throughput.

3. Backlog Growth – the tickets opened versus closed.

4. Lead Times – The lead times for the completed tickets. This also provides a detailed view by Jira board column to see where you spend most of your time.

We at Bazaarvoice are conservative gamblers, so you’ll see the throughput and lead time quadrants show the 50%, 80%, and 90% likelihood (the inverse of percentile).  We do this because relying on the average or mean is not your friend. Who want’s to bet on a coin toss? Not us. We like to be right say, eight out of ten times.

The Quarterly View

Later, we were asked to show throughput by quarter to help with quarterly goal planning. We created a side-car page for this.  It shows Throughput by quarter:

We also built a scatterplot for lead times so outliers could be investigated:

This view has zoomable regions and each point lets you click through to the corresponding JIRA ticket. So that’s nice.

But Wait! Git This….

From day one, we chose to show the same Quadrant for GitHub Pull Requests.

Note that we show rejected and merged lines in the PR Volume quadrant.  We also support overlaying your git tags on both PR and JIRA ticket data.  Pretty sweet!

I Want to Do More

Vger lets you download throughput data from the Quadrant and Quarterly views. You can also download lead time from the Quarterly view too. This lets teams and individuals perform their own visualizations and investigations on these very useful lean metrics.

But Why?

Vger was built with three use cases in mind:

Teams should be informed in retros

Teams should have easy access to these key lean metrics in their retros. We recommend that they start off viewing the quadrant and seeing if they agree with the narrative the retro facilitator presents. They should also consider the results of any improvement experiments they tried. Did the new behavior make throughput go up as they hoped it would? It the new behavior reduce time spent in code review? Did it reduce the number open bugs? etc.  Certainly not everything in a retro should be mercilessly data-driven, but it is a key element to a culture of continuous improvement.

Managers should know this data and speak to it

Team managers commonly speak to how their teams are progressing. These discussions should be data-driven, and most importantly it should be driven by the same data the team has access to (and hopefully retros to). It should also be presented in a common format that still provides for some customization. NOTE: You should avoid comparing team to team in Vger or a similar visualization. In most situations, that way leads to futility, confusion, and frustration.

We should have data to drive data-driven decisions about the future

Lean forecasting is beyond the scope of this post however, Troy Magennis has a fine take on it.  My short two cents on the matter is: a reasonably functioning team with even a little bit of run time should never be asked “how long will it take?”  Drop that low value ritual and do the high value task of decomposing the work, then forecast with historical data. Conveniently, you can download this historical data from Vger you used in your spreadsheet of choice.  I happen to like monte carlo simulations myself.

Isn’t This for Kanban?

You’ll note I used the term “lean metrics” throughout. I wanted to avoid any knee-jerk “kanban vs scrum vs ‘how we do things'”reaction. These metrics apply no matter what methodology you consciously (or unconsciously) use for the flow of work through your team.  It was built for feature development teams in mind, but we had good success when our client implementation team started using it as an early adopter. It allowed them to have a clear view into their lead time details and ferret out how much time was really spent waiting on clients to perform an action on their end.

Cool. How Do I Get a Vger?

We open sourced it here, so help yourself. This presented as “it worked for us”-ware and is not as finely polished as it could be, so it has some caveats. It is a very simple serverless app. We use JIRA and GitHub, so only those tools are currently supported. If you use similar, give Vger a try!

What’s Next?

If your fingers are itching to contribute, here’s some ideas:

  • Vger’s ETL process could really use an update
  • The Quadrant view UI really needs an update to React to match the Quarterly view
  • Make it flexible for your chosen issue tracker or source control?
  • How about adding a nice Cumulative Flow Diagram?

 

As a software engineer, how do I change my career to DevOps?

At Bazaarvoice, we’re big fans of cloud. Real big. We’re also fans of DevOps. There’s been a lot of discussion over the past several years about “What is DevOps?” Usually, this term is used to describe Systems Engineers and Site Reliability Engineers (sometimes called Infrastructure Engineers, Systems Engineers, Operations Engineers or, in the most unfortunate case, Systems Administrators, which is an entirely different job!). This is not what DevOps means, but in the context of career development, it carries the connotation of a “modern” Systems or Site Reliability Engineer.

There’s a lot of great literature about what a DevOps engineer is. I encourage you to read this interview of Google’s VP of Engineering, as well as Hixson and Beyer’s excellent paper on Systems Engineering and its distinction among Software, Systems and Site Reliability engineers. Although DevOps engineering goes beyond these technical descriptions, I’ll save that exegesis for another time. (Write me if you want to hear it, though!)

Many companies claim to hire or employ DevOps engineers. Few actually do. Bazaarvoice does. Google does, too, although they’re very hipster about it (they called it Site Reliability Engineering before the term DevOps landed on the scene, so they don’t call it DevOps because they had it before it was cool, or something). I don’t know about other companies because I haven’t worked at them (well, I haven’t worked at Google either, but they are pretty vocal about their engineering philosophies, so I’ll take them at their word). But there’s a lot of industry buzzwordium with little substance. This isn’t a jab at other companies (but really, Bazaarvoice is way cooler), it’s just a side-effect of assigning job titles based on pop culture. If you’re really a DevOps engineer, then you already know all of this, and you probably filter out a lot of this nonsense on a daily basis.

But we’re here to answer a specific question: If I’m already a software engineer, how do I become a DevOps engineer?

So, you’re a developer and you want to get in on the ops action. Boy, are you in for a surprise! This isn’t about installing Arch Linux and learning to write Perl. There’s a place for that kind of thing (a very small, dark place in a very distant corner of the universe), but that isn’t inherently what DevOps means.

Let’s consider a few of the properties and responsibilities of DevOps engineering.

A DevOps engineer:

  • Writes code / software. In fact, he is a proper software engineer.
  • Builds tools.
  • Does the painful things, as often and frequently as possible.
  • Participates in the on-call rotation (yes, for 2 a.m. production outages).
  • Infrastructure design.
  • Scaling systems (any system or subsystem — networking, applications, load balancers).
  • Maintenance. Like rebooting that frail vhost with a memory leak that no one’s bothered to fix or take ownership of.
  • Monitoring.
  • Virtualization.
  • Agile/kanban/whatever development methodology. It’s not so much that agile is “right.” It’s just the most efficient way to complete a work queue (taking into account interruptions and blockers). A good DevOps engineer has strong opinions about this!
  • Software release cycles and management. In fact, you might even see “development methodology” and software release cycles as the same thing.
  • Automation. Automation. Automation.
  • Designing a branch/release strategy for the provided SCM (git, Mercurial, svn, etc). Which you do have.
  • Metrics / reporting. Goes hand-in-hand with monitoring, although they are different.
  • Optimization / tuning. Of applications, tools, services, hardware…anything.
  • Load and performance testing and benchmarking, including performance testing of highly complex systems. And you know the difference between load testing and performance testing.
  • Cloud. Okay, you don’t really have to have cloud experience, but it can fundamentally change the way you think about complex systems. No one in a colo facility devised the notion of “immutable infrastructure.”
  • Configuration management. Or not. You have an opinion about it. (You’ve surely heard of Puppet, Chef, Ansible, etc. Yes?)
  • Security. At every layer.
  • Load balancing / proxying / replicating. (Of services, systems, components and processes.)
  • Command-line fu. A DevOps engineer is familiar with tools at his disposal for debugging, diagnosing and fixing issues on one or many servers, quickly. You know how pipes work, and you can count how many records contained some phrase in a log file with ease, for example.
  • Package management.
  • CI/CIT/CD — continuous integration, continuous integration testing, and continuous deployment. This is the closest thing to the real meaning of “DevOps” that a Systems Engineer will do.
  • Databases. All of them. SQL, NoSQL, whatever. Distributed ones, too!
  • Solid systems expertise. We’re talking about the networking stack, how hard disks work, how filesystems work, how system memory works, how CPU’s work, and how all these things come together. This is the traditional “operations” expertise you’ve heard about.

Phew! That’s a lot. Turns out, almost all of these skills are directly applicable to software engineering. The only difference is the breadth of domain, but a good software engineer will grow his breadth of domain expertise into operations naturally anyway! A DevOps engineer just starts his growth from a different side of the engineering career map.

Let’s stop and think for a moment about some things DevOps engineer is not. These details are critically important!

A DevOps engineer is not:

  • Easier than being a software engineer. (Ding! It is being a software engineer.)
  • Never writing code. I write tons of code.
  • Installing Linux and never touching your favorite OS again.
  • Working the third shift. (At least, it shouldn’t be; if it is, quit your job and come work with me.)
  • Inherently more “fun” than being a software engineer, although you may prefer it, if you’re into that kind of thing.
  • Greenfield. You’ll deal with old stuff in addition to new stuff. But as a good engineer, you care about business value and pragmatism.
  • An unsuccessful software engineer. Really: if you can’t write code, don’t expect to be a good DevOps engineer until you can.

A career shift

Here are a few things you should do to begin positioning yourself as a DevOps engineer.

  • Realize that you’re already an engineer, so becoming a DevOps engineer means you are just moving yourself to a different domain to grow and learn from a different direction.
  • Interview at a company that’s hiring DevOps. If you get hired, you’ll learn the operations side of things fast. Real fast. Or get fired. (Hint: you should disclose your experience honestly!) If you don’t get hired, you’ll learn what is still missing from your resume / experience that’s preventing you from becoming a full-time DevOps engineer. Incidentally, we’re hiring. 🙂
  • Tell your boss you want to become a DevOps engineer at your company. Your boss should help you to this end. If he/she does not, quit. Then come work at Bazaarvoice with me and a bunch of really awesome, super talented engineers working on some really awesome and challenging problems.
  • Obtain practical experience by using your skills as a software engineer to build tools rather than applications. Look at any of the open source projects Netflix has written for examples / ideas.
  • Learn OpenStack or some equivalent infrastructure-level project. (OpenStack has tremendous breadth, which is why I recommend it.) You can do this on your own time and budget. It’s not important whether OpenStack sucks compared to Rackspace Cloud. What’s important is that you understand all of the various components and why they are important. Have a wad of cash lying around? Learn Amazon Web Services or Google Compute Engine instead.
  • Bonus: learn about Apache Mesos and Kubernetes, and why they’re useful / important. Using them is less important than understanding them.
  • Participate in anything your team does involving operations — deployment, scale, etc. (See list above: “What DevOps is.”) If your team doesn’t do any of that (i.e., they send artifacts over to Operations and the Operations team does deployment), go over to the Operations team and sit in on a few deployments. You may be surprised!

Do I need to have deep operations experience to become a good devops engineer?

I’ve asked myself the same question. I come from a development background myself and only had less than a year of experience dealing with operations (part time) before becoming a DevOps engineer. (Granted, I had a referral vouching for me, which weighed in my favor.) Despite my less-than-stellar CS/algorithm skills (based on my complete lack of formal computer science education), I’ve had enough experience writing software that I could apply these concepts to systems in a similar fashion. That is, in the same way a software engineer needs to understand at what point his application code should be abstracted in support of future changes (i.e., refactored into smaller components), a DevOps engineer needs to understand at what point a component of his infrastructure should be abstracted in support of future changes (i.e., rebuilding all of his servers and rearchitecting infrastructure that’s already in production, despite the potential risk, in order to solve a problem, like allowing the architecture to scale to business needs). At its core, a good engineer is just as good whether he’s writing software or deploying it. Understanding complex systems and their interactions is a critical skill. But all of these are important for software engineers, whether you’re writing application code or not!

I hope this post helps you in your endeavor to become a DevOps engineer, or at least understand what it means to be a DevOps engineer at Bazaarvoice (as I said before, it may mean something totally different at other companies!). You should get your feet wet with some of the things we do. If it gets you tingly and excited, then come work with me at Bazaarvoice: http://www.bazaarvoice.com/careers/research-and-development/.


This article was originally posted as an answer on Quora. Due to surprising popularity, I’ve updated the article and posted it here.