Category Archives: DevOps

How We Migrated Millions of UGC Records to Aurora MySQL

Technical drivers for migration

Our decision was driven by clear, repeatable pain points across regions and clusters:

Elasticity and cost efficiency: RDS tightly couples compute and storage; scaling IOPS forced larger instances even when CPU/RAM headroom was unnecessary, leading to overprovisioning and costly peak-capacity sizing.
Read scalability and freshness: Minutes of replica lag during heavy write bursts degraded downstream experiences and analytics.
Operational agility: Routine maintenance (major/minor upgrades, index/build operations) required long windows and coordination, increasing risk and toil.
Availability and recovery: We needed stronger RTO (Recovery Time Objective)/RPO (Recovery Point Objective) posture across AZs (Availability Zones)/regions with simpler failover.
Currency and security: MySQL 5.7 end of life increased support cost and security exposure.

Current architecture (before)

This design served us well, but growth revealed fundamental constraints.

Our PRR (Product Reviews & Ratings) platform ran a large fleet of Amazon RDS for MySQL clusters deployed in the US and EU for latency and data residency. Each cluster had a single primary (writes) and multiple read replicas (reads), with application tiers pinned to predefined replicas to avoid read contention.

Figure 1: Previous RDS-based architecture

Observed Constraints and impact

Elasticity and cost: To gain IOPS we upsized instances despite CPU/RAM headroom, leading to 30–50% idle capacity off‑peak; resizing required restarts (10–30 min impact).
Read scalability and freshness: Under write bursts, replica lag p95 reached 5–20 minutes; read‑after‑write success fell below 99.5% at peak; downstream analytics SLAs missed.
Operational agility: Multi‑TB index/DDL work consumed 2–6 hours/change; quarterly maintenance totaled 8–12 hours/cluster; change failure rate elevated during long windows.
Availability and recovery: Manual failovers ran 3–10 minutes (RTO); potential data loss up to ~60s (RPO) during incidents; cross‑AZ runbooks were complex.
Large‑table contention: Long scans increased lock waits and p95 query latency by 20–40% during peaks.
Platform currency: MySQL 5.7 EOL increased support, security exposure and extended support costs.

Our modernization plan

We executed a two‑phase approach to de‑risk change and accelerate value:

Upgrade all clusters to MySQL 8.0 to eliminate EOL risk, unlock optimizer and engine improvements, and standardize client compatibility.
Migrate to Amazon Aurora MySQL to decouple compute from storage, minimize replica lag, simplify failover, and reduce operational overhead.

The below diagram illustrates our target Aurora Architecture.

Figure 2: Target Aurora-based architecture

Phase 1: Upgrading From MySQL 5.7 to MySQL 8.0 (blue‑green)

Objective: Achieve currency and immediate performance/operability wins with minimal downtime.

Approach

Stood up a parallel “green” MySQL 8.0 environment alongside the existing 5.7 “blue”.
Ran compatibility, integration, and performance tests against green; validated drivers and connection pools.
Orchestrated cutover: freeze writes on blue, switch application endpoints to green, closely monitor, then decommission.
Rollback strategy: No supported downgrade from 8.0 to 5.7; keep blue (5.7) intact and time‑box a 4‑hour rollback window during low traffic; on SLO regression set green read‑only, switch endpoints back to blue, re‑enable writes, and reconcile by replaying delta transactions from the upstream cache.

Key considerations and fixes

SQL and deprecated features: Refactored incompatible queries and stored routines; reviewed optimizer hints.
Character set: Validated/standardized on utf8mb4; audited column/ index lengths accordingly.
Authentication: Planned client updates for 8.0’s default caching_sha2_password vs mysql_native_password.
Auto‑increment lock mode: Verified innodb_autoinc_lock_mode defaults and replication mode to avoid duplicate key hazards.
Regular expression guards: Tuned regexp_time_limit / regexp_stack_limit where long‑running regex filters were valid.
Buffer pool warm state: Tuned innodb_buffer_pool_load_at_startup to preload hot pages on startup for faster warmup; paired with innodb_buffer_pool_dump_at_shutdown to persist cache at shutdown.

Outcome

Downtime kept to minutes via blue‑green cutover.
Immediate benefits from 8.0: improved optimizer, JSON functions, CTEs/window functions, better instrumentation.
Reduced future migration risk by standardizing engines and drivers ahead of Aurora Migration.

The below diagram illustrates the above RDS MySQL upgradation path.

Figure 3: MySQL 5.7 to MySQL 8.0 Upgradation Path

Phase 2: Migrating to Aurora MySQL (minimal downtime)

Why Amazon Aurora MySQL

Aurora’s cloud-native architecture directly addresses the core constraints—scalability, cost and operational inefficiencies, and replication/high availability—outlined in the Technical Drivers for Migration section.

Decoupled compute and storage: Writer and readers share a distributed, log‑structured storage layer that auto‑scales in 10‑GB segments, with six copies across three AZs.
Near‑zero replica lag: Readers consume the same storage stream; typical lag is single‑digit milliseconds.
Fast, predictable failover: Reader promotion is automated; failovers complete in seconds with near‑zero data loss (strong RPO/RTO posture). For further reading Aurora DB Cluster – High Availability
Right‑sized elasticity: Independently scale read capacity; add/remove readers in minutes without storage moves.
Operational simplicity: Continuous backups to S3, point‑in‑time restore, fast database cloning (copy‑on‑write) for testing/ops.
Advanced capabilities: Parallel Query for large scans, Global Database for cross‑region DR, and optional Serverless v2 for spiky workloads.

Migration Strategy

Our goal is a low-downtime move from RDS MySQL to Aurora MySQL and keeping a fast rollback path.

Minimize write downtime and application risk.
Maintain an immediate rollback option.

Migration Path

Source RDS (STATEMENT): Maintain the primary on STATEMENT binlog_format to satisfy existing downstream ETL requirements.
ROW bridge replica (RDS): Create a dedicated RDS read replica configured with:
- binlog_format=ROW
- log_slave_updates=1
- binlog_row_image=FULL

This emits row-based events required for Aurora Replica Cluster replication changing the primary.

Validation and query replay (pre-cutover)

Parity checks: Continuous row counts/checksums on critical tables; spot data sampling.
Shadow reads: Route a portion of read traffic to the Aurora reader endpoint; compare results/latency.
Query replay: Capture representative workload (from binlogs/proxy mirror) and replay against Aurora to validate execution plans, lock behavior, and performance.
Real-time replication monitoring: Continuously track replication lag, error rates, and binlog position between RDS and Aurora to ensure data consistency and timely sync before cutover.

Traffic shift and promotion

Phase 1 — Shift reads: Gradually route read traffic to the Aurora reader endpoint; monitor replica lag, p95/p99 latency, lock waits, and error rates.
Phase 2 — Promote: Promote the Aurora replica cluster to a standalone Aurora DB cluster (stop external replication).
Phase 3 — Shift writes: Redirect application writes to the Aurora writer endpoint; keep reads on Aurora readers.

Operational safeguards

Coordination: Quiesce non‑critical writers and drain long transactions before promotion.
DNS endpoint management: Before migration, set the DNS TTL to 5 seconds to enable rapid propagation. Update the existing DNS CNAME to point to Aurora cluster endpoints during cutover, then revert as needed for rollback.
Config hygiene: Update connection strings/secrets atomically; ensure pools recycle to pick up new endpoints.
Monitoring: Watch replica lag, query latency, deadlocks, and parity checks throughout and immediately after cutover.
Rollback: If SLOs degrade or parity fails, redirect writes back to RDS; the standby rollback replica can be promoted quickly for clean fallback.
Runbooks: Every migration step was governed by clear-cut runbooks, practiced and tested in advance using a parallel setup to ensure reliability and repeatability.

The below diagram illustrates the above RDS MySQL to Aurora MySQL migration path.

Figure 4: RDS MySQL to Aurora MySQL Migration Path

Results and what improved

Data freshness: Replica lag = 0s; read-after-write ≥ 99.9%; analytics staleness p95 ≤ 5 min.
Scalability: Read scaling efficiency ≥ 0.8; writer CPU ≤ 70%.
Availability: RTO p95 ≤ 30s; RPO ≤ 1s; observed availability ≥ 99.95%.
Operability: PITR restore ≤ 15 min; fast clone ≤ 60s; maintenance time −50% vs. baseline.
Cost alignment: No storage over provisioning – Replicas reduced from 7 to 1, yielding ~40% total cost reduction.

What Made Our Aurora Migration Unique to Bazaarvoice

While Aurora’s scalability and reliability are well known, our implementation required engineering solutions tailored to Bazaarvoice’s scale, dependencies, and workloads:

Foundational discovery: Three months of in-depth system analysis to map every dependency — the first such effort for this platform — ensuring a migration with zero blind spots.
Large-scale dataset migration: Handled individual tables measured in terabytes, requiring specialized sequencing, performance tuning, and failover planning to maintain SLAs.
Replica optimization: Reduced the read replica footprint through query pattern analysis and caching strategy changes, delivering over 50% cost savings without performance loss.
Beyond defaults: Tuned Aurora parameters such as auto_increment_lock_mode and InnoDB settings to better match our workload rather than relying on out-of-the-box configurations.
Replication format constraint: Our source ran statement-based replication with ETL triggers on read replicas, which meant we couldn’t create an Aurora reader cluster directly. We engineered a migration path that preserved ETL continuity while moving to Aurora.
Custom workload replay tooling: Developed scripts to filter RDS-specific SQL from production query logs before replaying in Aurora, enabling accurate compatibility and performance testing ahead of staging.
Environment-specific tuning: Applied different parameter groups and optimization strategies for QA, staging, and production to align with their distinct workloads.

These deliberate engineering decisions transformed a standard Aurora migration into a BV-optimized, cost-efficient, and risk-mitigated modernization effort.

Learnings and takeaways

Standardize early to remove surprises: Commit to utf8mb4 (with index length audits), modern auth (caching_sha2_password), and consistent client libraries/connection settings across services. This eliminated late rework and made test results portable across teams.
Build safety into rollout, not after: Shadow reads and row‑level checksums caught drift before cutover; plan diffs and feature flags kept risk bounded. Go/no‑go gates were tied to SLOs (p95 latency, error rate), not gut feel, and a rehearsed rollback made the decision reversible.
Rollback is a design requirement: With no supported 8.0→5.7 downgrade, we ran blue‑green, time‑boxed a 4‑hour rollback window during low traffic, kept low DNS/ALB TTLs, and pre‑captured binlog coordinates. On regression: set green read‑only, flip endpoints to blue, re‑enable writes, then reconcile by replaying delta transactions from the upstream cache.
Backward replication helps but is brittle: Short‑lived 8.0‑primary→5.7‑replica mirroring de‑risked bake windows, yet required strict session variable alignment and deferring 8.0‑only features until validation completed.
Partition first, automate always: Domain partitioning lets us upgrade independently and reduce blast radius. Automation handled drift detection, endpoint flips, buffer pool warm/cold checks, and post‑cutover validation so humans focused on anomalies, not checklists.
Operability hygiene pays compounding dividends: Keep transactions short and idempotent; enforce parameter discipline; cap runaway regex (regexp_time_limit/regexp_stack_limit); validate autoinc lock mode for your replication pattern; and keep instances warm by pairing innodb_buffer_pool_dump_at_shutdown with innodb_buffer_pool_load_at_startup. Snapshots + PITR isolated heavy work, and regular failover drills made maintenance predictable.
Measurable outcome: Replica lag dropped to milliseconds; cutovers held downtime to minutes; read scale‑out maintained p95 latency within +10% at target QPS with writer CPU ≤ 70% and read scaling efficiency ≥ 0.8; replicas reduced from 7 to 1, driving ~40% cost savings.

Conclusion

The strategic migration from Amazon RDS to Amazon Aurora MySQL was a significant milestone. By upgrading to MySQL 8.0 and then migrating to Aurora, we successfully eliminated replication bottlenecks and moved the responsibility for durability to Aurora’s distributed storage. This resulted in tangible benefits for our platform and our customers, including:

Improved Data Freshness: Replica lag was reduced to milliseconds, significantly improving downstream analytics and client experiences.
Enhanced Scalability and Cost Efficiency: We can now right-size compute resources independently of storage, avoiding over-provisioning for storage and IOPS. Read capacity can also be scaled independently.
Higher Availability: The architecture provides faster failovers and stronger multi-AZ durability with six-way storage replication.
Simplified Operations: Maintenance windows are shorter, upgrades are easier, and features like fast clones allow for safer experimentation.

This new, resilient, and scalable platform gives us the freedom to move faster with confidence as we continue to serve billions of consumers and our customers worldwide.

Acknowledgements

This milestone was possible thanks to many people across Bazaarvoice — our Database engineers, DevOps/SRE, application teams, security and compliance partners, product, principal engineering community and leadership stakeholders. Thank you for the rigor, patience, and craftsmanship that made this transition safe and successful.

Flyte 1 Year In

On the racetrack of building ML applications, traditional software development steps are often overtaken. Welcome to the world of MLOps, where unique challenges meet innovative solutions and consistency is king.

At Bazaarvoice, training pipelines serve as the backbone of our MLOps strategy. They underpin the reproducibility of our model builds. A glaring gap existed, however, in graduating experimental code to production.

Rather than continuing down piecemeal approaches, which often centered heavily on Jupyter notebooks, we determined a standard was needed to empower practitioners to experiment and ship machine learning models extremely fast while following our best practices.

(cover image generated with Midjourney)

Build vs. Buy

Fresh off the heels of our wins from unifying our machine learning (ML) model deployment strategy, we first needed to decide whether to build a custom in-house ML workflow orchestration platform or seek external solutions.

When deciding to “buy” (it is open source after all), selecting Flyte as our workflow management platform emerged as a clear choice. It saved invaluable development time and nudged our team closer to delivering a robust self-service infrastructure. Such an infrastructure allows our engineers to build, evaluate, register, and deploy models seamlessly. Rather than reinventing the wheel, Flyte equipped us with an efficient wheel to race ahead.

Before leaping with Flyte, we embarked on an extensive evaluation journey. Choosing the right workflow orchestration system wasn’t just about selecting a tool but also finding a platform to complement our vision and align with our strategic objectives. We knew the significance of this decision and wanted to ensure we had all the bases covered. Ultimately the final tooling options for consideration were Flyte, Metaflow, Kubeflow Pipelines, and Prefect.

To make an informed choice, we laid down a set of criteria:

Criteria for Evaluation

Must-Haves:

Ease of Development: The tool should intuitively aid developers without steep learning curves.
Deployment: Quick and hassle-free deployment mechanisms.
Pipeline Customization: Flexibility to adjust pipelines as distinct project requirements arise.
Visibility: Clear insights into processes for better debugging and understanding.

Good-to-Haves:

AWS Integration: Seamless integration capabilities with AWS services.
Metadata Retention: Efficient storage and retrieval of metadata.
Startup Time: Speedy initialization to reduce latency.
Caching: Optimal data caching for faster results.

Neutral, Yet Noteworthy:

Security: Robust security measures ensuring data protection.
User Administration: Features facilitating user management and access control.
Cost: Affordability – offering a good balance between features and price.

Why Flyte Stood Out: Addressing Key Criteria

Diving deeper into our selection process, Flyte consistently addressed our top criteria, often surpassing the capabilities of other tools under consideration:

Ease of Development: Pure Python | Task Decorators
- Python-native development experience
Pipeline Customization
- Easily customize any workflow and task by modifying the task decorator
Deployment: Kubernetes Cluster
Visibility
- Easily accessible container logs
- Flyte decks enable reporting visualizations
Flyte’s native Kubernetes integration simplified the deployment process.

The Bazaarvoice customization

As with any platform, while Flyte brought many advantages, we needed a different plug-and-play solution for our unique needs. We anticipated the platform’s novelty within our organization. We wanted to reduce the learning curve as much as possible and allow our developers to transition smoothly without being overwhelmed.

To smooth the transition and expedite the development process, we’ve developed a cookiecutter template to serve as a launchpad for developers, providing a structured starting point that’s standardized and aligned with best practices for Flyte projects. This structure empowers developers to swiftly construct training pipelines.

The most relevant files provided by the template are:

Pipfile - Details project dependencies
Dockerfile - Builds docker container
Makefile - Helper file to build, register, and execute projects
README.md - Details the project
src/
- tasks/
- Workflows.py (Follows the Kedro Standard for Data Layers)
  - process_raw_data - workflow to extract, clean, and transform raw data
  - generate_model_input - workflow to create train, test, and validation data sets
  - train_model - workflow to generate a serialized, trained machine learning model
  - generate_model_output - workflow to prevent train-serving skew by performing inference on the validation data set using the trained machine learning model
  - evaluate - workflow to evaluate the model on a desired set of performance metrics
  - reporting - workflow to summarize and visualize model performance
  - full - complete Flyte pipeline to generate trained model
tests/ - Unit tests for your workflows and tasks
run - Simplifies running of workflows

In addition, a common challenge in developing pipelines is needing resources beyond what our local machines offer. Or, there might be tasks that require extended runtimes. Flyte does grant the capability to develop locally and run remotely. However, this involves a series of steps:

Rebuild your custom docker image after each code modification.
Assign a version tag to this new docker image and push it to ECR.
Register this fresh workflow version with Flyte, updating the docker image.
Instruct Flyte to execute that specific version of the workflow, parameterizing via the CLI.

To circumvent these challenges and expedite the development process, we designed the template’s Makefile and run script to abstract the series of steps above into a single command!

./run —remote src/workflows.py full

The Makefile uses a couple helper targets, but overall provides the following commands:

info - Prints info about this project
init - Sets up project in flyte and creates an ECR repo
build - Builds the docker image
push - Pushes the docker image to ECR
package - Creates the flyte package
register - Registers version with flyte
runcmd - Generates run command for both local and remote
test - Runs any tests for the code
code_style - Applies black formatting & flake8

Key Triumphs

With Flyte as an integral component of our machine learning platform, we’ve achieved unmatched momentum in ML development. It enables swift experimentation and deployment of our models, ensuring we always adhere to best practices. Beyond aligning with fundamental MLOps principles, our customizations ensure Flyte perfectly meets our specific needs, guaranteeing the consistency and reliability of our ML models.

Closing Thoughts

Just as racers feel exhilaration crossing the finish line, our team feels an immense sense of achievement seeing our machine learning endeavors zoom with Flyte. As we gaze ahead, we’re optimistic, ready to embrace the new challenges and milestones that await. 🏎️

If you are drawn to this type of work, check out our job openings.

Vger Lets You Boldly Go . . .

Are you working on an agile team? Odds are high that you probably are. Whether you do Scrum/Kanban/lean/extreme, you are all about getting work done with the least resistance possible. Heck, if you are still on Waterfall, you care about that. But how well are you doing? Do you know? Is that something a developer or a lead should even worry about or is a SEP? That’s a trick question. If your team is being held accountable and there is a gap between their expectations and your delivery, by the transitive property, you should worry about some basic lean metrics.

Here at Bazaarvoice, we are agile and overwhelmingly leverage kanban. Kanban emphasizes the disciplines of flow and continuous improvement. In an effort to make data-driven decisions about our improvements, we needed an easy way to get the relevant data. With just JIRA and GitHub alone, access to the right data has a significant barrier to entry.

So, like any enterprising group of engineers, we built an app for that.

What we did

Some of us had recently gone through an excellent lean metric forecasting workshop with Troy Magennis from Focused Objective. In his training he presented the idea of displaying a quadrant of lean metrics in order to force a narrative for a teams behavior, and to avoid overdriving on a single metric. This really resonated with me and seemed like good paradigm for the app we wanted to build.

And thus, Vger was born.

We went with a simple quadrant view with very bookmarkable url parameters. We made it simple to for teams to self-service by giving them an interface to make their own “Vger team” and add whatever “Vger boards” they need. Essentially, if you can make a JQL query and give it a board in JIRA, Vger can graph the metrics for it. In the display, we provide a great deal of flexibility by letting teams configure date ranges for the dashboard, work types to be displayed, and the JIRA board columns to be considered as working/non-working.

Now the barrier to entry for lean metrics is down to “can you open a browser.” Not too shabby.

The Quadrant View

We show the following in the quadrant view:

1. Throughput – The number of completed tickets per week.

2. Variation – the variation (standard deviation/mean) for the Throughput.

3. Backlog Growth – the tickets opened versus closed.

4. Lead Times – The lead times for the completed tickets. This also provides a detailed view by Jira board column to see where you spend most of your time.

We at Bazaarvoice are conservative gamblers, so you’ll see the throughput and lead time quadrants show the 50%, 80%, and 90% likelihood (the inverse of percentile). We do this because relying on the average or mean is not your friend. Who want’s to bet on a coin toss? Not us. We like to be right say, eight out of ten times.

The Quarterly View

Later, we were asked to show throughput by quarter to help with quarterly goal planning. We created a side-car page for this. It shows Throughput by quarter:

We also built a scatterplot for lead times so outliers could be investigated:

This view has zoomable regions and each point lets you click through to the corresponding JIRA ticket. So that’s nice.

But Wait! Git This….

From day one, we chose to show the same Quadrant for GitHub Pull Requests.

Note that we show rejected and merged lines in the PR Volume quadrant. We also support overlaying your git tags on both PR and JIRA ticket data. Pretty sweet!

I Want to Do More

Vger lets you download throughput data from the Quadrant and Quarterly views. You can also download lead time from the Quarterly view too. This lets teams and individuals perform their own visualizations and investigations on these very useful lean metrics.

But Why?

Vger was built with three use cases in mind:

Teams should be informed in retros

Teams should have easy access to these key lean metrics in their retros. We recommend that they start off viewing the quadrant and seeing if they agree with the narrative the retro facilitator presents. They should also consider the results of any improvement experiments they tried. Did the new behavior make throughput go up as they hoped it would? It the new behavior reduce time spent in code review? Did it reduce the number open bugs? etc. Certainly not everything in a retro should be mercilessly data-driven, but it is a key element to a culture of continuous improvement.

Managers should know this data and speak to it

Team managers commonly speak to how their teams are progressing. These discussions should be data-driven, and most importantly it should be driven by the same data the team has access to (and hopefully retros to). It should also be presented in a common format that still provides for some customization. NOTE: You should avoid comparing team to team in Vger or a similar visualization. In most situations, that way leads to futility, confusion, and frustration.

We should have data to drive data-driven decisions about the future

Lean forecasting is beyond the scope of this post however, Troy Magennis has a fine take on it. My short two cents on the matter is: a reasonably functioning team with even a little bit of run time should never be asked “how long will it take?” Drop that low value ritual and do the high value task of decomposing the work, then forecast with historical data. Conveniently, you can download this historical data from Vger you used in your spreadsheet of choice. I happen to like monte carlo simulations myself.

Isn’t This for Kanban?

You’ll note I used the term “lean metrics” throughout. I wanted to avoid any knee-jerk “kanban vs scrum vs ‘how we do things'”reaction. These metrics apply no matter what methodology you consciously (or unconsciously) use for the flow of work through your team. It was built for feature development teams in mind, but we had good success when our client implementation team started using it as an early adopter. It allowed them to have a clear view into their lead time details and ferret out how much time was really spent waiting on clients to perform an action on their end.

Cool. How Do I Get a Vger?

We open sourced it here, so help yourself. This presented as “it worked for us”-ware and is not as finely polished as it could be, so it has some caveats. It is a very simple serverless app. We use JIRA and GitHub, so only those tools are currently supported. If you use similar, give Vger a try!

What’s Next?

If your fingers are itching to contribute, here’s some ideas:

Vger’s ETL process could really use an update
The Quadrant view UI really needs an update to React to match the Quarterly view
Make it flexible for your chosen issue tracker or source control?
How about adding a nice Cumulative Flow Diagram?

As a software engineer, how do I change my career to DevOps?

At Bazaarvoice, we’re big fans of cloud. Real big. We’re also fans of DevOps. There’s been a lot of discussion over the past several years about “What is DevOps?” Usually, this term is used to describe Systems Engineers and Site Reliability Engineers (sometimes called Infrastructure Engineers, Systems Engineers, Operations Engineers or, in the most unfortunate case, Systems Administrators, which is an entirely different job!). This is not what DevOps means, but in the context of career development, it carries the connotation of a “modern” Systems or Site Reliability Engineer.

There’s a lot of great literature about what a DevOps engineer is. I encourage you to read this interview of Google’s VP of Engineering, as well as Hixson and Beyer’s excellent paper on Systems Engineering and its distinction among Software, Systems and Site Reliability engineers. Although DevOps engineering goes beyond these technical descriptions, I’ll save that exegesis for another time. (Write me if you want to hear it, though!)

Many companies claim to hire or employ DevOps engineers. Few actually do. Bazaarvoice does. Google does, too, although they’re very hipster about it (they called it Site Reliability Engineering before the term DevOps landed on the scene, so they don’t call it DevOps because they had it before it was cool, or something). I don’t know about other companies because I haven’t worked at them (well, I haven’t worked at Google either, but they are pretty vocal about their engineering philosophies, so I’ll take them at their word). But there’s a lot of industry buzzwordium with little substance. This isn’t a jab at other companies (but really, Bazaarvoice is way cooler), it’s just a side-effect of assigning job titles based on pop culture. If you’re really a DevOps engineer, then you already know all of this, and you probably filter out a lot of this nonsense on a daily basis.

But we’re here to answer a specific question: If I’m already a software engineer, how do I become a DevOps engineer?

So, you’re a developer and you want to get in on the ops action. Boy, are you in for a surprise! This isn’t about installing Arch Linux and learning to write Perl. There’s a place for that kind of thing (a very small, dark place in a very distant corner of the universe), but that isn’t inherently what DevOps means.

Let’s consider a few of the properties and responsibilities of DevOps engineering.

A DevOps engineer:

Writes code / software. In fact, he is a proper software engineer.
Builds tools.
Does the painful things, as often and frequently as possible.
Participates in the on-call rotation (yes, for 2 a.m. production outages).
Infrastructure design.
Scaling systems (any system or subsystem — networking, applications, load balancers).
Maintenance. Like rebooting that frail vhost with a memory leak that no one’s bothered to fix or take ownership of.
Monitoring.
Virtualization.
Agile/kanban/whatever development methodology. It’s not so much that agile is “right.” It’s just the most efficient way to complete a work queue (taking into account interruptions and blockers). A good DevOps engineer has strong opinions about this!
Software release cycles and management. In fact, you might even see “development methodology” and software release cycles as the same thing.
Automation. Automation. Automation.
Designing a branch/release strategy for the provided SCM (git, Mercurial, svn, etc). Which you do have.
Metrics / reporting. Goes hand-in-hand with monitoring, although they are different.
Optimization / tuning. Of applications, tools, services, hardware…anything.
Load and performance testing and benchmarking, including performance testing of highly complex systems. And you know the difference between load testing and performance testing.
Cloud. Okay, you don’t really have to have cloud experience, but it can fundamentally change the way you think about complex systems. No one in a colo facility devised the notion of “immutable infrastructure.”
Configuration management. Or not. You have an opinion about it. (You’ve surely heard of Puppet, Chef, Ansible, etc. Yes?)
Security. At every layer.
Load balancing / proxying / replicating. (Of services, systems, components and processes.)
Command-line fu. A DevOps engineer is familiar with tools at his disposal for debugging, diagnosing and fixing issues on one or many servers, quickly. You know how pipes work, and you can count how many records contained some phrase in a log file with ease, for example.
Package management.
CI/CIT/CD — continuous integration, continuous integration testing, and continuous deployment. This is the closest thing to the real meaning of “DevOps” that a Systems Engineer will do.
Databases. All of them. SQL, NoSQL, whatever. Distributed ones, too!
Solid systems expertise. We’re talking about the networking stack, how hard disks work, how filesystems work, how system memory works, how CPU’s work, and how all these things come together. This is the traditional “operations” expertise you’ve heard about.

Phew! That’s a lot. Turns out, almost all of these skills are directly applicable to software engineering. The only difference is the breadth of domain, but a good software engineer will grow his breadth of domain expertise into operations naturally anyway! A DevOps engineer just starts his growth from a different side of the engineering career map.

Let’s stop and think for a moment about some things DevOps engineer is not. These details are critically important!

A DevOps engineer is not:

Easier than being a software engineer. (Ding! It is being a software engineer.)
Never writing code. I write tons of code.
Installing Linux and never touching your favorite OS again.
Working the third shift. (At least, it shouldn’t be; if it is, quit your job and come work with me.)
Inherently more “fun” than being a software engineer, although you may prefer it, if you’re into that kind of thing.
Greenfield. You’ll deal with old stuff in addition to new stuff. But as a good engineer, you care about business value and pragmatism.
An unsuccessful software engineer. Really: if you can’t write code, don’t expect to be a good DevOps engineer until you can.

A career shift

Here are a few things you should do to begin positioning yourself as a DevOps engineer.

Realize that you’re already an engineer, so becoming a DevOps engineer means you are just moving yourself to a different domain to grow and learn from a different direction.
Interview at a company that’s hiring DevOps. If you get hired, you’ll learn the operations side of things fast. Real fast. Or get fired. (Hint: you should disclose your experience honestly!) If you don’t get hired, you’ll learn what is still missing from your resume / experience that’s preventing you from becoming a full-time DevOps engineer. Incidentally, we’re hiring. 🙂
Tell your boss you want to become a DevOps engineer at your company. Your boss should help you to this end. If he/she does not, quit. Then come work at Bazaarvoice with me and a bunch of really awesome, super talented engineers working on some really awesome and challenging problems.
Obtain practical experience by using your skills as a software engineer to build tools rather than applications. Look at any of the open source projects Netflix has written for examples / ideas.
Learn OpenStack or some equivalent infrastructure-level project. (OpenStack has tremendous breadth, which is why I recommend it.) You can do this on your own time and budget. It’s not important whether OpenStack sucks compared to Rackspace Cloud. What’s important is that you understand all of the various components and why they are important. Have a wad of cash lying around? Learn Amazon Web Services or Google Compute Engine instead.
Bonus: learn about Apache Mesos and Kubernetes, and why they’re useful / important. Using them is less important than understanding them.
Participate in anything your team does involving operations — deployment, scale, etc. (See list above: “What DevOps is.”) If your team doesn’t do any of that (i.e., they send artifacts over to Operations and the Operations team does deployment), go over to the Operations team and sit in on a few deployments. You may be surprised!

Do I need to have deep operations experience to become a good devops engineer?

I’ve asked myself the same question. I come from a development background myself and only had less than a year of experience dealing with operations (part time) before becoming a DevOps engineer. (Granted, I had a referral vouching for me, which weighed in my favor.) Despite my less-than-stellar CS/algorithm skills (based on my complete lack of formal computer science education), I’ve had enough experience writing software that I could apply these concepts to systems in a similar fashion. That is, in the same way a software engineer needs to understand at what point his application code should be abstracted in support of future changes (i.e., refactored into smaller components), a DevOps engineer needs to understand at what point a component of his infrastructure should be abstracted in support of future changes (i.e., rebuilding all of his servers and rearchitecting infrastructure that’s already in production, despite the potential risk, in order to solve a problem, like allowing the architecture to scale to business needs). At its core, a good engineer is just as good whether he’s writing software or deploying it. Understanding complex systems and their interactions is a critical skill. But all of these are important for software engineers, whether you’re writing application code or not!

I hope this post helps you in your endeavor to become a DevOps engineer, or at least understand what it means to be a DevOps engineer at Bazaarvoice (as I said before, it may mean something totally different at other companies!). You should get your feet wet with some of the things we do. If it gets you tingly and excited, then come work with me at Bazaarvoice: http://www.bazaarvoice.com/careers/research-and-development/.

This article was originally posted as an answer on Quora. Due to surprising popularity, I’ve updated the article and posted it here.