Category Archives: Software Business

How to seamlessly move 300 Million shoppers to a highly scalable architecture, part 2

Divide and Conquer

As Engineers, we often like nice clean solutions that don’t carry along what we like to call technical debt.  Technical debt literally is stuff that we have to go back to fix/rewrite later or that requires significant ongoing maintenance effort.  In a perfect world, we fire up the the new platform and move all the traffic over.  If you find that perfect world, please send an Uber for me. Add to this the scale of traffic we serve at Bazaarvoice, and it’s obvious it would take time to harden the new system.

The secret to how we pulled this off lies in the architecture choices to break apart the challenge into two parts: frontend and backend.  While we reengineered the front-end into the the new javascript solution, there were still thousands of customers using the template-based front end.  So, we took the original server side rendering code and turned it into a service talking to our new Polloi service.  This enabled us to handle request from client sites exactly like the Classic original system.

Also, we created a service improved upon the original API but was compatible from a specific version forward.  We chose to not try to be compatible for all version for all time, as all APIs go through evolution and deprecation.  We naturally chose the version that was compatible with the new Javascript front end.  With these choices made, we could independently decide when and how to move clients to the new backend architecture irrespective of the front-end service they were using.

A simplified view of this architecture looks like this:

Divide_Conquer_arch

With the above in place, we can switch a Javascript client to use the new version of the API through just changing the endpoint of the API key.  For a template-based client, we can change the endpoint to the new referring service through a configuration in our CDN Akamai.

Testing for compatibility is a lot of work, though not particularly difficult. API compatibility is pretty straight forward, which testing whether a template page renders correctly is a little more involved especially since those pages can be highly customized.  We found the most effective way to accomplish the later since it was a one time event was with manual inspection to be sure that the pages rendered exactly the same on our QA clusters as they did in the production classic system.

Success we found early on was based on moving cohorts of customers together to the new system. At first we would move a few at a time, making absolutely sure the pages rendered correctly, monitoring system performance, and looking for any anomalies.  If we saw a problem, we could move them back quickly through reversing the change in Akamai. At first much of this was also manual, so in parallel, we had to build up tooling to handle the switching of customers, which even included working with Akamai to enhance their API so we could automate changes in the CDN.

From moving a few clients at a time, we progressed to moving over 10s of clients at a time. Through a tremendous engineering effort, in parallel we improved the scalability of our ElasticSearch clusters and other systems which allowed us to move 100s of clients at a time, then 500 clients at time. As of this writing, we’ve moved over 5,000 sites and 100% of our display traffic is now being served from our new architecture.

More than just serving the same traffic as before, we have been able to move over display traffic for new services like our Curations product that takes in and processes millions of tweets, Instagram posts, and other social media feeds.  That our new architecture could handle without change this additional, large-scale use case is a testimony to innovative engineering and determination by our team over the last 2+ years. Our largest future opportunities are enabled because we’ve successfully been able to realize this architectural transformation.

Rearchitecting the Team

In addition to rearchitecting the service to scale, we also had to rearchitect our team. As we set out on this journey to rebuild our solution into a scalable, cloud based service oriented architecture, we had to reconsider the very way our teams are put together.  We reimagined our team structure to include all the ingredients the team needs to go fast.  This meant a big investment in devops – engineers that focus on new architectures, deployment, monitoring, scalability, and performance in the cloud.

A critical part of this was a cultural transformation where the service is completely owned by the team, from understanding the requirements, to code, to automated test, to deployment, to 24×7 operation.  This means building out a complete monitoring and alerting infrastructure and that the on-call duty rotated through all members of the team.  The result is the team becomes 100% aligned around the success of the service and there is no “wall” to throw anything over – the commitment and ownership stays with the team.

For this team architecture to succeed, the critical element is to ensure the team has all the skills and team players needed to succeed.  This means platform services to support the team, strong product and program managers, talented QA automation engineers that can build on a common automation platform, gifted technical writers, and of course highly talented developers.  These teams are built to learn fast, build fast, and deploy fast, completely independent of other teams.

Supporting the service-oriented teams, a key element is our Platform Infrastructure team we created to provide a common set of cloud services to support all our teams.  Platform Infrastructure is responsible for the virtual private cloud (VPC) supporting the new services running in amazon web services. This team handles the overall concerns of security, network, service discovery, and other common services within the VPC. They also set up a set of best practices, such as ensuring all cloud instances are tagged with the name of the team that started them.

To ensure the best practices are followed, the platform infrastructure team created “Beavers” (a play on words describing a engineer at Bazaarvoice, a “BVer”). An idea borrowed from Netflix’s chaos monkeys, these are automated processes that run and examine our cloud environment in real time to ensure the best practices are followed.  For example, the “Conformity Beaver” runs regularly and checks to make sure all instances and buckets are tagged with team names. If it finds one that is not, it infers the owner and emails team aliases of the problem.  If not corrected, Conformity Beaver can terminate the instance.  This is just one example of the many Beavers we have created to help maintain consistency in a world where we have turned teams lose to move as quickly as possible.

An additional key common capability created by the Platform Infrastructure team is our Badger monitoring services. Badger enables teams to easily plug in a common healthcheck monitoring capability and can automatically discover nodes as they are started in the cloud. This service enables teams to easily implement these healthcheck that is captured in a common place and escalated through a notification system in the event of a service degradation.

The Proof is in the Pudding

The Black Friday and Holiday shopping season of 2015 was one of the smoothest ever in the history of Bazaarvoice while serving record traffic. From Black Friday to Cyber Monday, we saw over 300 million visitors.  At peak on Black Friday, we were seeing over 97,000 requests per second as we served up over 2.6 billion review impressions, a 20% increase over the year before.  There have been years of hard work and innovation that preceded this success and it is a testimony to what our new architecture is capable of delivering.

Keys to success

A few ingredients we’ve found to be important to successfully pull off a large scale rearchitecture such as described here:

  • Brilliant people. There is no replacement for brilliant engineers who are fearless in adopting new technologies and tackling what some will say can’t be done.
  • Strong leaders – and the right leaders at the right time. Often the leaders that sell the vision and get an undertaking like this going will need to be supplemented with those that can finish strong.
  • Perseverance and Determination – building a new platform using new technologies is going to be a much bigger challenge than you can estimate, requiring new skills, new approaches, and lots of mistakes. You must be completely determined and focused on the end game.
  • Tie back to business benefit – keep business informed of the benefits and ensuring that those benefits can be delivered continuously rather than a big bang. It will be a large investment and it is important that the business see some level of return as quickly as possible.
  • Make space for innovation – create room for engineers to learn and grow. We support this through organizing hackathons and time for growth projects that benefit the individual, team, and company.

Reachitecture is a Journey

One piece of advice: don’t be too critical of yourself along the way; celebrate each step of the reachitecture journey. As software engineers, we are driven to see things “complete”, wrapped up nice and neat, finished with a pretty bow. When replacing an existing system of significant complexity, this ideal is a trap because in reality you will never be complete.  It has taken us over 3 years of hard work to reach this point, and there are more things we are in the process of moving to newer architectures. Once we complete the things in front of us now, there will be more steps to take since we live in an ever evolving landscape. It is important to remember that we can never truly be complete as there will be new technologies, new architectures that deliver more capabilities to your customers, faster, and at a lower cost. Its a journey.

Perhaps that is the reason many companies can never seem to get started. They understandably want to know “When will it be done?” “What is it going to cost?”, and the unpopular answers are of course, never and more than you could imagine.  The solution to this puzzle is to identify and articulate the business value to be delivered as a step in the larger design of a software platform transformation.  Trouble is of course, you may only realistically be able to design the first few steps of your platform rearchitecture, leaving a lot of technical uncertainty ahead. Get comfortable with it and embrace it as a journey.  Engineer solid solutions in a service oriented way with clear interfaces and your customers will be happy never knowing they were switched to the next generation of your service.

authored by Gary Allison

How to seamlessly move 300 Million shoppers to a highly scalable architecture, part 1

At Bazaarvoice, we’ve pulled off an incredible feat, one that is such an enormous task that I’ve seen other companies hesitate to take on. We’ve learned a lot along the way and I wanted to share some of these experiences and lessons in hopes they may benefit others facing similar decisions.

The Beginning

Our original Product Ratings and Review service served us well for many years, though eventually encountered severe scalability challenges. Several aspects we wanted to change: a monolithic Java code base, fragile custom deployment, and server-side rendering. Creative use of tenant partitioning, data sharding and horizontal read scaling of our MySQL/Solr based architecture allowed us to scale well beyond our initial expectations. We’ve documented how we have accomplished this scaling on our developer blog in several past posts if you’d like to understand more. Still, time marches on and our clients have grown significantly in number and content over the years. New use cases have come along since the original design: emphasis on the mobile user and responsive design, accessibility, the emphasis on a growing network of consumer generated content flowing between brands and retailers, and the taking on of new social content that can come in floods from Twitter, Instagram, Facebook, etc.

As you can imagine, since the product ratings and reviews in our system are displayed on thousands of retailer and brand websites around the world, the read traffic from review display far outweighs the write traffic from new reviews being created. So, the addition of clusters of Solr servers that are highly optimized for fast queries was a great scalability addition to our solution.

A highly simplified diagram of our classic architecture:

Highly simplified view of our Classic Architecture

However, in addition to fast review display when a consumer visited a product page, another challenge started emerging out of our growing network of clients. This network is comprised of Brands like Adidas and Samsung who collect reviews on their websites from consumers who purchased the product and then want to “syndicate” those reviews to a set of retailer ecommerce sites where shoppers can benefit from them. Aside from the challenges of product matching which are very interesting, under the MySQL architecture this could mean the reviews could be copied over and over throughout this network. This approach worked for several years, but it was clear we needed a plan for the future.

As we grew, so did the challenge of an expanding volume of data in the master databases to serve across an expanding network of clients. This, together with the need to deliver more front-end web capability to our customers, drove us to what I hope you will find is a fascinating story of rearchitecture.

The Journey Begins

One of the first things we decided to tackle was to start moving analytics and reporting off the existing platform so that we could deliver new insights to our clients showing how reviews are used by shoppers in their purchase decisions. This choice also enabled us to decouple the architecture and spin up parallel teams to speed delivery. To deliver these capabilities, we adopted big data architectures based on Hadoop and HBase to be able to assimilate hundreds of millions of web visits into analytics that would paint the full shopper journey picture for our clients. By running map reduce over the large set of review traffic and purchase data, we are able to give our clients insight into these shopper behaviors and help our clients better understand the return on investment they receive from consumer generated content. As we built out this big data architecture, we also saw the opportunity to offload reporting from the review display engine. Now, all our new reporting and insight efforts are built off this data and we are actively working to move existing reporting functionality to this big data architecture.

On the front end, flexibility and mobile was a huge driver in our rearchitecture. Our original template-driven, server-side rendering can provide flexibility, but that ultimate flexibility is only required in a small number of use cases. For the vast majority, a client-side rendering via javascript with behavior that can be configured through a simple UI would yield a better mobile-enabled shopping experience that’s easier for clients to control. We made the call early on not to try to force migration of clients from one front end technology to another. For one thing, it’s not practical for a first version of a product to be 100% feature function capable to the predecessor. For another, there was just simply no reason to make clients choose. Instead, as clients redesigned their sites and as new clients were onboard, they opt’ed in to the new front end technology.

We attracted some of the top javascript talent in the country to this ambitious undertaking. There are some very interesting details of the architecture we built that have been described on our developer blog and that are available as open source projects on in our bazaarvoice github organization. Look for the post describing our Scoutfile architecture in March of 2015. The BV team is committed to giving back to the Open Source community and we hope this innovation helps you in your rearchitecure journey.

On the backend, we took inspiration from both Google and Netflix. It was clear that we needed to build an elastic, scalable, reliable, cloud-based data store and query layer. We needed to reorganize our engineering team into autonomous service oriented teams that could move faster. We needed to hire and build new skills in new technologies. We needed to be able to roll this out as transparently as possible to our clients while serving live shopping traffic so no one knows its happening at all. Needless to say, we had our work cut out for us.

For the foundation of our new architecture, we chose Cassandra, an Open Source NoSQL data solution based on influence of ideas from Google and their BigTable architecture. Cassandra had been battle hardened at Netflix and was a great solution for a cloud resilient, reliable storage engine. On this foundation we built a service we call Emo, originally intended for sentiment analysis. As we made progress towards delivery, we began to understand the full potential of Cassandra and its NoSQL based architecture as our primary display storage.

With Emo, we have solved the potential data consistency issues of Cassandra and guarantee ACID database operations. We can also seamlessly replicate and coordinate a consistent view of all the rating and review data across AWS availability zones worldwide, providing a scalable and resilient way to serve billions of shoppers. We can also be selective in the data that replicates for example from the European Union (EU) so that we can provide assurances of privacy for EU based clients. In addition to this consistency capability, Emo provides a databus that allows any Bazaarvoice service to listen for the kinds of changes the service particularly needs, perfect for a new service oriented architecture. For example, a service can listen for the event of a review passing moderation which would mean that it should now be visible to shoppers.

While Emo/Cassandra gave us many advantages, its NoSQL query capability is limited to what Cassandra’s key-value paradigm. We learned from our experience with Solr that having a flexible, scalable query layer on top of the master datastore resulted in significant performance advantages for calculating on-demand results of what to display during a shopper visit. This query layer naturally had to provide the distributed advantages to match Emo/Cassandra. We chose ElasticSearch for our architecture and implemented a flexible rules engine we call Polloi to abstract the indexing and aggregation complexities away from engineers on teams that would use this service. Polloi hooks up to the Emo databus and provides near real time visibility to changes flowing into Cassandra.

The rest of the monolithic code base was reimplemented into services as part of our service oriented architecture. Since your code is a direct reflection of the team, as we took on this challenge we formed autonomous teams that owned everything full cycle from initial conception to operation in production. We built the teams with all the skills needed for success: product owners, developers, QA engineers, UX designers (for front end), DevOps engineers, and tech writers. We built services that managed the product catalog, UI Configuration, syndication edges, content moderation, review feeds, and many more. We have many of these rearchitected services now in production and serving live traffic. Some examples include services that perform the real time calculation of what Brands are syndicating consumer generated content to which Retailers, services that process client product catalog feeds for 100s of millions of products, new API services, and much more.

To make all of the above more interesting, we also created this service-oriented architecture to leverage the full power of Amazon’s AWS cloud. It was clear we had the uncommon opportunity to build the platform from the ground up to run in the cloud with monitoring, elastic resiliency, and security capabilities that were unavailable in previous data center environments. With AWS, we can take advantage of new hardware platforms with a push of a button, create multi datacenter failover capabilities, and use new capabilities like elastic MapReduce to deliver big data analytics to our clients. We build auto-scaling groups that allow our services to automatically add compute capacity as client traffic demands grow. We can do all of this with a highly skilled team that focuses on delivering customer value instead of hardware procurement, configuration, deployment, and maintenance.

So now after two plus years of hard work, we have a modern, scalable service-oriented solution that can mirror exactly the original monolithic service. But more importantly, we have a production hardened new platform that we will scale horizontally for the next 10 years of growth. We can now deliver new services much more quickly leveraging the platform investment that we have made and deliver customer value at scale faster than ever before.

So how did we actually move 300 million shoppers without them even knowing?  We’ll take a look at this in an upcoming post!

authored by Gary Allison

 

Partner Integrations: Do’s and Don’ts

In this blog post , a Senior Product Manager on our Product team, discusses challenges to building and maintaining technical partnerships between organizations as well as provides advice on how to overcome those challenges.

Every company comes to a point, early or late, where it realizes that it must partner with other companies to drive value in the market. Partnerships always start with conversations, handshakes, and NDAs. At some point, unlocking the value of partnership may hinge upon establishing a formal integration between the two companies. These integrations constitute a technical “bridge” between companies. They can unlock otherwise inaccessible value, allow for one company to OEM the other, and/or can accelerate work that otherwise is “re-invented” each time the companies engage each other.

Integrations can be amazing vehicles to create value that only comes from combining capabilities from separate entities, while simultaneously allowing each entity to focus on what each one does best. They can be the perfect manifestation of the all too often promised “complimentary” value. Integrations can offer consistency, repeatability, and reduced friction in the activities involved in unlocking that value.

Unfortunately, integrations are often approached in manner in which all parties involved are not setup for success. Why?

Integrations aren’t just some “code.” They are product. They require an organized effort to build, including technical staff and non-technical staff to build (engineers, architects, project manager, product manager, partnership manager). They require support, assigned owners, subject matter experts, marketing, documentation, and proper roadmap vision. Integrations demand the same attention and focus that any first class “product” requires.

Integrations require both more and different types of communication. Because the value of the integration is typically not front-and-center to the core value of each org, there must be additional effort to communicate the existence of the integration and the value it brings within each org. Sales, onboarding support, post-live support organizations all need ways to communicate with the other integrated party (who calls who when something stops working?). The two product organizations must communicate ahead of any changes to dependent technologies such as APIs. A classic communication gap happens when one entity changes their APIs and doesn’t let the other party know soon enough or not at all. Problems are only discovered when something breaks.

Integrations are usually birthed by the wrong part of the org. The challenge with integrations is that the impetus to create them usually originates from one or both of the company’s business development/partnerships team – a group that typically has little appreciation for the discipline of product management. Their priority is on “relationships” that historically focus on non-technical efforts. Additionally, the ADD-like attention span of most partnerships teams results in a great desire to create an “integration” with a partner for marketing and sales-driven reasons, but very little attention, effort, and commitment to the long-term requirements of a properly supported product. It is quite easy to just stop communicating with a partner who is no longer deemed valuable, but such an about-face cannot be made when an integration is in place with paying customers. Most often, partnerships orgs do not have technical resources within their structure, but rather rely on borrowing technical resources from wherever they can be found (“Hey, I have a partner company who just trying to do this thing, and they have a quick technical question…”). This is a great approach for proof-of-concept initiatives, but certainly not for something that companies/customers must trust to deliver value. The product organizations at each company must be responsible for bringing an integration to life. Regardless of whether the product org has enough resources to service an integration like a first-class product citizen, at least the owner will have an understanding of what is and isn’t getting handled properly and can mitigate the potentially negative outcomes that arise from under-served products.

Correctly structured incentives are crucial to the short and long-term success of integrations. There must be something in it for all concerned parties. Direct compensation and rev share are two good options. You should be cautious of such benefits as “stickiness” (as in, the assumption that giving an integration free-of-charge to an existing customer makes that customer less likely to debook your core service) or the halo effect associated with integrating with a company (i.e. “Do you know we’re integrated with Facebook?”). Many integrations have been built on the promise of return. Once that promise begins to fade (from any one or more of the parties), so does the motivation of the affected party to keep up their end of the technical bargain. The technology world’s version of “he’s just not that into you (anymore).” Once an integration is no longer properly attended to from one party, the integration becomes a liability. It’s not enough for the bridge to be secured to just one side of the river.

People love to build stuff. But, they hate to support it. There must be something in it for the platform to properly prioritize integration maintenance efforts. Be wary of agreements that lack commitments, SLAs, etc. (often termed that they will do any needed work within the bounds of “best efforts”) as these agreements allow the company responsible for the integration (“code”) to elect to not invest in the support and roadmap development, should their interest wane. If the agreement lacks commitments, then the partnership will likely as well. They will acknowledge the maintenance effort, but it will always get pushed to the next dev cycle. Which leads us to…

The Challenge of Opportunity Cost

The assumption here is that these companies contemplating an integration are predominantly product organizations. Their company mandate is to bring products to market at scale. This is dramatically different than a service organization who essentially trades dollars for hours.
This means that the cost of the technical/engineering effort at a product organization is different than that of a service organization. Not in that engineers get paid more at product organizations, but rather the additional opportunity cost of engineering effort at a product organization often introduces an impossibly high hurdle rate for putting those engineers on non-core “integration work.” Even just the existence of opportunity cost, albeit uncalculated, is all that a dissenting product or engineering leader needs to de-prioritize seemingly less important “integration work” that doesn’t deliver core value.

One innovative approach to solve this dilemma is to use outsourced engineering resources from a service organization to avoid the challenges that comes with opportunity cost. It makes good business sense: let your in-house engineering staff concentrate on doing things that drive core value at scale. The downside of this approach is that there is a very clear and visible cost (hrs * hourly rate) that is attached to all effort associated with the integration. A similar cost analysis is rarely thought about when utilizing internal resources, so the integration product manager should be prepared. Getting things done is always more expensive than you thought.

Of course, another solution is to simply make integration work of the same perceived class of value as that of the core product org’s core solution. However, as we describe above, this can be a big challenge.

The technical approach must be at the convergence of correctly structured incentives and technical viability. How open or closed a platform is can dictate how an integration can be executed. The associated partnership incentive structure can dictate how an integration should be executed. The resulting integration will result from the intersection of these two perspectives.

Closed platforms force the work on that platform. Open platforms allow for more options – either or both entities, possibly even a third-party, can contribute to the integration development.

Let’s look at a few scenarios.

Scenario 1: B is a “closed” platform

b_is_closed_platform

“Closed” here means that the platform does not allow for integration (read: code) to be hosted by that platform and that the platform does not have externally accessible APIs to utilize from outside the platform. The closed platform may have internally accessible APIs, but those do an external party little good.

Closed platforms force that platform to do the integration work. Thus, there must be incentives for the closed platform to both build and support the integration long-term. The effort to build the integration is often simply the result of the opportunistic convergence of both parties being sold on (at least) the promise of value and some available engineering capacity within the close platform. Without the proper incentives for platform B, this becomes a classic example of the issue of the Challenge of Opportunity Cost, discussed above. The engineer who had some free time to build the integration is suddenly no longer available to fix a bug or build a new feature. There must be motivation in some form to continue to maintain the integrity of the integration.

Scenario 2: B is open

b_is_open_platform

Open platforms present more options. In scenario 2, B is no longer the only entity who can develop the integration. A, B, or a third-party org can build the integration. There are more alternative incentive structures as well. Since the engineering effort can be executed by a non-B entity, there doesn’t need to be much in it for B (there can be, but it is not near the necessity). There will certainly need to be knowledge of the B platform (documentation, sandboxes, API keys, deployment directions, etc.) on the part of the developing entity, but this effort on the part of B has a much lower hurdle rate than that which is required to get something into B’s engineering roadmap. Typically, B will have some form of partner “program” by where such assets and knowledge are available for a predetermined fee. Even in absence of such a program, the needs are significantly less than if the development effort required engineers from platform B to do the build work.

Scenario 3: Middle-ware Solution

middleware_solution

Scenario 3 is just a derivative of Scenario 2. Options are abundant. A, B, or a third-party can build the integration. In most cases, any of those entities can bring the integration to market. A major decision will be how and where to host the middle-ware solution and how to provide production-ready support, specifically beyond the initial build phase (which can just leverage cloud hosting services like Amazon, etc. to quickly get up and running). The trade-off is that such a middle-man solution removes any challenges that come with the need to host the integration within the B platform, which can range from simple plug-n-play effort to per-instance customizations required for each integration incarnation.

Incentive options are very similar to Scenario 2. One exception is that there is a clear opportunity for a third-party to bring the integration to market with an associated price tag.

Summary

Integrations are powerful and often hugely valuable, but their success is directly tied the ability to structure them for the long-term. Integrations are a special kind of “product” requiring different types of communication and can benefit from the use of outsourced resources to execute and maintain.

A successful integration is the result of a technical and non-technical relationship that is structured in a way that provides benefit to both parties that can adequately compensate for the often underestimated level of involvement required across both organizations.