Tag Archives: Scalability

How Bazaarvoice UGC APIs serve information to its brand & retailers

Bazaarvoice has thousands of clients including brands and retailers. Bazaarvoice has billions of records of product catalog and User Generated Content(UGC)from Bazaarvoice clients. When a shopper visits a brand or retailer site/app powered by Bazaarvoice, our APIs are triggered.

In 2023,Bazaarvoice UGC APIs recorded peak traffic of over 3+ billion calls per day with zero incidents. This blog post will discuss the high level design strategies that are implemented to handle this huge traffic even when serving hundreds of millions of pieces of User Generated Content to shoppers/clients around the globe.

The following actions can take place when shoppers interact with our User-Generated Content (UGC) APIs.

  • Writing Content
    • When a shopper writes any content such as reviews or comments etc. on any of the product on retailer or brand site, it invokes a call to Bazaarvoice’s write UGC APIs, followed by Authenticity/content moderation.
  • Reading Content
    • When a shopper visits the brand or retailer site/app for a product, Bazaarvoice’s read UGC APIs are invoked.

Traffic: 3+ Billion calls per day(peek)Data: ~5 Billions of records,Terabyte scale

High-level API Flow:

  1. Whenever a request is made to Bazaarvoice UGC API endpoints, the Bazaarvoice gateway service receives the request, authenticates the request, and then transmits the request information to the application load balancer.
  2. Upon receiving the request from the load balancer, the application server engages with authentication service to authenticate the request. If the request is deemed legitimate, the application proceeds to make a call to its database servers to retrieve the necessary information and the application formulates response accordingly.

Let’s get into a bit deeper into the design

Actions taken at the gateway upon receiving a request

  • API’s authentication:

We have an authentication service integrated to the gateway to validate the request. If it’s a valid request then we proceed further. Validation includes ensuring that the request is from a legitimate source to serve one of Bazaarvoice’s clients

  • API’s security:

If our API’s are experiencing any security attacks like Malicious or DDOS requests, WAF intercepts and subsequently blocks the security attacks as per the configured settings.

  • Response Caching:

We implemented response caching to improve response times and client page load performance, with a duration determined by the Time-to-Live (TTL) configuration for requests. This allows our gateway to resend the cached response, if the same request is received again, rather than forwarding the request to the server.

Understanding User-Generated Content (UGC) Data Types and API Services

Before delving into specifics of how the UGC is originally collected, it’s important to understand the type of data being served.

e.g.

  • Ratings & Reviews
  • Questions & Answers
  • Statistics (Product-based Review Statistics and Questions & Answers Statistics)
  • Products & Categories

For more details, you can refer to ConversationsAPI documentation via Bazaarvoice’s recently upgraded Developer Center.

Now, let’s explore the internals of these APIs in detail, and examine their interconnectedness.

  • Write UGC API service
  • Read UGC API service

Write UGC API service:

Our submission form customized for each client, the form will render based on the client configuration which can include numerous custom data attributes to serve their needs. When a shopper submits content such as a review or a question through the form, our system writes this content to a submission queue. A downstream internal system then retrieves this content from the queue and writes it into the master database.

Why do we have to use a queue rather than directly writing into a database?

  • Load Leveling
  • Asynchronous Processing
  • Scalability
  • Resilience to Database Failures

Read UGC API service:

The UGC read API’s database operates independently from the primary, internal database. While the primary database contains normalized data, the read API database is designed to serve denormalized and enriched data specifically tailored for API usage in order to meet the response time expectations of Bazaarvoice’s clients and their shoppers.

Why do we need denormalized data?

To handle large-scale traffic efficiently and avoid complex join operations in real-time, we denormalize our data according to specific use cases.

We transform the normalized data into denormalized enriched data through the following steps:

  1. Primary-Replica setup: This will help us to separate write and read calls.
  1. Data denormalization:  In Replica DB, we have triggers to do data processing (joining multiple tables) and write that data into staging tables. We have an application that reads data from staging tables and writes the denormalized data  into Nosql DB. Here data is segregated according to the content type. Subsequently, this data is forwarded to message queues for enrichment.
  1. Enriching the denormalized data: Our internal applications consume this data from message queues, with the help of internal state stores, we enrich the documents before forwarding them to a destination message queue.

e.g. : Average rating of a product, Total number of ugc information to a product.

  1. Data Transfer to UGC application Database: We have a connector application to consume data from the destination message queue and write it into the UGC application database.

Now that you’ve heard about how Bazaarvoice’s API’s handles the large client and request scale, let’s add another layer of complexity to the mix!

Connecting Brands and Retailers

Up to this point, we’ve discussed the journey of content within a given client’s dataset. Now, let’s delve into the broader problem that Bazaarvoice addresses.

Bazaarvoice helps its brands and retailers share reviews within the bazaarvoice network. For more details refer to syndicated-content.

Let’s talk about the scale and size of the problem before getting into details, 

From 12,000+ Bazaarvoice clients, We have billions of catalog and UGC content. Bazaarvoice provides a platform to share the content within its network. Here data is logically separated for all the clients.

Client’s can access their data directly, They can access other Bazaarvoice clients data, based on the Bazaarvoice Network’s configured connections. 

E.g. : 

From the above diagram, Retailer (R3) wanted to increase their sales of a product by showing a good amount of UGC content.

Retailer (R1)1 billion catalog & ugc records
Retailer (R2)2 billion catalog & ugc records
Retailer (R3)0.5 billion catalog & ugc records
Retailer (R4)1.2 billion catalog & ugc records
Brand (B1)0.2 billion catalog & ugc records
Brand (B2)1 billion catalog & ugc records

Now think, 

If Retailer (R3) is accessing only its data, then it’s operating on 0.5 billion records, but here Retailer (R3) is configured to get the ugc data from Brand (B1) , Brand (B2) , Retailer (R1) also.

If you look at the scale now it’s 0.5 + 0.2 + 1 + 1 = 2.7 billions.

To get the data for one request, it has to query on 2.7 billion records. On top of it we have filters and sorting, which make it even more complex.

In Summary

Here I’ve over simplified, to make you understand the solution that Bazaarvoice is providing, in reality it’s much more complex to serve the UGC Write and Read APIs at a global scale with fast response times and remain globally resilient to maintain high uptime.

Now you might correlate why we have this kind of architecture designed to solve this problem.  Hopefully after reading this post you have a better understanding of what it takes behind the scenes to serve User Generated Content across Brands and Retailers at billion-record-scale to shoppers across the globe.

Cloud-Native Marvel: Driving 6 Million Daily Notifications!

Bazaarvoice notification system stands as a testament to cutting-edge technology, designed to seamlessly dispatch transactional email messages (post-interaction email or PIE) on behalf of our clients. The heartbeat of our system lies in the constant influx of new content, driven by active content solicitations. Equipped with an array of tools, including email message styling, default templates, configurable scheduling, data privacy APIs, email security/encryption, reputation/identity management, as well as auditing and reporting functionalities, our Notification system is the backbone of client-facing communications.

PIE or post-interaction email messages

Let’s delve into the system’s architecture, strategically divided into two pivotal components. Firstly, the data ingestion process seamlessly incorporates transactional data from clients through manual uploads or automated transactions uploads. Secondly, the Notification system’s decision engine controls the delivery process, strategically timing email dispatches according to client configurations. Letterpress facilitates scalable email delivery to end consumers, enhancing the efficiency of the process.

Previous Obstacles: What Hindered Progress Before

Examining the architecture mentioned above, we were already leveraging the AWS cloud and utilizing AWS managed services such as EC2, S3, and SES to meet our requirements. However, we were still actively managing several elements, such as scaling EC2 instances according to load patterns, managing security updates for EC2s, and setting up a distinct log stream to gather all instance-level logs into a separate S3 bucket for temporary storage, among other responsibilities. It’s important to note that our deployment process used Jenkins and CloudFormation templates. Considering these factors, one could characterize our earlier architecture as semi-cloud-native.

Upon careful observation of the Bazaarvoice-managed File Processing EC2 instances, it becomes apparent that these instances are handling complex, prolonged batch jobs. The ongoing maintenance and orchestration of these EC2 instances add significant complexities to overall operations. Unfortunately, due to the lack of active development on this framework, consumers, such as Notifications, find themselves dedicating 30% of an engineer’s on-call week to address various issues related to feed processing, stuck jobs, and failures of specific feed files. Enduring such challenges over an extended period is challenging. The framework poses a risk of regional outages if a client’s job becomes stuck, and our aim is to achieve controlled degradation for that specific client during such instances. These outages occur approximately once a month, requiring a week of engineering effort to restore everything to a green state.

Embrace cloud-native excellence

The diagram above illustrates the recently operationalized cloud-native, serverless data ingestion pipeline for the Notification Systems. Our transition to a cloud-native architecture has been a game-changer. Through meticulous design and rigorous testing, we created a modern, real-time data ingestion pipeline capable of handling millions of transactional data records with unprecedented efficiency. Witness the evolution in action through our cloud-native, serverless data ingestion pipeline, operationalized with precision and running flawlessly for over seven months, serving thousands of clients seamlessly.

We’ve decomposed the complex services that were previously engaged with numerous responsibilities into smaller, specialized services with a primary focus on specific responsibilities. One such service is the engagement-service, tasked with managing all client inbox folders (Email/Text/WhatsApp). This service periodically checks for new files, employs a file splitting strategy to ingest them into our S3 buckets, and subsequently moves them from the inbox to a backup folder, appending a timestamp to the filename for identification.

Achieve excellence by breaking barriers

Microservice

The journey to our current state wasn’t without challenges. Previously, managing AWS cloud services like EC2, S3, and SES demanded significant manual effort. However, by adopting a microservices architecture and leveraging ECS fargate task, AWS Lambda, step functions, and SQS, we’ve streamlined file processing and message conversion, slashing complexities and enhancing scalability.

Serverless Computing

Serverless computing has emerged as a beacon of efficiency and cost-effectiveness. With AWS Lambda handling file-to-message conversion seamlessly, our focus shifts to business logic, driving unparalleled agility and resource optimization.

Transforming the ordinary into the remarkable, our daily transaction feed ingestion handles a staggering 5-6 million entries. This monumental data flow includes both file ingestion and analytics, fueling our innovative processes.

Consider a scenario where this load is seamlessly distributed throughout the day, resulting in a monthly cost of approximately $1.1k. In contrast, our previous method incurred a cost of around $1k per month.

Despite a nominal increase in cost, the advantages are game-changing:

  1. Enhanced Control: Our revamped framework puts us in the driver’s seat with customizable notifications, significantly boosting system maintainability.
  2. Streamlined Operations: Tasks like system downtime, debugging the master node issues, node replacements, and cluster restarts are simplified to a single button click.
  3. Improved Monitoring: Expect refined dashboards and alerting mechanisms that keep you informed and in control.
  4. Customized Delivery: By segregating email messages, SMS, and WhatsApp channels, we maintain client-set send times for text messages to their consumer base.

The pay-as-you-go model ensures cost efficiency of upto 17%, making serverless architecture a strategic choice for applications with dynamic workloads.

The logic for converting files to messages is implemented within AWS Lambda. This function is tasked with the responsibility of breaking down large file-based transactions into smaller messages, directing all client data to respective SQS queues based on the Notification channel (email or SMS). In this process, the focus is primarily on business logic, with less emphasis on infrastructure maintenance and scaling. Therefore, a serverless architecture, specifically AWS Lambda, step functions and SQS were used for this purpose.

Empower multitenancy, scalability, and adaptability

Our cloud-native notifications infrastructure thrives on managed resources, fostering collaboration and efficiency across teams. Scalability and adaptability are no longer buzzwords but integral elements driving continuous improvement and customer-centric innovations. These apps can handle problems well because they’re built from small, independent services. It’s easier to find and fix issues without stopping the whole server.

Notifications cloud-native infrastructure is spread out in different availability zones of a region, so if one zone has a problem, traffic can be redirected quickly to keep things running smoothly. Also, developers can add security features to their apps from the start.

Transitioning to cloud-native technologies demands strategic planning and cohesive teamwork across development, operations, and security domains. We started by experimenting with smaller applications to gain familiarity with the process. This allowed us to pinpoint applications that are well-suited for cloud-native transformation and retire those that are not suitable. Our journey has been marked by meticulous experimentation, focusing on applications ripe for transformation and retiring those not aligned with our cloud-native vision.

Conclusion: Shape tomorrow’s software engineering landscape

As we progress deeper into the era of cloud computing, the significance of cloud-native applications goes beyond a fleeting trend; they are positioned to establish a new standard in software engineering.

Through continuous innovation and extensive adoption, we are revolutionizing the landscape of Notifications system of Bazaarvoice with cloud-native applications, bringing about transformative changes through each microservice. The future has arrived, and it’s soaring on the cloud.

How We Scale to 16+ Billion Calls

The holiday season brings a huge spike in traffic for many companies. While increased traffic is great for retail business, it also puts infrastructure reliability to the test. At times when every second of uptime is of elevated importance, how can engineering teams ensure zero downtime and performant applications? Here are some key strategies and considerations we employ at Bazaarvoice as we prepare our platform to handle over 16 Billion API calls during Cyber Week.

Key to approaching readiness for peak load events is defining the scope of testing. Identify which services need to be tested and be clear about success requirements.  A common trade off will be choosing between reliability and cost. When making this choice, reliability is always the top priority. ‘Customer is Key’ is a key value at Bazaarvoice, and drives our decisions and behavior.  Service Level Objectives (SLOs) drive clarity of reliability requirements through each of our services.

Reliability is always the top priority

When customer traffic is at its peak, reliability and uptime must take precedence over all other concerns. While cost efficiency is important, the customer experience is key during these critical traffic surges. Engineers should have the infrastructure resources they need to maintain stability and performance, even if it means higher costs in the short-term.

Thorough testing and validation well in advance is essential to surfacing any issues before the holidays. All critical customer-facing services undergo load and failover simulations to identify performance bottlenecks and points of failure. In a Serverless-first architecture, ensuring configuration like reserved concurrency and quota limits are sufficient for autoscaling requirements are valuable to validate.  Often these simulations will uncover problems you have not previously encountered. For example, in this year’s preparations our load simulations uncovered scale limitations in our redis cache which required fixes prior to Black Friday.

“It’s not only about testing the ability to handle peak load”

It’s important to note readiness is not only about testing the ability to handle peak load. Disaster recovery plans are validated through simulated scenarios. Runbooks are verified as up-to-date, to ensure efficient incident response in the event something goes wrong. Verifying instrumentation and infrastructure that supports operability are tested, ensuring our tooling works when we need it most.

Similarly ensuring the appropriate tooling and processes are in place to address security concerns is another key concern. Preventing DDoS attacks which could easily overwhelm the system if not identified and mitigated, preventing impact of service availability.

Predicting the future

Observability through actionable monitoring, logging, and metrics provides the essential visibility to detect and isolate emerging problems early. It also provides the historical context and growth of traffic data over time, which can help forecast capacity needs and establish performance baselines that align with real production usage. In addition to quantitative measures, proactively reaching out to clients means we are in step with client needs about expected traffic helping align testing to actual usage patterns.  This data is important to simulate real world traffic patterns based on what has gone before, and has enabled us to accurately predict Black Friday traffic trends. However it’s important our systems are architected to scale with demand, to handle unpredicted load if need be, key to this is observing and understanding how our systems behave in production.

Traffic Trends

What did it look like this year? Consumer shopping patterns remained quite consistent on an elevated scale. Black Friday continues to be the largest shopping day of the year, and consumers continue to shop online in increasing numbers. During Cyber Week alone, Bazaarvoice handled over 16 Billion API calls.

Solving common problems once

While individual engineering teams own service readiness, having a coordinated effort ensures all critical dependencies are covered. Sharing forecasts, requirements, and learnings across teams enables better preparation. Testing surprises on dependent teams should be avoided through clear communication.

Automating performance testing, failover drills, and monitoring checks as part of regular release cycles or scheduled pipelines reduces the overhead of peak traffic preparation. Following site reliability principles and instilling always-ready operational practices makes services far more resilient year-round. 

For example, we recently put in place a shared dev pattern for continuous performance testing.  This involves a quick setup of k6 performance script, an example github action pipeline and observability configured to monitor performance over time. We also use an in-house Tech Radar to converge on common tooling so a greater number of teams can learn and stand on the shoulders of teams who have already tried and tested tooling in their context.

Other examples include, adding automation to performance tests to replay production requests for a given load profile makes tests easier to maintain, and reflect more accurately production behavior. Additionally, make use of automated fault injection tooling, chaos engineering and automated runbooks.

Adding automation and ensuring these practices are part of your everyday way of working are key to reducing the overhead of preparing for the holidays.

Consistent, continuous training conditions us to always be ready

Moving to an always-ready posture ensures our infrastructure is scalable, reliable and robust all year round. Implementing continuous performance testing using frequent baseline tests provides frequent feedback on performance from release to release.  Automated operational readiness service checks ensure principles and expectations are in place for production services and are continuously checked.  For example, automated checking of expected monitors, alerts, runbooks and incident escalation policy requirements.

At Bazaarvoice our engineering teams align on shared System Standards which gives technical direction and guidance to engineers on commonly solved problems, continuously evolving our systems and increasing our innovation velocity.  To use a trail running analogy, System Standards define the preferred paths and combined with Tech Radar, provide recommendations to help you succeed.  For example, what trail running shoes should I choose, what energy refuelling strategy should I use, how should I monitor performance.  The same is true for building resilient reliable software, as teams solve these common problems, share the learnings for those teams which come after.

Looking Ahead

With a relentless focus on reliability, scalability, continuous testing, enhanced observability, and cross-team collaboration, engineering organizations can optimize performance and minimize downtime during critical traffic surges. 

Don’t forget after the peak has passed and we have descended from the summit, analyze the data.  What went well, what didn’t go well, and what opportunities are there to improve for the next peak.