Category Archives: Security

How Bazaarvoice UGC APIs serve information to its brand & retailers

Bazaarvoice has thousands of clients including brands and retailers. Bazaarvoice has billions of records of product catalog and User Generated Content(UGC)from Bazaarvoice clients. When a shopper visits a brand or retailer site/app powered by Bazaarvoice, our APIs are triggered.

In 2023,Bazaarvoice UGC APIs recorded peak traffic of over 3+ billion calls per day with zero incidents. This blog post will discuss the high level design strategies that are implemented to handle this huge traffic even when serving hundreds of millions of pieces of User Generated Content to shoppers/clients around the globe.

The following actions can take place when shoppers interact with our User-Generated Content (UGC) APIs.

  • Writing Content
    • When a shopper writes any content such as reviews or comments etc. on any of the product on retailer or brand site, it invokes a call to Bazaarvoice’s write UGC APIs, followed by Authenticity/content moderation.
  • Reading Content
    • When a shopper visits the brand or retailer site/app for a product, Bazaarvoice’s read UGC APIs are invoked.

Traffic: 3+ Billion calls per day(peek)Data: ~5 Billions of records,Terabyte scale

High-level API Flow:

  1. Whenever a request is made to Bazaarvoice UGC API endpoints, the Bazaarvoice gateway service receives the request, authenticates the request, and then transmits the request information to the application load balancer.
  2. Upon receiving the request from the load balancer, the application server engages with authentication service to authenticate the request. If the request is deemed legitimate, the application proceeds to make a call to its database servers to retrieve the necessary information and the application formulates response accordingly.

Let’s get into a bit deeper into the design

Actions taken at the gateway upon receiving a request

  • API’s authentication:

We have an authentication service integrated to the gateway to validate the request. If it’s a valid request then we proceed further. Validation includes ensuring that the request is from a legitimate source to serve one of Bazaarvoice’s clients

  • API’s security:

If our API’s are experiencing any security attacks like Malicious or DDOS requests, WAF intercepts and subsequently blocks the security attacks as per the configured settings.

  • Response Caching:

We implemented response caching to improve response times and client page load performance, with a duration determined by the Time-to-Live (TTL) configuration for requests. This allows our gateway to resend the cached response, if the same request is received again, rather than forwarding the request to the server.

Understanding User-Generated Content (UGC) Data Types and API Services

Before delving into specifics of how the UGC is originally collected, it’s important to understand the type of data being served.

e.g.

  • Ratings & Reviews
  • Questions & Answers
  • Statistics (Product-based Review Statistics and Questions & Answers Statistics)
  • Products & Categories

For more details, you can refer to ConversationsAPI documentation via Bazaarvoice’s recently upgraded Developer Center.

Now, let’s explore the internals of these APIs in detail, and examine their interconnectedness.

  • Write UGC API service
  • Read UGC API service

Write UGC API service:

Our submission form customized for each client, the form will render based on the client configuration which can include numerous custom data attributes to serve their needs. When a shopper submits content such as a review or a question through the form, our system writes this content to a submission queue. A downstream internal system then retrieves this content from the queue and writes it into the master database.

Why do we have to use a queue rather than directly writing into a database?

  • Load Leveling
  • Asynchronous Processing
  • Scalability
  • Resilience to Database Failures

Read UGC API service:

The UGC read API’s database operates independently from the primary, internal database. While the primary database contains normalized data, the read API database is designed to serve denormalized and enriched data specifically tailored for API usage in order to meet the response time expectations of Bazaarvoice’s clients and their shoppers.

Why do we need denormalized data?

To handle large-scale traffic efficiently and avoid complex join operations in real-time, we denormalize our data according to specific use cases.

We transform the normalized data into denormalized enriched data through the following steps:

  1. Primary-Replica setup: This will help us to separate write and read calls.
  1. Data denormalization:  In Replica DB, we have triggers to do data processing (joining multiple tables) and write that data into staging tables. We have an application that reads data from staging tables and writes the denormalized data  into Nosql DB. Here data is segregated according to the content type. Subsequently, this data is forwarded to message queues for enrichment.
  1. Enriching the denormalized data: Our internal applications consume this data from message queues, with the help of internal state stores, we enrich the documents before forwarding them to a destination message queue.

e.g. : Average rating of a product, Total number of ugc information to a product.

  1. Data Transfer to UGC application Database: We have a connector application to consume data from the destination message queue and write it into the UGC application database.

Now that you’ve heard about how Bazaarvoice’s API’s handles the large client and request scale, let’s add another layer of complexity to the mix!

Connecting Brands and Retailers

Up to this point, we’ve discussed the journey of content within a given client’s dataset. Now, let’s delve into the broader problem that Bazaarvoice addresses.

Bazaarvoice helps its brands and retailers share reviews within the bazaarvoice network. For more details refer to syndicated-content.

Let’s talk about the scale and size of the problem before getting into details, 

From 12,000+ Bazaarvoice clients, We have billions of catalog and UGC content. Bazaarvoice provides a platform to share the content within its network. Here data is logically separated for all the clients.

Client’s can access their data directly, They can access other Bazaarvoice clients data, based on the Bazaarvoice Network’s configured connections. 

E.g. : 

From the above diagram, Retailer (R3) wanted to increase their sales of a product by showing a good amount of UGC content.

Retailer (R1)1 billion catalog & ugc records
Retailer (R2)2 billion catalog & ugc records
Retailer (R3)0.5 billion catalog & ugc records
Retailer (R4)1.2 billion catalog & ugc records
Brand (B1)0.2 billion catalog & ugc records
Brand (B2)1 billion catalog & ugc records

Now think, 

If Retailer (R3) is accessing only its data, then it’s operating on 0.5 billion records, but here Retailer (R3) is configured to get the ugc data from Brand (B1) , Brand (B2) , Retailer (R1) also.

If you look at the scale now it’s 0.5 + 0.2 + 1 + 1 = 2.7 billions.

To get the data for one request, it has to query on 2.7 billion records. On top of it we have filters and sorting, which make it even more complex.

In Summary

Here I’ve over simplified, to make you understand the solution that Bazaarvoice is providing, in reality it’s much more complex to serve the UGC Write and Read APIs at a global scale with fast response times and remain globally resilient to maintain high uptime.

Now you might correlate why we have this kind of architecture designed to solve this problem.  Hopefully after reading this post you have a better understanding of what it takes behind the scenes to serve User Generated Content across Brands and Retailers at billion-record-scale to shoppers across the globe.

Can You Keep a Secret?

We all have secrets. Sometimes, these are guilty pleasures that we try to keep hidden, like watching cheesy reality TV or indulging in strange comfort food. We often worry:

“How do we keep the secret safe?”

“What could happen if someone finds out the secret?”

“Who is keeping a secret?”

“What happens if we lose a secret?”

At Bazaarvoice, our security team starts with these same questions when it comes to secret management. They are less interested in trivial secrets like who left their dishes in the office sink or who finished the milk. Instead, they focus on the secrets used in the systems that power a wide range of products for the organization. They include:

  • API keys and tokens
  • Database connection strings that contain passwords
  • Encryption keys
  • Passwords

With hundreds of systems and components depending on secrets at Bazaarvoice, enumerating them and answering the above questions quickly, consistently and reliably can become a challenge without guiding standards. 

This post will discuss secret management and explain how Bazaarvoice implemented a Secret Catalog using open-source software. The post will provide reusable examples and helpful tips to reduce risk and improve compliance for your organization. Let’s dive in!

Secrets Catalog

Bazaarvoice is ISO27001 compliant and ensures its systems leverage industry standard tools and practices to store secrets securely. However, it isn’t always in the same place, and secrets can be stored using different tools. Sometimes AWS Secret Manager makes sense; other times, AWS KMS is a better choice for a given solution. You may even have a multi-cloud strategy, further scattering secrets. This is where a Secrets Catalog is extremely useful, providing a unified view of secrets across tools and vendors.

It may sound a little boring, and the information captured isn’t needed most of the time. However, in the event of an incident, having a secret catalog becomes an invaluable resource in reducing the time you need to resolve the problem.

Bazaarvoice recognized the value of a secrets catalog and decided to implement it. The Security Team agreed that each entry in the catalog must satisfy the following criteria:

  • A unique name
  • A good description of its purpose
  • Clear ownership
  • Where it is stored
  • A list of dependent systems
  • References to documentation to remediate any related incident, for example, how to rotate an API key

Crucially, the value of the secret must remain in its original and secure location outside of the Catalog, but it is essential to know where the secret is stored. Doing so avoids a single point of failure and does not hinder any success criteria.

Understanding where secrets are stored helps identify bad practices. For example, keeping an SSH key only on team members’ laptops would be a significant risk. A person can leave the company, win the lottery, or even spill a drink on their laptop (we all know someone!). The stores already defined in the catalog guide people in deciding how to store a new secret, directing them to secure and approved tools resistant to human error.

Admittedly, the initial attempt to implement the catalog at Bazaarvoice didn’t quite go to plan. Teams began meeting the criteria, but it quickly became apparent that each team produced different interpretations stored in multiple places and formats. Security would not have a unified view when multiple secret catalogs would ultimately exist. Teams would need additional guidance and a concrete standard to succeed.

We already have a perfect fit for this!

Bazaarvoice loves catalogs. After our clients, they might be our favourite thing. There is a product catalog for each of our over ten thousand clients, a data catalog, and, most relevantly, a service catalog powered by Backstage.

“Backstage unifies all your infrastructure tooling, services, and documentation to create a streamlined development environment from end to end.”

https://backstage.io/docs/overview/what-is-backstage

Out of the box, it comes with core entities enabling powerful ecosystem modeling:

https://backstage.io/docs/features/software-catalog/system-model#ecosystem-modeling

As shown in the diagram above, at the core of the Backstage model are Systems, Components, and Resources, the most relevant entities for secret management. You can find detailed descriptions of each entity in the Backstage modeling documentation. Briefly, they can be thought about as follows:

System – A collection of lower-level entities, including Components and Resources, acting as a layer of abstraction.

Component – A piece of software. It can optionally produce or consume APIs.

Resource – The infrastructure required by Components to operate.

New Resource Types

Resources are one of the Backstage entities used to represent infrastructure. They are a solid fit for representing secrets, allowing us to avoid writing custom in-house software to do the same job. Therefore, we defined two new Resource Types: secret and secret-store.

Tip: Agree on the allowed Types upfront to avoid a proliferation of variations such as ‘database’ and ‘db’ that can degrade searching and filtering.

Having already invested the effort in modeling the Bazaarvoice ecosystem, adding the necessary YAML to define secrets and secret stores was trivial for teams. 

Example minimal secret store:

apiVersion: backstage.io/v1alpha1
kind: Resource
metadata:
  name: aws-secrets-manager
  description: Resources of type 'secret' can depend on this Resource to indicate that it is stored in AWS Secrets Manager
spec:
  type: secret-store
  owner: team-x

Example minimal secret:

apiVersion: backstage.io/v1alpha1
kind: Resource
metadata:
  name: system-a-top-secret-key
  description: An example key stored in AWS Secrets Manager
  links:
    - url:https://internal-dev-handbook/docs/how-to-rotate-secrets 
      title: Rotation guide
spec:
  type: secret
  owner: team-1
  system: system-a
  dependsOn:
    - resource:aws-secrets-manager

Finally, to connect the dots to existing Systems, Components, and Resources, simply add a dependsOn section to their definitions. For example:

apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: system-a-component
  ...
spec:
  ...
  dependsOn:
    - resource:system-a-top-secret-key

How does it look?

It’s fantastic in our eyes! Let’s break down the image above.

The “About” section explains the secret, including which system it’s associated with, its owner, and the fact that it’s currently in production. It also provides a link to the source code and a way to report issues or mistakes, such as typos.

The “Relations” section, powered by an additional plugin, provides a visual and interactive graph that displays the relationships associated with the secret. This graph helps users quickly build a mental picture of the resources in the context of their systems, owners, and components. Navigating through this graph has proven to be an effective and efficient mechanism for understanding the relationships associated with the secret.

The “Links” section offers a consistent place to reference documentation related to secret management. 

Lastly, the “PagerDuty” plugin integrates with the on-call system, eliminating the need for manual contact identification during emergency incidents.

The value of Backstage shines through the power of the available plugins. Searching and filtering empower discoverability, and the API opens the potential for further integrations to internal systems.

Keeping it fresh

Maintaining accurate and up-to-date documentation is always a challenge. Co-locating the service catalog with related codebases helps avoid the risk of it becoming stale and has become a consideration for reviewing and approving changes. 

We are early in our journey with this approach to a secrets catalog and aim to share our incremental improvements in further blog posts as we progress.