Author Archives: Lou Kratz

About Lou Kratz

I’m a research engineer specializing in computer vision and machine learning. I spent most of my time on tech-transfer: figuring out how to build cutting edge research from the lab into real-world applications. As a research engineer I rapidly switch between various roles including researcher, software engineer, and data scientist. Prior to diving into industry, I received my PhD from Drexel University under the mentorship of Ko Nishino. Periodically I teach Computer Vision and Computational Photography at Drexel. When I’m not working, I enjoy photography, cooking, and fatherhood.

Semantically Compress Text to Save On LLM Costs

Introduction

Large language models are fantastic tools for unstructured text, but what if your text doesn’t fit in the context window? Bazaarvoice faced exactly this challenge when building our AI Review Summaries feature: millions of user reviews simply won’t fit into the context window of even newer LLMs and, even if they did, it would be prohibitively expensive.

In this post, I share how Bazaarvoice tackled this problem by compressing the input text without loss of semantics. Specifically, we use a multi-pass hierarchical clustering approach that lets us explicitly adjust the level of detail we want to lose in exchange for compression, regardless of the embedding model chosen. The final technique made our Review Summaries feature financially feasible and set us up to continue to scale our business in the future.

The Problem

Bazaarvoice has been collecting user-generated product reviews for nearly 20 years so we have a lot of data. These product reviews are completely unstructured, varying in length and content. Large language models are excellent tools for unstructured text: they can handle unstructured data and identify relevant pieces of information amongst distractors.

LLMs have their limitations, however, and one such limitation is the context window: how many tokens (roughly the number of words) can be put into the network at once. State-of-the-art large language models, such as Athropic’s Claude version 3, have extremely large context windows of up to 200,000 tokens. This means you can fit small novels into them, but the internet is still a vast, every-growing collection of data, and our user-generated product reviews are no different.

We hit the context window limit while building our Review Summaries feature that summarizes all of the reviews of a specific product on our clients website. Over the past 20 years, however, many products have garnered thousands of reviews that quickly overloaded the LLM context window. In fact, we even have products with millions of reviews that would require immense re-engineering of LLMs to be able to process in one prompt.

Even if it was technically feasible, the costs would be quite prohibitive. All LLM providers charge based on the number of input and output tokens. As you approach the context window limits for each product, of which we have millions, we can quickly run up cloud hosting bills in excess of six figures.

Our Approach

To ship Review Summaries despite these technical, and financial, limitations, we focused on a rather simple insight into our data: Many reviews say the same thing. In fact, the whole idea of a summary relies on this: review summaries capture the recurring insights, themes, and sentiments of the reviewers. We realized that we can capitalize on this data duplication to reduce the amount of text we need to send to the LLM, saving us from hitting the context window limit and reducing the operating cost of our system.

To achieve this, we needed to identify segments of text that say the same thing. Such a task is easier said than done: often people use different words or phrases to express the same thing.

Fortunately, the task of identifying if text is semantically similar has been an active area of research in the natural language processing field. The work by Agirre et. al. 2013 (SEM 2013 shared task: Semantic Textual Similarity. In Second Joint Conference on Lexical and Computational Semantics) even published a human-labeled data of semantically similar sentences known as the STS Benchmark. In it, they ask humans to indicate if textual sentences are semantically similar or dissimilar on a scale of 1-5, as illustrated in the table below (table from Cer et. al., SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation):

The STSBenchmark dataset is often used to evaluate how well a text embedding model can associate semantically similar sentences in its high-dimensional space. Specifically, Pearson’s correlation is used to measure how well the embedding model represents the human judgements.

Thus, we can use such an embedding model to identify semantically similar phrases from product reviews, and then remove repeated phrases before sending them to the LLM.

Our approach is as follows:

First, product reviews are segmented the into sentences.
An embedding vector is computed for each sentence using a network that performs well on the STS benchmark
Agglomerative clustering is used on all embedding vectors for each product.
An example sentence – the one closest to the cluster centroid – is retained from each cluster to send to the LLM, and other sentences within each cluster are dropped.
Any small clusters are considered outliers, and those are randomly sampled for inclusion in the LLM.
The number of sentences each cluster represents is included in the LLM prompt to ensure the weight of each sentiment is considered.

This may seem straightforward when written in a bulleted list, but there were some devils in the details we had to sort out before we could trust this approach.

Embedding Model Evaluation

First, we had to ensure the model we used effectively embedded text in a space where semantically similar sentences are close, and semantically dissimilar ones are far away. To do this, we simply used the STS benchmark dataset and computed the Pearson correlation for the models we desired to consider. We use AWS as a cloud provider, so naturally we wanted to evaluate their Titan Text Embedding models.

Below is a table showing the Pearson’s correlation on the STS Benchmark for different Titan Embedding models:

Model	Dimensionality	Correlation on STS Benchmark
amazon.titan-embed-text-v1	1536	0.801031
amazon.titan-embed-text-v2:0	256	0.818282
amazon.titan-embed-text-v2:0	512	0.854073
amazon.titan-embed-text-v2:0	1024	0.868574
State-of-the-art		0.929

So AWS’s embedding models are quite good at embedding semantically similar sentences. This was great news for us – we can use these models off the shelf and their cost is extremely low.

Semantically Similar Clustering

The next challenge we faced was: how can we enforce semantic similarity during clustering? Ideally, no cluster would have two sentences whose semantic similarity is less than humans can accept – a score of 4 in the table above. Those scores, however, do not directly translate to the embedding distances, which is what is needed for agglomerative clustering thresholds.

To deal with this issue, we again turned to the STS benchmark dataset. We computed the distances for all pairs in the training dataset, and fit a polynomial from the scores to the distance thresholds.

This polynomial lets us compute the distance threshold needed to meet any semantic similarity target. For Review Summaries, we selected a score of 3.5, so nearly all clusters contain sentences that are “roughly” to “mostly” equivalent or more.

It’s worth noting that this can be done on any embedding network. This lets us experiment with different embedding networks as they become available, and quickly swap them out should we desire without worrying that the clusters will have semantically dissimilar sentences.

Multi-Pass Clustering

Up to this point, we knew we could trust our semantic compression, but it wasn’t clear how much compression we could get from our data. As expected, the amount of compression varied across different products, clients, and industries.

Without loss of semantic information, i.e., a hard threshold of 4, we only achieved a compression ratio of 1.18 (i.e., a space savings of 15%).

Clearly lossless compression wasn’t going to be enough to make this feature financially viable.

Our distance selection approach discussed above, however, provided an interesting possibility here: we can slowly increase the amount of information loss by repeatedly running the clustering at lower thresholds for remaining data.

The approach is as follows:

Run the clustering with a threshold selected from score = 4. This is considered lossless.
Select any outlying clusters, i.e., those with only a few vectors. These are considered “not compressed” and used for the next phase. We chose to re-run clustering on any clusters with size less than 10.
Run clustering again with a threshold selected from score = 3. This is not lossless, but not so bad.
Select any clusters with size less than 10.
Repeat as desired, continuously decreasing the score threshold.

So, at each pass of the clustering, we’re sacrificing more information loss, but getting more compression and not muddying the lossless representative phrases we selected during the first pass.

In addition, such an approach is extremely useful not only for Review Summaries, where we want a high level of semantic similarity at the cost of less compression, but for other use cases where we may care less about semantic information loss but desire to spend less on prompt inputs.

In practice, there are still a significantly large number of clusters with only a single vector in them even after dropping the score threshold a number of times. These are considered outliers, and are randomly sampled for inclusion in the final prompt. We select the sample size to ensure the final prompt has 25,000 tokens, but no more.

Ensuring Authenticity

The multi-pass clustering and random outlier sampling permits semantic information loss in exchange for a smaller context window to send to the LLM. This raises the question: how good are our summaries?

At Bazaarvoice, we know authenticity is a requirement for consumer trust, and our Review Summaries must stay authentic to truly represent all voices captured in the reviews. Any lossy compression approach runs the risk of mis-representing or excluding the consumers who took time to author a review.

To ensure our compression technique was valid, we measured this directly. Specifically, for each product, we sampled a number of reviews, and then used LLM Evals to identify if the summary was representative of and relevant to each review. This gives us a hard metric to evaluate and balance our compression against.

Results

Over the past 20 years, we have collected nearly a billion user-generated reviews and needed to generate summaries for tens of millions of products. Many of these products have thousands of reviews, and some up to millions, that would exhaust the context windows of LLMs and run the price up considerably.

Using our approach above, however, we reduced the input text size by 97.7% (a compression ratio of 42), letting us scale this solution for all products and any amount of review volume in the future.
In addition, the cost of generating summaries for all of our billion-scale dataset reduced 82.4%. This includes the cost of embedding the sentence data and storing them in a database.

What was Old is New: Finding Joy in Modernising Legacy Systems

(cover image from ThisisEngineering RAEng)

Let’s face it: software is easier to write than maintain. This is why we, as software engineers, prefer to just “rip it out and start over” instead of trying to understand what another developer (or our past self) was thinking. We seem to have collectively forgotten that “programs must be written for people to read, and only incidentally for machines to execute”. You know it’s true — we’ve all had to painstakingly trace through a casserole of spaghetti code and thin, old-world-style abstractions digging for the meat of the program only to find nothing but a mess at the bottom of our plates.

It’s easy to yell “WTF” and blame the previous dev, but the truth is often more complicated. We can’t see the future, so it’s impossible to understand how requirements, technology, or business goals will grow when we design a net-new system. As a result, systems can become unreadable as their scope increases along with the business’s dependency on them. This is a bit of a paradox: older, harder to maintain systems often provide the most value. They are hard to work on because they’ve grown with the company, and scary to work on because breaking it could be a catastrophe.

Here’s where I’m calling you out: if you like hard, rewarding problems… try it. Take the oldest system you have and make it maintainable. You know the one I’m talking about — the one no one will “own”. That one the other departments depend on but engineers hate. The one you had to patch Log4Shell on first. Do it. I dare you.I recently had such an opportunity to update a decade old machine learning system at Bazaarvoice. On the surface, it didn’t sound exciting: this thing didn’t even have neural networks! Who cares! We ll… it mattered. This system processes nearly every user-generated product review received by Bazaarvoice — nearly 9 million per month — and does so with 90 million inference calls to machine learning models. Yup — 90 million inferences! It’s a huge scale, and I couldn’t wait to dive in.

In this post, I’ll share how modernizing this system through a re-architecture, instead of a re-write, allowed us to make it scalable and cost-effective without having to rip out all of the code and start over. The resulting system is serverless, containerized, and maintainable while reducing our hosting costs by nearly 80%. This post is a companion piece to a talk I recently presented at AWS Data Summit for Software Companies on generating value from data by leveraging our best practices to ensure success in machine learning projects. This post discusses the technical aspects in more detail, but you can watch the high-level talk linked at the end.

Something Old

First, let’s take a look at what we’re dealing with here. The legacy system my team was updating moderates user-generated content for all of Bazaarvoice. Specifically, it determines if each piece of content is appropriate for our client’s websites.

This sounds straightforward — eliminate obvious infractions such as hate speech, foul language, or solicitations — but in practice, it’s much more nuanced. Each client has unique requirements for what they consider appropriate. Beer brands, for example, would expect discussions of alcohol, but a children’s brand may not. We capture these client-specific options when we onboard new clients, and our Client Services team encodes them into a management database.

For some added complexity, we also sample a subset of our content to be moderated by human moderators. This allows us to continuously measure the performance of our models and discover opportunities for building more models.

The full architecture of our legacy system is shown below:

*Our legacy moderation system hosted machine learning models on a single EC2 instance. This made deployments slow and limited scalability to the host’s memory size.*

This system has some serious drawbacks. Specifically — all of the models are hosted on a single EC2 instance. This wasn’t due to bad engineering — just the inability of the original programmers to foresee the scale desired by the company. No one thought that it would grow as much as it did. In addition, the system suffered from developer rejection: it was written in Scala, which few engineers understood. Thus, it was often overlooked for improvement since no one wanted to touch it.

As a result, the system continued to grow in a keep-the-lights-on manner. Once we got around to re-architecting it, it was running on a single x1e.8xlarge instance. This thing had nearly a terabyte of ram and costs about $5,000/month (unreserved) to operate. Don’t worry, though, we just launched a second one for redundancy and a third for QA 🙃.

This system was costly to run and was at a high risk of failure (a single bad model can take down the whole service). Furthermore, the code base had not been actively developed and was thus significantly out of date with modern data science packages and did not follow our standard practices for services written in Scala.

Something New

When redesigning this system we had a clear goal: make it scalable. Reducing operating costs was a secondary goal, as was easing model and code management.

The new design we came up with is illustrated below:

*Our new architecture deploys each model to a SageMaker Serverless endpoint. This lets us scale the number of models without limit while maintaining a small cost footprint.*

Our approach to solving all of this was to put each machine learning model on an isolated SageMaker Serverless endpoint. Like AWS Lambda functions, serverless endpoints turn off when not in use — saving us runtime costs for infrequently used models. They also can scale out quickly in response to increases in traffic.

In addition, we exposed the client options to a single microservice that routes content to the appropriate models. This was the bulk of the new code we had to write: a small API that was easy to maintain and let our data scientists more easily update and deploy new models.

This approach has the following benefits:

Decreased the time to value by over 6x. Specifically, routing traffic to existing models is instantaneous, and deploying new models can be done in under 5 minutes instead of 30.
Scale without limit – we currently have 400 models but plan to scale to thousands to continue to increase the amount of content we can automatically moderate.
Saw a cost reduction of 82% moving off EC2 as the functions turn off when not in use, and we’re not paying for top-tier machines that are underutilized.

Simply designing an ideal architecture, however, isn’t the really ~~interesting~~ hard part of rebuilding a legacy system — you have to migrate to it.

Something Borrowed

Our first challenge in migration was figuring out how the heck to migrate a Java WEKA model to run on SageMaker, let alone SageMaker Serverless.

Fortunately, SageMaker deploys models in Docker containers, so at least we could freeze the Java and dependency versions to match our old code. This would help ensure the models hosted in the new system returned the same results as the legacy one.

To make the container compatible with SageMaker, all you need to do is implement a few specific HTTP endpoints:

POST /invocation — accept input, perform inference, and return results.
GET /ping — returns 200 if the JVM server is healthy

(We chose to ignore all of the cruft around BYO multimodel containers and the SageMaker inference toolkit.)

A few quick abstractions around com.sun.net.httpserver.HttpServer and we were ready to go.

And you know what? This was actually pretty fun. Toying with Docker containers and forcing something 10 years old into SageMaker Serverless had a bit of a tinkering vibe to it. It was pretty exciting when we got it working — especially when we got the legacy code to build it in our new sbt stack instead of maven. The new sbt stack made it easy to work on, and containerization ensured we could get proper behavior while running in the SageMaker environment.

Something Blue

So we have the models in containers and can deploy them to SageMaker — almost done, right? Not quite.

The hard lesson about migrating to a new architecture is that you must build three times your actual system just to support migration.

In addition to the new system, we had to build:

A data capture pipeline in the old system to record inputs and outputs from the model. We used these to confirm that the new system would return the same results.
A data processing pipeline in the new system to compute results and compare them to the data from the old system. This involved a large amount of measurement with Datadog and needed to offer the ability to replay data when we found discrepancies.
A full model deployment system to avoid impacting the users of the old system (which would simply upload models to S3). We knew we wanted to move them to an API eventually, but for the initial release, we needed to do so seamlessly.

All of this was throw-away code we knew we could toss once we finished migrating all of the users, but we still had to build it and ensure that the outputs of the new system matched the old.

Expect this upfront.

While building the migration tools and systems certainly took more than 60% of our engineering time on this project, it too was a fun experience. Unit testing became more like data science experiments: we wrote whole suites to ensure that our output matched exactly. It was a different way of thinking that made the work just that much more fun. A step outside our normal boxes, if you will.

So… Just Try It

Next time you’re tempted to rebuild a system from code up, I’d like to encourage you to try migrating the architecture instead of the code. You’ll find interesting and rewarding technical challenges and will likely enjoy it much more than debugging unexpected edge cases of your new code.

The talk I gave is a bit more high-level and goes into the MLOps side of things. Check it out here:

Kedro 6 Months In

We build AI software in two modes: experimentation and productization. During experimentation, we are trying to see if modern technology will solve our problem. If it does, we move on to productization and build reliable data pipelines at scale.

This presents a cyclical dependency when it comes to data engineering. We need reliable and maintainable data engineering pipelines during experimentation, but don’t know what that pipeline should do until after we’ve completed the experiments. In the past, I and many data scientists I know have used an ad-hoc combination of bash scripts and Jupyter Notebooks to wrangle experimental data. While this may have been the fastest way to get experimental results and model building, it’s really a technical debt that has to be paid down the road.

The Problem

Specifically, the ad-hoc approach to experimental data pipelines causes pain points around:

Reproducibility: Ad-hoc experimentation structures puts you at risk of making results that others can’t reproduce, which can lead to product downtime if or when you need to update your approach. Simple mistakes like executing a notebook cell twice or forgetting to seed a random number generator can usually be caught. But other, more insidious problems can occur, such as behavior changes between dependency versions.
Readability: If you’ve ever come across another person’s experimental code, you know it’s hard to find where to start. Even documented projects might just say “run x script, y notebook, etc”, and it’s often unclear where the data come from and if you’re on the right track. Similarly, code reviews for data science projects are often hard to read: it’s asking a lot for a reader to differentiate between notebook code for data manipulation and code for visualization.
Maintainability: It’s common during data science projects to do some exploratory analysis or generate early results, and then revise how your data is processed or gathered. This becomes difficult and tedious when all of these steps are an unstructured collection of notebooks or scripts. In other words, the pipeline is hard to maintain: updating or changing it requires you to keep track of the whole thing.
Shareability: Ad-hoc collections of notebooks and bash scripts are also difficult for a team to work on concurrently. Each member has to ensure their notebooks are up to date (version control on notebooks is less than ideal), and that they have the correct copy of any intermediate data.

Enter Kedro

A lot of the issues above aren’t new to the software engineering discipline and have been largely solved in that space. This is where Kedro comes in. Kedro is a framework for building data engineering pipelines whose structure forces you to follow good software engineering practices. By using Kedro in the experimentation phase of projects, we build maintainable and reproducible data pipelines that produce consistent experimental results.

Specifically, Kedro has you organize your data engineering code into one or more pipelines. Each pipeline consists of a number of nodes: a functional unit that takes some data sets and parameters as inputs and produces new data sets, models, or artifacts.

This simple but strict project structure is augmented by their data catalog: a YAML file that specifies how and where the input and output data sets are to be persisted. The data sets can be stored either locally or in a cloud data storage service such as S3.

I started using Kedro about six months ago, and since then have leveraged it for different experimental data pipelines. Some of these pipelines were for building models that eventually were deployed to production, and some were collaborations with team members. Below, I’ll discuss the good and bad things I’ve found with Kedro and how it helped us create reproducible, maintainable data pipelines.

The Good

Reproducibility: I can’t say enough good things here: they nailed it. Their dependency management took a bit of getting used to but it forces a specific version on all dependencies, which is awesome. Also, the ability to just type kedro install and kedro run to execute the whole pipeline is fantastic. You still have to remember to seed random number generators, but even that is easy to remember if you put it in their params.yml file.
Function Isolation: Kedro’s fixed project structure encourages you to think about what logical steps are necessary for your pipeline, and write a single node for each step. As a result, each node tends to be short (in terms of lines of code) and specific (in terms of logic). This makes each node easy to write, test, and read later on.
Developer Parallelization: The small nodes also make it easier for developers to work together concurrently. It’s easy to spot nodes that won’t depend on each other, and they can be coded concurrently by different people.
Intermediate Data: Perhaps my favorite thing about Kedro is the data catalog. Just add the name of an output data set to catalog.yml and BOOM, it’ll be serialized to disk or your cloud data store. This makes it super easy to build up the pipeline: you work on one node, commit it, execute it, and save the results. It also comes in handy when working on a team. I can run an expensive node on a big GPU machine and save the results to S3, and another team member can simply start from there. It’s all baked in.
Code Re-usability: I’ll admit I have never re-used a notebook. At best I pulled up an old one to remind myself how I achieved some complex analysis, but even then I had to remember the intricacies of the data. The isolation of nodes, however, makes it easy to re-use them. Also, Kedro’s support for modular pipelines (i.e., packaging a pipeline into a pip package) makes it simple to share common code. We’ve created modular pipelines for common tasks such as image processing.

The Bad

While Kedro has solved many of the quality challenges in experimental data pipelines, we have noticed a few gotchas that required less than elegant work arounds:

Incremental Dataset: This support exists for reading data, but it’s lacking for writing datasets. This affected us a few times when we had a node that would take 8-10 hours to run. We lost work if the node failed part of the way through. Similarly, if the result data set didn’t fit in memory, there wasn’t a good way to save incremental results since the writer in Kedro assumes all partitions are in memory. This GitHub issue may fix it if the developers address it, but for now you have to manage partial results on your own.
Pipeline Growth: Pipelines can quickly get hard to follow since the input and outputs are just named variables that may or may not exist in the data catalog. Kedro Viz helps with this, but it’s a bit annoying to switch between the navigator and code. We’ve also started enforcing name consistency between the node names and their functions, as well as the data set names in the pipeline and the argument names in the node functions. Finally, making more, smaller pipelines is also a good way to keep your sanity. While all of these techniques help you to mentally keep track, it’s still the trade off you make for coding the pipelines by naming the inputs and outputs.
Visualization: This isn’t really considered much in Kedro, and is the one thing I’d say notebooks still have a leg up on. Kedro makes it easy for you to load the Kedro context in a notebook, however, so you can still fire one up to do some visualization. Ultimately, though, I’d love to see better support within Kedro for producing a graphical report that gets persisted to the 08_reporting layer. Right now we worked around this by making a node that renders a notebook to disk, but it’s a hack at best. I’d love better support for generating final, highly visual reports that can be versioned in the data catalog much like the intermediate data.

Conclusion

So am I a Kedro convert? Yah, you betcha. It replaces the spider-web of bash scripts and Python notebooks I used to use for my experimental data pipelines and model training, and enables better collaboration among our teams. It won’t replace a fully productionalized stream-based data pipeline for me, but it absolutely makes sure my experimental pipelines are maintainable, reproducible, and shareable.

bazaarvoice: engineering

The official blog of Bazaarvoice R&D

Author Archives: Lou Kratz

About Lou Kratz

Semantically Compress Text to Save On LLM Costs

Introduction

The Problem

Our Approach

Embedding Model Evaluation

Semantically Similar Clustering

Multi-Pass Clustering

Ensuring Authenticity

Results

What was Old is New: Finding Joy in Modernising Legacy Systems

Something Old

Something New

Something Borrowed

Something Blue

So… Just Try It

Kedro 6 Months In

The Problem

Enter Kedro

The Good

The Bad

Conclusion