At Bazaarvoice, our business is powered by semi-structured data, and our mission is to unlock the potential of that data by delivering products and their reviews or answers to product-related questions, which are relevant to our clients and their consumers. Whether a consumer is trying to recommend a product that will answer another consumer’s question, or a merchandiser is trying to analyze reviews of the latest product launch, the vast majority of our platform functionality is powered by a search across a Solr/Lucene index. That’s why we were excited to attend the Lucene Revolution 2011 conference back in May 2011, and I wanted to share some key observations that can assist you in improving the search experience for your own users.
When integrating search into any application, you should first recognize that search across free-text or semi-structured data is unlike other features in your application in that the “correct” output for text search is often not well-defined. Typically, your requirements will state that searches should return “relevant” results, but for a non-trivial text corpus and an unbounded user query space, it is effectively impossible to define “relevant” for an arbitrary user query. In the face of this uncertainty, we developers tend to implement search in the manner we are accustomed for other features — configuring indexing and querying in a way that makes sense for a handful of cases we can imagine, and then checking the overall outcome for a dozen or so sample queries that we expect our users might enter. Once we are seeing reasonable results for our sample queries, and seeing no other ways to improve results across-the-board, we stamp the functionality as complete, and we move on to the next task, right?
The truth is that the lack of well-defined “correct” output for text search is actually the starting point for the implementation of another process — listening to what your end users expect from search. Unless you are the sole user, you as developer likely have only the vaguest understanding of how your users actually search. This is not a critique of you — it’s because every individual has developed their own process for formulating the terms they enter into that free-text search box you’ve provided. Fortunately, there are a number of common techniques for understanding user search behavior, which Otis Gospodnetić of Sematext outlined in his talk, Search Analytics: What? Why? How? Akin to web analytics, Otis described a number of key reports to use in measuring how well search is meeting the needs of end users. Among these are the top queries with zero results, top queries with the highest exit rate, words per query, and the top queries overall. Each of these reports can generally be created from query logs alone, and they are important barometers for evaluating and tuning the effectiveness of your search function. Using query logs, you can gauge the potential benefit of adding spell-checking (to address zero hits due to misspellings), query auto-completion (to assist with long queries), and hit highlighting (to see why results were considered relevant). After deeper query log analysis, you may even decide to preprocess user queries in a tailored way, as Floyd Morgan of Intuit described how they distill variations of the same query (e.g. “How do I input my 1099” and “Where do I enter a 1099”) to simpler customized search terms (“wheretoent 1099”) that provide better precision. As you can see, you can gain a significant understanding of your users’ expectations from query logs alone, but they do not provide the whole picture…
For even better analysis and performance of your search functionality, you need to pair query log data with some other form of feedback on result quality, usually click-through rates (CTR) on search results. Again, Otis described a number of metrics to compute when query logs are paired with CTR data. In a separate session, Andrzej Białecki of Lucid Imagination described how to take this data a step further, treating CTR as direct feedback on result relevance, and incorporating it into the actual document score. At first blush, this seems to be an ideal and straightforward search improvement, but Andrzej also identified a number of undesired effects that require conscious effort to avoid. Also, he highlighted that Solr/Lucene does not currently provide a preferred storage model for this type of data. Solr 3.x provides an ExternalFileField that is currently the best mechanism for incorporating document popularity based on CTR, while Lucene 4.0 is slated to deliver an efficient mechanism for storing popularity and other per-document values, which Simon Willnauer, a core Lucene committer, described in his session on Column Stride Fields a.k.a. DocValues. Finally, Timothy Potter of NREL described complementary techniques for improving result relevance by Boosting Documents by Recency, Popularity, and User Preferences. Obviously, every technique is not directly applicable to every scenario, but the approaches are common to many search applications, so it is worth considering how they could apply to your application.
As you can see, integrating search into an application is not a write-once, run-forever task. There is a wealth of opportunity for improving search so that it meets the actual needs of your users, and most of the information necessary for learning how to improve can be obtained by simply listening to your users using common techniques. So, I highly encourage you to review the sessions I have highlighted, and you can check out slides or videos for all the rest at the Lucene Revolution 2011 recap. Enjoy!