Near real time analytics on a large data set = hard!
About two years ago we found ourselves adding a lot of analytical features to the Bazaarvoice Workbench, our client facing analytics product. We were implementing various new dashboards, reports, and alerts. Dashboards needed to be near real time, so we were forced to either find a way to run required SQL queries in real time, or pre-compute required data. As the main data store was MySQL, we found it impossible to aggregate large amounts of data on the fly in real time, so we were forced to pre-compute a lot of aggregates. Pre-computation happens to be error prone, so we eventually went looking for other solutions which would not require us to do as much pre-calculation.
As we looked at the queries which we were running to pre-compute aggregate data, we spotted a common theme, i.e. most of the queries filtered and aggregated data. In the select clause of the query there would normally be a small number of columns with most values being aggregates such as counts, sums and averages. So we went brainstorming what type of database would potentially be a better fit for such queries compared to a traditional row oriented relational DBMS.
We looked at a number of different data stores which could work for the problem we had. One of the alternatives considered was Apache SOLR which is a great NoSQL search server we had been using for a long time already with great success. SOLR supports filtering and faceting which allows us to implement most things we were doing with MySQL using SQL. However, there is quite a bit of dev work which needs to go into indexing data in SOLR and querying it.
Enter the world of column oriented databases
Imagine a database which stores the data by column rather than by row. For example, we might want to store the following employee data in a table:
ID, Name, Department, Salary
1, Chuck Norris, DEV, 10000
2, Gate Bills, SALES, 20000
3, Bart Simpson, HR, 15000
Traditional row oriented databases would store column values for each row sequentially, hence the data is stored like this:
1, Chuck Norris, DEV, 10000
2, Gate Bills, SALES, 20000
3, Bart Simpson, HR, 15000
A column oriented database serializes values of each column together:
1,2,3
Chuck Norris, Gate Bills, Bart Simpson
DEV, SALES, HR
10000, 20000, 15000
This approach has some advantages over row oriented databases:
- A columnar database does not need to read the whole row when a query only needs a small subset of columns. This is a typical pattern you see in queries performing aggregations. Not having to read the whole row from disk has an advantage of minimized disk I/O which really matters since databases are normally disk I/O bound these days.
- Another important factor is data compression. If data is stored by column, it’s normally possible to achieve high compression rate for each column. This in turn helps to minimize disk footprint of the database, so the disk I/O is minimized as well compared to non-compressed data bases. It often makes sense to take a hit decompressing the data on the fly compared to reading more data from disk.
Getting to know Infobright
One of the column oriented databases we evaluated is Infobright which is an open source database built on MySQL by changing the storage engine to be column oriented. Of course a significant change to storage layout also means that Infobright had to replace MySQL’s query execution engine with a custom engine capable of taking advantage of the columnar storage model. So, let’s jump straight to test results we got when evaluating Infobright using a data set with 100MM records in the main fact table.
As you can see, the average query execution time for analytical queries was 20x faster than MySQL’s.
Additionally, the disk footprint was over 10x smaller compared to MySQL due to data compression.
Also, we found that Infobright supported most of the SQL syntax we were already using in our queries including outer joins, sub-queries and unions. We found that a little tweaking was still required to make some queries perform well.
The benchmark numbers were impressive enough that we ended up using Infobright to implement near real time analytics in one of our products. It does a great job at calculating aggregates on the fly so we no longer needed to maintain as many aggregate tables (for instance, daily, and monthly aggregates).
Infobright – the secret sauce
What is the secret sauce in Infobright? First, its column oriented storage model which leads to smaller disk I/O. Second, its “knowledge grid” which is aggregate data Infobright calculates during data loading. Data is stored in 65K Data Packs. Data Pack nodes in the knowledge grid contain a set of statistics about the data that is stored in each of the Data Packs. For instance, Infobright can pre-calculate min, max, and avg value for each column in the pack during the load, as well as keep track of distinct values for columns with low cardinality. Such metadata can really help when executing a query since it’s possible to ignore data packs which have no data matching filter criteria. If a data pack can be ignored, there is no penalty associated with decompressing the data pack.
Compared to our MySQL implementation, Infobright eliminated the need to create and manage indexes, as well as to partition tables.
Infobright – Limitations
Infobright Community Edition is the open source version of Infobright. Unfortunetly, it does not support DML (inserts, updates, and deletes), so the only way to get data loaded is bulk loads using “LOAD DATA INFILE …” command. It’s still possible to append data to the table, however there is no way to update or delete existing data without having to re-load the table. For some types of data (such as log files or call detail records), this is not a significant issue since data is rarely edited or deleted. However, for other projects this limitation may be a show stopper unless the data set is small enough when it can be periodically re-loaded into Infobright.
Infobright chose to take the SMP (Symmetric MultiProcessing) approach, so there is no built-in support for MPP (Massively Parallel Processing). This certainly limits the size of the databases for which query performance would be acceptable. However, it’s possible to shard the data across multiple Infobright instances and then aggregate the results. Shard-Query is an open source project which makes it possible to query a data set partitioned across many database instances, and our test results confirm that this approach works really well.
Summary
Column oriented databases proved to be a great fit for analytical workloads. We managed to eliminate the need to maintain a large number of aggregate tables by migrating our data warehouse from MySQL to Infobright. This in turn let us support near real time visualizations in our analytics products.
Thanks, this is useful information about column-oriented databases.