When Should Approximate Query Processing Be Used?
This is a guest repost by Barzan Mozafari, who is part of a new startup, www.snappydata.io, that recently launched an open source OLTP + OLAP Database built on Spark.
The growing market for Big Data has created a lot of interest around approximate query processing (AQP) as a means of achieving interactive response times (e.g., sub-second latencies) when faced with terabytes and petabytes of data. At the same time, there is a lot of misinformation about this technology and what it can or cannot do.
Having been involved in building a few academic prototypes and industrial engines for approximate query processing, I have heard many interesting statements about AQP and/or sampling techniques (from both DB vendors and end-users):
Myth #1. Sampling is only useful when you know your queries in advance
Myth #2. Sampling misses out on rare events or outliers in the data
Myth #3. AQP systems cannot handle join queries
Myth #4. It is hard for end-users to use approximate answers
Myth #5. Sampling is just like indexing
Myth #6. Sampling will break the BI tools
Myth #7. There is no point approximating if your data fits in memory
Although there is a Continue reading