In part one, we described our Analytics data ingestion pipeline, with BigQuery sitting as our data warehouse. However, having our analytics events in BigQuery is not enough. Most importantly, data needs to be served to our end-users.
In this article, we will detail:
- Why we chose Redshift to store our data marts,
- How it fits into our serving layer,
- Key learnings and optimization tips to make the most out of it,
- Orchestration workflows,
- How our data visualization apps (Chartio, web apps) benefit from this data.
Data is in BigQuery, now what?
The previous article described the caching algorithms used by Caffeine, in particular the eviction and concurrency models. Since then we’ve made improvements to the eviction algorithm and explored a new approach towards expiration.
Window TinyLFU (W-TinyLFU) splits the policy into three parts: an admission window, a frequency filter, and the main region. By using a compact popularity sketch, the historic frequencies are cheap to retain and lookup. This allows for quickly discarding new arrivals that are unlikely to be used again, guarding the main region from cache pollution. The admission window provides a small region for recency bursts to avoid consecutive misses when an item is building up its popularity.
This structure works surprisingly well for many important workloads like database, search, and analytics. These cases are frequency-biased where a small admission window is desirable to filter aggressively...
Wake up! It's HighScalability time:
Do you like this sort of Stuff? I'd greatly appreciate your support on Patreon. Know anyone who needs cloud? I wrote Explain the Cloud Like I'm 10 just for them. It has 39 mostly 5 star reviews. They'll learn a lot and love you forever.
- 2%: of sales spent by consumer packaged goods companies on R&D (14% for tech); 272 million: metric tons of plastic are produced each year around the globe; 100+ fps: Google's Edge TPU; 6,000: bugs per million lines of code; 2.2 GB/sec: SIMD JSON parser; 20-30%: fall in DRAM prices; 8x: Russian hackers faster than North Korean hackers; 50%: EV car sales in China by 2025;
- Quoteable Quotes:
- @davygreenberg: If I do a job in 30 minutes it’s because I spent 10 years learning how to do that in 30 minutes. You owe me for the years, not the minutes.
- @PaulDJohnston: Lambda done badly is still better than Kubernetes done well
- Ross Mcilroy: we now believe that speculative vulnerabilities on today's hardware defeat all language-enforced confidentiality with no known Continue reading