Popular is cheaper: curtailing memory costs in interactive analytics engines Ghosh et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
We’re sticking with the optimisation of data analytics today, but at the other end of the spectrum to the work on smart arrays that we looked at yesterday. Getafix (extra points for the Asterix-inspired name, especially as it works with Yahoo!’s Druid cluster) is aimed at reducing the memory costs for large-scale in-memory data analytics, without degrading performance of course. It does this through an intelligent placement strategy that decides on replication level and data placement for data segments based on the changing popularity of those segments over time. Experiments with workloads from Yahoo!’s production Druid cluster that Getafix can reduce memory footprint by 1.45-2.15x while maintaining comparable average and tail latencies. If you translate that into a public cloud setting, and assuming a 100TB hot dataset size — a conservative estimate in the Yahoo! case — we’re looking at savings on the order of $10M per year.
Real-time analytics is projected to Continue reading
Analytics with smart arrays: adaptive and efficient language-independent data Psaroudakis et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
We’re going lower-level today, with a look at some work on adaptive data structures by Oracle. It’s motivated by a desire to speed up big data analytic workloads that are “increasingly limited by simple bottlenecks within the machine.” The initial focus is on array processing, but the ambition is to extend the work to more data types in time.
Modern servers have multiple interconnected sockets of multi-core processors. Each socket has local memory, accessible via a cache-coherent non-uniform memory access (ccNUMA) architecture. In the NUMA world the following hold true:
If we want to crunch through an array as fast as possible in a NUMA world, the optimum way of doing it depends on the details of the machine, and on the application Continue reading
Medea: scheduling of long running applications in shared production clusters Garefalakis et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
We’re sticking with schedulers today, and a really interesting system called Medea which is designed to support the common real world use case of mixed long running applications (LRAs) and shorter duration tasks within the same cluster. The work is grounded in production cluster workloads at Microsoft and is now part of the Apache Hadoop 3.1 release. In the evaluation, when compared to the Kubernetes’ scheduling algorithm Medea reduces median runtimes by up to 32%, and by 2.1x compared to the previous generation YARN scheduler.
…a substantial portion of production clusters today is dedicated to LRAs…. placing LRAs, along with batch jobs, in shared clusters is appealing to reduce cluster operational costs, avoid unnecessary data movement, and enable pipelines involving both classes of applications. Despite these observations, support for LRAs in existing schedulers is rudimentary.
Example uses of long running application containers include streaming systems, interactive data-intensive applications (maintaining Continue reading
Optimus: an efficient dynamic resource scheduler for deep learning clusters Peng et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
It’s another paper promising to reduce your deep learning training times today. But instead of improving the programming model and/or dataflow engine, Optimus improves the scheduling of jobs within a cluster. You can run it on top of Kubernetes, and the authors claim about a 1.6x reduction in makespan compared to the mostly widely used schedulers today.
We’re using ever larger models, with ever increasing amounts of data (at least, whenever we can get our hands on it). In general this improves the learning accuracy, but it also increases the training time. The most common approach is parallel training using a machine learning cluster. Typically a model is partitioned among multiple parameter servers, and training data is spread across multiple workers. Workers compute parameter updates and push them to the respective parameter server.
Training is an iterative process with a dataset divided into chunks, and each chunk further divided into mini-batches. A Continue reading
Improving the expressiveness of deep learning frameworks with recursion Jeong, Jeong et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
Last week we looked at the embedded dynamic control flow operators in TensorFlow. In today’s paper choice, Jeong et al. make the case for support of an additional control flow construct: recursion. A little recursion it turns out, can go a long way. Implemented on top of TensorFlow (and with a design that should also work for other embedded control flow machine learning frameworks e.g. Theano, Caffe, MXNet), support for recursion enables cleaner expression of a class of model architectures, and improved performance. The performance gains come from the increased opportunities to exploit parallelism within the recursive definitions.
In this paper, we introduce recursive definitions into the programming model of existing embedded control flow frameworks, adding first-class support for recursion. By allowing users to directly express recursive definitions in application code with enhanced programmability, models with recursive data structures such as trees or graphs can be written without requiring users to use a separate complex API Continue reading
BDS: A centralized near-optimal overlay network for inter-datacenter data replication Zhang et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
This is the story of how inter-datacenter multicast transfers at Baidu were sped-up by a factor of 3-5x. That’s a big deal!
For large-scale online service providers, such as Google, Facebook, and Baidu, an important data communication pattern is inter-DC multicast of bulk data — replicating massive amounts of data (e.g., user logs, web search indexes, photo sharing, blog posts) from one DC to multiple DCs in geo-distributed locations.
To set the scene, the authors study inter-DC traffic at Baidu over a period of seven days. Nearly all inter-DC traffic is multicast (91.1%), highlighting the importance of optimising the multicast use case.
When looking at the individual transfers, there is great diversity in the source and destination DCs. Thus it’s not going to suffice to pre-configure a few select routes: “we need a system to automatically route and schedule any given inter-DC multicast transfers.”
60% of the transferred files are over 1TB Continue reading
Dynamic control flow in large-scale machine learning Yu et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
In 2016 the Google Brain team published a paper giving an overview of TensorFlow, “TensorFlow: a system for large-scale machine learning.” This paper is a follow-up, taking a much deeper look at how TensorFlow supports dynamic control flow, including extending automatic differentiation to control flow constructs.
With a wide range of machine learning models in use, and rapid exploration of new techniques, a machine learning system needs to be expressive and flexible to support both research and production use cases. Given the ever larger models and training sets, a machine learning system also needs to be scalable. These means both using individual devices efficiently (anything from phones to custom ASCIs in datacenters), and also supporting parallel execution over multiple devices.
Both the building blocks of machine learning and the architectures built up using these blocks have been changing rapidly. This pace appears likely to continue. Therefore, rather than defining RNNs, MoEs Continue reading
Reducing DRAM footprint with NVM in Facebook Eisenman et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
…to the best of our knowledge, this is the first study on the usage of NVM devices in a commercial data center environment.
We’ve been watching NVM coming for some time now, so it’s exciting to see a paper describing its adoption within Facebook. MyRocks is Facebook’s primary MySQL database, and is used to store petabytes of data and to serve real-time user activities. MyRocks uses RocksDB as the storage engine, and a typical server consumes 128GB of DRAM and 3 TB of flash. It all seems to work well, so what’s the problem? Spiralling costs!
As DRAM facing major scaling challenges, its bit supply growth rate has experienced a historic low. Together with the growing demand for DRAM, these trends have led to problems in global supply, increasing total cost of ownership (TCO) for data center providers. Over the last year, for example, the average DRAM DDR4 price has increased by 2.3x.
Just using less DRAM per server Continue reading
ServiceFabric: a distributed platform for building microservices in the cloud Kakivaya et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
Microsoft’s Service Fabric powers many of Azure’s critical services. It’s been in development for around 15 years, in production for 10, and was made available for external use in 2015.
ServiceFabric (SF) enables application lifecycle management of scalable and reliable applications composed of microservices running at very high density on a shared pool of machines, from development to deployment to management.
Some interesting systems running on top of SF include:
SF runs in multiple clusters each with 100s to many 100s of machines, totalling over 160K machines with over 2.5M cores.
Service Fabric defies easy categorisation, but the authors describe it as “Microsoft’s platform to support microservice applications in cloud settings.” What particularly makes it stand out from the crowd Continue reading
Hyperledger fabric: a distributed operating system for permissioned blockchains Androulaki et al., EuroSys’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site).
This very well written paper outlines the design of HyperLedger Fabric and the rationales for many of the key design decisions. It’s a great introduction and overview. Fabric is a permissioned blockchain system with the following key features:
…in popular deployment configurations, Fabric achieves throughput of more than 3500 tps, achieving finality with latency of a few hundred ms and scaling well to over 100 peers.
Examples of use cases powered by Fabric include foreign exchange netting in which a blockchain is used to resolve trades that aren’t settling; enterprise asset management tracking hardware assets as they move from manufacturing to Continue reading
ForkBase: an efficient storage engine for blockchain and forkable applications Wang et al., arXiv’18
ForkBase is a data storage system designed to support applications that need a combination of data versioning, forking, and tamper proofing. The prime example being blockchain systems, but this could also include collaborative applications such as GoogleDocs. Today for example Ethereum and HyperLedger build their data structures directly on top of a key-value store. ForkBase seeks to push these properties down into the storage layer instead:
One direct benefit is that it reduces development efforts for applications requiring any combination of these features. Another benefit is that it helps applications generalize better by providing additional features, such as efficient historical queries, at no extra cost. Finally, the storage engine can exploit performance optimization that is hard to achieve at the application layer.
Essentially what we end up with is a key-value store with native support for versioning, forking, and tamper evidence, built on top of an underlying object storage system. At the core of ForkBase is a novel index structure called a POS-Tree (pattern-oriented-split tree).
From the bottom-up, ForkBase comprises a chunk storage layer that performs chunking and deduplication, a Continue reading
zkLedger: privacy-preserving auditing for distributed ledgers Narula et al., NSDI’18
Somewhat similarly to Solidus that we looked at late last year, zkLedger (presumably this stands for zero-knowledge Ledger) provides transaction privacy for participants in a permissioned blockchain setting. zkLedger also has an extra trick up its sleeve: it provides rich and fully privacy-preserving auditing capabilities. Thus a number of financial institutions can collectively use a blockchain-based settlement ledger, and an auditor can measure properties such as financial leverage, asset illiquidity, counter-party risk exposures, and market concentration, either for the system as a whole, or for individual participants. It provides a cryptographically verified level of transparency that’s a step beyond anything we have today.
The goals of zkLedger are to hide the amounts, participants, and links between transactions while maintaining a verifiable transaction ledger, and for the Auditor to receive reliable answers to its queries. Specifically, zkLedger lets banks issue hidden transfer transactions which are still publicly verifiable by all other participants; every participant can confirm a transaction conserves assets and assets are only transferred with the spending bank’s authority.
A zkLedger system comprises n banks and an auditor that verifies certain operational aspects of transactions Continue reading
Towards a design philosophy for interoperable blockchain systems Hardjono et al., arXiv 2018
Once upon a time there were networks and inter-networking, which let carefully managed groups of computers talk to each other. Then with a capital “I” came the Internet, with design principles that ultimately enabled devices all over the world to interoperate. Like many other people, I have often thought about the parallels between networks and blockchains, between the Internet, and something we might call ‘the Blockchain’ (capital ‘B’). In today’s paper choice, Hardjono et al. explore this relationship, seeing what we can learn from the design principles of the Internet, and what it might take to create an interoperable blockchain infrastructure. Some of these lessons are embodied in the MIT Tradecoin project.
We argue that if blockchain technology seeks to be a fundamental component of the future global distributed network of commerce and value, then its architecture must also satisfy the same fundamental goals of the Internet architecture.
This section of the paper is a précis of ‘The design philosophy of the DARPA Internet protocols’ from SIGCOMM 1988. The top three fundamental goals for the Internet as conceived Continue reading
Measuring the tendency of CNNs to learn surface statistical regularities Jo et al., arXiv’17
With thanks to Cris Conde for bringing this paper to my attention.
We’ve looked at quite a few adversarial attacks on deep learning systems in previous editions of The Morning Paper. I find them fascinating for what they reveal about the current limits of our understanding.
…humans are able to correctly classify the adversarial image with relative ease, whereas the CNNs predict the wrong label, usually with very high confidence. The sensitivity of high performance CNNs to adversarial examples casts serious doubt that these networks are actually learning high level abstract concepts. This begs the following question: How can a network that is not learning high level abstract concepts manage to generalize so well?
In this paper, Jo and Bengio conduct a series of careful experiments to try and discover what’s going on. The initial hypothesis runs like this:
Large-scale analysis of style injection by relative path overwrite Arshad et al., WWW’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site, or from the WWW 2018 proceedings page).
We’ve all been fairly well trained to have good awareness of cross-site scripting (XSS) attacks. Less obvious, and also less well known, is that a similar attack is possible using style sheet injection. A good name for these attacks might be SSS: same-site style attacks.
Even though style injection may appear less serious a threat than script injection, it has been shown that it enables a range of attacks, including secret exfiltration… Our work shows that around 9% of the sites in the Alexa top 10,000 contain at least one vulnerable page, out of which more than one third can be exploited.
I’m going to break today’s write-up down into four parts:
Style sheet injection Continue reading
Unsupervised anomaly detection via variational auto-encoder for seasonal KPIs in web applications Xu et al., WWW’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site, or from the WWW 2018 proceedings page).
Today’s paper examines the problem of anomaly detection for web application KPIs (e.g. page views, number of orders), studied in the context of a ‘top global Internet company’ which we can reasonably assume to be Alibaba.
Among all KPIs, the most (important?) ones are business-related KPIs, which are heavily influenced by user behaviour and schedule, thus roughly have seasonal patterns occurring at regular intervals (e.g., daily and/or weekly). However, anomaly detection for these seasonal KPIs with various patterns and data quality has been a great challenge, especially without labels.
Donut is an unsupervised anomaly detection algorithm based on Variational Auto-Encoding (VAE). It uses three techniques (modified ELBO, missing data injection, and MCMC imputation), which together add up to state-of-the-art anomaly detection performance. One of the interesting findings in the research is that it is important to train on both normal data and abnormal data Continue reading
Algorithmic glass ceiling in social networks: the effects of social recommendations on network diversity Stoica et al., WWW’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site, or from the WWW 2018 proceedings page).
Social networks were meant to connect us and bring us together. This paper shows that while they might be quite successful at doing this in the small, on a macro scale they’re actually doing the opposite. Not only do they reinforce and sustain disparities among groups, but they actually reinforce the rate at which disparity grows. I.e., they’re driving us apart. This happens due to the rich-get-richer phenomenon resulting from friend/follow recommendation algorithms.
… we find that prominent social recommendation algorithms can exacerbate the under-representation of certain demographic groups at the top of the social hierarchy… Our mathematical analysis demonstrates the existence of an algorithmic glass ceiling that exhibits all the properties of the metaphorical social barrier that hinders groups like women or people of colour from attaining equal representation.
“In the social networks now governing the knowledge, Continue reading
Pixie: a system for recommending 3+ billion items to 200+ million users in real-time Eksombatchai et al., WWW’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site, or from the WWW 2018 proceedings page).
Pinterest is a visual catalog with several billion pins, which are visual bookmarks containing a description, a link, and an image or a video. A major problem faced at Pinterest is to provide personalized, engaging, and timely recommendations from a pool of 3+ billion items to 200+ million monthly active users.
Stating the obvious, 3 billion or so items is a lot to draw recommendations from. This paper describes how Pinterest do it. One of the requirements is that recommendations need to be calculated in real-time on-demand. I’m used to thinking about the shift from batch to real-time in terms of improved business responsiveness, more up-to-date information, continuous processing, and so on. Pinterest give another really good reason which is obvious with hindsight, but hadn’t struck me before: when you compute recommendations using a batch process, you have to calculate the recommendations for every user Continue reading
SafeKeeper: protecting web passwords using trusted execution environments Krawiecka et al., WWW’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site, or from the WWW 2018 proceedings page).
Today’s paper is all about password management for password protected web sites / applications. Even if we assume that passwords are salted and hashed in accordance with best practice (NIST’s June 2017 digital identity guidelines now mandate the use of keyed one-way functions such as CMAC), an adversary that can obtain a copy of the back-end database containing the per-user salts and the hash values can still mount brute force guessing attacks against individual passwords.
SafeKeeper goes a lot further in its protection of passwords. What really stands out is the threat model. SafeKeeper keeps end user passwords safe even when we assume that an adversary has unrestricted access to the password database. Not only that, the adversary is able to modify the content sent to the user from the web site (including active content such as client-side scripts). And not only that! The adversary is also able to read all Continue reading
Semantics and complexity of GraphQL Hartig & Pérez, WWW’18
(If you don’t have ACM Digital Library access, the paper can be accessed either by following the link above directly from The Morning Paper blog site, or from the WWW 2018 proceedings page).
GraphQL has been gathering good momentum since Facebook open sourced it in 2015, so I was very interested to see this paper from Hartig and Pérez exploring its properties.
One of the main advantages (of GraphQL) is its ability to define precisely the data you want, replacing multiple REST requests with a single call…
One of the most interesting questions here is what if you make a public-facing GraphQL-based API (as e.g. GitHub have done), and then the data that people ask for happens to be very expensive to compute in space and time?
Here’s a simple GraphQL query to GitHub asking for the login names of the owners of the first two repositories where ‘danbri’ is an owner.
From here there are two directions we can go in to expand the set of results returned : we can increase the breadth by asking for more repositories to be considered (i.e., changing first:2
Continue reading