Author Archives: adriancolyer
Author Archives: adriancolyer
Image-to-image translation with conditional adversarial networks Isola et al., CVPR’17
It’s time we looked at some machine learning papers again! Over the next few days I’ve selected a few papers that demonstrate the exciting capabilities being developed around images. I find it simultaneously amazing to see what can be done, and troubling to think about a ‘post-reality’ society in which audio, images, and videos can all be cheaply synthesised to tell any story, with increasing realism. Will our brains really be able to hold the required degree of skepticism? It’s true that we have a saying “Don’t believe everything you hear,” but we also say “It must be true, I’ve seen it with my own eyes…”.
Anyway, back to the research! The common name for the system described in today’s paper is pix2pix. You can find the code and more details online at https://github.com/phillipi/pix2pix. The name ‘pix2pix’ comes from that fact that the network is trained to map from input pictures (images) to output pictures (images), where the output is some translation of the input. Lots of image problems can be formulated this way, and the figure below shows six examples:

The really fascinating part about pix2pix Continue reading
Equality of opportunity in supervised learning Hardt et al., NIPS’16
With thanks to Rob Harrop for highlighting this paper to me.
There is a a lot of concern about discrimination and bias entering our machine learning models. Today’s paper choice introduces two notions of fairness: equalised odds, and equalised opportunity, and shows how to construct predictors that are fair under these criteria. One very appealing feature of the model is that in the case of uncertainty caused by under-representation in the training data, the cost of less accurate decision making in that demographic is moved from the protected class (who might otherwise for example not be offered loans), to the decision maker. I’m going to approach the paper backwards, and start with the case study, as I find a motivating example really helps with the intuition.
We examine various fairness measures in the context of FICO scores with the protected attribute of race. FICO scores are a proprietary classifier widely used in the United States to predict credit worthiness. Our FICO data is based on a sample of 301,536 TransUnion TransRisk scores from 2003.
We’re interesting in comparing scores, the Continue reading
Performance analysis of cloud applications Ardelean et al., NSDI’18
Today’s choice gives us an insight into how Google measure and analyse the performance of large user-facing services such as Gmail (from which most of the data in the paper is taken). It’s a paper in two halves. The first part of the paper demonstrates through an analysis of traffic and load patterns why the only real way to analyse production performance is using live production systems. The second part of the paper shares two techniques that Google use for doing so: coordinated bursty tracing and vertical context injection.
Let’s start out just by consider Gmail requests explicitly generated by users (called ‘user visible requests,’ or UVRs, in the paper). These are requests generated by mail clients due to clicking on messages, sending messages, and background syncing (e.g., IMAP).
You can see a clear diurnal cycle here, with the highest QPS when both North America and Europe are active in the early morning, and lower QPS at weekends. (All charts are rescaled using some unknown factor, to protect Google information).

Request response sizes vary by about 20% over time. Two contributing factors are bulk mail senders, Continue reading
Stateless datacenter load-balancing with Beamer Olteanu et al., NSDI’18
We’ve spent the last couple of days looking at datacenter network infrastructure, but we didn’t touch on the topic of load balancing. For a single TCP connection, you want all of the packets to end up at the same destination. Logically, a load balancer (a.k.a. ‘mux’) needs to keep some state somewhere to remember the mapping.
Existing load balancer solutions can load balance TCP and UDP traffic at datacenter scale at different price points. However, they all keep per-flow state; after a load balancer decides which server should handle a connection, that decision is “remembered” locally and used to handle future packets of the same connection. Keeping per-flow state should ensure that ongoing connections do not break when servers and muxes come or go…
There are two issues with keeping this state though. Firstly , it can sometimes end up incomplete or out of date (especially under periods of rapid network change, such as during scale out and scale in). Secondly, there’s only a finite amount of resource to back that state, which opens the door to denial of service attacks such as SYN flood attacks.
Beamer is Continue reading
Andromeda: performance, isolation, and velocity at scale in cloud network virtualization Dalton et al., NSDI’18
Yesterday we took a look at the Microsoft Azure networking stack, today it’s the turn of the Google Cloud Platform. (It’s a very handy coincidence to have two such experience and system design report papers appearing side by side so that we can compare). Andromeda has similar design goals to AccelNet: performance close to hardware, serviceability, and the flexibility and velocity of a software-based architecture. The Google team solve those challenges in a very different way though, being prepared to make use of host cores (which you’ll recall the Azure team wanted to avoid).
We opted for a high-performance software-based architecture instead of a hardware-only solution like SR-IOV because software enables flexible, high-velocity feature deployment… Andromeda consumes a few percent of the CPU and memory on-host. One physical CPU core is reserved for the Andromeda dataplane… In the future, we plan to increase the dataplane CPU reservation to two physical cores on newer hosts with faster physical NICs and more CPU cores in order to improve VM network throughput.
Both the control plane and data plane use a hierarchical structure. The control Continue reading
Azure accelerated networking: SmartNICs in the public cloud Firestone et al., NSDI’18
We’re still on the ‘beyond CPUs’ theme today, with a great paper from Microsoft detailing their use of FPGAs to accelerate networking in Azure. Microsoft have been doing this since 2015, and hence this paper also serves as a wonderful experience report documenting the thought processes that led to an FPGA-based design, and lessons learned transitioning an all-software team to include hardware components.
There’s another reminder here too of the scale at which cloud vendors operate, which makes doing a project like this viable. The bulk purchase of FPGAs keeps their cost low, and the scale of the project makes the development investment worthwhile.
One question we are often asked is if FPGAs are ready to serve as SmartNICs more broadly outside Microsoft… We’ve observed that necessary tooling, basic IP blocks, and general support have dramatically improved over the last few years. But this would still be a daunting task for a new team… The scale of Azure is large enough to justify the massive development efforts — we achieved a level of performance and efficiency simply not possible with CPUs, and programmability far beyond an ASIC, Continue reading
NetChain: Scale-free sub-RTT coordination Jin et al., NSDI’18
NetChain won a best paper award at NSDI 2018 earlier this month. By thinking outside of the box (in this case, the box is the chassis containing the server), Jin et al. have demonstrated how to build a coordination service (think Apache ZooKeeper) with incredibly low latency and high throughput. We’re talking 9.7 microseconds for both reads and writes, with scalability on the order of tens of billions of operations per second. Similarly to KV-Direct that we looked at last year, NetChain achieves this stunning performance by moving the system implementation into the network. Whereas KV-Direct used programmable NICs though, NetChain takes advantage of programmable switches, and can be incrementally deployed in existing datacenters.
We expect a lightning fast coordination system like NetChain can open the door for designing a new generation of distributed systems beyond distributed transactions.
It’s really exciting to watch all of the performance leaps being made by moving compute and storage around (accelerators, taking advantage of storage pockets e.g. processing-in-memory, non-volatile memory, in-network processing, and so on). The sheer processing power we’ll have at our disposal as all of these become mainstream is staggering to Continue reading
SmoothOperator: reducing power fragmentation and improving power utilization in large-scale datacenters Hsu et al., ASPLOS’18
What do you do when your theory of constraints analysis reveals that power has become your major limiting factor? That is, you can’t add more servers to your existing datacenter(s) without blowing your power budget, and you don’t want to build a new datacenter just for that? In this paper, Hsu et al. analyse power utilisation in Facebook datacenters and find that overall power budget utilisation can be comparatively low, even while peak requirements are at capacity. We can’t easily smooth the workload (that’s driven by business and end-user requirements), but maybe we can do something to smooth the power usage.
Our experiments based on real production workload and power traces show that we are able to host up to 13% more machines in production, without changing the underlying power infrastructure. Utilizing the unleashed power headroom with dynamic reshaping, we achieve up to an estimated total of 15% and 11% throughput improvement for latency-critical service and batch service respectively at the same time, with up to 44% of energy slack reduction.
There’s a maximum safe amount of power Continue reading
Skyway: connecting managed heaps in distributed big data systems Nguyen et al., ASPLOS’18
Yesterday we saw how to make Java objects persistent using NVM-backed heaps with Espresso. One of the drawbacks of using that as a persistence mechanism is that they’re only stored in the memory of a single node. If only there was some way to create a cluster of JVMs, and efficiently copy objects across remote heaps in the cluster… Meet Skyway!
Skyway is aimed at JVM-based big data systems (think Spark, Flink) that end up spending a lot of their time serializing and deserializing objects to move them around the cluster (e.g., to and from workers – see ‘Making sense of performance in data analytics frameworks’). Java comes with a default serialization mechanism, and there are also many third party libraries. Kyro is the recommended library for use with Spark.
Consider a small Spark cluster (3 worker nodes each with a 20 GB heap) running a triangle counting algorithm over the LiveJournal graph (about 1.2GB). With both the standard Java serializers and Kyro, serialization and deserialization combined account for a significant portion of the overall execution time (more than 30%).

Where Continue reading
Espresso: brewing Java for more non-volatility with non-volatile memory Wu et al., ASPLOS’18
What happens when you introduce non-volatile memory (NVM) to the world of Java? In theory, with a heap backed by NVM, we should get persistence for free? It’s not quite that straightforward of course, but Espresso gets you pretty close. There are a few things to consider, for example:
Espresso adds a new type of heap, a persistent Java heap (PJH) backed by NVM, and a persistent Java object (PJO) programming abstraction which is backwards compatible with JPA. PJO gives a 3.24x speedup even over JPA backed by H2.
JPA is the standard Java Persistence API. Java classes are decorated with persistence annotations describing their mapping to an underlying relational database. It’s an Continue reading
Watching for software inefficiencies with Witch Wen et al., ASPLOS’18
(The link above is to the ACM Digital Library, if you don’t have membership you should still be able to access the paper pdf by following the link from The Morning Paper blog post directly.)
Inefficiencies abound in complex, layered software.
These inefficiencies can arise during design (poor choice of algorithm), implementation, or translation (e.g., compiler optimisations or lack thereof). At the level of the hardware, inefficiencies involving the memory subsystem are some of the most costly…
Repeated initialization, register spill and restore on hot paths, lack of inlining hot functions, missed optimization opportunities due to aliasing, computing and storing already computed or sparingly changing values, and contention and false sharing (in multi-threaded codes), are some of the common prodigal uses of the memory subsystem.
Coarse grained profilers (e.g., gprof) have comparatively little overhead and can detect hotspots, but fail to distinguish between efficient and inefficient resource usage. Fine-grained profilers (e.g. DeadSpy) can detect inefficiencies, but typically introduce high overheads (10-80x slowdown and 6-100x extra memory). These high overheads prevent such tools from being widely used. Witch is a fine-grained inefficiency detection Continue reading
WSMeter: A performance evaluation methodology for Google’s production warehouse-scale computers Lee et al., ASPLOS’18
(The link above is to the ACM Digital Library, if you don’t have membership you should still be able to access the paper pdf by following the link from The Morning Paper blog post directly.)
How do you know how well your large kubernetes cluster / data centre / warehouse-scale computer (WSC) is performing? Is a particular change worth deploying? Can you quantify the ROI? To do that, you’re going to need some WSC-wide metric of performance. Not so easy! The WSC may be running thousands of distinct jobs all sharing the same underlying resources. Developing a load-testing benchmark workload to accurately model this is ‘practically impossible.’ Therefore, we need a method that lets us evaluate performance in a live production environment. Google’s answer is the Warehouse Scale performance Meter (WSMeter), “a methodology to efficiently and accurately evaluate a WSC’s performance using a live production environment.” At WSC scale, even small improvements can translate into considerable cost reductions. WSMeter’s low-risk, low-cost approach encourages more aggressive evaluation of potential new features.
Consider a change Continue reading
The architectural implications of autonomous driving: constraints and acceleration Lin et al., ASPLOS’18
Today’s paper is another example of complementing CPUs with GPUs, FPGAs, and ASICs in order to build a system with the desired performance. In this instance, the challenge is to build an autonomous self-driving car!
Architecting autonomous driving systems is particularly challenging for a number of reasons…
There are several defined levels of automation, with level 2 being ‘partial automation’ in which the automated system controls steering and acceleration/deceleration under limited driving conditions. At level 3 the automated system handles all driving tasks under limited conditions (with a human driver taking over outside of that). By level 5 Continue reading
Darwin: a genomics co-processor provides up to 15,000x acceleration on long read assembly Turakhia et al., ASPLOS’18
With the slow demise of Moore’s law, hardware accelerators are needed to meet the rapidly growing computational requirements of X.
For this paper, X = genomics, and genomic data is certainly growing fast: doubling every 7 months and on track to surpass YouTube and Twitter by 2025. Rack-size machines can sequence 50 genomes a day, portable sequencers require several days per genome. Third-generation sequencing technologies are now available which produce much longer reads of contiguous DNA – on the order of tens of kilobases compared to only a few hundred bases with the previous generations of technology.
For personalized medicine, long reads are superior in identifying structural variants i.e. large insertions, deletions and re-arrangements in the genome spanning kilobases or more which are sometimes associated with diseases; for haplotype phasing, to distinguish mutations on maternal vs paternal chromosomes; and for resolving highly repetitive regions in the genome.
The long read technology comes with a drawback though – high error rates in sequencing of between 15%-40%. The errors are corrected using computational methods ‘that can be orders of magnitude slower than Continue reading
Google workloads for consumer devices: mitigating data movement bottlenecks Boroumand et al., ASPLOS’18
What if your mobile device could be twice as fast on common tasks, greatly improving the user experience, while at the same time significantly extending your battery life? This is the feat that the authors of today’s paper pull-off, using a technique known as processing-in-memory (PIM). PIM moves some processing into the memory itself, avoiding the need to transfer data from memory to the CPU for those operations. It turns out that such data movement is a major contributor to the total system energy usage, so eliminating it can lead to big gains.
Our evaluation shows that offloading simple functions from these consumer workloads to PIM logic, consisting of either simple cores or specialized accelerators, reduces system energy consumption by 55.4% and execution time by 54.2%, on average across all of our workloads.
While the performance requirements of consumer devices increase year on year, and devices pack in power-hungry CPUs, GPUs, special-purpose accelerators, sensors and high-resolution screens to keep pace, lithium ion battery capacity has only doubled in the last 20 years. Moreover, the thermal power dissipation in consumer Continue reading