Archive

Category Archives for "The Morning Paper"

Performance analysis of cloud applications

Performance analysis of cloud applications Ardelean et al., NSDI’18

Today’s choice gives us an insight into how Google measure and analyse the performance of large user-facing services such as Gmail (from which most of the data in the paper is taken). It’s a paper in two halves. The first part of the paper demonstrates through an analysis of traffic and load patterns why the only real way to analyse production performance is using live production systems. The second part of the paper shares two techniques that Google use for doing so: coordinated bursty tracing and vertical context injection.

(Un)predictable load

Let’s start out just by consider Gmail requests explicitly generated by users (called ‘user visible requests,’ or UVRs, in the paper). These are requests generated by mail clients due to clicking on messages, sending messages, and background syncing (e.g., IMAP).

You can see a clear diurnal cycle here, with the highest QPS when both North America and Europe are active in the early morning, and lower QPS at weekends. (All charts are rescaled using some unknown factor, to protect Google information).

Request response sizes vary by about 20% over time. Two contributing factors are bulk mail senders, Continue reading

Stateless datacenter load-balancing with Beamer

Stateless datacenter load-balancing with Beamer Olteanu et al., NSDI’18

We’ve spent the last couple of days looking at datacenter network infrastructure, but we didn’t touch on the topic of load balancing. For a single TCP connection, you want all of the packets to end up at the same destination. Logically, a load balancer (a.k.a. ‘mux’) needs to keep some state somewhere to remember the mapping.

Existing load balancer solutions can load balance TCP and UDP traffic at datacenter scale at different price points. However, they all keep per-flow state; after a load balancer decides which server should handle a connection, that decision is “remembered” locally and used to handle future packets of the same connection. Keeping per-flow state should ensure that ongoing connections do not break when servers and muxes come or go…

There are two issues with keeping this state though. Firstly , it can sometimes end up incomplete or out of date (especially under periods of rapid network change, such as during scale out and scale in). Secondly, there’s only a finite amount of resource to back that state, which opens the door to denial of service attacks such as SYN flood attacks.

Beamer is Continue reading

Andromeda: performance, isolation, and velocity at scale in cloud network virtualization

Andromeda: performance, isolation, and velocity at scale in cloud network virtualization Dalton et al., NSDI’18

Yesterday we took a look at the Microsoft Azure networking stack, today it’s the turn of the Google Cloud Platform. (It’s a very handy coincidence to have two such experience and system design report papers appearing side by side so that we can compare). Andromeda has similar design goals to AccelNet: performance close to hardware, serviceability, and the flexibility and velocity of a software-based architecture. The Google team solve those challenges in a very different way though, being prepared to make use of host cores (which you’ll recall the Azure team wanted to avoid).

We opted for a high-performance software-based architecture instead of a hardware-only solution like SR-IOV because software enables flexible, high-velocity feature deployment… Andromeda consumes a few percent of the CPU and memory on-host. One physical CPU core is reserved for the Andromeda dataplane… In the future, we plan to increase the dataplane CPU reservation to two physical cores on newer hosts with faster physical NICs and more CPU cores in order to improve VM network throughput.

High-level design

Both the control plane and data plane use a hierarchical structure. The control Continue reading

Azure accelerated networking: SmartNICs in the public cloud

Azure accelerated networking: SmartNICs in the public cloud Firestone et al., NSDI’18

We’re still on the ‘beyond CPUs’ theme today, with a great paper from Microsoft detailing their use of FPGAs to accelerate networking in Azure. Microsoft have been doing this since 2015, and hence this paper also serves as a wonderful experience report documenting the thought processes that led to an FPGA-based design, and lessons learned transitioning an all-software team to include hardware components.

There’s another reminder here too of the scale at which cloud vendors operate, which makes doing a project like this viable. The bulk purchase of FPGAs keeps their cost low, and the scale of the project makes the development investment worthwhile.

One question we are often asked is if FPGAs are ready to serve as SmartNICs more broadly outside Microsoft… We’ve observed that necessary tooling, basic IP blocks, and general support have dramatically improved over the last few years. But this would still be a daunting task for a new team… The scale of Azure is large enough to justify the massive development efforts — we achieved a level of performance and efficiency simply not possible with CPUs, and programmability far beyond an ASIC, Continue reading

NetChain: Scale-free sub-RTT coordination

NetChain: Scale-free sub-RTT coordination Jin et al., NSDI’18

NetChain won a best paper award at NSDI 2018 earlier this month. By thinking outside of the box (in this case, the box is the chassis containing the server), Jin et al. have demonstrated how to build a coordination service (think Apache ZooKeeper) with incredibly low latency and high throughput. We’re talking 9.7 microseconds for both reads and writes, with scalability on the order of tens of billions of operations per second. Similarly to KV-Direct that we looked at last year, NetChain achieves this stunning performance by moving the system implementation into the network. Whereas KV-Direct used programmable NICs though, NetChain takes advantage of programmable switches, and can be incrementally deployed in existing datacenters.

We expect a lightning fast coordination system like NetChain can open the door for designing a new generation of distributed systems beyond distributed transactions.

It’s really exciting to watch all of the performance leaps being made by moving compute and storage around (accelerators, taking advantage of storage pockets e.g. processing-in-memory, non-volatile memory, in-network processing, and so on). The sheer processing power we’ll have at our disposal as all of these become mainstream is staggering to Continue reading

SmoothOperator: reducing power fragmentation and improving power utilization in large-scale datacenters

SmoothOperator: reducing power fragmentation and improving power utilization in large-scale datacenters Hsu et al., ASPLOS’18

What do you do when your theory of constraints analysis reveals that power has become your major limiting factor? That is, you can’t add more servers to your existing datacenter(s) without blowing your power budget, and you don’t want to build a new datacenter just for that? In this paper, Hsu et al. analyse power utilisation in Facebook datacenters and find that overall power budget utilisation can be comparatively low, even while peak requirements are at capacity. We can’t easily smooth the workload (that’s driven by business and end-user requirements), but maybe we can do something to smooth the power usage.

Our experiments based on real production workload and power traces show that we are able to host up to 13% more machines in production, without changing the underlying power infrastructure. Utilizing the unleashed power headroom with dynamic reshaping, we achieve up to an estimated total of 15% and 11% throughput improvement for latency-critical service and batch service respectively at the same time, with up to 44% of energy slack reduction.

No more headroom and low utilisation…

There’s a maximum safe amount of power Continue reading

Skyway: connecting managed heaps in distributed big data systems

Skyway: connecting managed heaps in distributed big data systems Nguyen et al., ASPLOS’18

Yesterday we saw how to make Java objects persistent using NVM-backed heaps with Espresso. One of the drawbacks of using that as a persistence mechanism is that they’re only stored in the memory of a single node. If only there was some way to create a cluster of JVMs, and efficiently copy objects across remote heaps in the cluster… Meet Skyway!

Skyway is aimed at JVM-based big data systems (think Spark, Flink) that end up spending a lot of their time serializing and deserializing objects to move them around the cluster (e.g., to and from workers – see ‘Making sense of performance in data analytics frameworks’). Java comes with a default serialization mechanism, and there are also many third party libraries. Kyro is the recommended library for use with Spark.

Consider a small Spark cluster (3 worker nodes each with a 20 GB heap) running a triangle counting algorithm over the LiveJournal graph (about 1.2GB). With both the standard Java serializers and Kyro, serialization and deserialization combined account for a significant portion of the overall execution time (more than 30%).

Where Continue reading

Espresso: brewing Java for more non-volatility with non-volatile memory

Espresso: brewing Java for more non-volatility with non-volatile memory Wu et al., ASPLOS’18

What happens when you introduce non-volatile memory (NVM) to the world of Java? In theory, with a heap backed by NVM, we should get persistence for free? It’s not quite that straightforward of course, but Espresso gets you pretty close. There are a few things to consider, for example:

  • we probably don’t want all of our objects backed by persistent memory, as it still has higher latency than DRAM
  • we don’t want to make intrusive changes to existing code, and ideally would be able to continue using JPA (but why go through an expensive ORM mapping if we’re not targeting a relational store?)
  • we need to ensure any persistent data structures remain consistent after a crash

Espresso adds a new type of heap, a persistent Java heap (PJH) backed by NVM, and a persistent Java object (PJO) programming abstraction which is backwards compatible with JPA. PJO gives a 3.24x speedup even over JPA backed by H2.

JPA, PCJ, and NVM

JPA is the standard Java Persistence API. Java classes are decorated with persistence annotations describing their mapping to an underlying relational database. It’s an Continue reading

Watching for software inefficiencies with Witch

Watching for software inefficiencies with Witch Wen et al., ASPLOS’18

(The link above is to the ACM Digital Library, if you don’t have membership you should still be able to access the paper pdf by following the link from The Morning Paper blog post directly.)

Inefficiencies abound in complex, layered software.

These inefficiencies can arise during design (poor choice of algorithm), implementation, or translation (e.g., compiler optimisations or lack thereof). At the level of the hardware, inefficiencies involving the memory subsystem are some of the most costly…

Repeated initialization, register spill and restore on hot paths, lack of inlining hot functions, missed optimization opportunities due to aliasing, computing and storing already computed or sparingly changing values, and contention and false sharing (in multi-threaded codes), are some of the common prodigal uses of the memory subsystem.

Coarse grained profilers (e.g., gprof) have comparatively little overhead and can detect hotspots, but fail to distinguish between efficient and inefficient resource usage. Fine-grained profilers (e.g. DeadSpy) can detect inefficiencies, but typically introduce high overheads (10-80x slowdown and 6-100x extra memory). These high overheads prevent such tools from being widely used. Witch is a fine-grained inefficiency detection Continue reading

WSMeter: A performance evaluation methodology for Google’s production warehouse-scale computers

WSMeter: A performance evaluation methodology for Google’s production warehouse-scale computers Lee et al., ASPLOS’18

(The link above is to the ACM Digital Library, if you don’t have membership you should still be able to access the paper pdf by following the link from The Morning Paper blog post directly.)

How do you know how well your large kubernetes cluster / data centre / warehouse-scale computer (WSC) is performing? Is a particular change worth deploying? Can you quantify the ROI? To do that, you’re going to need some WSC-wide metric of performance. Not so easy! The WSC may be running thousands of distinct jobs all sharing the same underlying resources. Developing a load-testing benchmark workload to accurately model this is ‘practically impossible.’ Therefore, we need a method that lets us evaluate performance in a live production environment. Google’s answer is the Warehouse Scale performance Meter (WSMeter), “a methodology to efficiently and accurately evaluate a WSC’s performance using a live production environment.” At WSC scale, even small improvements can translate into considerable cost reductions. WSMeter’s low-risk, low-cost approach encourages more aggressive evaluation of potential new features.

Challenges in defining a WSC performance metric

Consider a change Continue reading

The architectural implications of autonomous driving: constraints and acceleration

The architectural implications of autonomous driving: constraints and acceleration Lin et al., ASPLOS’18

Today’s paper is another example of complementing CPUs with GPUs, FPGAs, and ASICs in order to build a system with the desired performance. In this instance, the challenge is to build an autonomous self-driving car!

Architecting autonomous driving systems is particularly challenging for a number of reasons…

  1. The system has to make “correct” operational decisions at all times to avoid accidents, and advanced machine learning, computer vision, and robotic processing algorithms are used to deliver the required high precision. These algorithms are compute intensive.
  2. The system must be able to react to traffic conditions in real-time, which means processing must always finish under strict deadlines (about 100ms in this work).
  3. The system must operate within a power budget to avoid negatively impacting driving range and fuel efficiency.

So how do you build a self-driving car?

There are several defined levels of automation, with level 2 being ‘partial automation’ in which the automated system controls steering and acceleration/deceleration under limited driving conditions. At level 3 the automated system handles all driving tasks under limited conditions (with a human driver taking over outside of that). By level 5 Continue reading

Darwin: a genomics co-processor provides up to 15,000x acceleration on long read assembly

Darwin: a genomics co-processor provides up to 15,000x acceleration on long read assembly Turakhia et al., ASPLOS’18

With the slow demise of Moore’s law, hardware accelerators are needed to meet the rapidly growing computational requirements of X.

For this paper, X = genomics, and genomic data is certainly growing fast: doubling every 7 months and on track to surpass YouTube and Twitter by 2025. Rack-size machines can sequence 50 genomes a day, portable sequencers require several days per genome. Third-generation sequencing technologies are now available which produce much longer reads of contiguous DNA – on the order of tens of kilobases compared to only a few hundred bases with the previous generations of technology.

For personalized medicine, long reads are superior in identifying structural variants i.e. large insertions, deletions and re-arrangements in the genome spanning kilobases or more which are sometimes associated with diseases; for haplotype phasing, to distinguish mutations on maternal vs paternal chromosomes; and for resolving highly repetitive regions in the genome.

The long read technology comes with a drawback though – high error rates in sequencing of between 15%-40%. The errors are corrected using computational methods ‘that can be orders of magnitude slower than Continue reading

Google workloads for consumer devices: mitigating data movement bottlenecks

Google workloads for consumer devices: mitigating data movement bottlenecks Boroumand et al., ASPLOS’18

What if your mobile device could be twice as fast on common tasks, greatly improving the user experience, while at the same time significantly extending your battery life? This is the feat that the authors of today’s paper pull-off, using a technique known as processing-in-memory (PIM). PIM moves some processing into the memory itself, avoiding the need to transfer data from memory to the CPU for those operations. It turns out that such data movement is a major contributor to the total system energy usage, so eliminating it can lead to big gains.

Our evaluation shows that offloading simple functions from these consumer workloads to PIM logic, consisting of either simple cores or specialized accelerators, reduces system energy consumption by 55.4% and execution time by 54.2%, on average across all of our workloads.

Energy as a limiting factor

While the performance requirements of consumer devices increase year on year, and devices pack in power-hungry CPUs, GPUs, special-purpose accelerators, sensors and high-resolution screens to keep pace, lithium ion battery capacity has only doubled in the last 20 years. Moreover, the thermal power dissipation in consumer Continue reading

1 14 15 16