
This is an adapted transcript of a talk we gave at Monitorama 2022. You can find the slides with presenter’s notes here and video here.

When a request at Cloudflare throws an error, information gets logged in our requests_error pipeline. The error logs are used to help troubleshoot customer-specific or network-wide issues.
We, Site Reliability Engineers (SREs), manage the logging platform. We have been running Elasticsearch clusters for many years and during these years, the log volume has increased drastically. With the log volume increase, we started facing a few issues. Slow query performance and high resource consumption to list a few. We aimed to improve the log consumer's experience by improving query performance and providing cost-effective solutions for storing logs. This blog post discusses challenges with logging pipelines and how we designed the new architecture to make it faster and cost-efficient.
Before we dive into challenges in maintaining the logging pipelines, let us look at the characteristics of logs.

Unpredictable - In today's world, where there are tons of microservices, the amount of logs a centralized logging system will receive is very unpredictable. There are various reasons why capacity estimation of log volume is so difficult. Continue reading
After completing the discussion of basic Kubernetes networking with a typical inter-pod traffic scenario, Stuart Charlton tackled another confusing topic: an overview of what Kubernetes services are.
After completing the discussion of basic Kubernetes networking with a typical inter-pod traffic scenario, Stuart Charlton tackled another confusing topic: an overview of what Kubernetes services are.
I have always found TCP congestion control algorithms fascinating, and at the same time, I know very little about them. So once in a while, I will spend some time with the hope of gaining some new insights. This blog post will share some of my experiments with various TCP congestion control algorithms. We will start with TCP Reno, then look at Cubic and ends with BBR.I am using Linux network namespaces to emulate topology for experimentation, making it easier to run than setting up a physical test bed.
For many years, the main algorithm of congestion control was TCP Reno. The goal of congestion control is to determine
how much capacity is available in the network, so that source knows how
many packets it can safely have in transit (Inflight). Once a source has these packets in transit, it uses the ACK’s
arrival as a signal that packets are leaving the network, and therefore it’s safe to send more packets into the network.
By using ACKs for pacing the transmission of packets, TCP is self-clocking. The number of packets which
TCP can inject into the network is controlled by Congestion Window(cwnd).
Congestion Window:


The software world is known for overdue projects, costs overrun, lots of defects, and lots of failure all the way around. Many other engineering fields have stricter requirements to take on projects and liability insurance driving correct practice and care. The networking world, and the larger IT world, however, has neither of these things. Does this make IT folks less likely to “do the right thing,” or is the self-regulation we have today enough? Join Tom Ammon, Eyvonne Sharp, and Russ White as they discuss the possibilities of professional liability in information technology.