Adopting OpenTelemetry for our logging pipeline

Cloudflare’s logging pipeline is one of the largest data pipelines that Cloudflare has, serving millions of log events per second globally, from every server we run. Recently, we undertook a project to migrate the underlying systems of our logging pipeline from syslog-ng to OpenTelemetry Collector and in this post we want to share how we managed to swap out such a significant piece of our infrastructure, why we did it, what went well, what went wrong, and how we plan to improve the pipeline even more going forward.

Background

A full breakdown of our existing infrastructure can be found in our previous post An overview of Cloudflare's logging pipeline, but to quickly summarize here:

  • We run a syslog-ng daemon on every server, reading from the local systemd-journald journal, and a set of named pipes.
  • We forward those logs to a set of centralized “log-x receivers”, in one of our core data centers.
  • We have a dead letter queue destination in another core data center, which receives messages that could not be sent to the primary receiver, and which get mirrored across to the primary receivers when possible.

The goal of this project was to replace those syslog-ng instances as Continue reading

Configuring AAA on Arista EOS Devices Using TACACS+ and ISE

Configuring AAA on Arista EOS Devices Using TACACS+ and ISE

In this blog post, let's look at how to configure TACACS+ AAA authentication on Arista EOS devices using Cisco ISE. When someone tries to log in to the device, we want the Arista device to authenticate and authorize the user against Cisco ISE. We'll go through the necessary configurations and steps to set up this integration between Arista EOS and Cisco ISE.

Local Authorization vs ISE Authorization

You could configure this in two ways with a slight difference. With the first method, ISE authenticates the user and tells Arista which role to apply. Arista devices come with two predefined roles, network-admin and network-operator. For example, if we have two different groups of users, network engineers who need full access and NOC engineers who only need read-only access. When the users log in, depending on the policy, ISE will send TACACS+ attributes that tell the switch which role to apply. With this method, the authorization happens locally at the switch.

For the second method, we will not use these two predefined roles. Each command the user enters on the CLI will be authorized by Cisco ISE. For example, we can allow all commands for network engineers and prevent NOC engineers from Continue reading

NetBox in the Cloud, for Free

NetBox in the Cloud, for Free

Yes, you read that right. NetBox Labs is now offering a generous free plan for their SaaS version of NetBox. This change is a big win for many of us who no longer need to worry about managing our own NetBox instances. With this free plan, you can take advantage of all the powerful features of NetBox without the hassle of maintenance and updates.

Why This Matters to Me?

As a blogger, I create a lot of labs and practice a lot of automation. I rely on NetBox for IP Address Management (IPAM) and other network-related tasks. Before this, I had my NetBox running as a Docker container on one of my VMs. However, there were times when I wanted to access NetBox and found out the VM was powered off. This free plan is music to my ears. There is a 100-device limit, but that's more than enough for my needs.

How to Get it?

Getting started with the free plan is as simple as going to their website and signing up for a free plan. I was up and running within a few minutes. The free plan includes up to 100 devices, 500 IP addresses, and 10k API Continue reading

Network observability in Kubernetes clusters for better security and faster troubleshooting

For DevOps and platform teams working with containers and Kubernetes, reducing downtime and improving security posture is crucial. A clear understanding of network topology, service interactions, and workload dependencies is required in cloud-native applications. This is essential for securing and optimizing the Kubernetes deployment and minimizing response time in the event of failure.

Network observability can highlight gaps in network policies for applications that require network policy controls to reduce the risk of attack from unsecured egress access or lateral movement of threats within the Kubernetes cluster. However, visualizing workload communication, service dependencies, and active and inactive network security policies presents significant challenges due to the distributed and dynamic nature of Kubernetes workloads.

Why is network observability difficult with Kubernetes workloads?

Kubernetes scales up and scales out pods and creates and destroys services depending on real-time business requirements, resulting in dynamic network connections for each workload instance. Network access policies defined for each workload further impact these connections.

In such a scenario, capturing an accurate and up-to-date representation of network traffic, service dependencies, and network policies is difficult. The default Kubernetes implementation provides limited network traffic visibility and policy information, making it challenging for teams to troubleshoot connectivity issues, improve Continue reading

Extending local traffic management load balancing to Layer 4 with Spectrum

In 2023, Cloudflare introduced a new load balancing solution, supporting Local Traffic Management (LTM). This gives organizations a way to balance HTTP(S) traffic between private or internal servers within a region-specific data center. Today, we are thrilled to be able to extend those same LTM capabilities to non-HTTP(S) traffic. This new feature is enabled by the integration of Cloudflare Spectrum, Cloudflare Tunnels, and Cloudflare load balancers and is available to enterprise customers. Our customers can now use Cloudflare load balancers for all TCP and UDP traffic destined for private IP addresses, eliminating the need for expensive on-premise load balancers.

A quick primer

In this blog post, we will be referring to load balancers at either layer 4 or layer 7. This is, of course, referring to layers of the OSI model but more specifically, the ingress path that is being used to reach the load balancer. Layer 7, also known as the Application Layer, is where the HTTP(S) protocol exists. Cloudflare is well known for our layer 7 capabilities, which are built around speeding up and protecting websites which run over HTTP(S). When we refer to layer 7 load balancers, we are referring to HTTP(S)-based services. Our layer Continue reading

Technology Short Take 178

Welcome to Technology Short Take #178! This one is notably shorter than many of the Technology Short Takes I publish; I’m still trying to fine-tune my collection of RSS feeds (such a useful technology that seems to have fallen out of favor), removing inactive feeds and looking for new feeds to replace them. Regardless, I have managed to collect a few links for your reading pleasure this weekend. Enjoy!

Networking

Security

  • Matt Moore, CTO of Chainguard, goes into some detail on how Chainguard intends to honor the principles behind the CISA’s Secure by Design pledge.
  • Ars Technica examines TunnelVision, a vulnerability that has existed since 2002 and has the potential to render VPN apps useless. From my reading of the article, the greatest concern lies with untrusted networks where an attacker could manipulate things in their favor. Join that Wi-Fi network at the coffee shop at your own risk!
  • Here’s a slightly older post (March 2023) on using AppArmor to restrict app permissions, with a particular focus on containers (including Kubernetes). It’s a bit basic, but it does (in my opinion) provide some useful information.
  • Nick Frichette shares some Continue reading

Worth Reading: ChatGPT Does Not Summarize

I mostly gave up on LLMs being any help (apart from generating copious amounts of bullshit), but I still thought that generating summaries might be an interesting use case. I was wrong.

As Gerben Wierda explains in his recent “When ChatGPT summarises, it actually does nothing of the kind” blog post, you have to understand a text if you want to generate a useful summary, and that’s not what LLMs do. They can generate a shorter version of the text, which might not focus on the significant bits.

Worth Reading: ChatGPT Does Not Summarize

I mostly gave up on LLMs being any help (apart from generating copious amounts of bullshit), but I still thought that generating summaries might be an interesting use case. I was wrong.

As Gerben Wierda explains in his recent “When ChatGPT summarises, it actually does nothing of the kind” blog post, you have to understand a text if you want to generate a useful summary, and that’s not what LLMs do. They can generate a shorter version of the text, which might not focus on the significant bits.

Endpoint Selectors and Kubernetes Namespaces in CiliumNetworkPolicies

While performing some testing with CiliumNetworkPolicies, I came across a behavior that was unintuitive and unexpected to me. The behavior centers around how an endpoint selector behaves in a CiliumNetworkPolicies when Kubernetes namespaces are involved. (If you didn’t understand a bit of what I just said, I’ll provide some additional explanation shortly—stay with me!) After chatting through the behavior with a few folks, I realized the behavior is essentially “correct” and expected. However, if I was confused by the behavior then there’s a good chance others might be confused by the behavior as well, so I thought a quick blog post might be a good idea. Keep reading to get more details on the interaction between endpoint selectors and Kubernetes namespaces in CiliumNetworkPolicies.

Before digging into the behavior, let me first provide some definitions or explanations of the various things involved here:

  • Kubernetes namespaces are a way to logically isolate groups of resources in a cluster. For example, you might install the software that drives your point-of-sale (PoS) devices in the “retail-pos” namespace while the application that handles inventory is in the “inventory” namespace. You can read more about namespaces in the Kubernetes documentation.
  • CiliumNetworkPolicies are Cilium-specific network policies Continue reading

Disrupting FlyingYeti’s campaign targeting Ukraine

Cloudforce One is publishing the results of our investigation and real-time effort to detect, deny, degrade, disrupt, and delay threat activity by the Russia-aligned threat actor FlyingYeti during their latest phishing campaign targeting Ukraine. At the onset of Russia’s invasion of Ukraine on February 24, 2022, Ukraine introduced a moratorium on evictions and termination of utility services for unpaid debt. The moratorium ended in January 2024, resulting in significant debt liability and increased financial stress for Ukrainian citizens. The FlyingYeti campaign capitalized on anxiety over the potential loss of access to housing and utilities by enticing targets to open malicious files via debt-themed lures. If opened, the files would result in infection with the PowerShell malware known as COOKBOX, allowing FlyingYeti to support follow-on objectives, such as installation of additional payloads and control over the victim’s system.

Since April 26, 2024, Cloudforce One has taken measures to prevent FlyingYeti from launching their phishing campaign – a campaign involving the use of Cloudflare Workers and GitHub, as well as exploitation of the WinRAR vulnerability CVE-2023-38831. Our countermeasures included internal actions, such as detections and code takedowns, as well as external collaboration with third parties to remove the actor’s cloud-hosted malware. Continue reading