TNO023: Networking’s Third Phase – The Network Operator Experience

Guest Chris Grundemann believes that NetOps is in the third phase of networking–improving the network operator experience. Not just making the network functional or improving end user experience. In this episode, Chris tells his origin story at a wireless service provider and growth into a founder of multiple companies. He also shares his community-focused work... Read more »

Cloudflare’s commitment to CISA Secure-By-Design pledge: delivering new kernels, faster

As cyber threats continue to exploit systemic vulnerabilities in widely used technologies, the United States Cybersecurity and Infrastructure Agency (CISA) produced best practices for the technology industry with their Secure-by-Design pledge. Cloudflare proudly signed this pledge on May 8, 2024, reinforcing our commitment to creating resilient systems where security is not just a feature, but a foundational principle.

We’re excited to share and provide transparency into how our security patching process meets one of CISA’s goals in the pledge: Demonstrating actions taken to increase installation of security patches for our customers.

Balancing security patching and customer experience 

Managing and deploying Linux kernel updates is one of Cloudflare’s most challenging security processes. In 2024, over 1000 CVEs were logged against the Linux kernel and patched. To keep our systems secure, it is vital to perform critical patch deployment across systems while maintaining the user experience. 

A common technical support phrase is “Have you tried turning it off and then on again?”.  One may be  surprised how often this tactic is used — it is also an essential part of how Cloudflare operates at scale when it comes to applying our most critical patches. Frequently restarting systems exercises the Continue reading

Transitioning into Networking, 2025 Edition

Elmer sent me the following question:

I’ve been working in systems engineering (Linux, virtualization, infrastructure ops) and am considering shifting toward network engineering or architecture. I got my CCNA years ago and started CCNP but didn’t continue.

I’d really appreciate any thoughts you might have on how someone with my background could best make that transition today, especially with how things are evolving around automation and the cloud.

I keep answering a variant of this question every other year or so (2019, 2021, 2023, 2024). I guess it’s time for another answer, so here we go.

KubeCon Europe: Kgateway Aims To Be the Kubernetes Onramp

Kubernetes network administrators at KubeCon + CloudNativeCon EU this week in London should drop by the ease the management of moving traffic to and from clusters. Built on top of Kubernetes Gateway API, the open source Solo.io, and went under the name Gloo Gateway. At last year’s KubeCon +_ CloudNativeCon North America 2024, the company announced that it would be donating the software to the Cloud Native Computing Foundation (CNCF), changing the software’s name to kgateway in the process. In March, CNCFGloo open source repository will be deprecated over time. The Importance of the Kubernetes Gateway API In 2023, the

N4N020: To Cert Or Not To Cert?

To cert or not to cert? That is the question Holly & Ethan discuss on today’s episode. Will a certification really land you a networking job? Are certs the guaranteed path to tech career success? We consider this, talking through the benefits, challenges and even risks of networking industry certification. And there’s some bonus material,... Read more »

Improve your media pipelines with the Images binding for Cloudflare Workers

When building a full-stack application, many developers spend a surprising amount of time trying to make sure that the various services they use can communicate and interact with each other. Media-rich applications require image and video pipelines that can integrate seamlessly with the rest of your technology stack.

With this in mind, we’re excited to introduce the Images binding, a way to connect the Images API directly to your Worker and enable new, programmatic workflows. The binding removes unnecessary friction from application development by allowing you to transform, overlay, and encode images within the Cloudflare Developer Platform ecosystem.

In this post, we’ll explain how the Images binding works, as well as the decisions behind local development support. We’ll also walk through an example app that watermarks and encodes a user-uploaded image, then uploads the output directly to an R2 bucket.

The challenges of fetch()

Cloudflare Images was designed to help developers build scalable, cost-effective, and reliable image pipelines. You can deliver multiple copies of an image — each resized, manipulated, and encoded based on your needs. Only the original image needs to be stored; different versions are generated dynamically, or as requested by a user’s browser, then subsequently served Continue reading

ARP Challenges in EVPN/VXLAN Symmetric IRB

Whenever I claimed that EVPN is The SIP of Networking, vendor engineers quickly told me that “EVPN interoperability is a solved problem” and that they run regular multi-vendor interoperability labs to iron out the quirks. As it turns out, things aren’t as rosy in real life; it’s still helpful to have an EVPN equivalent of the DTMF tone generators handy.

I encountered a particularly nasty quirk when running the netlab EVPN integration test using symmetric IRB with an anycast gateway between Nokia SR Linux (or Juniper vSwitch) and FRR container.

Lab topology

Lab topology

Calico Whisker, Your New Ally in Network Observability

With the upcoming release of Calico v3.30 on the horizon, we are excited to introduce Calico Whisker, a simple yet powerful User Interface (UI) designed to enhance network observability and policy debugging. If you’ve ever struggled to make sense of network flow logs or troubleshoot policies in a complex Kubernetes cluster, Whisker is your friend!

Whisker is a three part deployment that holds a UI, backend and a gRPC channel to communicate with the Felix brain of Calico to gather live flow information and present it in a human readable, easy to understand way. But before we get started let’s dive into why Whisker is a must-have for your Kubernetes environment, what problems it solves, and how it can streamline your policy management.

Navigating Network Flows is Difficult

In Kubernetes environments, network flows are the backbone of communication between workloads. As clusters scale, so does the complexity of managing these flows and their security. Without clear visibility and effective observability tools, teams often struggle with:

  • Diagnosing unexplained workload behavior and determining why certain applications aren’t working as expected.
  • Identifying the real reason why certain workload communications are permitted or denied, which stems from understanding which policies are affecting specific Continue reading

A steam locomotive from 1993 broke my yarn test

So the story begins with a pair programming session I had with my colleague, which I desperately needed because my node skill tree is still at level 1, and I needed to get started with React because I'll be working on our internal backstage instance.

We worked together on a small feature, tested it locally, and it worked. Great. Now it's time to make My Very First React Commit. So I ran the usual git add and git commit, which hooked into yarn test, to automatically run unit tests for backstage, and that's when everything got derailed. For all the React tutorials I have followed, I have never actually run a yarn test on my machine. And the first time I tried yarn test, it hung, and after a long time, the command eventually failed:

Determining test suites to run...

  ● Test suite failed to run

thrown: [Error]

error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
🌈  backstage  ⚡

I could tell it was obviously unhappy about something, and then it threw some [Error]. I have very little actual JavaScript experience, but this looks suspiciously like someone had neglected to Continue reading

Comparing IP and CLNP: Local (Node) Multihoming

Another area where CLNP is a clear winner when compared to the TCP/IP stack is multi-homed nodes (nodes with multiple interfaces, not site multi-homing, where whole networks are connected to two upstream providers).

Multi-homed TCP/IP nodes must have multiple IP addresses because IP uses address interfaces. There is no well-defined procedure in TCP/IP for how a multi-homed node should behave. In the early days of TCP/IP, they tried to address that in RFC 1122 (Host Requirements RFC), but even then, there were two ideas about dealing with multiple interfaces: the strong and weak end system models (more details).

The Team That Will Implement Intel’s New Vision

Intel is hosting its Vision 2025 annual event in Las Vegas this week, what we old hands used to call Intel Developer Forum back in the days when the chip maker was taking over more and more of the datacenter and had give the world a relatively inexpensive and uniform substrate on which to build hyperscale infrastructure.

The Team That Will Implement Intel’s New Vision was written by Timothy Prickett Morgan at The Next Platform.

HS099: From CLI to CFO: Translating Complex Network Data into Clear Strategic and Financial Insights (Sponsored)

IT and network leaders need more than uptime—they need to know what their networks cost, what they deliver, and how future changes will impact the business. That’s where Netos comes in. CEO and founder Richard Foster joins Johna and John in a lively discussion to explore how Netos turns complex operational data into clear financial... Read more »

Comparing AI / ML activity from two production networks

AI Metrics describes how to deploy the open source ai-metrics application. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter. The screen capture from the article (above) shows results from a simulated 48,000 GPU cluster.

This article goes beyond simulation to demonstrate the AI Metrics dashboard by comparing live traffic seen in two production AI clusters.

Cluster 1

This cluster consists of 250 GPUs connected via 100G ports to single large switch. The results are pretty consistent with simulation from the original article. In this case there is no Core Link Traffic because the cluster consists of a single switch. The Discards chart shows a burst of Out (egress) discards and the Drop Reasons chart gives the reason as ingress_vlan_filter. The Total Traffic, Operations, Edge Link Traffic, and RDMA Operations charts all show a transient drop in throughput coincident with the discard spike. Further details of the dropped packets, such as source/destination address, operation, ingress / egress port, QP pair, etc. can be extracted from the sFlow Dropped Packet Notifications that are populating Continue reading
1 5 6 7 8 9 3,780