Archive

Category Archives for "Networking"

Transitioning into Networking, 2025 Edition

Elmer sent me the following question:

I’ve been working in systems engineering (Linux, virtualization, infrastructure ops) and am considering shifting toward network engineering or architecture. I got my CCNA years ago and started CCNP but didn’t continue.

I’d really appreciate any thoughts you might have on how someone with my background could best make that transition today, especially with how things are evolving around automation and the cloud.

I keep answering a variant of this question every other year or so (2019, 2021, 2023, 2024). I guess it’s time for another answer, so here we go.

KubeCon Europe: Kgateway Aims To Be the Kubernetes Onramp

Kubernetes network administrators at KubeCon + CloudNativeCon EU this week in London should drop by the ease the management of moving traffic to and from clusters. Built on top of Kubernetes Gateway API, the open source Solo.io, and went under the name Gloo Gateway. At last year’s KubeCon +_ CloudNativeCon North America 2024, the company announced that it would be donating the software to the Cloud Native Computing Foundation (CNCF), changing the software’s name to kgateway in the process. In March, CNCFGloo open source repository will be deprecated over time. The Importance of the Kubernetes Gateway API In 2023, the

N4N020: To Cert Or Not To Cert?

To cert or not to cert? That is the question Holly & Ethan discuss on today’s episode. Will a certification really land you a networking job? Are certs the guaranteed path to tech career success? We consider this, talking through the benefits, challenges and even risks of networking industry certification. And there’s some bonus material,... Read more »

Improve your media pipelines with the Images binding for Cloudflare Workers

When building a full-stack application, many developers spend a surprising amount of time trying to make sure that the various services they use can communicate and interact with each other. Media-rich applications require image and video pipelines that can integrate seamlessly with the rest of your technology stack.

With this in mind, we’re excited to introduce the Images binding, a way to connect the Images API directly to your Worker and enable new, programmatic workflows. The binding removes unnecessary friction from application development by allowing you to transform, overlay, and encode images within the Cloudflare Developer Platform ecosystem.

In this post, we’ll explain how the Images binding works, as well as the decisions behind local development support. We’ll also walk through an example app that watermarks and encodes a user-uploaded image, then uploads the output directly to an R2 bucket.

The challenges of fetch()

Cloudflare Images was designed to help developers build scalable, cost-effective, and reliable image pipelines. You can deliver multiple copies of an image — each resized, manipulated, and encoded based on your needs. Only the original image needs to be stored; different versions are generated dynamically, or as requested by a user’s browser, then subsequently served Continue reading

ARP Challenges in EVPN/VXLAN Symmetric IRB

Whenever I claimed that EVPN is The SIP of Networking, vendor engineers quickly told me that “EVPN interoperability is a solved problem” and that they run regular multi-vendor interoperability labs to iron out the quirks. As it turns out, things aren’t as rosy in real life; it’s still helpful to have an EVPN equivalent of the DTMF tone generators handy.

I encountered a particularly nasty quirk when running the netlab EVPN integration test using symmetric IRB with an anycast gateway between Nokia SR Linux (or Juniper vSwitch) and FRR container.

Lab topology

Lab topology

Calico Whisker, Your New Ally in Network Observability

With the upcoming release of Calico v3.30 on the horizon, we are excited to introduce Calico Whisker, a simple yet powerful User Interface (UI) designed to enhance network observability and policy debugging. If you’ve ever struggled to make sense of network flow logs or troubleshoot policies in a complex Kubernetes cluster, Whisker is your friend!

Whisker is a three part deployment that holds a UI, backend and a gRPC channel to communicate with the Felix brain of Calico to gather live flow information and present it in a human readable, easy to understand way. But before we get started let’s dive into why Whisker is a must-have for your Kubernetes environment, what problems it solves, and how it can streamline your policy management.

Navigating Network Flows is Difficult

In Kubernetes environments, network flows are the backbone of communication between workloads. As clusters scale, so does the complexity of managing these flows and their security. Without clear visibility and effective observability tools, teams often struggle with:

  • Diagnosing unexplained workload behavior and determining why certain applications aren’t working as expected.
  • Identifying the real reason why certain workload communications are permitted or denied, which stems from understanding which policies are affecting specific Continue reading

A steam locomotive from 1993 broke my yarn test

So the story begins with a pair programming session I had with my colleague, which I desperately needed because my node skill tree is still at level 1, and I needed to get started with React because I'll be working on our internal backstage instance.

We worked together on a small feature, tested it locally, and it worked. Great. Now it's time to make My Very First React Commit. So I ran the usual git add and git commit, which hooked into yarn test, to automatically run unit tests for backstage, and that's when everything got derailed. For all the React tutorials I have followed, I have never actually run a yarn test on my machine. And the first time I tried yarn test, it hung, and after a long time, the command eventually failed:

Determining test suites to run...

  ● Test suite failed to run

thrown: [Error]

error Command failed with exit code 1.
info Visit https://yarnpkg.com/en/docs/cli/run for documentation about this command.
🌈  backstage  ⚡

I could tell it was obviously unhappy about something, and then it threw some [Error]. I have very little actual JavaScript experience, but this looks suspiciously like someone had neglected to Continue reading

Comparing IP and CLNP: Local (Node) Multihoming

Another area where CLNP is a clear winner when compared to the TCP/IP stack is multi-homed nodes (nodes with multiple interfaces, not site multi-homing, where whole networks are connected to two upstream providers).

Multi-homed TCP/IP nodes must have multiple IP addresses because IP uses address interfaces. There is no well-defined procedure in TCP/IP for how a multi-homed node should behave. In the early days of TCP/IP, they tried to address that in RFC 1122 (Host Requirements RFC), but even then, there were two ideas about dealing with multiple interfaces: the strong and weak end system models (more details).

HS099: From CLI to CFO: Translating Complex Network Data into Clear Strategic and Financial Insights (Sponsored)

IT and network leaders need more than uptime—they need to know what their networks cost, what they deliver, and how future changes will impact the business. That’s where Netos comes in. CEO and founder Richard Foster joins Johna and John in a lively discussion to explore how Netos turns complex operational data into clear financial... Read more »

Comparing AI / ML activity from two production networks

AI Metrics describes how to deploy the open source ai-metrics application. The application provides performance metrics for AI/ML RoCEv2 network traffic, for example, large scale CUDA compute tasks using NVIDIA Collective Communication Library (NCCL) operations for inter-GPU communications: AllReduce, Broadcast, Reduce, AllGather, and ReduceScatter. The screen capture from the article (above) shows results from a simulated 48,000 GPU cluster.

This article goes beyond simulation to demonstrate the AI Metrics dashboard by comparing live traffic seen in two production AI clusters.

Cluster 1

This cluster consists of 250 GPUs connected via 100G ports to single large switch. The results are pretty consistent with simulation from the original article. In this case there is no Core Link Traffic because the cluster consists of a single switch. The Discards chart shows a burst of Out (egress) discards and the Drop Reasons chart gives the reason as ingress_vlan_filter. The Total Traffic, Operations, Edge Link Traffic, and RDMA Operations charts all show a transient drop in throughput coincident with the discard spike. Further details of the dropped packets, such as source/destination address, operation, ingress / egress port, QP pair, etc. can be extracted from the sFlow Dropped Packet Notifications that are populating Continue reading

“You get Instant Purge, and you get Instant Purge!” — all purge methods now available to all customers

There's a tradition at Cloudflare of launching real products on April 1, instead of the usual joke product announcements circulating online today. In previous years, we've introduced impactful products like 1.1.1.1 and 1.1.1.1 for Families. Today, we're excited to continue this tradition by making every purge method available to all customers, regardless of plan type.

During Birthday Week 2024, we announced our intention to bring the full suite of purge methods — including purge by URL, purge by hostname, purge by tag, purge by prefix, and purge everything — to all Cloudflare plans. Historically, methods other than "purge by URL" and "purge everything" were exclusive to Enterprise customers. However, we've been openly rebuilding our purge pipeline over the past few years (hopefully you’ve read some of our blog series), and we're thrilled to share the results more broadly. We've spent recent months ensuring the new Instant Purge pipeline performs consistently under 150 ms, even during increased load scenarios, making it ready for every customer.  

But that's not all — we're also significantly raising the default purge rate limits for Enterprise customers, allowing even greater purge throughput thanks to the efficiency of our Continue reading

NB520: When Good LLMs Do Bad Things, Dell’s Workforce Downsizes and Quantum Key Distribution From Space

Grab a virtual doughnut to blaze through this week’s IT news with Johna Johnson and John Burke as Drew Conry-Murray is enjoying his glazed, filled and sprinkled vacation donuts.  Today, we’re going to talk about getting good LLMs to do bad things, Dell’s workforce downsizing, Cloudflare’s recent outage, some developments in space networking, and more.... Read more »

TNO022: Secure Automation at Enterprise Scale for the Public Sector with Red Hat Ansible (Sponsored)

There are both benefits and challenges when adopting automation in the public sector, but Red Hat Ansible enhances efficiency, security and service delivery. With the right tooling, network operators can integrate automation into existing environments and improve network security.  Providing insights into adopting automation in the public sector are Tony Dubiel, Principal Specialist Solution Architect... Read more »

HN774: Who Put These OT Risks In My IT Ops? Fortinet Has Answers (Sponsored)

IT and infosec professionals are used to operating and protecting mission-critical infrastructure; servers, databases, load balancers, and so on. But what about valves that control the flow of gas or oil in a refinery? Temperature and vibration sensors that monitor industrial manufacturing processes? If you’re thinking “That’s not my problem” think again. There’s a whole... Read more »
1 4 5 6 7 8 3,432