We've been relying on ML and AI for our core services like Web Application Firewall (WAF) since the early days of Cloudflare. Through this journey, we've learned many lessons about running AI deployments at scale, and all the tooling and processes necessary. We recently launched Workers AI to help abstract a lot of that away for inference, giving developers an easy way to leverage powerful models with just a few lines of code. In this post, we’re going to explore some of the lessons we’ve learned on the other side of the ML equation: training.
Cloudflare has extensive experience training models and using them to improve our products. A constantly-evolving ML model drives the WAF attack score that helps protect our customers from malicious payloads. Another evolving model power bot management product to catch and prevent bot attacks on our customers. Our customer support is augmented by data science. We build machine learning to identify threats with our global network. To top it all off, Cloudflare is delivering machine learning at unprecedented scale across our network.
Each of these products, along with many others, has elevated ML models — including experimentation, training, and deployment — to a crucial position within Continue reading
Azure Boost is a hardware offload of Azure virtual machines designed to improve VM performance. On today's Day Two Cloud we dig into how it works. We also talk about how to implement security in Virtual Network Manager, as well as how to optimize your Azure observability--meaning, how not to blow up your budget with unnecessary logging.
The post D2C223: Accelerating VM Performance With Azure Boost appeared first on Packet Pushers.
Cloudflare recently announced Workers AI, giving developers the ability to run serverless GPU-powered AI inference on Cloudflare’s global network. One key area of focus in enabling this across our network was updating our Baseboard Management Controllers (BMCs). The BMC is an embedded microprocessor that sits on most servers and is responsible for remote power management, sensors, serial console, and other features such as virtual media.
To efficiently manage our BMCs, Cloudflare leverages OpenBMC, an open-source firmware stack from the Open Compute Project (OCP). For Cloudflare, OpenBMC provides transparent, auditable firmware. Below describes some of what Cloudflare has been able to do so far with OpenBMC with respect to our GPU-equipped servers.
For this project, we needed a way to adjust our BMC firmware to accommodate new GPUs, while maintaining the operational efficiency with respect to thermals and power consumption. OpenBMC was a powerful tool in meeting this objective.
OpenBMC allows us to change the hardware of our existing servers without the dependency of our Original Design Manufacturers (ODMs), consequently allowing our product teams to get started on products quickly. To physically support this effort, our servers need to be able to supply enough power and keep Continue reading
Most organizations are terribly bad at interviewing people. They overcomplicate things by holding too many interviews (more than 2-3) and often focus their interview on trivia and memorization rather than walking through a scenario. Every interview should have some form of a scenario and a whiteboard if you are hiring a Network Engineer. Rather than overcomplicating things, here’s how you can interview someone using a single scenario that you can expand on and go to different depths at different stages depending on the focus of the role.
You are an employee working in a large campus network. Your computer has just started up and has not previously communicated with anything before you open your browser and type in microsoft.com.
Before any communication can take place, you need an IP address. What IP protocols are there? What are the main differences between the two?
Things to look for: IPv4 vs IPv6. ARP vs ND. DHCP vs RA. Broadcast vs multicast.
What methods are there of configuring an IP address?
Things to look for: Static IP vs DHCP vs RA.
When I need to communicate to something external, traffic goes through a gateway. What type of device would Continue reading
Fortinet turns its on-prem and cloud security devices into a sensor network that collects threat intelligence across the globe. That intelligence then feeds those devices and services with new updates and the latest protections. In today's sponsored Heavy Networking, we talk with Fortinet about its Fortiguard Security Services, how they work, and how customers can take advantage of them.
The post HN712: FortiGuard Security Services: Invisible Operations, Tangible Results (Sponsored) appeared first on Packet Pushers.
Powering data centres is big deal in current decade. Massive increases in consumption and scaling of off-prem clouds has exceeded the capacity of civilian power grids while cloud operators are reluctant to sign thirty year supply agreements so that more power plants can be built. Enter power micro-generation where large DCs needs too small power supply.
This post is also available in Deutsch.
A recent decision from the Higher Regional Court of Cologne in Germany marked important progress for Cloudflare and the Internet in pushing back against misguided attempts to address online copyright infringement through the DNS system. In early November, the Court in Universal v. Cloudflare issued its decision rejecting a request to require public DNS resolvers like Cloudflare’s 18.104.22.168. to block websites based on allegations of online copyright infringement. That’s a position we’ve long advocated, because blocking through public resolvers is ineffective and disproportionate, and it does not allow for much-needed transparency as to what is blocked and why.
To see why the Universal decision matters, it’s important to understand what a public DNS resolver is, and why it’s not a good place to try to moderate content on the Internet.
The DNS system translates website names to IP addresses, so that Internet requests can be routed to the correct location. At a high-level, the DNS system consists of two parts. On one side sit a series of nameservers (Root, TLD, and Authoritative) that together store information mapping domain names to IP addresses; on the other Continue reading
It’s been a while since the last netlab release. Most of that time was spent refactoring stuff that you don’t care about, but you might like these features:
As always, we also improved the platform support:
Today on the Tech Bytes podcast we talk with sponsor Pliant about its automation platform. Pliant helps you orchestrate across devices and domains with a low-code approach that uses APIs to automate and orchestrate across your infrastructure.
The post Tech Bytes: Pliant Combines APIs, Low Code Approach For Network Automation (Sponsored) appeared first on Packet Pushers.
Today's Network Break discusses a new Trident ASIC with an on-chip neural net inference engine, Broadcom staff cuts at VMware, more bad news from an Okta breach, financial results, and more.
The post NB458: Broadcom Debuts On-Chip Neural Net, Lays Off VMware Staff; Okta Breach Gets Worse appeared first on Packet Pushers.
Some friends shared a Reddit post the other day that made me both shake my head and ponder the state of the networking industry. Here is the locked post for your viewing pleasure. It was locked because the comments were going to devolve into a mess eventually. The person making the comment seems to be honest and sincere in their approach to “layer 3 going away”. The post generated a lot of amusement from the networking side of IT about how this person doesn’t understand the basics but I think there’s a deeper issue going on.
Our visibility of the state of the network below the application interface is very general in today’s world. That’s because things “just work” to borrow an overused phrase. Aside from the occasional troubleshooting exercise to find out why packets destined for Azure or AWS are failing along the way when is the last time you had to get really creative in finding a routing issue in someone else’s equipment? We spend more time now trying to figure out how to make our own networks operate efficiently and less time worrying about what happens to the packets when they leave our organization. Continue reading
On the 26th of January, I’ll be teaching a webinar over at Safari Books Online (subscription service) called Modern Network Troubleshooting. From the blurb:
The first section of this class considers the nature of resilience, and how design tradeoffs result in different levels of resilience. The class then moves into a theoretical understanding of failures, how network resilience is measured, and how the Mean Time to Repair (MTTR) relates to human and machine-driven factors. One of these factors is the unintended consequences arising from abstractions, covered in the next section of the class.
The class then moves into troubleshooting proper, examining the half-split formal troubleshooting method and how it can be combined with more intuitive methods. This section also examines how network models can be used to guide the troubleshooting process. The class then covers two examples of troubleshooting reachability problems in a small network, and considers using ChaptGPT and other LLMs in the troubleshooting process. A third, more complex example is then covered in a data center fabric.
I’ve never personally done this on the net but….wouldn’t the BGP origin code also work with moving one’s ingress traffic similarly to AS PATH?
TL&DR: Sort of, but not exactly. Also, just because you can climb up ropes using shoelaces instead of jumars doesn’t mean you should.
Let’s deal with the moving traffic bit first.