Back in the day, when the AI revolution was young, Nvidia co-founder and chief executive officer used to make the rounds with each new generation of datacenter GPUs and provide the first DGX systems based on any particular GPU to marquee customers. …
Taiwan’s Fastest AI Supercomputer Goes To Foxconn was written by Timothy Prickett Morgan at The Next Platform.
Failure is an expected state in production systems, and no predictable failure of either software or hardware components should result in a negative experience for users. The exact failure mode may vary, but certain remediation steps must be taken after detection. A common example is when an error occurs on a server, rendering it unfit for production workloads, and requiring action to recover.
When operating at Cloudflare’s scale, it is important to ensure that our platform is able to recover from faults seamlessly. It can be tempting to rely on the expertise of world-class engineers to remediate these faults, but this would be manual, repetitive, unlikely to produce enduring value, and not scaling. In one word: toil; not a viable solution at our scale and rate of growth.
In this post we discuss how we built the foundations to enable a more scalable future, and what problems it has immediately allowed us to solve.
The Cloudflare Site Reliability Engineering (SRE) team builds and manages the platform that helps product teams deliver our extensive suite of offerings to customers. One important component of this platform is the collection of servers that power critical products such as Durable Objects, Workers, Continue reading
I just completed the Programming for Network Engineers (PRNE) learning path training on Cisco U and would like to share this with you: For a limited time, you have FREE…
The post Cisco U – Rev Up to Recert: Programming for Network Engineer appeared first on AboutNetworks.net.
One of the best ways of learning something is building a lab for it. Especially when it comes to complex topics like network authentication. When I started learning about network authentication and Cisco Identity Services Engine (ISE), I found that there wasn’t a lot of clear information on how you build a lab. Not in Cisco documentation and also not on blogs, etc. In this post I’ll explain how I built my lab using CML and ESX.
Having a lab with ISE only is not going to get you very far. At a minimum, I think the following devices are needed in a network authentication lab:
For my lab, I’m using only virtual devices. The focus is on learning network authentication and ISE which is why I’ve setup a very simple PKI, ignoring best practices such as offline root, intermediate CA, and so on. I might lab that at a later stage, but that’s not the current focus.
The topology of my lab is shown below:
Note that some VMs such as the virtual Catalyst Continue reading
This is a quick post to describe how to reclaim disk space being used by an old Windows install. Recently, I upgraded to Windows 11 from Windows 10. I noticed that I was starting to run a bit low on disk space on my SSD. I have a 512 GB SSD and had less than 100 GB available:
I noticed that there is a folder named Windows.old that is 40 GB in size:
The instructions to reclaim the space seemed clear. Go to Settings -> System -> Storage and reclaim the space labeled as Previous Windows installation. However, ther was no such category when I tried:
After some searching and a little bit of thinking, I realized that this is probably a privileges problem. I became local admin by using the PAM tool. Then I ran the disk cleanup util as administrator:
I can now see that there are previous Windows installations:
I select to delete Previous Windows installations:
You have to confirm that it’s OK to delete:
The deletion process starts:
This will take some time…
There is now more space available:
If you’re running low on disk, check if you have previous Windows installations that you can Continue reading
To a certain extent, all of the major HPC centers in the world live in the future. …
The Stepping Stones In America To Exascale And Beyond was written by Timothy Prickett Morgan at The Next Platform.
COMMISSIONED: As the world becomes more data-driven, the landscape of safety and security is evolving in remarkable ways. …
The Future Of Security: Harnessing AI And Storage For Smarter Protection was written by Martin Courtney at The Next Platform.
We’re excited to announce that Kivera, a cloud security, data protection, and compliance company, has joined Cloudflare. This acquisition extends our SASE portfolio to incorporate inline cloud app controls, empowering Cloudflare One customers with preventative security controls for all their cloud services.
In today’s digital landscape, cloud services and SaaS (software as a service) apps have become indispensable for the daily operation of organizations. At the same time, the amount of data flowing between organizations and their cloud providers has ballooned, increasing the chances of data leakage, compliance issues, and worse, opportunities for attackers. Additionally, many companies — especially at enterprise scale — are working directly with multiple cloud providers for flexibility based on the strengths, resiliency against outages or errors, and cost efficiencies of different clouds.
Security teams that rely on Cloud Security Posture Management (CSPM) or similar tools for monitoring cloud configurations and permissions and Infrastructure as code (IaC) scanning are falling short due to detecting issues only after misconfigurations occur with an overwhelming volume of alerts. The combination of Kivera and Cloudflare One puts preventive controls directly into the deployment process, or ‘inline’, blocking errors before they happen. This offers a proactive approach essential to Continue reading
Cloudflare runs several multi-tenant Kubernetes clusters across our core data centers. These general-purpose clusters run on bare metal and power our control plane, analytics, and various engineering tools such as build infrastructure and continuous integration.
Kubernetes is a container orchestration platform. It enables software engineers to deploy containerized applications to a cluster of machines. This enables teams to build highly-available software on a scalable and resilient platform.
In this blog post we discuss our Kubernetes architecture, why we needed virtualization, and how we’re using it today.
Multi-tenancy is a concept where one system can share its resources among a wide range of customers. This model allows us to build and manage a small number of general purpose Kubernetes clusters for our internal application teams. Keeping the number of clusters small reduces our operational toil. This model shrinks costs and increases computational efficiency by sharing hardware. Multi-tenancy also allows us to scale more efficiently. Scaling is done at either a cluster or application level. Cluster operators scale the platform by adding more hardware. Teams scale their applications by updating their Kubernetes manifests. They can scale vertically by increasing their resource requests or horizontally by increasing the number of Continue reading
In the previous blog posts, we explored the simplest possible IBGP-based EVPN design and made it scalable with BGP route reflectors.
Now, imagine someone persuaded you that EBGP is better than any IGP (OSPF or IS-IS) when building a data center fabric. You’re running EBGP sessions between the leaf- and the spine switches and exchanging IPv4 and IPv6 prefixes over those EBGP sessions. Can you use the same EBGP sessions for EVPN?
TL&DR: It depends™.
The AI boom is going sonic, and it looks like we had all better cover our ears if we want to be able to hear by the end of the holiday season if the prognostications of the box counters at IDC are correct. …
Datacenter Infrastructure Spending Is Up, And Forecasts Are Even Higher was written by Timothy Prickett Morgan at The Next Platform.
The main conclusions based on testing described in the two VPP blog posts are:
The VPP sFlow plugin provides a lightweight method of exporting real-time sFlow telemetry from a VPP based router. Including the plugin with VPP distributions has no impact on performance. Enabling the plugin provides real-time visibility that opens up additional use cases for VPPs programmable dataplane. For example, VPP is well suited to packet filtering use cases where the number of Continue reading
In the dynamic evolution of AI and cloud computing, the deployment of efficient and reliable hardware is critical. As we roll out our Gen 12 hardware across hundreds of cities worldwide, the challenge of maintaining optimal thermal performance becomes essential. This blog post provides a deep dive into the robust thermal design that supports our newest Gen 12 server hardware, ensuring it remains reliable, efficient, and cool (pun very much intended).
Generally speaking, a server has five core resources: CPU (computing power), RAM (short term memory), SSD (long term storage), NIC (Network Interface Controller, connectivity beyond the server), and GPU (for AI/ML computations). Each of these components can withstand different temperature limits based on their design, materials, location within the server, and most importantly, the power they are designed to work at. This final criteria is known as thermal design power (TDP).
The reason why TDP is so important is closely related to the first law of thermodynamics, which states that energy cannot be created or destroyed, only transformed. In semiconductors, electrical energy is converted into heat, and TDP measures the maximum heat output that needs to be managed to ensure Continue reading
Since exploring the EPE basics, now it’s time to understand the building blocks of the solution.
Overall, there are 3 elements of the EPE solution: