The story of one latency spike
A customer reported an unusual problem with our CloudFlare CDN: our servers were responding to some HTTP requests slowly. Extremely slowly. 30 seconds slowly. This happened very rarely and wasn't easily reproducible. To make things worse all our usual monitoring hadn't caught the problem. At the application layer everything was fine: our NGINX servers were not reporting any long running requests.
Time to send in The Wolf.
He solves problems.
Following the evidence
First, we attempted to reproduce what the customer reported—long HTTP responses. Here is a chart of of test HTTP requests time measured against our CDN:

We ran thousands of HTTP queries against one server over a couple of hours. Almost all the requests finished in milliseconds, but, as you can clearly, see 5 requests out of thousands took as long as 1000ms to finish. When debugging network problems the delays of 1s, 30s are very characteristic. They may indicate packet loss since the SYN packets are usually retransmitted at times 1s, 3s, 7s, 15, 31s.
Blame the network
At first we thought the spikes in HTTP load times might indicate some sort of network problem. To be sure we ran ICMP pings against two IPs over many Continue reading
IT professionals need to expand their horizons to IoT and the cloud.
Cole Crawford’s Open DCRE is bringing new management tools to data centers.
We’re happy to announce the release of Ansible Tower 2.4. In this release, we’ve focused on some core improvements for our customers operating in spaces like government and security who have specific needs around authentication and tracking, but we expect these features will be useful to much of our general user base as well.
A freemium version of LiveUX monitors SaaS from the data center to the end user.