sFlow Archives - Page 9 of 15 - NetworkingNexus.net

Troubleshooting connectivity problems in leaf and spine fabrics

Introducing data center fabric, the next-generation Facebook data center network describes the benefits of moving to a leaf and spine network architecture. The diagram shows how the leaf and spine architecture creates many paths between each pair of hosts. Multiple paths increase available bandwidth and resilience against the loss of a link or a switch. While most networks don't have the scale requirements of Facebook, smaller scale leaf and spine designs deliver high bandwidth, low latency, networking to support cloud workloads (e.g. vSphere, OpenStack, Docker, Hadoop, etc.).

Unlike traditional hierarchical network designs, where a small number of links can be monitored to provide visibility, a leaf and spine network has no special links or switches where running CLI commands or attaching a probe would provide visibility. Even if it were possible to attach probes, the effective bandwidth of a leaf and spine network can be as high as a Petabit/second, well beyond the capabilities of current generation monitoring tools.

Fortunately, industry standard sFlow monitoring technology is built into the commodity switch hardware used to build leaf and spine networks. Enabling sFlow telemetry on all the switches in the network provides centralized, real-time, visibility into network traffic.

Fabric View Continue reading

Cumulus Linux 3.4 REST API

The latest Cumulus Linux 3.4 release include a REST API. This article will demonstrate how the REST API can be used to automatically deploy traffic controls based on real-time sFlow telemetry. DDoS mitigation with Cumulus Linux describes how sFlow-RT can detect Distributed Denial of Service (DDoS) attacks in real-time and deploy automated controls.

The following ddos.js script is modified to use the REST API to send Network Command Line Utility - NCLU commands to add and remove ACLs, see Installing and Managing ACL Rules with NCLU:

var user = "cumulus";
var password = "CumulusLinux!";
var thresh = 10000;
var block_minutes = 1;

setFlow('udp_target',{keys:'ipdestination,udpsourceport',value:'frames'});

setThreshold('attack',{metric:'udp_target', value:thresh, byFlow:true, timeout:10});

function restCmds(agent,cmds) {
  for(var i = 0; i < cmds.length; i++) {
    let msg = {cmd:cmds[i]};
    http("https://"+agent+":8080/nclu/v1/rpc",
         "post","application/json",JSON.stringify(msg),user,password);
  }
}

var controls = {};
var id = 0;
setEventHandler(function(evt) {
  var key = evt.agent + ',' + evt.flowKey;
  if(controls[key]) return;

  var ifname = metric(evt.agent,evt.dataSource+".ifname")[0].metricValue;
  if(!ifname) return;

  var now = (new Date()).getTime();
  var name = 'ddos'+id++;
  var [ip,port] = evt.flowKey.split(',');
  var cmds = [
    'add acl ipv4 '+name+' drop udp source-ip any source-port '+port+' dest-ip '+ip+' dest-port any',
     Continue reading

Linux 4.11 kernel extends packet sampling support

Linux 4.11 on Linux Kernel Newbies describes the features added in the April 30, 2017 release. Of particular interest is the new netlink sampling channel:

Introduce psample, a general way for kernel modules to sample packets, without being tied to any specific subsystem. This netlink channel can be used by tc, iptables, etc. and allow to standardize packet sampling in the kernel commit

The psample netlink channel delivers sampled packet headers along with associated metadata from the Linux kernel to user space. The psample fields map directly into sFlow Version 5 sampled_header export structures:

netlink psample	sFlow	Description
PSAMPLE_ATTR_IIFINDEX	input	Interface packet was received on.
PSAMPLE_ATTR_OIFINDEX	output	Interface packet was sent on.
PSAMPLE_ATTR_SAMPLE_GROUP	data source	The location within network device that generated packet sample.
PSAMPLE_ATTR_GROUP_SEQ	drops	Number of times that the sFlow agent detected that a packet marked to be sampled was dropped due to lack of resources. Agent calculates drops by tracking discontinuities in PSAMPLE_ATTR_GROUP_SEQ
PSAMPLE_ATTR_SAMPLE_RATE	sampling_rate	The Sampling Rate specifies the ratio of packets observed at the Data Source to the samples generated. For example a sampling rate of 100 specifies that, on Continue reading

Arista eAPI

The sFlow and eAPI features of EOS (Extensible Operating System) are standard across the full range of Arista Networks switches. This article demonstrates how the real-time visibility provided by sFlow telemetry can be combined with the programmatic control of eAPI to automatically adapt the network to changing traffic conditions.

In the diagram, the sFlow-RT analytics engine receives streaming sFlow telemetry, provides real-time network-wide visibility, and automatically applies controls using eAPI to optimize forwarding, block denial of service attacks, or capture suspicious traffic.

Arista eAPI 101 describes the JSON RPC interface for programmatic control of Arista switches. The following eapi.js script shows how eAPI requests can be made using sFlow-RT's JavaScript API:

function runCmds(proto, agent, usr, pwd, cmds) {
  var req = {
    jsonrpc:'2.0',id:'sflowrt',method:'runCmds',
    params:{version:1,cmds:cmds,format:'json'}
  };
  var url = (proto || 'http')+'://'+agent+'/command-api';
  var resp = http(url,'post','application/json',JSON.stringify(req),usr,pwd);
  if(!resp) throw "no response";
  resp = JSON.parse(resp);
  if(resp.error) throw resp.error.message;
  return resp.result; 
}

The following test.js script demonstrates the eAPI functionality with a basic show request:

include('eapi.js');
var result = runCmds('http','10.0.0.90','admin','arista',['show hostname']);
logInfo(JSON.stringify(result));

Starting sFlow-RT:

env "RTPROP=-Dscript.file=test.js" ./start.sh

Running the script generates the following output:

2017-07-10T14:00:06-0700  Continue reading

Real-time DDoS mitigation using sFlow and BGP FlowSpec

Remotely Triggered Black Hole (RTBH) Routing describes how native BGP support in the sFlow-RT real-time sFlow analytics engine can be used to blackhole traffic in order to mitigate a distributed denial of service (DDoS) attack. Black hole routing is effective, but there is significant potential for collateral damage since ALL traffic to the IP address targeted by the attack is dropped.

The BGP FlowSpec extension (RFC 5575: Dissemination of Flow Specification Rules) provides a method of transmitting traffic filters that selectively block the attack traffic while allowing normal traffic to pass. BGP FlowSpec support has recently been added to sFlow-RT and this article demonstrates the new capability.

This demonstration uses the test network described in Remotely Triggered Black Hole (RTBH) Routing. The network was constructed using free components: VirtualBox, Cumulus VX, and Ubuntu Linux. BGP FlowSpec on white box switch describes how to implement basic FlowSpec support on Cumulus Linux.

The following flowspec.js sFlow-RT script detects and blocks UDP-Based Amplification attacks:

var router = '10.0.0.141';
var id = '10.0.0.70';
var as = 65141;
var thresh = 1000;
var block_minutes = 1;

setFlow('udp_target',{keys:'ipdestination,udpsourceport',value:'frames'});

setThreshold('attack',{metric:'udp_target', value:thresh, byFlow:true});

bgpAddNeighbor(router,as,id,{flowspec:true});

var  Continue reading

BGP FlowSpec on white box switch

BGP FlowSpec is a method of distributing access control lists (ACLs) using the BGP protocol. Distributed denial of service (DDoS) mitigation is an important use case for the technology, allowing a targeted network to push filters to their upstream provider to selectively remove the attack traffic.

Unfortunately, FlowSpec is currently only available on high end routing devices and so experimenting with the technology is expensive. Looking for an alternative, Cumulus Linux is an open Linux platform that allows users to install Linux packages and develop their own software.

This article describes a proof of concept implementation of basic FlowSpec functionality using ExaBGP installed on a free Cumulus VX virtual machine. The same solution can be run on inexpensive commodity white box hardware to deliver terabit traffic filtering in a production network.

First, install latest version of ExaBGP on the Cumulus Linux switch:

curl -L https://github.com/Exa-Networks/exabgp/archive/4.0.0.tar.gz | tar zx

Now define the handler, acl.py, that will convert BGP FlowSpec updates into standard Linux netfilter/iptables entries used by Cumulus Linux to specify hardware ACLs (see Netfilter - ACLs):

#!/usr/bin/python
 
import json
import re
from os import listdir,remove
from os.path import isfile
from  Continue reading

Remotely Triggered Black Hole (RTBH) Routing

The screen shot demonstrates real-time distributed denial of service (DDoS) mitigation. Automatic mitigation was disabled for the first simulated attack (shown on the left of the chart). The attack reaches a sustained packet rate of 1000 packets per second for a period of 60 seconds. Next, automatic mitigation was enabled and a second attack launched. This time, as soon as the traffic crosses the threshold (the horizontal red line), a BGP remote trigger message is sent to router, which immediately drops the traffic.

The diagram shows the test setup. The network was built out of freely available components: CumulusVX switches and Ubuntu 16.04 servers running under VirtualBox.

The following configuration is installed on the ce-router:

router bgp 65140
 bgp router-id 0.0.0.140
 neighbor 10.0.0.70 remote-as 65140
 neighbor 10.0.0.70 port 1179
 neighbor 172.16.141.2 remote-as 65141
 !
 address-family ipv4 unicast
  neighbor 10.0.0.70 allowas-in
  neighbor 10.0.0.70 route-map blackhole-in in
 exit-address-family
!
ip community-list standard blackhole permit 65535:666
!
route-map blackhole-in permit 20
 match community blackhole
 match ip address prefix-len 32
 set ip next-hop 192.0.2.1

The ce-router peers with the upstream service provider router ( Continue reading

Arista EOS telemetry

Arista EOS switches support industry standard sFlow telemetry, enabling hardware instrumentation supported by merchant silicon to export hardware interface counters and flow data. The latest release of the open source Host sFlow agent has been ported to EOS, augmenting the telemetry with standard host CPU, memory, and disk IO metrics.

Linux as a Switch Operating System: Five Lessons Learned identifies benefits of using Linux as the basis for EOS. In this context, the Linux operating system made it easy to port the Host sFlow agent, use standard Linux package management (RPM Package Manager), and gather metrics using standard Linux APIs. A new eAPI module automatically synchronizes the Host sFlow daemon with the EOS sFlow configuration.

The following sflowtool output shows the additional metrics contributed by a Host sFlow agent installed on an Arista switch:

startDatagram =================================
datagramSourceIP 172.17.0.1
datagramSize 704
unixSecondsUTC 1490843418
datagramVersion 5
agentSubId 100000
agent 10.0.0.90
packetSequenceNo 714
sysUpTime 0
samplesInPacket 1
startSample ----------------------
sampleType_tag 0:2
sampleType COUNTERSSAMPLE
sampleSequenceNo 714
sourceId 2:1
counterBlock_tag 0:2001
counterBlock_tag 0:2010
udpInDatagrams 1459
udpNoPorts 16
udpInErrors 0
udpOutDatagrams 4765
udpRcvbufErrors 0
udpSndbufErrors 0
udpInCsumErrors 0
counterBlock_tag 0:2009
tcpRtoAlgorithm 1
tcpRtoMin 200
tcpRtoMax 120000
tcpMaxConn 4294967295
tcpActiveOpens 102
 Continue reading

Nutanix

Maximum Performance from Acropolis Hypervisor and Open vSwitch describes the network architecture within a Nutanix converged infrastructure appliance - see diagram above. This article will explore how the Host sFlow agent can be deployed to enable sFlow instrumentation in the Open vSwitch (OVS) and deliver streaming network and system telemetry from nodes in a Nutanix cluster.

This article is based on a single hardware node running Nutanix Community Edition (CE), built following the instruction in Part I: How to setup a three-node NUC Nutanix CE cluster. If you don't have hardware readily available, the article, 6 Nested Virtualization Resources To Get You Started With Community Edition, describes how to run Nutanix CE as a virtual machine.

The sFlow standard is widely supported by network equipment vendors, which combined with sFlow from each Nutanix appliance, delivers end to end visibility in the Nutanix cluster. The following screen captures from the free sFlowTrend tool are representative examples of the data available from the Nutanix appliance.

The Network > Top N chart displays the top flows traversing OVS. In this case an HTTP connection is responsible for most of the traffic. Inter-VM and external traffic flows traverse OVS and are efficiently Continue reading

QUIC

A QUIC update on Google’s experimental transport describes some of the benefits of the QUIC (Quick UDP Internet Connections) protocol that is now the default transport when Google's Chrome browser connects to Google services (gmail, search, etc.). Given the over 50% market share of the Chrome browser (NetMarketShare) and the popularity of Google services, it is important to be aware of the QUIC protocol and to start tracking its use of network resources.

An easy way to see if you have any QUIC traffic on your network is to use the standard sFlow instrumentation built into network switches. Configure the switches to send sFlow telemetry to an sFlow collector for visibility into network traffic.

For example, use Docker to run the sFlow-RT active-flows application to analyze the sFlow data stream:

docker run -p 6343:6343/udp -p 8008:8008 -d sflow/top-flows

Access the web interface at http://localhost:8008/ and enter the following Flow Specification to monitor QUICK flows:

dns:ipsource,dns:ipdestination,quicpackettype

Note: Real-time domain name lookups describes how sFlow-RT incorporates DNS (Domain Name Service) requests in its real-time analytics pipeline so that traffic flows can be identified by domain name.

The resulting top flows table is shown in the screen capture above. Continue reading

Telegraf, InfluxDB, Chronograf, and Kapacitor

The InfluxData TICK (Telegraf, InfluxDB, Chronograf, Kapacitor) provides a full set of integrated metrics tools, including an agent to export metrics (Telegraf), a time series database to collect and store the metrics (InfluxDB), a dashboard to display metrics (Chronograf), and a data processing engine (Kapacitor). Each of the tools is open sourced and can be used together or separately.

This article will show how industry standard sFlow agents embedded within the data center infrastructure can provide Telegraf metrics to InfluxDB. The solution uses sFlow-RT as a proxy to convert sFlow metrics into their Telegraf equivalent form so that they are immediately visible through the default Chronograf dashboards (Using a proxy to feed metrics into Ganglia described a similar approach for sending metrics to Ganglia).

The following telegraf.js script instructs sFlow-RT to periodically export host metrics to InfluxDB:

var influxdb = "http://10.0.0.56:8086/write?db=telegraf";

function sendToInfluxDB(msg) {
  if(!msg || !msg.length) return;
  
  var req = {
    url:influxdb,
    operation:'POST',
    headers:{"Content-Type":"text/plain"},
    body:msg.join('\n')
  };
  req.error = function(e) {
    logWarning('InfluxDB POST failed, error=' + e);
  }
  try { httpAsync(req); }
  catch(e) {
    logWarning('bad request ' + req.url + ' ' + e);
  }
}

var metric_names = [
   Continue reading

Using Ganglia to monitor Linux services

The screen capture from the Ganglia monitoring tool shows metrics for services running on a Linux host. Monitoring Linux services describes how the open source Host sFlow agent has been extended to export standard Virtual Node metrics from services running under systemd. Ganglia already supports these standard metrics and the article Using Ganglia to monitor virtual machine pools describes the configuration steps needed to enable this feature.

Monitoring Linux services

Mainstream Linux distributions have moved to systemd to manage daemons (e.g. httpd, sshd, etc.). The diagram illustrates how systemd runs each daemon within its own container so that it can maintain tight control of the daemon's resources.

This article describes how to use the open source Host sFlow agent to gather telemetry from daemons running under systemd.

Host sFlow systemd monitoring exports a standard set of metrics for each systemd service - the sFlow Host Structures extension defines metrics for Virtual Nodes (virtual machines, containers, etc.) that are used to export Xen, KVM, Docker, and Java resource usage. Exporting the standard metrics for systemd services provides interoperability with sFlow analyzers, allowing them to report on Linux services using existing virtual node monitoring capabilities.

While running daemons within containers helps systemd maintain control of the resources, it also provides a very useful abstraction for monitoring. For example, a single service (like the Apache web server) may consist of dozens of processes. Reporting on container level metrics abstracts away the per-process details and gives a view of the total resources consumed by the service. In addition, service metadata (like the service name) provides a useful way of identifying and grouping Continue reading

IPv6 Internet router using merchant silicon

Internet router using merchant silicon describes how a commodity white box switch can be used as a replacement for an expensive Internet router. The solution combines standard sFlow instrumentation implemented in merchant silicon with BGP routing information to selectively install only active routes into the hardware.

The article describes a simple self contained solution that uses standard APIs and should be able to run on a variety of Linux based network operating systems, including: Cumulus Linux, Dell OS10, Arista EOS, and Cisco NX-OS.

The diagram shows the elements of the solution. Standard sFlow instrumentation embedded in the merchant silicon ASIC data plane in the white box switch provides real-time information on traffic flowing through the switch. The sFlow agent is configured to send the sFlow to an instance of sFlow-RT running on the switch. The Bird routing daemon is used to handle the BGP peering sessions and to install routes in the Linux kernel using the standard netlink interface. The network operating system in turn programs the switch ASIC with the kernel routes so that packets are forwarded by the switch hardware and not by the kernel software.

The key to this solution is Bird's multi-table capabilities. The full Internet Continue reading

Monitoring at Terabit speeds

The chart was generated from industry standard sFlow telemetry from the switches and routers comprising The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16) network. The chart shows a number of conference participants pushing the network to see how much data they can transfer, peaking at a combined bandwidth of 3 Terabits/second over a minute just before noon and sustaining over 2.5 Terabits/second for over an hour. The traffic is broken out by MAC vendors code: routed traffic can be identified by router vendor (Juniper, Brocade, etc.) and layer 2 transfers (RDMA over Converged Ethernet) are identified by host adapter vendor codes (Mellanox, Hewlett-Packard Enterprise, etc.).

From the SCinet web page, "The Fastest Network Connecting the Fastest Computers: SC16 will host the most powerful and advanced networks in the world – SCinet. Created each year for the conference, SCinet brings to life a very high-capacity network that supports the revolutionary applications and experiments that are a hallmark of the SC conference."

SC16 live real-time weathermaps provides additional demonstrations of high performance network monitoring.

SC16 live real-time weathermaps

Connect to https://inmon.sc16.org/sflow-rt/app/sc16-weather/html/ between now and November 17th to see a real-time heat map of the The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16) network.

From the SCinet web page, "The Fastest Network Connecting the Fastest Computers: SC16 will host the most powerful and advanced networks in the world – SCinet. Created each year for the conference, SCinet brings to life a very high-capacity network that supports the revolutionary applications and experiments that are a hallmark of the SC conference."

The real-time weathermap leverages industry standard sFlow instrumentation built into network switch and router hardware to provide scaleable monitoring of the SCinet network. Link colors are updated every second to reflect operational status and utilization of each link.

Clicking on a link in the map pops up a 1 second resolution strip chart showing the protocol mix carried by the link.

OSiRIS (Open Storage Research Infrastructure) is a "distributed, multi-institutional storage infrastructure that lets researchers write, manage, and share data from their own computing facility locations."

Connect to http://inmon.sc16.org/sflow-rt/app/OSiRIS-weather/html/ to see an animated diagram of the SC16 OSiRIS demonstration connecting SCinet with University of Michigan, Michigan State, Wayne Continue reading

Network performance monitoring

Today, network performance monitoring typically relies on probe devices to perform active tests and/or observe network traffic in order to try and infer performance. This article demonstrates that hosts already track network performance and that exporting host-based network performance information provides an attractive alternative to complex and expensive in-network approaches.

# tcpdump -ni eth0 tcp
11:29:28.949783 IP 10.0.0.162.ssh > 10.0.0.70.56174: Flags [P.], seq 1424968:1425312, ack 1081, win 218, options [nop,nop,TS val 2823262261 ecr 2337599335], length 344
11:29:28.950393 IP 10.0.0.70.56174 > 10.0.0.162.ssh: Flags [.], ack 1425312, win 4085, options [nop,nop,TS val 2337599335 ecr 2823262261], length 0

The host TCP/IP stack continuously measured round trip time and estimates available bandwidth for each active connection as part of its normal operation. The tcpdump output shown above highlights timestamp information that is exchanged in TCP packets to provide the accurate round trip time measurements needed for reliable high speed data transfer.

The open source Host sFlow agent already makes use of Berkeley Packet Filter (BPF) capability on Linux to efficiently sample packets and provide visibility into traffic flows. Adding support Continue reading

Real-time domain name lookups

Reverse DNS requests request the domain name associated with an IP address, for example providing the name google-public-dns-a.google.com for IP address 8.8.8.8. This article demonstrates how the sFlow-RT engine incorporates domain name lookups in real-time flow analytics.

First, use the dns.servers System Property is used to specify one or more DNS servers to handle the reverse lookup requests. For example, the following command uses Docker to run sFlow-RT with DNS lookups directed to server 10.0.0.1:

docker run -e "RTPROP=-Ddns.servers=10.0.0.1" \
-p 8008:8008 -p 6343:6343/udp -d sflow/sflow-rt

The following Python script dnspair.py uses the sFlow-RT REST API to define a flow and log the resulting flow records:

#!/usr/bin/env python
import requests
import json

flow = {'keys':'dns:ipsource,dns:ipdestination',
 'value':'bytes','activeTimeout':10,'log':True}
requests.put('http://localhost:8008/flow/dnspair/json',data=json.dumps(flow))
flowurl = 'http://localhost:8008/flows/json?name=dnspair&maxFlows=10&timeout=60'
flowID = -1
while 1 == 1:
  r = requests.get(flowurl + "&flowID=" + str(flowID))
  if r.status_code != 200: break
  flows = r.json()
  if len(flows) == 0: continue

  flowID = flows[0]["flowID"]
  flows.reverse()
  for f in flows:
    print json.dumps(f,indent=1)

Running the script generates the following output:

$ ./dnspair.py
{
 "value": 233370.92322668363, 
 "end": 1476234478177, 
 "name": "dnspair", 
 "flowID":  Continue reading

Collecting Docker Swarm service metrics

This article demonstrates how to address the challenge of monitoring dynamic Docker Swarm deployments and track service performance metrics using existing on-premises and cloud monitoring tools like Ganglia, Graphite, InfluxDB, Grafana, SignalFX, Librato, etc.

In this example, Docker Swarm is used to deploy a simple web service on a four node cluster:

docker service create --replicas 2 -p 80:80 --name apache httpd:2.4

Next, the following script tests the agility of monitoring systems by constantly changing the number of replicas in the service:

#!/bin/bash
while true
do
  docker service scale apache=$(( ( RANDOM % 20 )  + 1 ))
  sleep 30
done

The above test is easy to set up and is a quick way to stress test monitoring systems and reveal accuracy and performance problems when they are confronted with container workloads.

Many approaches to gathering and recording metrics were developed for static environments and have a great deal of difficulty tracking rapidly changing container-based service pools without missing information, leaking resources, and slowing down. For example, each new container in Docker Swarm has unique name, e.g. apache.16.17w67u9157wlri7trd854x6q0. Monitoring solutions that record container names, or even worse, index data by container name, will suffer from bloated Continue reading

Docker 1.12 swarm mode elastic load balancing

Docker Built-In Orchestration Ready For Production: Docker 1.12 Goes GA describes the native swarm mode feature that integrates cluster management, virtual networking, and policy based deployment of services.

This article will demonstrate how real-time streaming telemetry can be used to construct an elastic load balancing solution that dynamically adjusts service capacity to match changing demand.

Getting started with swarm mode describes the steps to configure a swarm cluster. For example, following command issued on any of the Manager nodes deploys a web service on the cluster:

docker service create --replicas 2 -p 80:80 --name apache httpd:2.4

And the following command raises the number of containers in the service pool from 2 to 4:

docker service scale apache=4

Asynchronous Docker metrics describes how sFlow telemetry provides the real-time visibility required for elastic load balancing. The diagram shows how streaming telemetry allows the sFlow-RT controller to determine the load on the service pool so that it can use the Docker service API to automatically increase or decrease the size of the pool as demand changes. Elastic load balancing of the service pools ensures consistent service levels by adding additional resources if demand increases. In addition, efficiency is improved by releasing resources Continue reading

« Previous 1 … 7 8 9 10 11 … 15 Next »