Delivering Enhanced Visibility for Modern Overlay Networks with ArcOS + ArcIQ

, Customer Engineering

June 3, 2020

Introduction

Network monitoring and visibility solutions have always been the cornerstone of a good network architecture. In modern networks, these requirements have been amplified by the need for simple and scalable real-time solutions to meet operational and business SLAs. Legacy approaches such as SNMP, CLI screen-scraping, query-based APIs lack the scale, analytics and automation attributes to meet today’s requirements.

Here is a recent talk from Google at NANOG’78 that reviews how networking has evolved over past several decades and how Google has evolved their monitoring and visibility frameworks with that. Amazon, Facebook, Microsoft and several others who operate networks at scale have all evolved their own solutions as well in the same vein. All networks big and small can benefit from a modern monitoring, visibility and analytics framework. Here are some of the key attributes I would like to highlight:

  • Streaming telemetry – Event-driven, push-based streaming telemetry. Real-time instead of query or polling-based.
  • Machine Learning and Proactive Analytics – the NOC with humans is now complemented or even replaced by software tools and portals that are constantly collecting, processing and proactively analyzing data from several tens of thousands of network devices 24/7/365. This plays a key role in the following:
    • Event Correlation – Dealing with single cognitive event vs handful of events scattered across nodes.
    • Anomaly detection – Real-time alerts before the operator realizes network condition has deviated from the norm.
  • Standards-based protocols – Avoid lock-ins from proprietary technologies and leveraging standards-based ones for easier insertion
  • Scale – With networks spanning hundreds of nodes and thousands of interconnections and with workloads in the order of several thousands, having a unified view across the entire span of your network is a life saver.
  • Flexible deployment – Ability to deploy visibility engine on premise or in the cloud, ability to scale-out instances to meet scale or geographical needs.

Modern Networks, Workloads, and Protocols

Have you ever wondered how most applications today are just available all the time? And how services are seamless?

Figure 1 Seamless workload migration

Figure 1 Seamless workload migration

Today, compute infrastructure is built as a fungible pool of resources and workloads that get seamlessly migrated from one physical server or cluster to another, regularly for various reasons – planned maintenance, load distribution, power or thermal management. This migration is automated and happens in matter of sub-seconds while all of this is abstracted to the end user and in most cases no disruption is noticed by the end user. The network of today plays an equal role in making this happen. EVPN has become the de-facto standard for multi-tenant overlay networks that need distributed Layer 2 and Layer 3 domains. This, however, adds an extra element of complexity in realizing unified visibility from the network.

In this blog, I will highlight how ArcOS and ArcIQ can help provide a simple and elegant solution for enhanced visibility in EVPN VxLAN networks.

ArcOS® + ArcIQ®

ArcOS has been designed with the foundational capability for streaming telemetry data across various system components, such as BGP/ISIS/RIB/FIB/ACLs/EVPN etc. This data can be streamed to any external data collection engine, modeled as JSON schemas over a Kafka bus or gNMI. But using ArcIQ, our deep visibility and analytics platform to process the data from the network, to monitor and report the health of your network and correlate various events in the network spanning different timestamps, you can gain better real-time, cognitive insights into the network.

Figure 2 Enhanced visibility with ArcOS + ArcIQ

Figure 2 Enhanced visibility with ArcOS + ArcIQ

ArcOS+ArcIQ Visibility solution for EVPN

Let’s see how a typical datacenter network running EVPN VxLAN on ArcOS can gather deep visibility into workload mobility by utilizing streaming based telemetry and cognitive capabilities of ArcIQ.

Streaming Telemetry

When a workload has moved locally within the same switch or from remote-to-local across an EVPN instance, ArcOS devices instantly generate telemetry data containing metadata that lets the operator know several parameters such as type of move, time of move, workload mac-address, source VTEP address, source interface, destination VTEP address, destination interface, VNI, and seq number.

Event Correlation and Cognitive insights

ArcIQ collects the data from several such micro-events and produces a single meaningful event and distinctly displays if it’s a mac-move (mobility in L2 only domain) or mac-ip move (mobility in L3 domain). And data from multiple nodes are presented in a common place to help you correlate events or take further actions.

Figure 3 ArcIQ to visualize your network

Figure 3 ArcIQ to visualize your network

Anomaly detection

In the context of datacenter networks running EVPN, MAC address duplication is an anomaly. Without proper visibility, such anomalies go undetected and EVPN would either blacklist or pin down the workload to last known location. And events like MAC address duplications are an indication of a potential security breach or a layer2 loop, which the operator needs to know immediately. When ArcOS detects a Mac duplication either locally or in an extended L2 domain, it generates a real-time update and notifies the visibility engine about the duplication. ArcIQ renders this data and gives meaningful info about the VTEP that generated this alert along with metadata about VNI, VTEP address, mac-mobility sequence number and interface.

Figure 4 Anomaly detection

Figure 4 Anomaly detection

Leveraging protocol standards

ArcOS leverages EVPN machinery for workload mobility across stretched L2 domains and generates the telemetry update by gleaning into the EVPN state before and after the workload migration. Say for a remote to local migration, ArcOS already knows the earlier location of the workload from the EVPN Type2 MAC-IP route. After the migration to local, it’s going to take the current location in terms of the interface the workload is behind and the pre-migration information such as VTEP address, VNI, VLAN etc. and pack it into a telemetry update to the visibility engine.

Figure 5 MAC-IP move with source and destination details

Figure 5 MAC-IP move with source and destination details

Check out the demo video below that showcases the detection of MAC moves and MAC address duplication in EVPN networks using ArcOS+ArcIQ.

What other insights do you care about?

Above, we saw a few examples of how ArcOS+ArcIQ can deliver increased visibility specific to EVPN networks. However, as we noted before, streaming telemetry is an integral part of ArcOS. Similarly, the ability to store, process and analyze data at scale, to produce meaningful information is integral to ArcIQ. So, the possibilities are endless for the ArcOS+ArcIQ solution to deliver analytics and insights important to your deployment, your environment and your ecosystem. While I leave you with that thought, here are some other use cases where we have leveraged ArcOS + ArcIQ to provide security insights.

BGP Flowspec-based DDoS mitigation
RPKI-based Route Origin Validation (ROV)

Learn More

Visit us at https://www.arrcus.com to learn more about our EVPN products and solutions and our Visibility and Analytics solutions to realize your modern monitoring and visibility framework.