In this post, I’d like to offer my take on one element that I believe will be important for any successful IBN implementation: efficient and effective open telemetry.
Telemetry is key to data collection and monitoring which, in turn, is crucial for managing not only networks, but everything the network supports, including applications and storage systems.
With effective telemetry, we can automate responses to network issues to ensure reliability, uptime and performance.
Open telemetry with gNMI
At its core, telemetry involves two things:
- The continuous collection of data from networking devices, such as switches and routers
- Ensuring all data collected is time-stamped
Continuous data collection is important in order to be able to diagnose issues on the network, including those in the past, and to establish a baseline of “normal” network performance. The timestamp is important to establish when a given incident occurred, and the time differential between different, related incidents. Together, the two elements enable you to identify trends and, soon, to diagnose issues in real time, using artificial intelligence tools.
The real challenge with telemetry is coming up with an efficient way to collect data while handling data traffic in parallel, which is exactly the issue gNMI-gRPC Network Management Interface is designed to overcome. gNMI is an open source (OpenConfig Project) unified management protocol for streaming telemetry and configuration management that leverages the open source gRPC framework. This means a single gRPC service definition can cover both configuration and telemetry. The gNMI service defines operations for configuration management, operational state retrieval, and bulk data collection via streaming telemetry.
Spawning new management capabilities
This efficient data-collection capability is important because it will give users management capabilities they simply have not had up to now. It will enable us to use sFlow to monitor network devices in real time, while using gNMI to efficiently send the data to a central data collection server for storage and analysis. To date, sFlow has been used mainly for non-real-time historical analysis and troubleshooting, because there simply wasn’t an efficient way to get sFlow data to a server.
This capability will enable users to do things like set thresholds and get alarms for any criteria they like. In the past, most alarm criteria were defined by the switch vendor for alarms like the CPU is too hot or memory is too low. These things rarely happen in switches, but when they do you certainly want to know.
It would be preferable, however, to monitor the CPU utilization by dynamically adjusting the threshold based on the telemetry data itself. For example, if the CPU is normally running at 10% during off-shift hours, but somehow spikes to 60% on a particular night, the telemetry data analyzer could automatically generate an alarm. However, if a CPU is usually running 70% busy around 9 a.m. because all staff are being authenticated at that hour, the telemetry analyzer would not trigger an alarm, even if the CPU was at 80%. This is just one example of how telemetry data can help the analyzer to observe and learn. The same technique can just as easily be applied to network security or performance monitoring.
Event logs are another consideration. Right now, you can set up a system log server to collect data from a network device event log. But you can’t analyze it in real time because, again, there’s no efficient way to get the data to the server. gNMI provides just that mechanism.
In short, gNMI provides a way to implement telemetry in an open, efficient, standards-based way. We are working on integrating it with our Linux-based network operating system, PICOS, so you’ll soon have an efficient way to stream data out of any network device.
Automated troubleshooting and response
While using gNMI for network telemetry is new, it has long been used with servers and other IT infrastructure. So, once we have widespread availability of streaming telemetry from network elements, network and IT managers will have much better visibility into the state of not just the network, but the entire IT infrastructure.
This is where we can expect AI to play a role, by analyzing the available data in order to identify root causes of problems. For decades, we’ve all struggled to troubleshoot application performance problems. Was it actually a problem with the network, or was it the server or the application code? By constantly collecting data from all the pieces involved in the IT infrastructure and applying AI, we can finally start answering those important remediation questions – in real time.
The next step is to automate the response, which is what IBN is all about. You collect and analyze all the event data and, if it is a network problem, IBN automatically adjusts data paths to correct it. But none of this is feasible if you can’t efficiently collect the data in the first place.
IBN also doesn’t work effectively if you’re not combining network data with event data from other components, including servers and storage systems. So, it won’t be just a single IBN server managing the network, but lots of servers monitoring different components.
This is the IBN vision Pica8 is working towards. I fully expect we’ll have the gRPC-gNMI piece done around the end of the third quarter this year. From there, you can use our AmpCon open network services platform, or your own controller, to implement it. That’s what open networking is all about – choice.