Many of you have probably heard through the grapevine, at TechEx, or directly from us about our new tool, the GlobalNOC Network Troubleshooter. This tool is a direct result of our internal GlobalNOC teams (Service Desk, network engineering, systems engineering) coming together to figure out how to better leverage automation to improve our network operations. Let's take a quick look at the tool's origin, the problem we are trying to solve, and the results.
In early 2022, GlobalNOC formed a network automation working group to discuss the next project to focus on. Many ideas were thrown around, but the most popular idea was to attempt to resolve a small set of network outages automatically without requiring network engineers. It was recognized that this would be extremely difficult to implement, not to mention we need to build some trust before just setting robots loose on the network!
The working group decided on a simpler initial step: gathering all of the information that an engineer would need to diagnose the problem without having to log in to the devices or other GlobalNOC tools. The NAP (Network Automation and Performance) syseng team worked with a few select networks interested in this process to determine a set of commands to run for a single outage type (BGP alarms).
Great! Task completed, we’re done! Well, not quite …
There are a few problems here. Incidents with lots of alarms had a LOT of data all mashed together into the ticket making this less beneficial. There was also not a lot of flexibility in the commands being run, the ability to run the commands for an alarm again, or the ability to integrate other data sources (SNAPP, for example).
The next step in this process was to create a troubleshooting portal that was flexible enough to process many different alarm types, with different commands, for different networks with different vendors, and that could integrate other data sources like SNAPP. The working group got together again to write a PRD (Product Requirements Document) for the Network Troubleshooter Portal.
Once the working group approved the PRD, the NAP (Network Automation and Performance Team), SMS (Service Management Team), and GNUI (GlobalNOC UI) development teams got to work.
There are several new frameworks that the syseng teams were able to utilize. For years the GlobalNOC software development has been written in Perl, but recently we decided to start writing new code in Python and our new web-service infrastructure, FastAPI.
Many of you have also probably noticed the new style UI. For about 10 years, our web UIs have been implemented in a framework we called GLUE (GlobalNOC Library of UI Elements), built around YUI and JQuery. YUI has been deprecated for many years, and it was time for us to evaluate a new framework. The syseng teams formed a working group in late 2021 to determine and build their new UI framework; ultimately, we chose REACT and a set of other frameworks called GNUI (GlobalNOC UI). Both the FastAPI and GNUI tools allow us to develop and deploy tools and updates quickly. The formation of the UI Development team is also new, with the hope that a focused set of people working on UI development will give us more consistent and better UIs for our users.
The results of this project are still relatively modest but with lots of potential. Incidents in ServiceNow have a link to “Open Network Troubleshooter.”
The GlobalNOC Network Troubleshooter portal displays data gathered from devices, the database, measurement, and other tools, and displays them on a single page for each alarm acknowledged to the incident in ServiceNow.
When an alarm is acknowledged in an incident, it launches a workflow in our AWX instances configurable for each network and alarm type (BGP, Interface, CPU, memory, etc.). The results of these workflows push results into the Network Troubleshooter backend and are then available when the page loads to display those results to the engineers troubleshooting the incidents.
When multiple alarms are acknowledged to a ticket, the UI will display the results for each alarm, which is selectable in the UI. Also, it is possible to relaunch a workflow for a given alarm to see more recent information, and a direct link to the ticket is also available.
The results displayed to the user can be raw text from the device (logs, commit history) or processed by the AWX workflow (ping panel).
Additionally, the tool can include other data sources like GRNOC DB, or SNAPP.
Looking forward, there are several enhancements needed to this tool to meet our final goal of resolving network outages automatically without requiring network engineers. Automatically gathering start/end times for outages through this tool is an immediate feature request we will be working on. Longer term, we want to enable this tool to have the ability to start allowing engineers to remediate problems directly from the tool.
Imagine a “flap the interface,” “increase the prefix limit,” or “contact the customer” button, which could automatically fix a problem or email the customer. This all leads us to eventually having the Network Troubleshooter fix a problem without human interaction.
Many thanks to the Service Desk operators, network engineers, and systems engineers who made this tool possible.