Network automation tools
As we have seen throughout the previous chapters, we have multiple choices regarding automating a network. From a basic configuration for any device using Netmiko to deploying and creating configurations across various devices in a network using Ansible, there are many options for engineers to automate networks based upon various needs.
Python is extensively used in creating automation scenarios, owing to its open community support for various vendors and protocols. Nearly every major player in the industry has support for Python programming, tweaking their own tools or any supporting technology that they have. Another major aspect of network automation are the custom-based solutions that could be made for organization requirements. The self-service API model is a good start to ensuring that some of the tasks that are done manually can be converted to APIs, which can then be leveraged into any language based upon the automation needs.
Let's see an example that can be used as a basic guide to understand the advantage of self or custom-created automation tools. The output of show ip bgp summary in Cisco is the same as show bgp summary in Juniper. Now, as an engineer who needs to validate the BGP on both the vendors, I need to understand both the commands and interpret the output.
Think of this situation by adding more vendors which have their own unique way of fetching BGP output. This becomes complex and a network engineer needs to be trained on a multi-vendor environment to be able to fetch the same type of output from each vendor.
Now, let's say we create an API (for example, getbgpstatus ), which takes the input as some hostname. The API at the backend is intelligent enough to fetch the vendor model using SNMP, and based upon the vendor sends a specific command (like show ip bgp summary for Cisco or show ip summary for Juniper), and parses that output to a human-readable format, like only the IP address and status of that BGP neighbor.
For example, instead of printing the raw output of show ip bgp summary or show bgp summary, it parses the output like this:
IPaddress1 : Status is UP
IPaddress2 : Status is Down (Active)
This output can be returned as a JSON value back to the call of the API.
Hence, let's say we can call the API as http://localhost/networkdevices/getbgpstatus?device=devicex and the API from the backend will identify if devicex is Juniper or Cisco or any other vendor, and based upon this the vendor will fetch and parse the output relevant to that vendor. A return of that API call will be JSON text as we saw in the preceding example, that we can parse in our automation language.
Let us see a basic example of another popular tool, SolarWinds. There are many aspects of SolarWinds; it can auto-discover a device (based upon MIBs and SNMP), identify the vendor, and fetch relevant information from the device.
Let's see some of the following screenshots for basic SolarWinds device management. SolarWinds is freely available as a trial download.
The prerequisite for SolarWinds device management is as follows:
- We need to add a device in SolarWinds, shown as below:
As we can see, SolarWinds has the ability to discover devices (using network discovery), or we can add a specific IP address/hostname with the correct SNMP string for SolarWinds to detect the device.
- Once the device is detected it will show as the monitored node, as in the below screenshot:
Notice the green dot next to the IP address (or hostname). This signifies that the node is alive (reachable) and SolarWinds can interact with the node correctly.
Additional task(s) that can be performed post device discovery is as follows:
Once we have the node available or detected in SolarWinds, here are some of the additional tasks that can be performed in SolarWinds (as shown in screenshot below):
We have selected the CONFIGS menu, under which we can perform config management for the devices. Additionally, as we can see in the following screenshot, we have the ability to create small scripts, (like we did here to show running config), which we can use to execute against a certain set of devices from SolarWinds itself (as in screenshot below):
The result is retrieved and can be stored as a text file, or can even be sent as a report back to any email client if configured. Similarly, there are certain tasks (called jobs in SolarWinds), that can be done on a scheduled basis, as we can see in the following screenshot:
As we can see in the preceding screenshot, we can Download Configs from Devices, and then select all or certain devices in the next step and schedule the job. This is very useful in terms of fetching a config from a previous date or in case a rollback is needed to a last known good config scenario. Also, there are times when auditing needs to be performed regarding who changed what and what was changed in configurations, and SolarWinds can extend this ability by sending reports and alerts. Programmatically, we have the additional ability to call the SolarWinds API to fetch the results from Python.
Consider the following example:
from orionsdk import SwisClient
import requests
npm_server = 'myserver'
username = "username"
password = "password"
verify = False
if not verify:
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
swis = SwisClient(npm_server, username, password)
results = swis.query("SELECT NodeID, DisplayName FROM Orion.Nodes Where Vendor= 'Cisco'")
for row in results['results']:
print("{NodeID:<5}: {DisplayName}".format(**row))
Since SolarWinds supports a direct SQL query, we use the query:
SELECT NodeID, DisplayName FROM Orion.Nodes Where Vendor= 'Cisco'
We are trying to fetch the NodeID and DisplayName (or the device name) for all the devices which have the vendor Cisco. Once we have the result, we print the result in a formatted way. In our case, the output will be (let's assume our Cisco devices in SolarWinds are added as mytestrouter1 and mytestrouter2):
>>>
===================== RESTART: C:\a1\checksolarwinds.py =====================
101 : mytestrouter1
102 : mytestrouter2
>>>
Using some of these automation tools and APIs, we can ensure that our tasks are focused on actual work with some of the basic or core tasks (like fetching values from devices and so on) being offloaded to the tools or APIs to take care of.
Let's now create a basic automation tool from scratch that monitors the reachability of any node that is part of that monitoring tool, using a ping test. We can call it PingMesh or PingMatrix, as the tool will generate a web-based matrix to show the reachability of the routers.
The topology that we would be using is as follows:
Here, we would be using four routers (R1 to R4), and the Cloud1 as our monitoring source. Each of the routers will try to reach each other through ping, and will report back to the script running on Cloud1 which will interpret the results and display the web-based matrix through a web-based URL.
The explanation of the preceding topology is as follows:
- What we are trying to do is log in to each router (preferably in parallel), ping each destination from each source, and report back the reachability status of each destination.
- As an example, if we want to do the task manually, we would log in to R1 and try to ping R2, R3, and R4 from the source to check the reachability of each router from R1. The main script on Cloud1 (acting as the controller) will interpret the result and update the web matrix accordingly.
- In our case all the routers (and the controller) are residing in 192.168.255.x subnet, hence they are reachable to each other using a simple ping.
We are going to create two separate Python programs (one to be called as the library for invoking the commands on various nodes, fetching the results from the nodes, interpreting the results, and sending the parsed data to the main program). The main program will be responsible for calling the library, and will use the results we get back to create the HTML web matrix.
Let's create the library or the program to be called in the main program first (we called it getmeshvalues.py):
#!/usr/bin/env python
import re
import sys
import os
import time
from netmiko import ConnectHandler
from threading import Thread
from random import randrange
username="cisco"
password="cisco"
splitlist = lambda lst, sz: [lst[i:i+sz] for i in range(0, len(lst), sz)]
returns = {}
resultoutput={}
devlist=[]
cmdlist=""
def fetchallvalues(sourceip,sourcelist,delay,cmddelay):
print ("checking for....."+sourceip)
cmdend=" repeat 10" # this is to ensure that we ping for 10 packets
splitsublist=splitlist(sourcelist,6) # this is to ensure we open not more than 6 sessions on router at a time
threads_imagex= []
for item in splitsublist:
t = Thread(target=fetchpingvalues, args=(sourceip,item,cmdend,delay,cmddelay,))
t.start()
time.sleep(randrange(1,2,1)/20)
threads_imagex.append(t)
for t in threads_imagex:
t.join()
def fetchpingvalues(devip,destips,cmdend,delay,cmddelay):
global resultoutput
ttl="0"
destip="none"
command=""
try:
output=""
device = ConnectHandler(device_type='cisco_ios', ip=devip, username=username, password=password, global_delay_factor=cmddelay)
time.sleep(delay)
device.clear_buffer()
for destip in destips:
command="ping "+destip+" source "+devip+cmdend
output = device.send_command_timing(command,delay_factor=cmddelay)
if ("round-trip" in output):
resultoutput[devip+":"+destip]="True"
elif ("Success rate is 0 percent" in output):
resultoutput[devip+":"+destip]="False"
device.disconnect()
except:
print ("Error connecting to ..."+devip)
for destip in destips:
resultoutput[devip+":"+destip]="False"
def getallvalues(allips):
global resultoutput
threads_imagex= []
for item in allips:
#print ("calling "+item)
t = Thread(target=fetchallvalues, args=(item,allips,2,1,))
t.start()
time.sleep(randrange(1,2,1)/30)
threads_imagex.append(t)
for t in threads_imagex:
t.join()
dnew=sorted(resultoutput.items())
return dnew
#print (getallvalues(["192.168.255.240","192.168.255.245","192.168.255.248","192.168.255.249","4.2.2.2"]))
In the preceding code, we have created three main functions that we call in a thread (for parallel execution). The getallvalues() contains the list of IP addresses that we want to get the data from. It then passes this information to fetchallvalues() with specific device information to fetch the ping values again in parallel execution. For executing the command on the router and fetching the results, we call the fetchpingvalues() function.
Let's see the result of this code (by removing the remark on the code that calls the function). We need to pass the device IPs that we want to validate as a list. In our case, we have all the valid routers in the 192.168.255.x range, and 4.2.2.2 is taken as an example of a non-reachable router:
print(getallvalues(["192.168.255.240","192.168.255.245","192.168.255.248","192.168.255.249","4.2.2.2"]))
The preceding code gives the following output:
As we can see in the result, we get the reachability in terms of True or False from each node to the other node.
For example, the first item in the list ('192.168.255.240:192.168.255.240','True') interprets that from the source 192.168.255.240 to destination 192.168.255.240 (which is the same self IP) is reachable. Similarly, the next item in the same list ('192.168.255.240:192.168.255.245','True') confirms that from source IP 192.168.255.240 the destination 192.168.255.245 we have reachability from ping. This information is required to create a matrix based upon the results. Next we see the main code where we fetch these results and create a web-based matrix page.
Next, we need to create the main file (we're calling it pingmesh.py):
import getmeshvalue
from getmeshvalue import getallvalues
getdevinformation={}
devicenamemapping={}
arraydeviceglobal=[]
pingmeshvalues={}
arraydeviceglobal=["192.168.255.240","192.168.255.245","192.168.255.248","192.168.255.249","4.2.2.2"]
devicenamemapping['192.168.255.240']="R1"
devicenamemapping['192.168.255.245']="R2"
devicenamemapping['192.168.255.248']="R3"
devicenamemapping['192.168.255.249']="R4"
devicenamemapping['4.2.2.2']="Random"
def getmeshvalues():
global arraydeviceglobal
global pingmeshvalues
arraydeviceglobal=sorted(set(arraydeviceglobal))
tval=getallvalues(arraydeviceglobal)
pingmeshvalues = dict(tval)
getmeshvalues()
def createhtml():
global arraydeviceglobal
fopen=open("C:\pingmesh\pingmesh.html","w") ### this needs to be changed as web path of the html location
head="""<html><head><meta http-equiv="refresh" content="60" ></head>"""
head=head+"""<script type="text/javascript">
function updatetime() {
var x = new Date(document.lastModified);
document.getElementById("modified").innerHTML = "Last Modified: "+x+" ";
}
</script>"""+"<body onLoad='updatetime();'>"
head=head+"<div style='display: inline-block;float: right;font-size: 80%'><h4><h4><p id='modified'></p></div>"
head=head+"<div style='display: inline-block;float: left;font-size: 90%'></h4><center><h2>Network Health Dashboard<h2></div>"
head=head+"<br><div><table border='1' align='center'><caption><b>Ping Matrix</b></caption>"
head=head+"<center><br><br><br><br><br><br><br><br>"
fopen.write(head)
dval=""
fopen.write("<tr><td>Devices</td>")
for fromdevice in arraydeviceglobal:
fopen.write("<td><b>"+devicenamemapping[fromdevice]+"</b></td>")
fopen.write("</tr>")
for fromdevice in arraydeviceglobal:
fopen.write("<tr>")
fopen.write("<td><b>"+devicenamemapping[fromdevice]+"</b></td>")
for todevice in arraydeviceglobal:
askvalue=fromdevice+":"+todevice
if (askvalue in pingmeshvalues):
getallvalues=pingmeshvalues.get(askvalue)
bgcolor='lime'
if (getallvalues == "False"):
bgcolor='salmon'
fopen.write("<td align='center' font size='2' height='2' width='2' bgcolor='"+bgcolor+"'title='"+askvalue+"'>"+"<font color='white'><b>"+getallvalues+"</b></font></td>")
fopen.write("</tr>\n")
fopen.write("</table></div>")
fopen.close()
createhtml()
print("All done!!!!")
In this case, we have the following mappings in place:
devicenamemapping['192.168.255.240']="R1"
devicenamemapping['192.168.255.245']="R2"
devicenamemapping['192.168.255.248']="R3"
devicenamemapping['192.168.255.249']="R4"
devicenamemapping['4.2.2.2']="Random"
The last device named Random, is a test device which is not in our network and is non-reachable for test purposes. Once executed, it creates a file named pingmesh.html with standard HTML formats and a last-refreshed clock (from JavaScript) to confirm when the last refresh occurred. This is required if we want the script to be executed from the task scheduler (Let's say every five minutes), and anybody opening the HTML page will know when the probe occurred. The HTML file needs to be placed or saved in a folder which is mapped to a web folder so that it can be accessed using the URL http://<server>/pingmesh.html.
When executed, here is the output from the Python script:
The HTML file, when placed in the web-mapped URL and called, looks like this:
As we can see, in the PingMatrix there is an entire red row and column, which means that any connectivity between any router to the random router and from the random router to any router is not there. Green means that all the connectivity between all other routers is fine.
Additionally, we have also configured a tooltip on each cell, and hovering the mouse over that specific cell would also show the source and destination IP address mapping for that particular cell, as shown in the following screenshot:
Let's see another screenshot, in which we shut down R2 to make it unreachable:
Now, as we can see, the entire row and column of R2 is red, and hence the PingMatrix shows that R2 is now unreachable from everywhere else, and R2 also cannot reach anyone else in the network.
Let's see a final example, in which for test purposes we intentionally block the ping traffic from R2 to R4 (and vice versa) using an extended Cisco ACL, which in turn reports that R4 and R2 have reachability issues in the PingMatrix:
As we see can in the preceding screenshot, the Random router is still a red or false, since it is not in our network, but now it is showing red/false between R2 and R4 and also between R4 and R2. This gives us a quick view that even with multiple paths to reach each node with another node, we have a connectivity issue between the two nodes.
Going by the preceding examples, we can enhance the tool to easily monitor and understand any routing/reachability issues, or even link down connectivity problems using a holistic view of all of the connections in our network. PingMesh/Matrix can be extended to check latency, and even packet drops in each connection between various nodes. Additionally, using syslog or email functionality (specific Python libraries are available for sending syslog messages from Python or even emails from Python code), alerts or tickets can also be generated in case of failures detected or high latency observed from the Python script itself.
This tool can easily become a central monitoring tool in any organization, and based upon patterns (such as green or red, and other color codes if needed), engineers can make decisions on the actual issues and take proactive actions instead of reactive actions to ensure the high reliability and uptime of the network.