Pandas for Network Engineers (Who doesn’t love Pandas? )

The module not the mammal!

My original title for this article was going to be *Decomposing Pandas* as a follow on to *Decomposing Data Structures* but I was advised against that name. Go figure.

One of the things I love most about Python is that its always waiting for me to get just a little bit better so it can show me a slightly smarter way to do something. Pandas is the latest such example.

Pandas is a powerful data science Python library that excels at manipulating multidimensional data.

Why is this even remotely interesting to me as a network engineer?

Well, thats what Excel does, right?

I spend more time than I care to admit processing data in Excel. I find that Excel is always the lowest common denominator. I understand why and often I’m a culprit myself but eventually one grows weary of all the data being in a spreadsheet and having to manipulate it. I’m working on the former and Pandas is helping on the latter.

Google around enough for help on processing spreadsheets and you will come across references to the Pandas Python module.

If you are anything like me, you go through some or all of these stages:

  • You dismiss it as irrelevant to what you are trying to do
  • You dismiss it because its seems to be about big data, analytics, and scientific analysis of data (not your thing right?)
  • As you continue to struggle with what got you here in the first place (there has got to be a better way to deal with this spreadsheet data) you reconsider. So you try to do some processing in Pandas and pull a mental muscle…and what the heck is this NaN thing that keeps making my program crash? Basically, you find yourself way way out of your comfort zone (well..I did)!
  • You determine that your limited Python skills are not up to something quite this complex…after all, you know just enough Python to do the automation stuff you need to do and you are not a data scientist.

Finally, in a fit of desperation as you see all the Excel files you have to process, you decide that a python module is not going to get the better of you and you give it another go!

So here I am, on the other side of that brain sprain, and better for it, as is usually the case.

What is possible with Pandas…

Once you get the hang of it, manipulating spreadsheet-like data sets becomes so much simpler with Pandas. In fact, thats true for any data set, not just ones from spreadsheets. In fact, in the examples below, the data set comes from parsing show commands with TextFSM.

Knowing how to work with Pandas, even in a limited fashion as is the case with me, is going to be a handy skill to have for any Network Engineer who is (or is trying to become) conversant in programmability & automation.

My goal here is not to teach you Pandas as there is quite alot of excellent material out there to do that. I’ve highlighted the content which helped me the most in the “Study Guide” section at the end.

My goal is to share what I’ve been able to do with it as a Network Engineer, what I found most useful as I tried to wrap my head around it, and my own REPL work.

Lets look at something simple. I need to get the ARP table from a device and “interrogate” the data.

In this example, I have a text file with the output of the “show ip arp” command which I’ve parsed with TextFSM.

Here is the raw data returned from the TextFSM parsing script:

 # Executing textfsm strainer function only to get data
  strained, strainer = basic_textfsm.textfsm_strainer(template_file, output_file, debug=False)

In [1]: strained                                                                                                                                                                                                            
Out[1]:
[['Internet', '10.1.10.1', '5', '28c6.8ee1.659b', 'ARPA', 'Vlan1'],
['Internet', '10.1.10.11', '4', '6400.6a64.f5ca', 'ARPA', 'Vlan1'],
['Internet', '10.1.10.10', '172', '0018.7149.5160', 'ARPA', 'Vlan1'],
['Internet', '10.1.10.21', '0', 'a860.b603.421c', 'ARPA', 'Vlan1'],
['Internet', '10.1.10.37', '18', 'a4c3.f047.4528', 'ARPA', 'Vlan1'],
['Internet', '10.10.101.1', '-', '0018.b9b5.93c2', 'ARPA', 'Vlan101'],
['Internet', '10.10.100.1', '-', '0018.b9b5.93c1', 'ARPA', 'Vlan100'],
['Internet', '10.1.10.102', '-', '0018.b9b5.93c0', 'ARPA', 'Vlan1'],
['Internet', '71.103.129.220', '4', '28c6.8ee1.6599', 'ARPA', 'Vlan1'],
['Internet', '10.1.10.170', '0', '000c.294f.a20b', 'ARPA', 'Vlan1'],
['Internet', '10.1.10.181', '0', '000c.298c.d663', 'ARPA', 'Vlan1']]

Note: don’t read anything into the variable name strained. The function I use to parse the data is called textfsm_strainer because I “strain” the data through TextFSM to get structured data out of it so I put the resulting parsed data from that function into a variable called “strained”.

Here is that data in a Pandas Data Frame:

# strained is the parsed data from my TextFSM function and the first command below
# loads that parsed data into a Pandas Data Frame called "df"
​
In [1]: df = pd.DataFrame(strained, columns=strainer.header)                                                                                                                                                                                                           
In [2]: df                                                                                                                                                                                                                                                      
Out[2]: 
​
    PROTOCOL         ADDRESS  AGE             MAC  TYPE INTERFACE
0   Internet       10.1.10.1    5  28c6.8ee1.659b  ARPA     Vlan1
1   Internet      10.1.10.11    4  6400.6a64.f5ca  ARPA     Vlan1
2   Internet      10.1.10.10  172  0018.7149.5160  ARPA     Vlan1
3   Internet      10.1.10.21    0  a860.b603.421c  ARPA     Vlan1
4   Internet      10.1.10.37   18  a4c3.f047.4528  ARPA     Vlan1
5   Internet     10.10.101.1    -  0018.b9b5.93c2  ARPA   Vlan101
6   Internet     10.10.100.1    -  0018.b9b5.93c1  ARPA   Vlan100
7   Internet     10.1.10.102    -  0018.b9b5.93c0  ARPA     Vlan1
8   Internet  71.103.129.220    4  28c6.8ee1.6599  ARPA     Vlan1
9   Internet     10.1.10.170    0  000c.294f.a20b  ARPA     Vlan1
10  Internet     10.1.10.181    0  000c.298c.d663  ARPA     Vlan1

I now have a spreadsheet like data structure with columns and rows that I can query and manipulate.


My first question:

What are all the IPs in Vlan1?

Just Python

Before Pandas, I would initialize an empty list to hold the one or more IPs and then I would iterate through the data structure (strained in this example) and where the interface “column” value (which in this list of lists in the strained variable is at index 5) was equal to ‘Vlan1’ I appended that IP to the list. The IP is in index 1 in each item the strained list.

# Using Python Only
print("\n\tUsing Python only..")
vlan1ips = []
for line in strained:
    if line[5] == 'Vlan1':
        vlan1ips.append(line[1])
print(f"{vlan1ips}")

The resulting output would look something like this:

['10.1.10.1', '10.1.10.11', '10.1.10.10', '10.1.10.21', '10.1.10.37', '10.1.10.102', '71.103.129.220', '10.1.10.170', '10.1.10.181']

Python and Pandas

Using a Pandas data frame df to hold the parsed data:

pandas_vlan1ips = df['ADDRESS'].loc[df['INTERFACE'] == 'Vlan1'].values

The resulting output from the one liner above would look something like this:

 ['10.1.10.1' '10.1.10.11' '10.1.10.10' '10.1.10.21' '10.1.10.37'
'10.1.10.102' '71.103.129.220' '10.1.10.170' '10.1.10.181']

Same output with a single command!

Python List Comprehension

For those more conversant with Python, you could say that list comprehension is just as efficient.

# Using list comprehension
print("Using Python List Comprehension...")
lc_vlan1ips = [line[1] for line in strained if line[5] == 'Vlan1' ]

Results in:

Using List Comprehension: 
['10.1.10.1', '10.1.10.11', '10.1.10.10', '10.1.10.21', '10.1.10.37', '10.1.10.102', '71.103.129.220', '10.1.10.170', '10.1.10.181']

So yes..list comprehension gets us down to one line but I find it a bit obscure to read and a week later I will have no idea what is in line[5] or line[1].

I could turn the data into a list of dictionaries so that rather than using the positional indexes in a list I could turn line[1] into line[‘IP_ADDRESS’] and line[5] into line[‘INTERFACE’] which would make reading the list comprehension and the basic python easier but now we’ve added lines to the script.

Finally, Yes its one line but I’m still iterating over the data.

Pandas is set up to do all the iteration for me and lets me refer to data by name or by position “out of the box” and without any extra steps.

Lets decompose the one line of code:

If you think of this expression as a filter sandwich, the df[‘ADDRESS’] and .values are the bread and the middle .loc[df[‘INTERFACE’]] == ‘Vlan1’] part that filters is the main ingredient.

Without the middle part you would have a Pandas Series or list of all the IPs in the ARP table. Basically you get the entire contents of the ‘ADDRESS” column in the data frame without any filtering.

When you “qualify” df[‘ADDRESS’] with .loc[df[‘INTERFACE’]] == ‘Vlan1’] you filter the ADDRESS column in the data frame for just those records where INTERFACE is ‘Vlan1’ and you only return the IP values by using the .values method.

Now, this will return a numpy.ndarray which might be great for some subsequent statistical analysis but as network engineers our needs are simple.

I’m using iPython in the examples below as you can see from the “In” and “Out” line prefixes.

In [1]: pandas_vlan1ips = df['ADDRESS'].loc[df['INTERFACE'] == 'Vlan1'].values

In [2]: type(pandas_vlan1ips) Out[2]: numpy.ndarray

I would like my list back as an actual python list and thats no problem for Pandas.

pandas-vlan1ips-list-2019-12-30_07-46-43

In [3]: pandas_vlan1ips = df['ADDRESS'].loc[df['INTERFACE'] == 'Vlan1'].to_list()

In [4]: type(pandas_vlan1ips) Out[4]: list

In [5]: pandas_vlan1ips Out[5]:` `['10.1.10.1',` `'10.1.10.11',` `'10.1.10.10',` `'10.1.10.21',` `'10.1.10.37',` `'10.1.10.102',` `'71.103.129.220',` `'10.1.10.170',` `'10.1.10.181']

You know what would be really handy? A list of dictionaries where I can reference both the IP ADDRESS and the MAC as keys.

In [5]: vlan1ipmac_ldict = df[['ADDRESS', 'MAC']].to_dict(orient='records')

In [6]: type(vlan1ipmac_ldict) Out[6]: list

In [7]: vlan1ipmac_ldict Out[7]:` `[{'ADDRESS': '10.1.10.1', 'MAC': '28c6.8ee1.659b'},` `{'ADDRESS': '10.1.10.11', 'MAC': '6400.6a64.f5ca'},` `{'ADDRESS': '10.1.10.10', 'MAC': '0018.7149.5160'},` `{'ADDRESS': '10.1.10.21', 'MAC': 'a860.b603.421c'},` `{'ADDRESS': '10.1.10.37', 'MAC': 'a4c3.f047.4528'},` `{'ADDRESS': '10.10.101.1', 'MAC': '0018.b9b5.93c2'},` `{'ADDRESS': '10.10.100.1', 'MAC': '0018.b9b5.93c1'},` `{'ADDRESS': '10.1.10.102', 'MAC': '0018.b9b5.93c0'},` `{'ADDRESS': '71.103.129.220', 'MAC': '28c6.8ee1.6599'},` `{'ADDRESS': '10.1.10.170', 'MAC': '000c.294f.a20b'},` `{'ADDRESS': '10.1.10.181', 'MAC': '000c.298c.d663'}]

In [8]: len(vlan1ipmac_ldict) Out[8]: 11

MAC address Lookup

Not impressed yet. Let see what else we can do with this Data Frame.

I have a small function that performs MAC address lookups to get the Vendor OUI.

This function is called get_oui_macvendors() and you pass it a MAC address and it returns the vendor name.

It uses the MacVendors.co API.

I’d like to add a column of data to our Data Frame with the Vendor OUI for each MAC address.

In the one line below, I’ve added a column to the data frame titled ‘OUI’ and populated its value by performing a lookup on each MAC and using the result from the get_oui_macvendors function.

df['OUI'] = df['MAC'].map(get_oui_macvendors)

The left side of the equation references a column in the data Fram which does not exist so it will be added.

The right side takes the existing MAC column in the data frame and takes each MAC address and runs it through the get_oui_macvendors function to get the Vendor OUI and “maps” that result into the new OUI “cell” for that MACs row in the data frame.

pandas-newcolumn diagram to show what is happening under the hood in the one line command to ad a coloumn

Now we have an updated Data Frame with a new OUI column giving the vendor code for each Mac.

In [1]: df                                                                                                                                                                                                                                                      
 Out[1]: 
     PROTOCOL         ADDRESS  AGE             MAC  TYPE INTERFACE                 OUI
 0   Internet       10.1.10.1    5  28c6.8ee1.659b  ARPA     Vlan1             NETGEAR
 1   Internet      10.1.10.11    4  6400.6a64.f5ca  ARPA     Vlan1           Dell Inc.
 2   Internet      10.1.10.10  172  0018.7149.5160  ARPA     Vlan1     Hewlett Packard
 3   Internet      10.1.10.21    0  a860.b603.421c  ARPA     Vlan1         Apple, Inc.
 4   Internet      10.1.10.37   18  a4c3.f047.4528  ARPA     Vlan1     Intel Corporate
 5   Internet     10.10.101.1    -  0018.b9b5.93c2  ARPA   Vlan101  Cisco Systems, Inc
 6   Internet     10.10.100.1    -  0018.b9b5.93c1  ARPA   Vlan100  Cisco Systems, Inc
 7   Internet     10.1.10.102    -  0018.b9b5.93c0  ARPA     Vlan1  Cisco Systems, Inc
 8   Internet  71.103.129.220    4  28c6.8ee1.6599  ARPA     Vlan1             NETGEAR
 9   Internet     10.1.10.170    0  000c.294f.a20b  ARPA     Vlan1        VMware, Inc.
 10  Internet     10.1.10.181    0  000c.298c.d663  ARPA     Vlan1        VMware, Inc.

More questions

Lets interrogate our data set further.

I want a unique list of all the INTERFACE values.

In [3]: df['INTERFACE'].unique()                                                                                                                                                                                                                                
 Out[3]: array(['Vlan1', 'Vlan101', 'Vlan100'], dtype=object)

How about “Give me a total count of each of the unique INTERFACE values?”

In [4]: df.groupby('INTERFACE').size()                                                                                                                                                                                                                          
 Out[4]: 
 INTERFACE
 Vlan1      9
 Vlan100    1
 Vlan101    1
 dtype: int64

Lets take it down a level and get unique totals based on INTERFACE and vendor OUI.

In [2]: df.groupby(['INTERFACE','OUI']).size()                                                                                                                                                                                                                  
 Out[2]: 
 INTERFACE  OUI               
 Vlan1      Apple, Inc.           1
            Cisco Systems, Inc    1
            Dell Inc.             1
            Hewlett Packard       1
            Intel Corporate       1
            NETGEAR               2
            VMware, Inc.          2
 Vlan100    Cisco Systems, Inc    1
 Vlan101    Cisco Systems, Inc    1
 dtype: int64

I could do this all day long!

Conclusion

I’ve just scratched the surface of what Pandas can do and I hope some of the examples I’ve shown above illustrate why investing in learning how to use data frames could be very beneficial. Filtering, getting unique values with counts, even Pivot Tables are possible with Pandas.

Don’t be discouraged by its seeming complexity like I was.

Don’t discount it because it does not seem to be applicable to what you are trying to do as a Network Engineer, like I did. I hope I’ve shown how very wrong I was and that it is very applicable.

In fact, this small example and some of the other content in this repository comes from an actual use case.

I’m involved in several large refresh projects and our workflow is what you would expect.

  1. Snapshot the environment before you change out the equipment
  2. Perform some basic reachability tests
  3. Replace the equipment (switches in this case)
  4. Perform basic reachability tests again
  5. Compare PRE and POST state and confirm that all the devices you had just before you started are back on the network.
  6. Troubleshoot as needed

As you can see if you delve into this repository, its heavy on APR and MAC data manipulation so that we can automate most of the workflow I’ve described above. Could I have done it without Pandas? Yes. Could I have done it as quickly and efficiently with code that I will have some shot of understanding in a month without Pandas? No.

I hope I’ve either put Pandas on your radar as a possible tool to use in the future or actually gotten you curious enough to take the next steps.

I really hope that the latter is the case and I encourage you to just dive in.

The companion repository on GitHub is intended to help and give you examples.


Next Steps

The “Study Guide” links below have some very good and clear content to get you started. Of all the content out there, these resources were the most helpful for me.

Let me also say that it took a focused effort to get the point where I was doing useful work with Pandas and I’ve only just scratched the surface. I was worth every minute! What I have described here and in this repository are the things that were useful for me as a Network Engineer.

Once you’ve gone through the Study Guide links and any others that you have found, you can return to this repository to see examples. In particular, this repository contains a Python script called arp_interrogate.py.

It goes through loading the ARP data from the “show ip arp” command, parsing it, and creating a Pandas Data Frame.

It then goes through a variety of questions (some of which you have seen above) to show how the Data Frame can be “interrogated” to get to information that might prove useful.

There are comments throughout which are reminders for me and which may be useful to you.

The script is designed to run with data in the repository by default but you can pass it your own “show ip arp” output with the -o option.

Using the -i option will drop you into iPython with all of the data still in memory for you to use. This will allow you to interrogate the data in the Data Frame yourself..

If you would like to use it make sure you clone or download the repository and set up the expected environment.

Options for the arp_interrogate.py script:

(pandas) Claudias-iMac:pandas_neteng claudia$ python arp_interrogate.py -h
usage: arp_interrogate.py [-h] [-t TEMPLATE_FILE] [-o OUTPUT_FILE] [-v]
                        [-f FILENAME] [-s] [-i] [-c]

Script Description

optional arguments:
-h, --help           show this help message and exit
-t TEMPLATE_FILE, --template_file TEMPLATE_FILE
                      TextFSM Template File
-o OUTPUT_FILE, --output_file OUTPUT_FILE
                      Full path to file with show command show ip arp output
-v, --verbose         Enable all of the extra print statements used to
                      investigate the results
-f FILENAME, --filename FILENAME
                      Resulting device data parsed output file name suffix
-s, --save           Save Parsed output in TXT, JSON, YAML, and CSV Formats
-i, --interactive     Drop into iPython
-c, --comparison     Show Comparison

Usage: ' python arp_interrogate.py Will run with default data in the
repository'
(pandas) Claudias-iMac:pandas_neteng claudia$

Study Guide

A Quick Introduction to the “Pandas” Python Library

https://towardsdatascience.com/a-quick-introduction-to-the-pandas-python-library-f1b678f34673

For me this is the class that made all the other classes start to make sense.

Note that this class is not Free.

Pandas Fundamentals by Paweł Kordek on PluralSight is exceptionally good.

There is quite alot to Pandas and it can be overwhelming (at least it was for me) but this course in particular got me working very quickly and explained things in a very clear way.

Python Pandas Tutorial 2: Dataframe Basics by codebasics <- good for Pandas operations and set_index

Python Pandas Tutorial 5: Handle Missing Data: fillna, dropna, interpolate by codebasics

Python Pandas Tutorial 6. Handle Missing Data: replace function by codebasics

Real Python <- this is terrific resource for learning Python

There is a lot of content here. Explore at will. The two below I found particularly helpful.

https://realpython.com/search?q=pandas

Intro to DataFrames by Joe James <–great ‘cheatsheet’



What others have shared…

Analyzing Wireshark Data with Pandas


Disclaimer

THE SOFTWARE in the mentioned repository IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NON INFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Decomposing Data Structures

Whether you are trying to find all the tenants in an ACI fabric, or all the interface IPs and descriptions on a network device, or trying to determine if the Earth is in imminent danger from an asteroid hurtling towards it, understanding complex data structures is a critical skill for anyone working with modern IT infrastructure technology.

As APIs become more and more prevalent, acting on an object (device, controller, cloud-based service, management system, etc.) will return structured data in a format you may have never seen before. For beginners, this may be a little daunting but once you get a few basics down you will be decomposing data structures at dinner parties!

Most modern APIs return data in XML or JSON format. Some give you the option to choose. We won’t spend any time on how you get this structured data. That will be a topic for another day. The focus today is how to interpret and manipulate data returned to you in JSON. I’m not a fan of XML so if I ever run into an API that only returns XML (odds are you will too) I do my very best to convert it to JSON first.

Lets get some basics out of the way. Curly braces {}, square brackets [], and spacing (whitespace) provide a syntax for this returned data. If you know a little Python these will be familiar to you.

Symbol


Meaning
[ ]LIST
Square brackets denote a list of values will be found between the opening [ and closing ] brackets separated by commas List elements are referenced by the number of its position in the list so that in a list like my_list = [1,2,3,4], if you want the information in the 3rd position (the third element) which is the number 3 you say my_list[2] because lists are zero indexed and so my_list[0] is the number 1, my_list[1] is the number 2, etc.
{ }DICTIONARY
Curly braces denote a list of key value pairs will be found between the opening { and closing } braces separated by commas with the key and value pair separated by a colon : key: value Dictionary key:value pairs are reference by the key so that in a dictionary like my_dict = {‘la’: ‘Dodgers’, ‘sf’: ‘Giants’} if you want to know the baseball team in LA you reference my_dict[‘la’].

Lists look like this:

[
item1,
item2,
item3
]

or (equally valid)

[item1,item2,item3]

Dictionaries look like this:

{
key:value,
otherkey:othervalue
}

or (equally valid)

{key:value, otherkey:othervalue}

The other thing to note is that when these structures are formatted, spacing is important and can tell you a lot about the hierarchy.

The examples above are not quoted and they should be. I didn’t because I wanted to highlight that single and double quotes can be used in Python but the JSON standard requires double quotes.

Its also important to understand that these two basic structures can be combined so that you can have a list of dictionaries or a dictionary of key:value pairs where the value is a list or a dictionary. You can have many levels and combinations and often that is what makes the data structure seem complex.

Lets start with a simple example.

Below is the output of an Ansible playbook that queries an ACI fabric for the configured tenants.

Cisco ACI Tenatn Query Playbook output
ACI Tenant Query Playbook – output

Lets look at the data that was returned, which is highlighted in a big yellow outer box in the image below. Also highlighted are the syntax symbols that will let us decompose this structure. As you can see from a, the entire structure is encased in curly braces {}. That tells us that the outermost structure is a dictionary and we know that dictionaries provide us with a key (think of this as an index value you use to get to the data in the value part) followed by a colon : followed by a value. In this case, we have a dictionary with one element or key:value pair and the value is a list. We know it is a list from b, which shows left square bracket immediately following the colon and denoting the start of a list structure.

{ "key" : "value" } ( where value is a list structure [])
{"tenantlist": ["common", "mgmt", "infra", etc..]}

Building a Custom TextFSM Template
ACI Tenant Query Playbook – output with annotations


This is a simple structure. It is short, so you can see the entire data structure in one view, but it has both types of data structures that you will encounter, a list and a dictionary.


Lets look at something a bit more complex.

REST COUNTRIES provides a public REST API for obtaining information about countries.

Check them out!

Using Postman, I submitted a query about Kenya. Below are the first 25 or so lines of the data returned.

Noting the colored letters and line numbers below in the image:

a. Line 1 shows us the left square bracket (outlined in a green box) which tells us this is a list. That is the outermost structure so you know that to get to the first element you will need to use a list reference like mylist[0] for the first element.

b. Line 2 shows us the left curly brace (outlined in a yellow box) which indicates that what follows as the first element of the list is a dictionary. Line 3 is a key:value pair where the key “name” has a value of “Kenya”.

c. Line 4 is a key:value pair but the value of key “topLevelDomain” is a list (of one element “.ke”).
Line7 returns us to a simple key:value pair.

Structured data returned from REST Countries API query
Structured data returned from REST Countries API query

Here is where it can start getting confusing. Remembering our reference rules..that is:

  • you reference a list element by its zero indexed positional number and
  • you reference the value of a dictionary element by its key

Don’t be distracted by the data within the syntax symbols just yet. If you see something like [ {},{},{} ], (ignoring the contents inside the braces and brackets) you should see that this is a list of three elements and those elements are dictionaries. Assuming alist = [ {},{},{} ] you access the first element (dictionary) with alist[0].

Going a step further, if you see this structure [ {"key1":[]},{"a_key": "blue"},{"listkey": [1,2,3]} ], you already know its a list of dictionaries. Now you can also see that the first element in the list is a dictionary with a single key:value pair and the key is “key1” and the value is an empty list. The second element in the list is also a dictionary with a single key:value pair with a key of “a_key” and a value of a string “blue”. I’ll leave you to describe the third element in the list.

Assuming my_list = [ {"key1":[]},{"a_key": "blue"},{"listkey": [1,2,3]} ], if I wanted to pull out the string “blue” I would reference my_color = my_list[1]["a_key"] and the variable my_color would be equal to “blue”. The string “blue” is the value of the second dictionary in the list. Remembering that list elements start at “0” (zero indexed), you need to access the element in the second position with [1]. To get to “blue” you have to use the key of “a_key” and so you have mylist[1][“a_key”] which will give you “blue”.

Lets try to extract the two letter country code for Kenya.

So I’m going to cheat a little here to introduce you to the concept of digging in to the data structure to “pluck” out the data you want. You really want to understand the entire data structure before doing this so I’m going to assume that this is a list with one element and that element is a dictionary of key value pairs with different information about Kenya.

That being the case, first I have to reference the element in the list.

  • The list has one element so I know I have to reference it with the zero index [0] (the first and only element in the list)
  • Next I have to pluck out the 2 letter country code for Kenya and that is in a key:value pair with the key of 'alpha2code'

So assuming we have a variable country_info that has our data structure, then to reference the 2 letter country code I would need to use

country_info[0]["alpha2Code"]

That reference structure above would return “KE”.

[0] takes us one level deep into the first element of the list. This is where the dictionary (note the curly brace in line 2 below) can be accessed. At this level we can access the key we need “alpha2Code” to get the 2 letter country code.

country_info =

Extracting the Country Code from the JSON output
Extracting the Country Code from the JSON output

Lets build on this. What if I need the country calling code so I can make a phone call to someone in Kenya from another country? For this, we need to go a level deeper as the country code is in a key named "callingCodes" at the same level of "alpha2Code" but the value is a list rather than a variable. See lines 9 – 11 in the image above. We know how to reference a list, so in this case, if I wanted the first country code in the list my reference structure would look like:

country_info[0]["callingCodes"][0]

That would return “254” (a string).

In many cases, you might want the entire list and so to get that:

country_info[0]["callingCodes"]

That would return [“254”] as a list (yes a list with only one element but a list because its enclosed in square brackets). There are cases where you may want to do some specific manipulation and you need the entire list.

Extra: In the companion GitHub repository to this post there is a quick & dirty Python3 script country_info_rest.py that will let you get some country data and save it to a file. There is also an optional “decompose function” that you can tailor to your needs to get a feel for decomposing the data structure via a script.

(generic_py3_env) Claudias-iMac:claudia$ python country_info_rest.py -h
usage: country_info_rest.py [-h] [-n CNAME] [-d]
​
Call REST Countries REST API with a country name.
​
optional arguments:
  -h, --help            show this help message and exit
  -n CNAME, --cname CNAME
                        Country Name to override default (Mexico)
  -d, --decompose       Execute a function to help decompose the response
​
Usage: 'python country_info_rest.py' without the --cname argument the script
will use the default country name of Mexico. Usage with optional name
parameter: 'python country_info_rest.py -n Singapore'. Note: this is a python3
script.
​​

Lets look at something really complex.

Now a word about “complex”. At this point I’m hoping you can start to see the pattern. It is a pattern of understanding the “breadcrumbs” that you need to follow to get to the data you want. You now know all the “breadcrumb” formats:

  • [positional number] for lists and
  • [“key”] for dictionaries

From here on out its more of the same. Lets see this in action with the data returned from the NASA Asetroids – Near Earth Object Web Service.

But first, here is why I called this “really complex”. A better term might be “long with scary data at first glance”. At least it was for me because when I first ran my first successful Ansible playbook and finally got data back it was like opening a shoe box with my favorite pair of shoes in it and finding a spider in the box…yes, shrieking “what IS that!” and jumping away.

I hope that by this point there is no shrieking but more of a quizzical..”Hmmm OK, I see the outer dictionary with a key of “asteroid_output” and a value of another dictionary with quite alot of keys and some odd looking stuff…”. Lets get into it!

…and if there is shrieking then I hope this helps get you on your way to where its quieter.

Raw output from an Ansible Playbook querying the NASA Near Earth Web Service
Raw output from an Ansible Playbook

I want to pluck out the diameter of the asteroid as well as something that tells me if there is any danger to Earth.

Somewhere, in the middle of all of this output, is this section which has the data we want but where is it in reference to the start of the entire data structure? Where are the breadcrumbs? Hard to tell… or is it?

Information we need to get to within the data structure returned by the Ansible playbook

You can visually walk the data but for data structures of this size and with many levels of hierarchy it can be time consuming and a bit daunting until you get the hang of it. There are a number of approaches I’ve tried including:

  1. visually inspecting it (good for 25 lines or less….if you cannot fit it on a single page to “eyeball” it try one of the other methods below…I promise you it will save you time)
  2. saving the output to a text file and opening it up in a modern text editor or IDE so that you can inspect and collapse sections to get a better understanding of the structure
  3. using a Python script or Ansible playbook to decompose by trial and error
  4. using a JSON editor to convert to a more readable structure and to interpret the data structure for you

I don’t recommend the first approach at all unless your data is like our first example or you combine it with the Python (or Ansible) trial and error approach but this can be time consuming. I do have to recommend doing it this way once because it really helps you understand what is going on.

Using a good advanced editor (*not* Notepad.exe) or IDE (Integrated Development Environment) is a good approach but for something that makes my eyes cross like the output above I use a JSON editor.

In the two sections below I’ll show you a bit more detail on approach #2 and #4. Play around with the companion GitHub repository for an example of approach #3.

Asteroid Data collapsed down in Sublime Text Editor

Note that this has already been collapsed down to the value of the key asteroid_output so the outer dictionary is already stripped off. In this view it looks a bit more manageable and the values we want can be found at the level shown below:

asteroid_output["json"]["near_earth_objects"][<date_key>]

where <date_key> can be any of the 8 date keys found in line 23, line 858, line 1154 etc. The gap in the line numbers give you a sense of how much data we’ve collapsed down but I hope you can begin to see how that makes it easier to start understanding you how you need to walk the path to the data you want.

asteroid_output =

Expanding one of the date keys as shown in the next image shows us how we might start to get to the data we want.

asteroid_output["json"]["near_earth_objects"]["2019-07-07"]

The date key we expanded, "2019-07-07", has a value that is a list. If we take the first element of that list we can get the estimated diameter in feet and the boolean value of “are we to go the way of the dinosaurs” or technically they value of key "is_potentially_hazardous_asteroid".

Estimated maximum diameter in feet:

asteroid_output["json"]["near_earth_objects"]["2019-07-07"][0]["estimated_diameter"]["feet"]["estimated_diameter_max"]

Is this going to be an extinction level event?:

asteroid_output["json"]["near_earth_objects"]["2019-07-07"][0]["is_potentially_hazardous_asteroid"]

Which will give us false (for that one date anyway :D).

Using a good text editor or IDE to investigate a data structure by expanding
Using a good text editor or IDE to investigate a data structure by expanding

Using JSON tools to decompose your data structure

It is here I must confess that these days if I can’t visually figure out or “eyeball” the “breadcrumbs” I need to use to get to the data I want, I immediately go to this approach. Invariably I think I can “eyeball” it and miss a level.

If I’m working with non-sensitive data JSON Editor Online is my personal favorite.

  1. I copy the output and paste it into the left window,
  2. click to analyze and format into the right window, and
  3. then I collapse and expand to explore the data structure and figure out the breadcrumbs that I need.

The Editor gives you additional information and element counts and has many other useful features. One of them is allowing you to save an analysis on line so you can share it.

Decomposing_Data_Structures_asteroid_output in JSON Editor Online

Using the JSON Editor Online to navigate through the returned data from your API call
Using the JSON Editor Online to navigate through the returned data from your API call

There are occasions where I’m not working with public data and in those cases I’m more comfortable using a local application. My “go to” local utility is JSON Editor from Vlad Badea available from the Apple Store. I don’t have a recommendation for Windows but I know such tools exist and some look interesting.

For this data set, the local JSON Editor application does a nicer job of representing the asteroid_output because it really collapses that hairy content value.

Using the JSON Editor App on your system to navigate through the returned data from your API call
JSON Editor APP on local system

Using a Python script to decompose by trial and error

In this repository there is a rudimentary Python3 script country_info_rest.py which, when executed with the “-d” option, will attempt to walk the response from the REST Country API a couple of levels.

The first part of the script executes a REST GET and saves the response. With the “-d” option it also executes a “decompose” function to help understand the returned data structure. Some sample output from the script follows.

Outer structure (0) levels deep:
        The data structure 0 levels deep is a <class 'list'>
        The length of the data structure 0 levels deep is 1
​
One level deep:
        The data structure 1 level deep is a <class 'dict'>
        The length of the data structure 1 level deep is 24
​
        Dictionary keys are dict_keys(['name', 'topLevelDomain', 'alpha2Code', 'alpha3Code', 'callingCodes', 'capital', 'altSpellings', 'region', 'subregion', 'population', 'latlng', 'demonym', 'area', 'gini', 'timezones', 'borders', 'nativeName', 'numericCode', 'currencies', 'languages', 'translations', 'flag', 'regionalBlocs', 'cioc'])
​
                Key: name       Value: Singapore
​
                Key: topLevelDomain     Value: ['.sg']
​
                Key: alpha2Code         Value: SG
​
                Key: alpha3Code         Value: SGP
​
                Key: callingCodes       Value: ['65']
​
                Key: capital    Value: Singapore
​
                Key: altSpellings       Value: ['SG', 'Singapura', 'Republik Singapura', '新加坡共和国']
​
                Key: region     Value: Asia
​
                Key: subregion  Value: South-Eastern Asia
​
                Key: population         Value: 5535000
​
                Key: latlng     Value: [1.36666666, 103.8]
​
                Key: demonym    Value: Singaporean
​
                Key: area       Value: 710.0
​
                Key: gini       Value: 48.1
​
                Key: timezones  Value: ['UTC+08:00']
​
                Key: borders    Value: []
​
                Key: nativeName         Value: Singapore
​
                Key: numericCode        Value: 702
​
                Key: currencies         Value: [{'code': 'BND', 'name': 'Brunei dollar', 'symbol': '$'}, {'code': 'SGD', 'name': 'Singapore dollar', 'symbol': '$'}]
​
                Key: languages  Value: [{'iso639_1': 'en', 'iso639_2': 'eng', 'name': 'English', 'nativeName': 'English'}, {'iso639_1': 'ms', 'iso639_2': 'msa', 'name': 'Malay', 'nativeName': 'bahasa Melayu'}, {'iso639_1': 'ta', 'iso639_2': 'tam', 'name': 'Tamil', 'nativeName': 'தமிழ்'}, {'iso639_1': 'zh', 'iso639_2': 'zho', 'name': 'Chinese', 'nativeName': '中文 (Zhōngwén)'}]
​
                Key: translations       Value: {'de': 'Singapur', 'es': 'Singapur', 'fr': 'Singapour', 'ja': 'シンガポール', 'it': 'Singapore', 'br': 'Singapura', 'pt': 'Singapura', 'nl': 'Singapore', 'hr': 'Singapur', 'faگاپور'}
​
                Key: flag       Value: https://restcountries.eu/data/sgp.svg
​
                Key: regionalBlocs      Value: [{'acronym': 'ASEAN', 'name': 'Association of Southeast Asian Nations', 'otherAcronyms': [], 'otherNames': []}]
​
                Key: cioc       Value: SIN

===== Plucking out specific data:     
2 Letter Country Code:                          SG     
First (0 index) International Calling Code:     65     
List of International Calling Code:             ['65']     
==========================================

Feel free to take this script and add to it and modify it for your own data structure!


Apart from the first example, I have deliberately not used data from network devices. I wanted to show that the data source really does not matter. Once you understand how to decompose the data, that is, get to the data you want within a returned data structure, you can pluck out data all day long from any data set. A secondary objective was to play around with some of these APIs. While checking to see if the Earth is about to get broadsided by an asteroid is clearly important there are quite a few public and fee-based APIs out there with perhaps more practical use.

Of course within our own networks, we will be querying device and controller APIs for status and pushing configuration payloads. We will be pulling inventory data from a CMDB system API, executing some actions, perhaps some updates, and recording any changes via API to the Ticketing System.


Some final tips, links, and notes:

Some sites have excellent API documentation that tell you exactly what will be returned but some don’t, so in many instances you have to do this decomposition exercise anyway. It’s best to get familiar with it. It’s like knowing how to read a map and and how to navigate in case you forget your GPS.

JSON Tools

JSON Editor Online

REST APIs

REST COUNTRIES

NASA Asteroids – Near Earth Object Web Service

https://api.nasa.gov/api.html#NeoWS

Examples Repository cldeluna/Decomposing_DataStructures


The Gratuitous. Arp

2019-07-07