Using Python and Pandas to look at Pandemic Data

The script and supporting files in this repository are intended to show how the Python Pandas module can be used to analyze data, specifically COVID-19 data.

I am going to recommend 3 data sets to “investigate”:


WHO Data

The repository comes with the WHO data file from 06 April 2020 (WHO-COVID-19-global-data.csv). The simplest run of the script will use this WHO data file.

To download the latest file go to the Who Overview Map and download the Map Data from the link on the lower right hand side.

This CSV file will need clean up. Remove spaces from column titles. Some rows have spaces in the country names and so spaces have shifted columns (Belize and Palestine). You will need to combine the name and shift the data back to the correct columns. Welcome to the world of data.


John Hopkins University (JHU) Center for Systems Science and Engineering (CSSE) Data

The John Hopikns Unversity CSSE data is widely used in the media and either drives or is incorporated into many other data sets.
More importantly for our purposes, this wonderful institution of higher learning makes the raw data available on a public repository (GitHub).

CSSEGISandData on GitHub

I’ve cloned the repository so that it sits as a subdirectory in my pandas_for_pandemic_data folder and I refresh it every day.

# Clones the pands_for_pandemic_data Repository
git clone

# Change into the pands_for_pandemic_data Repository
cd pandas_for_pandemic_data

# Clones the John Hopkins University CSSE Data
git clone

# Refresh the JHU Data
cd COVID-19
git pull
# Example of refreshing the JHU repository
Claudias-iMac:COVID-19 claudia$ git pull
remote: Enumerating objects: 148, done.
remote: Counting objects: 100% (148/148), done.
remote: Compressing objects: 100% (13/13), done.
remote: Total 252 (delta 135), reused 140 (delta 135), pack-reused 104
Receiving objects: 100% (252/252), 1.25 MiB | 6.51 MiB/s, done.
Resolving deltas: 100% (157/157), completed with 14 local objects.
865c933c..f3dea791 master -> origin/master
513b21a4..493821d3 web-data -> origin/web-data
Updating 865c933c..f3dea791
csse_covid_19_data/UID_ISO_FIPS_LookUp_Table.csv | 7141 ++++++++++----------
…/csse_covid_19_daily_reports/04-06-2020.csv | 2810 ++++++++
…/time_series_covid19_confirmed_US.csv | 6508 +++++++++---------
…/time_series_covid19_confirmed_global.csv | 527 +-
…/time_series_covid19_deaths_US.csv | 6508 +++++++++---------
…/time_series_covid19_deaths_global.csv | 527 +-
…/time_series_covid19_recovered_global.csv | 499 +-
7 files changed, 13668 insertions(+), 10852 deletions(-)
create mode 100644 csse_covid_19_data/csse_covid_19_daily_reports/04-06-2020.csv
Claudias-iMac:COVID-19 claudia$

Feel free to put it elsewhere in your directory structure. The script sets the default path in the arguments section at the bottom. You can either update the default path directly or use the -d option when you execute the script to redirect script to look there for the daily files.


New York Times Data

The New York Times has also shared their data. This repository only contains data for the US. They share two flavors:

  • US State Level data
  • US County Level data

They do a good job of keeping the data set very clean. Its all numeric and so far I’ve not seen any missing data which is rare for any data set.

== Number of MISSING values in each column:
date 0
state 0
fips 0
cases 0
deaths 0
dtype: int64

New York Times US Data GitHub Repository

I took the same approach with this repository as I did for the JHU data. I’ve cloned the repository so that it sits as a subdirectory in my pandas_for_pandemic_data folder and I refresh it every day.


In general, the script will take in a CSV data file, turn it into a Pandas Data Frame, and execute a set of commands against the data. The main section manages which options to execute and sends the relevant data frame to a function that prints out the various analysis statements for the data frame. In general, this is what will be shown for each data frame.

  • Describe the data (pandas method showing interesting statistical facts about the data)
  • Show the shape of the data frame (number of rows and columns)
  • Show the first and last 5 lines of data
  • List the column headings
  • Show the data type of each column
  • Look for the total number of missing values in each column
  • Sum the columns (only makes sense for columns holding numeric data)

The various options let you control which data set you want to investigate and filter. The output is sent to your screen. These are just some of the actions available to you with Pandas. Once you have the data in a Pandas data frame you can query the data frame for the data that is meaningful to you.

Cheatsheet for script

python -hDisplay all the options available (Help)
python todays_totals.pyWHO Data
Without any options, the script will load the local WHO data file from 6 April into a Pandas Data Frame and run some commands to investigate the data.
Reminder: If you download a fresh WHO CSV file please note the updates I list above so that you can cleanly import the CSV into a Data Frame
python -c “MX”WHO Data Filtered for a Specific Country
Note: use the 2 letter country code as an argument with the -c option
python -tJohn Hopkins University CSSE Data
The -t option will look for todays daily log file in the JHU CSSE (remember to clone the repository)
python -t -c “Mexico”John Hopkins University CSSE Data
The -t -c “country or region” option will let you filter for a country
python -t -s “California”John Hopkins University CSSE Data
The -t -s “state” option filters the JHU data set for a state or province
python -t -f 06037John Hopkins University CSSE Data
The -t -f  FIPS option filters the JHU data set for a FIPS county code.   Note: FIPS code 06037 is for Los Angeles County
python -nNew York Times Data
US Totals Only for the full NY Times data set with the -n option
(remember to clone the repository)
python -n -f 6
python -n -p “California”
New York Times Data
This data set has both “state” and “fips” but fips represents FIPS State Code so in this example 6 is the FIPS state code. 
This should get you exactly the same data as ** python -n -p “California”**
Script CLI Cheat Sheet

The script will give you an idea of how to load pandemic data into a Pandas data frame and interrogate the data. Executing it with the -h option will give you help on the options.

(pandas) Claudias-iMac:pandas_for_pandemic_data claudia$ python -h
[-t] [-n]
Script Description
optional arguments:
-h, --help show this help message and exit
Set path to CSSE Dailty Report folder
csse_covid_19_daily_reports. Default is ./COVID-19/css
Filer on 2 letter Country Region. Example: "US"
Filer on Province State. Example: "California"
-s SPECIFIC_DAY, --specific_day SPECIFIC_DAY
File for specific day. Example: 04-01-2020
-f FIPS, --fips FIPS FIPS County Code Example: 06037 (Los Angeles County)
-w, --who_data_file Analyze the WHO data file provided
-t, --today_csse Analyze todays file in the CSSE repo
-n, --new_york_times Analyze the New York Times Data
Usage: 'python todays_totals'

Running the script without any parameters yields some information on the WHO data set from 6 April which is part of the repository. This information includes:

A description of the data frame including some statistical data on the numeric values.

The rows and

Example output for WHO Data:

(pandas) Claudias-iMac:pandas_for_pandemic_data claudia$ python
==================== DATA FRAME CHECK ====================
==================== WHO Data Frame from WHO-COVID-19-global-data.csv ====================

== Describe the Data Frame:
Deaths CumulativeDeaths Confirmed CumulativeConfirmed
count 6786.000000 6786.000000 6786.000000 6786.000000
mean 9.971412 104.946360 178.487179 2313.689213
std 73.268455 761.569677 1183.935476 12918.006118
min 0.000000 0.000000 0.000000 1.000000
25% 0.000000 0.000000 0.000000 4.000000
50% 0.000000 0.000000 2.000000 26.000000
75% 0.000000 3.000000 25.000000 235.000000
max 2003.000000 15889.000000 33510.000000 307318.000000

== Shape of the Data Frame:
(6786, 8)

== SAMPLE (first and last 5 rows):
day Country CountryName Region Deaths CumulativeDeaths Confirmed CumulativeConfirmed
0 2/25/20 AF Afghanistan EMRO 0 0 1 1
1 2/26/20 AF Afghanistan EMRO 0 0 0 1
2 2/27/20 AF Afghanistan EMRO 0 0 0 1
3 2/28/20 AF Afghanistan EMRO 0 0 0 1
4 2/29/20 AF Afghanistan EMRO 0 0 0 1
day Country CountryName Region Deaths CumulativeDeaths Confirmed CumulativeConfirmed
6781 4/2/20 ZW Zimbabwe AFRO 0 1 0 8
6782 4/3/20 ZW Zimbabwe AFRO 0 1 0 8
6783 4/4/20 ZW Zimbabwe AFRO 0 1 1 9
6784 4/5/20 ZW Zimbabwe AFRO 0 1 0 9
6785 4/6/20 ZW Zimbabwe AFRO 0 1 0 9

== Column Headings of the data set:
['day' 'Country' 'CountryName' 'Region' 'Deaths' 'CumulativeDeaths'
'Confirmed' 'CumulativeConfirmed']

== Column default data type:
day object
Country object
CountryName object
Region object
Deaths int64
CumulativeDeaths int64
Confirmed int64
CumulativeConfirmed int64
dtype: object

== Number of MISSING values in each column:
day 0
Country 85
CountryName 0
Region 62
Deaths 0
CumulativeDeaths 0
Confirmed 0
CumulativeConfirmed 0
dtype: int64

== Sum just the numeric columns in the Data Frame:
Deaths 67666
CumulativeDeaths 712166
Confirmed 1211214
CumulativeConfirmed 15700695
dtype: int64

(pandas) Claudias-iMac:pandas_for_pandemic_data claudia$

FuzzyWuzzy was a Python Module

An example of using the fuzzywuzzy Python module to match data sets with similar but not exact data – fuzzy matches!

I was recently given a list of locations that I had to analyze.

For the analysis, I needed data that was not in the original list (lets call that the source list). Luckily I had a larger data set (lets call that the detail list) which did have all the additional details I needed.

The source list with locations was a subset of the detail list. Even better, both lists had an “Address” column so I figured it would be a simple matter of looking up each location in my source list in the larger detail list and picking out the additional data that I needed.

…or so I thought…

The Same Column but NOT the Same Data

As I drilled down into the actual data, what I saw immediately derailed my original plan to use the “Address” column in both data sets as the key to match. For that to work, the keys would need to be the same or nearly so but I saw drastically different data. I might have been able to account for spaces and capitalization differences but never something like this:

Value From Source List AddressValue From Detailed List Address
Fort Irwin Military Base93 Goldstone Rd (Ft. Irwin)

As ever, Google to the rescue and enter the fuzzywuzzy python module.

Let’s take a closer look at the data.

I’ve used a list of NASA’s Deep Space Network Complexes to simulate the data and issues I had. My actual source list had hundreds of locations and the details list had thousands.

Numbers correspond to the image below.

  1. This is the original source list containing the list of locations to be analyzed.
  2. I happened to have the additional detail file which had all the additional information I needed for the locations listed in the original source list.
  3. At first look it seemed a perfect match as both files had an “Address” column I could use as a “key” to get the additional information for the locations in my original source list. However, on closer inspection you can see that the addresses for the same location were quite different. Even the “City” column had differences.
  4. You can also see that the source list was missing some State data.
  5. Finally, for each location in my source list I needed ZIP, complete state data, 2 Letter Country code, Directions, and the URL for each location.
  6. What I needed was all of the data in #5 extracted from the much larger detail file and combined with the locations from my original list.
Data Files
  • 6 & 7 For those that like to flip to the end, the script we will discuss here provided the last file combining the source list of locations with the detail data for each location from the larger detail file.

The Main section of the script

I’ve put the sample data sets and the script here on GitHub.

The main body of the script is very simple thanks to Pandas (another wonderful python module). It’s really only 8 lines of code and 4 steps.

  • Load the data sets from the Excel files into Pandas Data Frames
# Create a Data Frame from the Source Excel File
df_src = df_from_excel(arguments.source_file)
# Create a Data Frame from the Additional Details Excel File
df_det = df_from_excel(arguments.detail_file)
  • Add a new column to the source data set for “Full_Address” which contains the address information from the detailed file. This allows us to use the Pandas merge function to merge the two data sets in one line in exactly the way we want. This one statement calls a function that performs the fuzzy lookups
# Add a new column "Full_Address" to the Source data frame which has the "Address" information form the additional details data frame
# This new column will be used in the Pandas merge 
df_src['Full_Address'] = df_src['Address'].apply(get_series_match, args=[df_det['Address']])
  • Merge the two Data Frames into one data set with all the information needed
df_merged = pd.merge(df_src, df_det, left_on="Full_Address", right_on="Address", how="left", suffixes=('_src', '_det'))
Annotated merge command
  • Save the merged Data Frame as an Excel file and as a JSON file.
all_data_fn = "DSN_Complex_Lists_COMBINED"
df_merged.to_json(f"{all_data_fn}.json", orient="records")

That is an overview of the entire process.

The get_series_match Function

The real magic is in the get_series_match function that is used to build the new “Full_Address” column in the original data set, df_src.

Let’s take a closer look at what that function is doing.

First, lets remember how its called:

df_src['Full_Address'] = df_src['Address'].apply(get_series_match, args=[df_det['Address']])

I will be the first to admit this is in no way elegant or “pythonic” but I’m not a developer. I’m a network engineer. It works!

In this one statement, thanks to the power of Pandas, we can iterate over the values in the Address column of the source data frame df_src[‘Address’], sending each to the get_series_match function to search for a match in the detail data frame Address column.

To the left of the equal sign, we tell the df_src data frame that we want to add a new column, “Full_Address”. We build the contents of that new column by passing the “Address” value from the source data frame, df_src[‘Address’], and using Pandas “apply” method to build the new value of each row using the get_series_match function (and we pass to the function the column “Address” data from the details data frame, df_det[‘Address’]).

get_series_match function

In line 38 we see the function defined with the two values it expected to be passed to it when called. The variable row_val has the “cell” value from the source data frame “Address” column.

Line 44 sets an empty variable to hold the match value (the address from the detail data frame).

The args variable is defined as a list and it has the column values from the detail data frame “Address” column so that we can iterate through that list looking for the specific row_val. Specifically the row_val passed to the function will be compared against this list of addresses from the detail data frame, df_det[‘Address’] :

0                        93 Goldstone Rd (Ft. Irwin)  
1       421 Discovery Drive, Paddy's River District    
2  Ctra. M-531 Robledo de Chavela a Colmenar del ...  
3                               4800 Oak Grove Drive  

Line 48 starts the for loop which will iterate over each value in the df_det[‘Address’] data which we have stored as a python list in the variable args looking for the closest match to row_val.

Line 51 attempts an exact match. That is, if the src “Address” value matches the detail “Address” value (case insensitive), then we set match_value to the address from the detail data frame and we break out of the loop. We have found what we were looking for.

Lines 58 – 60 are executed if we don’t find an exact match. In reality, to make this shorter I would remove the exact match test (lines 51 – 56) but I left it in as an example of what I tried to do initially.

These 3 variables hold a ratio or “score” of how closely the two address values matched given different algorithms available in the fuzzywuzzy module. Each algorithm has a sweet spot. I try all three and see what gets me closest to what I need. In this case, with this data, token_sort_ration worked best (gave me the consistently higher score).

Value From Source List AddressValue From Detailed List AddressRatioPartial RatioToken Sort Ratio
Fort Irwin Military Base93 Goldstone Rd (Ft. Irwin)353850
Fort Irwin Military Base421 Discovery Drive, Paddy’s River District323833
Fort Irwin Military BaseCtra. M-531 Robledo de Chavela a Colmenar del Arroyo, Km 7.1243324
Fort Irwin Military Base4800 Oak Grove Drive272527

The get_series_match function returns match_value = “93 Goldstone Rd (Ft. Irwin)” in this example.

Script output with print enabled (stdout_details = True):

>>>>>>>>>>>>> Comparing row value Fort Irwin Military Base with additional detail series value 93 Goldstone Rd (Ft. Irwin)
                Fuzzy Ratio:    35
                Fuzzy Partial Ratio:    38
                Fuzzy Token Sort Ratio:         50 of type <class 'int'>
        Fuzzy Match Found for row: 
                Fort Irwin Military Base 
        with series value:
                93 Goldstone Rd (Ft. Irwin) 
        with Fuzzy Token Sort Ratio of 50!
>>>>>>>>>>>>> Comparing row value Fort Irwin Military Base with additional detail series value 421 Discovery Drive, Paddy's River District 
                Fuzzy Ratio:    32
                Fuzzy Partial Ratio:    38
                Fuzzy Token Sort Ratio:         33 of type <class 'int'>
>>>>>>>>>>>>> Comparing row value Fort Irwin Military Base with additional detail series value Ctra. M-531 Robledo de Chavela a Colmenar del Arroyo, Km 7.1
                Fuzzy Ratio:    24
                Fuzzy Partial Ratio:    33
                Fuzzy Token Sort Ratio:         24 of type <class 'int'>
>>>>>>>>>>>>> Comparing row value Fort Irwin Military Base with additional detail series value 4800 Oak Grove Drive
                Fuzzy Ratio:    27
                Fuzzy Partial Ratio:    25
                Fuzzy Token Sort Ratio:         27 of type <class 'int'>
match_value is 93 Goldstone Rd (Ft. Irwin)

Line 66 performs the critical test. I knew from looking at all the ration values that token_sort_ration was what I wanted to use in my logic test and by looking at the ratios or scores I knew that a score of 50 or better got me a good match. The rest is just like the exact match test. I set the match_value and break out of the loop.

Line 78 returns the match_value back to the calling statement and inserted that value into the new column in the appropriate row. Note that in line 74 if match_value never gets set in the for loop (the search), that means a match could not be made and so it gets set to “No match found”.

Video Overview ~12 minutes

Helpful Links

Pandas for Network Engineers (Who doesn’t love Pandas? )

The module not the mammal!

My original title for this article was going to be *Decomposing Pandas* as a follow on to *Decomposing Data Structures* but I was advised against that name. Go figure.

One of the things I love most about Python is that its always waiting for me to get just a little bit better so it can show me a slightly smarter way to do something. Pandas is the latest such example.

Pandas is a powerful data science Python library that excels at manipulating multidimensional data.

Why is this even remotely interesting to me as a network engineer?

Well, thats what Excel does, right?

I spend more time than I care to admit processing data in Excel. I find that Excel is always the lowest common denominator. I understand why and often I’m a culprit myself but eventually one grows weary of all the data being in a spreadsheet and having to manipulate it. I’m working on the former and Pandas is helping on the latter.

Google around enough for help on processing spreadsheets and you will come across references to the Pandas Python module.

If you are anything like me, you go through some or all of these stages:

  • You dismiss it as irrelevant to what you are trying to do
  • You dismiss it because its seems to be about big data, analytics, and scientific analysis of data (not your thing right?)
  • As you continue to struggle with what got you here in the first place (there has got to be a better way to deal with this spreadsheet data) you reconsider. So you try to do some processing in Pandas and pull a mental muscle…and what the heck is this NaN thing that keeps making my program crash? Basically, you find yourself way way out of your comfort zone (well..I did)!
  • You determine that your limited Python skills are not up to something quite this complex…after all, you know just enough Python to do the automation stuff you need to do and you are not a data scientist.

Finally, in a fit of desperation as you see all the Excel files you have to process, you decide that a python module is not going to get the better of you and you give it another go!

So here I am, on the other side of that brain sprain, and better for it, as is usually the case.

What is possible with Pandas…

Once you get the hang of it, manipulating spreadsheet-like data sets becomes so much simpler with Pandas. In fact, thats true for any data set, not just ones from spreadsheets. In fact, in the examples below, the data set comes from parsing show commands with TextFSM.

Knowing how to work with Pandas, even in a limited fashion as is the case with me, is going to be a handy skill to have for any Network Engineer who is (or is trying to become) conversant in programmability & automation.

My goal here is not to teach you Pandas as there is quite alot of excellent material out there to do that. I’ve highlighted the content which helped me the most in the “Study Guide” section at the end.

My goal is to share what I’ve been able to do with it as a Network Engineer, what I found most useful as I tried to wrap my head around it, and my own REPL work.

Lets look at something simple. I need to get the ARP table from a device and “interrogate” the data.

In this example, I have a text file with the output of the “show ip arp” command which I’ve parsed with TextFSM.

Here is the raw data returned from the TextFSM parsing script:

 # Executing textfsm strainer function only to get data
  strained, strainer = basic_textfsm.textfsm_strainer(template_file, output_file, debug=False)

In [1]: strained                                                                                                                                                                                                            
[['Internet', '', '5', '28c6.8ee1.659b', 'ARPA', 'Vlan1'],
['Internet', '', '4', '6400.6a64.f5ca', 'ARPA', 'Vlan1'],
['Internet', '', '172', '0018.7149.5160', 'ARPA', 'Vlan1'],
['Internet', '', '0', 'a860.b603.421c', 'ARPA', 'Vlan1'],
['Internet', '', '18', 'a4c3.f047.4528', 'ARPA', 'Vlan1'],
['Internet', '', '-', '0018.b9b5.93c2', 'ARPA', 'Vlan101'],
['Internet', '', '-', '0018.b9b5.93c1', 'ARPA', 'Vlan100'],
['Internet', '', '-', '0018.b9b5.93c0', 'ARPA', 'Vlan1'],
['Internet', '', '4', '28c6.8ee1.6599', 'ARPA', 'Vlan1'],
['Internet', '', '0', '000c.294f.a20b', 'ARPA', 'Vlan1'],
['Internet', '', '0', '000c.298c.d663', 'ARPA', 'Vlan1']]

Note: don’t read anything into the variable name strained. The function I use to parse the data is called textfsm_strainer because I “strain” the data through TextFSM to get structured data out of it so I put the resulting parsed data from that function into a variable called “strained”.

Here is that data in a Pandas Data Frame:

# strained is the parsed data from my TextFSM function and the first command below
# loads that parsed data into a Pandas Data Frame called "df"
In [1]: df = pd.DataFrame(strained, columns=strainer.header)                                                                                                                                                                                                           
In [2]: df                                                                                                                                                                                                                                                      
0   Internet    5  28c6.8ee1.659b  ARPA     Vlan1
1   Internet    4  6400.6a64.f5ca  ARPA     Vlan1
2   Internet  172  0018.7149.5160  ARPA     Vlan1
3   Internet    0  a860.b603.421c  ARPA     Vlan1
4   Internet   18  a4c3.f047.4528  ARPA     Vlan1
5   Internet    -  0018.b9b5.93c2  ARPA   Vlan101
6   Internet    -  0018.b9b5.93c1  ARPA   Vlan100
7   Internet    -  0018.b9b5.93c0  ARPA     Vlan1
8   Internet    4  28c6.8ee1.6599  ARPA     Vlan1
9   Internet    0  000c.294f.a20b  ARPA     Vlan1
10  Internet    0  000c.298c.d663  ARPA     Vlan1

I now have a spreadsheet like data structure with columns and rows that I can query and manipulate.

My first question:

What are all the IPs in Vlan1?

Just Python

Before Pandas, I would initialize an empty list to hold the one or more IPs and then I would iterate through the data structure (strained in this example) and where the interface “column” value (which in this list of lists in the strained variable is at index 5) was equal to ‘Vlan1’ I appended that IP to the list. The IP is in index 1 in each item the strained list.

# Using Python Only
print("\n\tUsing Python only..")
vlan1ips = []
for line in strained:
    if line[5] == 'Vlan1':

The resulting output would look something like this:

['', '', '', '', '', '', '', '', '']

Python and Pandas

Using a Pandas data frame df to hold the parsed data:

pandas_vlan1ips = df['ADDRESS'].loc[df['INTERFACE'] == 'Vlan1'].values

The resulting output from the one liner above would look something like this:

 ['' '' '' '' ''
'' '' '' '']

Same output with a single command!

Python List Comprehension

For those more conversant with Python, you could say that list comprehension is just as efficient.

# Using list comprehension
print("Using Python List Comprehension...")
lc_vlan1ips = [line[1] for line in strained if line[5] == 'Vlan1' ]

Results in:

Using List Comprehension: 
['', '', '', '', '', '', '', '', '']

So yes..list comprehension gets us down to one line but I find it a bit obscure to read and a week later I will have no idea what is in line[5] or line[1].

I could turn the data into a list of dictionaries so that rather than using the positional indexes in a list I could turn line[1] into line[‘IP_ADDRESS’] and line[5] into line[‘INTERFACE’] which would make reading the list comprehension and the basic python easier but now we’ve added lines to the script.

Finally, Yes its one line but I’m still iterating over the data.

Pandas is set up to do all the iteration for me and lets me refer to data by name or by position “out of the box” and without any extra steps.

Lets decompose the one line of code:

If you think of this expression as a filter sandwich, the df[‘ADDRESS’] and .values are the bread and the middle .loc[df[‘INTERFACE’]] == ‘Vlan1’] part that filters is the main ingredient.

Without the middle part you would have a Pandas Series or list of all the IPs in the ARP table. Basically you get the entire contents of the ‘ADDRESS” column in the data frame without any filtering.

When you “qualify” df[‘ADDRESS’] with .loc[df[‘INTERFACE’]] == ‘Vlan1’] you filter the ADDRESS column in the data frame for just those records where INTERFACE is ‘Vlan1’ and you only return the IP values by using the .values method.

Now, this will return a numpy.ndarray which might be great for some subsequent statistical analysis but as network engineers our needs are simple.

I’m using iPython in the examples below as you can see from the “In” and “Out” line prefixes.

In [1]: pandas_vlan1ips = df['ADDRESS'].loc[df['INTERFACE'] == 'Vlan1'].values

In [2]: type(pandas_vlan1ips) Out[2]: numpy.ndarray

I would like my list back as an actual python list and thats no problem for Pandas.


In [3]: pandas_vlan1ips = df['ADDRESS'].loc[df['INTERFACE'] == 'Vlan1'].to_list()

In [4]: type(pandas_vlan1ips) Out[4]: list

In [5]: pandas_vlan1ips Out[5]:` `['',` `'',` `'',` `'',` `'',` `'',` `'',` `'',` `'']

You know what would be really handy? A list of dictionaries where I can reference both the IP ADDRESS and the MAC as keys.

In [5]: vlan1ipmac_ldict = df[['ADDRESS', 'MAC']].to_dict(orient='records')

In [6]: type(vlan1ipmac_ldict) Out[6]: list

In [7]: vlan1ipmac_ldict Out[7]:` `[{'ADDRESS': '', 'MAC': '28c6.8ee1.659b'},` `{'ADDRESS': '', 'MAC': '6400.6a64.f5ca'},` `{'ADDRESS': '', 'MAC': '0018.7149.5160'},` `{'ADDRESS': '', 'MAC': 'a860.b603.421c'},` `{'ADDRESS': '', 'MAC': 'a4c3.f047.4528'},` `{'ADDRESS': '', 'MAC': '0018.b9b5.93c2'},` `{'ADDRESS': '', 'MAC': '0018.b9b5.93c1'},` `{'ADDRESS': '', 'MAC': '0018.b9b5.93c0'},` `{'ADDRESS': '', 'MAC': '28c6.8ee1.6599'},` `{'ADDRESS': '', 'MAC': '000c.294f.a20b'},` `{'ADDRESS': '', 'MAC': '000c.298c.d663'}]

In [8]: len(vlan1ipmac_ldict) Out[8]: 11

MAC address Lookup

Not impressed yet. Let see what else we can do with this Data Frame.

I have a small function that performs MAC address lookups to get the Vendor OUI.

This function is called get_oui_macvendors() and you pass it a MAC address and it returns the vendor name.

It uses the API.

I’d like to add a column of data to our Data Frame with the Vendor OUI for each MAC address.

In the one line below, I’ve added a column to the data frame titled ‘OUI’ and populated its value by performing a lookup on each MAC and using the result from the get_oui_macvendors function.

df['OUI'] = df['MAC'].map(get_oui_macvendors)

The left side of the equation references a column in the data Fram which does not exist so it will be added.

The right side takes the existing MAC column in the data frame and takes each MAC address and runs it through the get_oui_macvendors function to get the Vendor OUI and “maps” that result into the new OUI “cell” for that MACs row in the data frame.

pandas-newcolumn diagram to show what is happening under the hood in the one line command to ad a coloumn

Now we have an updated Data Frame with a new OUI column giving the vendor code for each Mac.

In [1]: df                                                                                                                                                                                                                                                      
     PROTOCOL         ADDRESS  AGE             MAC  TYPE INTERFACE                 OUI
 0   Internet    5  28c6.8ee1.659b  ARPA     Vlan1             NETGEAR
 1   Internet    4  6400.6a64.f5ca  ARPA     Vlan1           Dell Inc.
 2   Internet  172  0018.7149.5160  ARPA     Vlan1     Hewlett Packard
 3   Internet    0  a860.b603.421c  ARPA     Vlan1         Apple, Inc.
 4   Internet   18  a4c3.f047.4528  ARPA     Vlan1     Intel Corporate
 5   Internet    -  0018.b9b5.93c2  ARPA   Vlan101  Cisco Systems, Inc
 6   Internet    -  0018.b9b5.93c1  ARPA   Vlan100  Cisco Systems, Inc
 7   Internet    -  0018.b9b5.93c0  ARPA     Vlan1  Cisco Systems, Inc
 8   Internet    4  28c6.8ee1.6599  ARPA     Vlan1             NETGEAR
 9   Internet    0  000c.294f.a20b  ARPA     Vlan1        VMware, Inc.
 10  Internet    0  000c.298c.d663  ARPA     Vlan1        VMware, Inc.

More questions

Lets interrogate our data set further.

I want a unique list of all the INTERFACE values.

In [3]: df['INTERFACE'].unique()                                                                                                                                                                                                                                
 Out[3]: array(['Vlan1', 'Vlan101', 'Vlan100'], dtype=object)

How about “Give me a total count of each of the unique INTERFACE values?”

In [4]: df.groupby('INTERFACE').size()                                                                                                                                                                                                                          
 Vlan1      9
 Vlan100    1
 Vlan101    1
 dtype: int64

Lets take it down a level and get unique totals based on INTERFACE and vendor OUI.

In [2]: df.groupby(['INTERFACE','OUI']).size()                                                                                                                                                                                                                  
 INTERFACE  OUI               
 Vlan1      Apple, Inc.           1
            Cisco Systems, Inc    1
            Dell Inc.             1
            Hewlett Packard       1
            Intel Corporate       1
            NETGEAR               2
            VMware, Inc.          2
 Vlan100    Cisco Systems, Inc    1
 Vlan101    Cisco Systems, Inc    1
 dtype: int64

I could do this all day long!


I’ve just scratched the surface of what Pandas can do and I hope some of the examples I’ve shown above illustrate why investing in learning how to use data frames could be very beneficial. Filtering, getting unique values with counts, even Pivot Tables are possible with Pandas.

Don’t be discouraged by its seeming complexity like I was.

Don’t discount it because it does not seem to be applicable to what you are trying to do as a Network Engineer, like I did. I hope I’ve shown how very wrong I was and that it is very applicable.

In fact, this small example and some of the other content in this repository comes from an actual use case.

I’m involved in several large refresh projects and our workflow is what you would expect.

  1. Snapshot the environment before you change out the equipment
  2. Perform some basic reachability tests
  3. Replace the equipment (switches in this case)
  4. Perform basic reachability tests again
  5. Compare PRE and POST state and confirm that all the devices you had just before you started are back on the network.
  6. Troubleshoot as needed

As you can see if you delve into this repository, its heavy on APR and MAC data manipulation so that we can automate most of the workflow I’ve described above. Could I have done it without Pandas? Yes. Could I have done it as quickly and efficiently with code that I will have some shot of understanding in a month without Pandas? No.

I hope I’ve either put Pandas on your radar as a possible tool to use in the future or actually gotten you curious enough to take the next steps.

I really hope that the latter is the case and I encourage you to just dive in.

The companion repository on GitHub is intended to help and give you examples.

Next Steps

The “Study Guide” links below have some very good and clear content to get you started. Of all the content out there, these resources were the most helpful for me.

Let me also say that it took a focused effort to get the point where I was doing useful work with Pandas and I’ve only just scratched the surface. I was worth every minute! What I have described here and in this repository are the things that were useful for me as a Network Engineer.

Once you’ve gone through the Study Guide links and any others that you have found, you can return to this repository to see examples. In particular, this repository contains a Python script called

It goes through loading the ARP data from the “show ip arp” command, parsing it, and creating a Pandas Data Frame.

It then goes through a variety of questions (some of which you have seen above) to show how the Data Frame can be “interrogated” to get to information that might prove useful.

There are comments throughout which are reminders for me and which may be useful to you.

The script is designed to run with data in the repository by default but you can pass it your own “show ip arp” output with the -o option.

Using the -i option will drop you into iPython with all of the data still in memory for you to use. This will allow you to interrogate the data in the Data Frame yourself..

If you would like to use it make sure you clone or download the repository and set up the expected environment.

Options for the script:

(pandas) Claudias-iMac:pandas_neteng claudia$ python -h
usage: [-h] [-t TEMPLATE_FILE] [-o OUTPUT_FILE] [-v]
                        [-f FILENAME] [-s] [-i] [-c]

Script Description

optional arguments:
-h, --help           show this help message and exit
                      TextFSM Template File
-o OUTPUT_FILE, --output_file OUTPUT_FILE
                      Full path to file with show command show ip arp output
-v, --verbose         Enable all of the extra print statements used to
                      investigate the results
-f FILENAME, --filename FILENAME
                      Resulting device data parsed output file name suffix
-s, --save           Save Parsed output in TXT, JSON, YAML, and CSV Formats
-i, --interactive     Drop into iPython
-c, --comparison     Show Comparison

Usage: ' python Will run with default data in the
(pandas) Claudias-iMac:pandas_neteng claudia$

Study Guide

A Quick Introduction to the “Pandas” Python Library

For me this is the class that made all the other classes start to make sense.

Note that this class is not Free.

Pandas Fundamentals by Paweł Kordek on PluralSight is exceptionally good.

There is quite alot to Pandas and it can be overwhelming (at least it was for me) but this course in particular got me working very quickly and explained things in a very clear way.

Python Pandas Tutorial 2: Dataframe Basics by codebasics <- good for Pandas operations and set_index

Python Pandas Tutorial 5: Handle Missing Data: fillna, dropna, interpolate by codebasics

Python Pandas Tutorial 6. Handle Missing Data: replace function by codebasics

Real Python <- this is terrific resource for learning Python

There is a lot of content here. Explore at will. The two below I found particularly helpful.

Intro to DataFrames by Joe James <–great ‘cheatsheet’

What others have shared…

Analyzing Wireshark Data with Pandas



What language for Network Automation?

I’m often asked variants of “What language would you recommend when getting started with Network Automation and how would you get started?

As with most things in IT, the answer requires context.

In general, right now I would still pick Python.  Go is getting popular but still can’t compete with the huge body for work and modules and examples available for Python.


Before getting to some suggestions on how I would start learning Python if I had it to do over, let me mention that there are other ways to get started with automation that also involve learning “new syntax” but may seem simpler (at first). Ansible and other automation frameworks use Domain Specific Language (DSL) which may be simpler to learn if you are doing very basic things or things that have examples in the “Global Data Store” more commonly known as Google & the Internet!

So if you are new to programming you may want to start with Ansible…lets you do lots of basic automation with more of an abstracted language.  All of the resources I mention below also have Ansible training.  Try to get one that is recent and geared towards networking as many Ansible courses are geared towards servers and while useful, you will wind up spending more time ramping up.

All roads lead to Python today

Having said that, once you start trying to do more complex things you may run into limitations or issues where some Python knowledge would be useful. Ansible is written in Python. Also, if you are new to linux, then while you may have saved some time with Ansible, you will need to invest some time figuring out how to get around in Linux as the Ansible control host runs on Linux and its variants but there are many other reasons why being conversant with Linux as a network engineer is important.

If you are serious about “upgrading” your skill set as a network engineer make sure you are somewhat comfortable with Linux basics and, while Ansible can get you going with some simple immediate tasks you may need to do for work, get started with Python as soon as you can.

If you are not comfortable moving around in Linux, David Bombal has an excellent intro on YouTube which explains why its an important skill to have and his full course is available via various training options.
Linux for Network Engineers: You need to learn Linux! (Prt 1) Demo of Cisco 9k, Arista EOS & Cumulus

One more point. In the interest of getting you productive as quickly as possible, I have built a series of Docker images that provide an Ansible control server as well as a python programing environment with many of the modules network engineers tend to use.

Ansible Server in Docker for Network Engineers

So back to the question at hand:

Step 1 (Investment: Time)

I would start with Kurt Byers free Learning Python course.

Note: I have no affiliation with him or his company but he speaks to Network Engineers better than anyone I’ve seen and it is an on-line course of recorded lessons (with a lab), one lesson a week, so from a time perspective is very easy to “consume” his course while still having an opportunity to interact and ask questions. The course with the community forum is well worth the additional cost. You will make new acquaintances who are interested in what you are doing and you sometimes run into old friends.

I would also start going through the coursework available for free on DevNet (requires an account).

Intermission (Investment: Time)

Pick a small task and get it working. This step is essential and will help focus your next step.

Spend time working with complex data structures, YAML, and JSON. See my post on “Decomposing Data Structures“.

Step 2 (Investment: Time & Money)

Your next step can take you along various paths. All of these are paid courses and can be done on demand or in a class. There are many options here but I’m listing the options with which I have first hand experience.

Kurt Byer

I really recommend Kurt Byers  Python for Network Engineers course. I’ve taken it several times!! 😀 . You have a lesson a week and this is taught by a Network Engineer for Network Engineers!


I’m also a BIG fan of Udemy and they have some very good courses as well at very reasonable price points. Always listen to the free examples and always check the dates before you purchase. Make sure you find out if the Python course is using Python3 as there are many courses there which use Python 2 and if you are just staring, please start with Python3.!

These courses are at a nominal cost, self-paced but, again, make sure you get a current one that is Python 3.


Sign up for an INE course. These guys are also very good and more self-paced but a bit more costly.

Network to Code

The Network to Code guys have great content in a more formalized setting where you can go for a week.

Note: I have taught for these guys so I do have a loose affiliation which does not lessen the point that they have terrific content.

A quick example of using TextFSM to parse data from Cisco show commands – Python3 Version

As part of my ongoing effort to migrate everything over to Python 3, it’s time to show this “quick example” in Python 3.

TextFSM is a powerful parsing tool (python module) developed by Google.    There are some great examples out there to get you started. Here are two I urge you to read if this topic is of interest to you:

I can never get enough of examples so here is a very simple one to get you started or keep you practicing.  I find that a quick example where I can see results of my own making really energizes my learning process.

For this example, you need python 3 and the textfsm module installed.

TextFSM Getting Started is an excellent resource which includes the installation process for textfsm (textfsm is a pip installable module).

In addition to the environment, you will need 2 things to get started.

  • A TextFMS template
  • Content to parse with the TextFMS template

The TextFSM Pattern Matching or “Parsing” Template

The template may be the tricky part.  This template defines the structure of the data you are trying to gather or parse out of your content.  We are very fortunate that the Network to Code (NTC) team has given us a large library of templates from which to choose and so very often we don’t have to worry too much about the template itself. We just need to know what Cisco IOS show command has the information we want.

Content (network device output) to Parse

Once you have your template, you need content to parse.  You can get this in a variety of ways.  You can query your devices real time via Ansible or via a python script or you can act on file based (saved) data that you already have.

In this example we will keep it simple and assume we have a text file of show commands that we need to parse to get the device hardware information and software version.

To get hardware and software information, the “show version” output should have what we want.  So looking at the existing templates in the NTC library, it looks like the cisco_ios_show_version.template has what we need, If we look at the template we can see that it has two variables, VERSION and HARDWARE (which will return a list).  That looks just about right for the information we want to extract and luckily the file of show commands includes the output of the “show version” command.

So here are the two files we will work with in this example:

– the textFSM template file (downloadable from GitHub) and included my textfsm3 repository on GitHub.

– the content file with show commands including the output of the show version command (downloadable here)

For simplicity I’ve put them both in a temp directory and I will launch the python interpreter from there so we can work real time. I also list the modules that I have in the virtual environment. The only one you need for this example is textfsm.

Working directory and files

(txtfsm3) Eugenias-PB:textfsm3 eugenia$ tree
├── cisco_ios_show_version.template
└── lab-swtich-show-cmds.txt

The environment

(textfsm3) Eugenias-PB:~ eugenia$ python --version 
Python 3.6.55
(textfsm3) Eugenias-PB:textfsm-example eugenia$ pip freeze
PB:textfsm-example eugenia$

Lets get started…

Launch the interpreter and import the textfsm module. The “>>>” tells you that you are in the python command interpreter.

(textfsm) Eugenias-PB:textfsm3 eugenia$ python 
Python 3.6.5 (default, Apr 25 2018, 14:26:36)
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

Open the template file into a file handle I’ve called template and pass that as the argument to the textfsm.TextFSM method creating a “parsing object” based on that template (the show version template in our case).

>>> template = open('cisco_ios_show_version.template') 
>>> results_template = textfsm.TextFSM(template)

Look at some of the methods (functions) available in the results_template object by using dir() on the results_template object you just created. Make note of the ‘ParseText‘ one as that is the one we will use shortly to parse our content.

>>> import textfsm
>>> template = open('cisco_ios_show_version.template')
>>> results_template = textfsm.TextFSM(template)
>>> dir(results_template)
['GetValuesByAttrib', 'MAX_NAME_LEN', 'ParseText', 'Reset', '_AppendRecord', '_AssignVar', '_CheckLine', '_CheckRule', '_ClearAllRecord', '_ClearRecord', '_DEFAULT_OPTIONS', '_GetHeader', '_GetValue', '_Operations', '_Parse', '_ParseFSMState', '_ParseFSMVariables', '_ValidateFSM', '_ValidateOptions', 'class', 'delattr', 'dict', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'gt', 'hash', 'init', 'init_subclass', 'le', 'lt', 'module', 'ne', 'new', 'reduce', 'reduce_ex', 'repr', 'setattr', 'sizeof', 'str', 'subclasshook', 'weakref', '_cur_state', '_cur_state_name', '_line_num', '_options_cls', '_result', 'comment_regex', 'header', 'state_list', 'state_name_re', 'states', 'value_map', 'values']

>>> results_template.value_map
{'VERSION': '(?P.+?)', 'ROMMON': '(?P\S+)', 'HOSTNAME': '(?P\S+)', 'UPTIME': '(?P.+)', 'RELOAD_REASON': '(?P.+?)', 'RUNNING_IMAGE': '(?P\S+)', 'HARDWARE': '(?P\S+\d\S+)', 'SERIAL': '(?P\S+)', 'CONFIG_REGISTER': '(?P\S+)', 'MAC': '(?P[0-9a-fA-F]{2}(:[0-9a-fA-F]{2}){5})'}

You don’t need these commands. I just wanted to show you how to investigate the object a little which is quite easy when working in the interpreter. Start with dir() which will give you the various methods or actions you can perform on the object you created. Focus on the ones without the underscore as those are generally internal methods although in many cases they can be useful. In the case of the textfms module I use the _results and the header method often.

This is the heart of what we are trying to do now that we’ve selected our parsing template and created an object that will parse our content and pull out the information we want. So far all we have done is built a specific “strainer” to catch the information we need. How we need to pour our data through our strainer.

First we open our text file of show commands. Below I create a file handle (basically a variable) to the show command file called content2parse and read the contents of our text file of show commands into the variable. You have to open the file first before you can read its contents.

>>> content2parse = open('lab-swtich-show-cmds.txt') 
>>> content =

Now we parse out the data we want using our results_template object and its ParseText method against our content and store the results in the parsed_results variable. As you can see, this is a list of lists. If you ran this against show command files from 5 different devices you can begin to see the possibilities. You would have a list of 5 lists with the show version information for each device.

>>> parsed_results = results_template.ParseText(content) 
>>> parsed_results
[['15.2(2)E3', 'Bootstrap', 'artic-sw01', '4 days, 14 hours, 2 minutes', 'c2960s-universalk9-mz.152-2.E3.bin', ['WS-C2960S-24TS-S'], ['FOC1709W1DT'], '0xF']]

On a side note, notice that the value of results_template._result is equal to our parsed_results. Its a good practice to put your results in

>>> results_template._result
[['15.2(2)E3', 'Bootstrap', 'artic-sw01', '4 days, 14 hours, 2 minutes', 'Reload command', 'c2960s-universalk9-mz.152-2.E3.bin', ['WS-C2960S-24TS-S'], ['FOC1709W1DT'], '0xF', ['70:10:5C:53:D4:80']]]

Your first inclination might be to iterate over the results but remember that its a list of lists (notice the double opening square brackets, [[, the first denoting that the entire result is a list and the second denoting that the first element in the list is also a list) so iterating over the results would iterate over only the item in the “top level” list [0]. What you want is to iterate over each element of the results list which is also a list. I know that my top level list has 1 element, the [0] element, and that list has 8 elements. We only parsed one file which is why our list only has one element. If I iterate over the [0] element of the list I get each individual bit of information. Get the length of the top level list (should only have one element since we only parsed 1 file)

Get the length of list (the “outer” or top level list):

>>> len(parsed_results) 

>>> for item in parsed_results:
>>> … print(item)
>>> …
['15.2(2)E3', 'Bootstrap', 'artic-sw01', '4 days, 14 hours, 2 minutes', 'Reload command', 'c2960s-universalk9-mz.152-2.E3.bin', ['WS-C2960S-24TS-S'], ['FOC1709W1DT'], '0xF', ['70:10:5C:53:D4:80']]

Next, get the length of that first element (the 0 element as lists are zero indexed). Here is our data for our one device. If we iterate over our outer list we get a single line of output with the one element (also a list).

Lets find out how many elements our inner list has:

>>> len(parsed_results[0])

>>> index = 0
>>> for item in parsed_results[0]:
... print(f"Element # {index}: \t{item}")
... index += 1
Element # 0: 15.2(2)E3
Element # 1: Bootstrap
Element # 2: artic-sw01
Element # 3: 4 days, 14 hours, 2 minutes
Element # 4: Reload command
Element # 5: c2960s-universalk9-mz.152-2.E3.bin
Element # 6: ['WS-C2960S-24TS-S']
Element # 7: ['FOC1709W1DT']
Element # 8: 0xF
Element # 9: ['70:10:5C:53:D4:80']

I know I have 10 elements (0-9) because i checked the length and if you do a length on the value_map you can see that the template is designed to get 10 pieces of information so that makes sense. Get the length of the first list

>>> len(parsed_results[0]) 
>>> len(results_template.value_map)

Finally, after all of this I wanted to get the version and hardware information out of this file and here it is:

>>> parsed_results[0][0] 
>>> parsed_results[0][6]

I hope these basics help you get started or keep practicing.

Here are all the main commands in sequence so you can see the flow without all of the interruptions. Also easier to copy and paste!

>>> import textfsm 
>>> template = open('cisco_ios_show_version.template')
>>> results_template = textfsm.TextFSM(template)
>>> content2parse = open('lab-swtich-show-cmds.txt')
>>> content =
>>> parsed_results = results_template.ParseText(content)
>>> parsed_results
[['15.2(2)E3', 'Bootstrap', 'artic-sw01', '4 days, 14 hours, 2 minutes', 'Reload command', 'c2960s-universalk9-mz.152-2.E3.bin', ['WS-C2960S-24TS-S'], ['FOC1709W1DT'], '0xF', ['70:10:5C:53:D4:80']]]

Python 2 or Python 3? Asked and Answered.

I’ve spent the last 6 months converting my Python 2.7 scripts over to Python 3. In the nascent days of SDN and Automation that was always one of the first questions asked… To code in Python 2 or Python 3?

That question has been answered.

If you look at the Python Developers Guide you can see the expiration date quite clearly and looking at PEP 373 gives you a real sense of the history.

There will be some niche projects that keep 2.7 going I suspect, however as this question relates to network automation the answer is clear.

It’s time to embrace Python 3.

If you are just getting started, download Python 3 and move along. No need to read this!

For those who started with Python 2, the key impediments are gone. For me Ansible was the last hold out. Starting with Ansible 2.2 (11-01-2016) we saw the effort to move the code base to Python 3. Officially, as of Ansible 2.5.0 (Released 2018-03-22), Python 3 (specifically 3.5 or later) is supported.

All the modules commonly used by network engineers are mature on Python 3.

  • Netmiko
  • Napalm
  • Ciscoconfparse
  • TextFSM

If the imminent “end of life” of Python 2 is not enough there are some real benefits to moving to Python 3. There are features in Python 3 that will help you if you are beginner:

and features that will help you if you are more advanced:

just to highlight two of my favorites.

Python 3 also comes with a built-in ipaddress module. I’m a big fan of netaddr which is absolutely available in Python 3 but if you want to share your code and keep dependencies down this is a big plus.