Featured

webscraping for your dream car with Python

For those of you who enjoy a pretty car – here is something that might sound familiar. You’ve seen the car on the road, wondered how much it would cost, checked Done Deal, scrolled through the listings, read through the descriptions, decided against it because of the price and then closed the site. Only to repeat again in 6 to 8 weeks and forget what the prices were the last time you browsed. Rinse. Repeat. Well, rinse and repeat no more with Python and a package call beautifulsoup

Python & Python packages

For those unfamiliar with the world of programming, Python is a versatile programming language that is configurable to do almost anything you can think of. Want to automate an Excel report? Python does that. Want to build a machine learning model? Python does that. Want to do something as basic as adding 2 numbers together? Python does that too, you just need to know how!

This is where Python ‘packages’ come in. Python packages, or libraries, are sets of preconfigured code that allows you to skip steps when creating the code to do your task. For example, rather than coding lines of script to create a decision tree model (a type of machine learning model), the scikit-learn package allows you to do this in one line: clf=tree.DecisionTreeClassifier(). Essentially, someone else has done the grunt work for you – the joys of open source!

webscraping with Beautifulsoup

Introducing beautifulsoup, the Python package that allows users to web-scrape with ease. Webscraping is a method of extracting information from websites and is what we will use here to extract all the ads from Done Deal for our dream car. An important point to note here is that when using any webscraping tools, you should do so responsibly. If you make multiple requests to scrape information from a site in a short space of time, your IP address (your networks identifier) can be blocked from the site. Websites do this to avoid heavy loads of unintended traffic to their site. They’ve built the site for users and want to prioritise their experience over webscrapers, naturally enough.

As straight forward as beautifulsoup is to use as a Python package, each website is built differently. This means that a webscraper script built for Done Deal won’t work for Carzone. You could even find a script you created today mightn’t work in 1 months’ time because the websites content layout has changed. Another challenge is that you may be limited to a maximum number of search results depending on the limits put in place on the website. For example, in the search results of Done Deal, we can’t display more than 30 results at a time, limiting the results we get from a single request. Sounds frustrating, right? All part of the fun!

getting started

Now onto the technical stuff – you can skip to the end if you just want the Python script to do your own scraping of Done Deal! Here is the link to my GitHub file. The car we’re going to build a web scraper for is the Toyota GT86. To me, this is an amazingly stylish sports car, suitable for a daily driver and reasonably priced. Heading over to Done Deal we find the listings for the GT86. The URL is what we’re interested in here, we’re going to take this and assign it to a variable called URL. We then use the requests library to read the URL, and use beautifulsoup to parse the html response. What you’re left with is all the code used to create the webpage and from this text you’ll find all the details of the cars. Now, you could manually sieve through this text but I wouldn’t recommend it because we can use parsing tools with the beautifulsoup package to find the data we’re looking for.

# first we need to import all the libaraies we will use 

import requests # this library allows us to read from the website we're interested in  
from urllib.request import urlopen # this library is used for connecting to the websites
import time # we use the time library to get the current date
from bs4 import BeautifulSoup # this is the magical library - this makes it easy to access the data on websites
import pandas as pd # pandas is the building blocks of Python


# before kicking into this, we need to set some parameters
# parameters are like reference points that we can access at any point in the script

# assigning the website url - I've gone for the page with the Toyota GT
url = 'https://www.donedeal.ie/cars/Toyota/GT86'
# this is the parameter we use to open the website
page = urlopen(url)
# this is how we read from the website
html = page.read().decode("utf-8")
# this is how we make sense of the data we access from the website
soup = BeautifulSoup(html, 'html.parser')

# making sure we have access to the webpage in question 
print(soup.title.string)

To know what part of the results page we’ll find the nuggets of data we’re interested it, trial and error is involved. For a full in depth overview of what this entails, here are two great articles that helped me along the way:

https://realpython.com/python-web-scraping-practical-introduction/

https://medium.com/analytics-vidhya/scraping-car-prices-using-python-97086c30cd65

DoneDeal html response

After exploration of the site, the main bit of html code that we will use to extract the data is the class ‘div’ and attribute ‘class’: ‘Card__Body-sc-1v41pi0-8 cAPCyy’. This is because each ad is contained in this class. What we need to do is extract each element of the ad iteratively and create a dataframe to save them into. The elements of the ads we’re interested in are the ad title, the year of the car, the location of the car and the car price. Because we know that each add has the same elements, instead of extracting each ad individually, we can create functions that iterate over the ads, extracting elements individually and then bringing them together at the end. In our case, we create four functions and create a dataframe of the output of these functions.

# we need to access each element of the ad, so we need a function for each eg add title, car year etc
# below is the function we can use to to get the title of the add

def car_names(card_body):
    names = []
    for item in card_body:
        for i in item:
            names.append(i.find_all("p", attrs = {"class" : "Card__Title-sc-1v41pi0-4 duHUaw"}))
    return names
    
# the next 3 functions are self explanatory - car year, car location and car price

def car_year(card_body):
    years = []
    for item in card_body:
        for i in item:
            years.append(i.find_all("li", attrs = {"class" : "Card__KeyInfoItem-sc-1v41pi0-5 hGnmwg ki-1"}))
    return years

def car_location(card_body):
    locations = []
    for item in card_body:
        for i in item:
            locations.append(i.find_all("li", attrs = {"class" : "Card__KeyInfoItem-sc-1v41pi0-5 hGnmwg ki-5"}))
    return locations

def car_price(card_body):
    prices = []
    for item in card_body:
        for i in item:
            prices.append(i.find_all("p", attrs = {"class" : "Card__InfoText-sc-1v41pi0-13 jDzR"}))
    return prices
    
# now we can use the functions we created to extract the information we're interested in and create a dataframe
# a dataframe is like a table in an Excel file - it allows us to 'play' with the data
car_listings = pd.DataFrame({"Name" : car_names(card_body), "Year": car_year(card_body), "Price": car_price(card_body), "Location": car_location(card_body)})
the boring part

et voila, once you run this you have a nice dataframe ready to go. If only. What you’re left with is a bit of a mess! While there are only 20 odd adds on the page, the dataframe has over 100 rows. And because you’ve extracted html text, the data in your columns are contained in lists. And then within some of these lists, you have multiple lists – look at the format of the price column. This is where you need some data wrangling experience to get the data ready for consumption. As always, I recommend you go through each step yourself to observe the changes and purpose of each bit of code. This bit gets easier with time though. 

fin

You can save the results to a file and do the same each time you run the script, or you can append to an existing file. As what you do with the data itself, that’s up to you. One point of note is that the car you’re interested in may have multiple pages of results. To handle this, you can create a for loop and iterate over the result pages. If you want a hand with this, reach out!

Here is the link to the GitHub page. Until next time, nerd out, fellow car enthusiasts!

streamline the boring stuff: the CSO edition

I’ve recently been working a lot with APIs (otherwise known as Application Programming Interfaces) in my role with GRID Finance as we aim to apply “API-first” principles to our analytics offering. It reminded me of the post I created about Ireland and her Exports because the data that inspired that post came from the Central Statistics Office of Ireland (CSO).

What I didn’t realise at the time was that the CSO has a free to use API so instead of manually downloading a CSV file and then importing it into a tool like Python, developers can connect into the CSO directly and… streamline the boring stuff! This post will show how the API works using Python so you can spend more time working with the data. If you want to get straight to the good stuff, feel free to jump to the end of the page where you’ll find a text file with the code you need to get started.

connect to the CSO with python in 6 easy steps

  1. Install the package json-rpc. This allows you to explore the response you get from the API which is returned as a JSON object. 
  2. Find the dataset you want to work with on the CSO site. In this example we’re going to use the monthly unemployment dataset, specifically the “MUM01 – Seasonally Adjusted Monthly Unemployment” from the following site: https://data.cso.ie
  3. Select the data you want to include in your dataset. We’re going to select the unemployment rate as a percentage, the last 12 months, all age groups and all sexes.
  4. Scroll down the page and expand the API Data Query option, making sure you’ve selected ‘JSON-RPC’ within the drop down – this will show you the commands you need to query the API, including the options you selected in step 3. What we want is the GET URL, and the POST Body.
  5. Create 1 variable for the URL and another for the API request – I’ve named them URL & payload.
  6. Call the API using the requests.post(url, Jason=payload).json() command.

code to access the CSO API for monthly unemployment figures

# monthly unemployment figures - using the API request from the CSO site 

url = "https://ws.cso.ie/public/api.jsonrpc?data=%7B%22jsonrpc%22:%222.0%22,%22method%22:%22PxStat.Data.Cube_API.ReadDataset%22,%22params%22:%7B%22class%22:%22query%22,%22id%22:%5B%22STATISTIC%22,%22TLIST(M1)%22,%22C02076V02508%22,%22C02199V02655%22%5D,%22dimension%22:%7B%22STATISTIC%22:%7B%22category%22:%7B%22index%22:%5B%22MUM01C02%22%5D%7D%7D,%22TLIST(M1)%22:%7B%22category%22:%7B%22index%22:%5B%22202207%22,%22202206%22,%22202205%22,%22202204%22,%22202203%22,%22202202%22,%22202201%22%5D%7D%7D,%22C02076V02508%22:%7B%22category%22:%7B%22index%22:%5B%22316%22%5D%7D%7D,%22C02199V02655%22:%7B%22category%22:%7B%22index%22:%5B%22-%22%5D%7D%7D%7D,%22extension%22:%7B%22pivot%22:null,%22codes%22:false,%22language%22:%7B%22code%22:%22en%22%7D,%22format%22:%7B%22type%22:%22JSON-stat%22,%22version%22:%222.0%22%7D,%22matrix%22:%22MUM01%22%7D,%22version%22:%222.0%22%7D%7D"

payload = {
"jsonrpc": "2.0",
"method": "PxStat.Data.Cube_API.ReadDataset",
"params": {
    "class": "query",
    "id": [
        "STATISTIC",
        "TLIST(M1)",
        "C02076V02508",
        "C02199V02655"
    ],
    "dimension": {
        "STATISTIC": {
            "category": {
                "index": [
                    "MUM01C02"
                ]
            }
        },
        "TLIST(M1)": {
            "category": {
                "index": [
                   "202208",
						"202207",
						"202206",
						"202205",
						"202204",
						"202203",
						"202202",
						"202201",
						"202112",
						"202111",
						"202110",
						"202109",
						"202108"
                ]
            }
        },
        "C02076V02508": {
            "category": {
                "index": [
                    "316"
                ]
            }
        },
        "C02199V02655": {
            "category": {
                "index": [
                    "-"
                ]
            }
        }
    },
    "extension": {
        "pivot": 'null',
        "codes": 'false',
        "language": {
            "code": "en"
        },
        "format": {
            "type": "JSON-stat",
            "version": "2.0"
        },
        "matrix": "MUM01"
    },
    "version": "2.0"
}
}

# calling the API with the requests package, and json to explore the data
response = requests.post(url, json=payload).json()

exploring the output of the JSON object

When we take a look at the response, it can be daunting – it’s not in a nice data frame that we can automatically use for data exploration so a bit of data wrangling is needed. To access the data within the response we can use commands like response[‘result’] – the API response is like a set of lists that are easily accessible. Each API provider is different, so a bit of time up front is needed to understand each one. But it also means that once you figure out how the CSO works, you can use it for any of their published datasets!

What we need from this response is the ‘result’, which we can access using response[‘result’].items() and save this as a pandas data frame. From there we extract the date range and values and then combine the 2 to create our final dataset. I’ve skipped a few steps on how I extracted the data but I suggest you have a play around with the queries by breaking them into their own queries to get a better understanding of how its done.

bonus step

If you know you’re going to query this API on a regular basis and don’t want to have to manually select or deselect months, you can use variables to assign dates that you pass into the API query instead. For example, in the attached python script, I’ve future-proofed the API call by creating variables with future dates and including them into the query.

I hope this was useful. Hit me up if you’ve any questions.

The full script can be found below.

network analysis and the Irish stock market (feat. Python)

The stock market. Buy low, sell high. Unfortunately, it’s not that easy and if it was we’d all be silly rich and probably not writing blog articles on personal websites about it! Fortunately, there are a variety of tools that traders can use to gain an advantage in the stock market. One of those tools is social network analysis.

what is social network analysis?

Social network analysis (SNA) is the study of relationships between people, entities or objects that can be measured by graph properties and visualised as a network graph. The nodes (people, entities or objects) are connected by edges (weights) that allows the measurement of the relationships between them. The larger the weight, the stronger the relationship.

A simple example of an undirected network graph is a friendship circle. Person A knows person B and C but person B and C don’t know each other. Person C knows person D, who knows person E. Person E also knows person G and A. Finding it difficult to keep track? Wondering what the best way to visualise these relationships is? You’re probably thinking of drawing it out. You’re thinking of a network graph.

By drawing the connections between people we can understand the relationships between them. For example, who do you think is the most connected person in this graph? Person E. Why? Because he is connected to A, D & G. The least connected are persons B & G (they are only connected to 1 other person).

This is known as an undirected graph because the relationship goes both ways. Person A cannot be a friend of Person B if Person B isn’t their friend back. In a directed graph, the relationship is 1 way. An example of this is a direct flight from Dublin to Knock.

By applying the same logic to any dataset we can view complex relationships in compact network graphs.

applying SNA to the Irish stock market

How can we create a network graph with the returns of the Irish stock market? First we need to define our nodes and in this case, our nodes are the companies listed in the stock market. Second we need our edges – this is how we connect nodes together. For our example we can use the correlation of daily share price returns for each pair of companies. By using the correlation we can determine the strength of the relationship between the nodes.

To keep things simple we can include a threshold for correlation. If a pair of companies are moderately correlated (> 0.3) then they will be included in our dataset. This is more commonly known as the ‘winner takes all’ approach.

This can all be done in Python so the next part of this article focuses on how we can achieve this using extracts from a script I recently developed. The dataset we’re working with here are share price returns for January 2009 and is available at the bottom of this article.

our trustworthy friend, Python

First we need to calculate the correlation matrix:

Jan_2009_dropped_NAN_corr = Jan_2009_dropped_NAN_floats.corr().abs()

To manipulate the data so that we have a list of pairs of companies and their correlation coefficients we can use .stack() along with some dataframe adjustments:

df = Jan_2009_dropped_NAN_corr.stack()
df1 = pd.DataFrame(df)
df1 = df1.reset_index(level=[0])
df1 = df1.rename(columns = {'Company_Name':'Company_Name_2', 0:'Correlation_Value'})
df1 = df1.reset_index().rename(columns = {'Company_Name':'Company_Name_1'})

To finalise our new dataframe we isolate all company pairings where the correlation is greater than 0.3:

df2_reduced = df2.loc[(df2['Correlation_Value'] > 0.3)]
df2_reduced_company_pairs = df2_reduced[['Company_Name_1', 'Company_Name_2']]

To initialize and visualize our graph using the Python package networkx is quite simple:

G = nx.Graph()
G = nx.from_pandas_edgelist(df2_reduced_company_pairs, 'Company_Name_1', 'Company_Name_2')

figure(figsize=(10, 7))
nx.draw_shell(G, with_labels=True,width=1)

network graph of the Irish stock market

Et voila. A network of companies on the Irish stock market that share a correlation of > 0.3. We can see companies such as Tullow Oil and FBD holdings are connected with many companies in the graph, indicating an underlying relationship with each other.

Companies that aren’t connected don’t share a connection. For instance, Ovoca Bio is connected to PetroNeft and no other company indicating its share price is independent of other companies in the market.

As an alternative to Python, to the left is the same network graph as above but created using Gephi (software specifically developed to create network graphs). Not only does it make aesthetically pleasing graphs, it allows drag and drop customization that Python does not offer.

One example is that we can color the nodes to a gradient depending on the importance they are to the overall network with importance being measured by the number of nodes it’s connected to.

In this instance, dark green indicates most important while dark purple indicates less important – white indicates being in the middle.

so what?

When working with data it’s easy to get carried away with the doing and lose sight of the reason why data science exists – the so what? What value does this piece of work bring to the table? In our case, by applying SNA to stock market price returns we can infer how stock prices move together and how this changes over time.

As a result of SNA, we know that for January 2009 Tullow Oil was the most connected company in the network. One tactic to reduce risk taking in trading is diversification. This involves buying multiple stocks in different industries so that when the value of certain stocks decrease, the whole portfolio isn’t impacted. However if we had a limited amount of capital to invest we could use our analysis and invest only in Tullow Oil as it’s closely related to the highest number of stocks in the market and therefore a safer bet (theoretically – this is not financial advice!).

Thanks for reading. The dataset and code used to create the above graph has been included below and feel free to reach out if you’ve any questions.

Ireland and her exports

On July 15th 2020 Ireland won its appeal against a European Commission ruling that it gave Apple illegal state aid in the form of tax advantages that gave it an unfair advantage over other EU countries. According to the Commission, Apple owes Ireland a hefty sum of €14.3 billion in uncollected taxes. Margrethe Vestager, the EU Commissioner for Competition Policy and ultimately the figurehead for the EU on this front, has been fighting what she determines are unfair tax breaks being applied by States like Ireland. Labelled as a “consumer champion” by some commentators, Vestager is determined to bring order to an ever growing digital economy.

Vestager is likely to bring the case to the EU’s highest court, the Court of Justice of the European Union with potentially significant implications for the future of Irish tax policies and the companies that have chosen to establish their EU headquarters on the island. Naturally there are conflicting opinions on the ruling. Some argue that Ireland is blatantly allowing international powerhouses funnel money through it’s Irish headquarters, akin to money laundering. Others argue without foreign direct investment, Ireland wouldn’t be in the strong position it finds itself in, Covid-19 crises and all. Quoting the EU directly:

“As a small open economy Ireland’s financial fortunes are largely dependent on international trade and influenced by global markets.”

https://ec.europa.eu/ireland/news/key-eu-policy-areas/economy_en

So, with this backdrop, I continued my exploration of Irish exports over the last 5 years and took a look into the commodities we’re sending overseas.

data preprocessing

In my last post I showed how easy it is to map data in Python by aggregating an Irish exports dataset published by the CSO. For this post I used the same dataset and followed a similar method of aggregating the data – except this time I condensed the commodities into groups. This was a critical part of the process because there are over 60 commodity classifications in the dataset and too many to graph at once.

Typically you’d have an SME on hand to confirm your groupings and naming conventions but in this case I’ll trust my gut – although upon reflection after looking at my results, the commodity group ‘Transport’ should have been something like ‘Automobiles’ but we’ll plough on anyway. By grouping the categories, I was able to reduce them down to 13 groups and 1 catch all group ‘None’. The catch all group is created as a back up if we miss any categories and for those that have no category at all so hopefully the total of this group is negligible. You’ll find an extract of how I did this below in the code snippet but as usual, you’ll find the full code at the bottom of this post.

# setting up a function that searches for the string in the category
# each category has a number with the description, so by searching for the number, I can assign it a new group
# for example, Live animals except fish etc. (00) is a category - so I search (00) & assign it agriculture  

def commodity(x):
    if '(00)' in x:
        return 'Agriculture'
    elif '(01)' in x:
        return 'Food & Beverage'
    elif '(02)' in x:
        return 'Food & Beverage'
    elif '(03)' in x:
        return 'Food & Beverage'
    elif '(04)' in x:
        return 'Agriculture'
 .......

home of pharma & chemical exports

The next step in the process is to visualise the groupings. Since starting in Bank of Ireland I’ve been using Tableau as my main data visualisation tool & I have to say – I’m sold! It’s such a handy tool as long as the data is in the correct format (which it is, thanks to Python). So I continued the trend and used it to create the below visuals.

Line charts are useful tools for showing trends over time however when you have too many categories to visualise at once it’s sometimes difficult to see what story is.

Here we can see that there are 2 categories (pharma & chemical manufacturing) that are several times higher in value than all other categories and look to be strong areas of growth in Ireland. After that it’s quite difficult to see trends in other categories because the lines are overlapping. What we can say is that apart from machinery, things looks fairly steady suggesting minimal growth. For reference, the colour legend in the data prep section represents each commodity.

A way to visualise the composition of categories is to create bar charts. By allowing for a direct comparison from year to year we can see a clearer view of the difference between categories and years. By arranging the bars from highest to lowest, highest value being at the bottom and lowest on top, we can see how pharma and chemical manufacturing have a significant contribution to total exports. Whats difficult about this view is that we can’t really see the composition of the other categories – similarly to the line chart above, it looks steady but it’s hard to tell for sure. This is where stacked bar charts prove their worth!

stacks, stacks, stacks

A stacked bar chart is the same as a normal bar chart however the values are converted into percentage values of the total sum. Taking 2019 for instance, instead of showing the actual value of export for each commodity group, we can see the percentage value relative to all other commodities. This allows us to identify the areas of growth in the economy relative to all other areas.

For instance, while we see growth in the food & beverage sector in real value terms between 2015 & 2019 (€7.9B in 2015 to €10.B in 2019), in percentage worth to the economy, it remains steady because of the increases in other sectors. Taking a look into the agriculture sector we see a modest increase in real value terms (€3.5B in 2015 to €3.6B in 2019) but a decrease in percentage terms.

overall value

Heatmaps are a really useful tool if we’re concerned with a point in time proportional view of our data. In this instance, I totaled the value of each commodity group between the years 2015 & 2019 to see how much each contributes to the Irish economy. Again, we can see just how important pharma & chemical exports are to Ireland with general manufacturing and machinery being the next most important commodity groups.

the movers and shakers

The last visual I created shows the percentage change in each commodity group over the 5 years. As 2015 is the base year there is no value assigned. The value in 2016 represents the percentage change between 2015 & 2016, with 2017 being the change between 2016 & 2017 and so on. By representing the percentage change in colour format, it allows us to visualise areas of growth and decline. A darker brown/orange colour shows a higher negative change while a darker blue shows a higher positive change.

This view shows us that tobacco exports have rapidly declined while there has been 2 years of decline in transport related exports. The commodity seeing the biggest swings is machinery with a massive increase of 37% worth of exports in 2019. With companies like Kingspan & Combilift expanding in recent years I wonder how much these companies are contributing to that growth.

final thoughts

I was really surprised with just how much Ireland relies on high tech exports. As recent as the 1970’s Ireland’s main export was agriculture based products to the U.K. So, in this context, is it surprising that the Irish government has strongly opposed the anti-competition charges it has faced? And how long can it defend itself against the heavy hitters of the EU who claim that it’s cheating the system? Only time will tell.

Thanks for stopping by. If you’re interested in trying it for yourself, the code and data have been included below.

mapping Irish exports in Python

Recently, there’s been a lot of talk about a reduction in consumer spending in economies around the world so it got me thinking about who are Ireland’s biggest trading partners. Fortunately, the wonderful people over at the Central Statistics Office publish this data and more but for now, I’m going to focus on creating a map in Python of total Irish exports since 2015 to identify who we’ve been exporting to the most.

Most programming tools have mapping functionalities with some easier to use than others. I prefer using folium in Python because unlike packages like ggmap in R, we don’t need to sign up for API’s and it’s relatively straight forward to use.

data preprocessing

Before we’re able to map the data, we need to preprocess it so that it’s in the correct format for mapping – and to be honest, data preprocessing is generally the most time consuming part of data science! The main elements of preprocessing for this dataset was to:

  • create a full record per row
  • replace blank values with 0’s
  • remove any total or blank columns
  • convert all remaining variables to numeric form
  • aggregate exports per country
  • create a new column calculating each country’s percentage value of total exports
  • assign countries their longitudes & latitudes for mapping

Once preprocessing is complete we can move onto the fun stuff! Below is an excerpt of the code used to create our map of exports:

# Initialize the base map - these are starting points for the map:
m3 = folium.Map(location=[20, 0], zoom_start=2)
 
# defining the settings for the chloropleth:
m3.choropleth(
 geo_data=country_shapes, # using the country shapes we got the json file
 name='Total Irish Exports between 2015 & 2020',  # name of our map
 data=countries_total, # what data we want to map
 columns=['country_codes', 'perc'], # what columns from our dataframe that we want to map
 key_on='feature.id', # what we're matching with - in our instance, we're joining the country codes to the IDs in the json file
 fill_color='YlGnBu', # the colours we want to use in the ma
p
 fill_opacity=0.5, # similar to transparency - colour setting
 line_opacity=0.5,
 legend_name='Percentage Value of Exports', # the name under our legend
 smooth_factor=0, # lines around the countries as we zoom in 
 highlight=True, # does the map highlight the country when we hover over it with a mouse 
)
folium.LayerControl().add_to(m3)

The choropleth function allows us to colour the map depending on the value attributed with each country. In this instance, the percentage of total exports column was used. The higher the percentage, the darker the colour. Surprisingly, the US are the biggest importers of Irish goods (in total € value) with more than double the total value of the next largest importer, Belgium, who are closely followed by Great Britain. The countries in black are unrecorded in our dataset (the joys of working with data – some of it simply isn’t available or falls into a category like ‘other countries’).

The downside to using something like choropleth is that when values are very similar it’s difficult to differentiate between variables. In the above map, countries between 0 and 5% of total exports are indistinguishable from each other so in this instance a simple bar chart may be more suitable depending on your needs.

an alternative map in Tableau

At my day job, I’ve been working with Tableau and I’m always pleasantly surprised at its functionality so long as your data has been preprocessed using a programming language like Python or SQL. So, here’s what it looks like in Tableau.

Which is nicer? I’ll let you decide!

Thanks for stopping by. If you’re interested in trying it for yourself, the code and data have been included below.

where’s the rain?

We love talking about the weather in Ireland. Absolutely love it. And there’s an unlimited amount of ways to describe it – fierce weather; good drying weather; scorcher; belter; cracker – can all be used to say it’s rather sunny outside. And my favourite – there’s a grand stretch in the evening!

I’ve often wondered why we love to talk about the weather and the conclusion I always come to is that it’s just an easy conversation opener and instantly puts (Irish!) people at ease. Right now, we’re beginning to hit leaving cert weather. Leaving cert weather is how we refer to the first couple of weeks in June – you’re almost always guaranteed to have sunshine in Ireland. However we’ve been a bit spoilt for sunshine in the last 2 months and it’s got me thinking, where’s the rain? So, using Python, I decided to take a look into the data to see what it’s telling us.

Sourcing data from Met Éireann (who fortunately have an open data policy and publish data collected from their 25 weather stations around Ireland on a daily basis) I created a dataset comprising of monthly rainfall and air temperature stats from 2017-2020 for each weather station. The dataset also includes the mean figures for each station over the last 30 years.

rain, rain, go away

First, I decided to take a look into total rainfall across Ireland. Living in Dublin, I’ve always felt that it rarely rains here – and the data backs me up. The boxplot to the left shows the 3 stations with the least amount of rain in the last 3 years are located in Dublin (Casement, Dublin Airport & Phoenix Park) with the wettest stations located in Newport in Mayo & Valentia in Kerry.

A boxplot is a useful tool that shows the range of values for each variable, displaying a condensed view of data that we would otherwise find difficult to compare. The line in the colored box indicates the median value for each variable (the median is the most common value). The larger the box, the larger the range of values.

Turning our attention to Athenry in the above graph, it has a median annual rainfall value of 1200 millimetres but we’ve also had years when the rainfall was 1410 mm and 1100 mm. This proves useful in 2 ways. Firstly, it allows us to compare the median of all stations at once. Secondly, it allows us to compare stations range of values to all other stations – for example, when we compare Athenry to a station like Mace Head, we can see that there is a larger amount of variation in rainfall in Athenry than Mace Head – very useful to know if you’re planning on travelling to either of those places.

too wet, too dry?

When we want to compare something over time, simple line charts are very effective. The graph to the left shows the average rainfall per month over the last 3 years in Newport (the wettest station in Ireland) and Phoenix Park (the driest station in Ireland). By comparing the last 3 years to the mean over the last 30 years, we can see that Newport has generally been wetter during the winter/autumn months and drier during the summer months excluding August which has seen nearly 33% more rain in recent times! While in the Phoenix Park, it’s generally been drier over the last 3 years than it’s been over the last 30 years – excluding the spikes in March and November.

This got me thinking – can we prove with the limited amount of data we’re working with that the weather in Ireland is getting more extreme?

To try and answer this the histogram on the right represents each stations annual rainfall grouped by year. Let’s take the mean as an example. Each one of the yellow bars represent 1 or more station and their relative annual rainfall. Taking the 800 mark into consideration, it shows that 3 stations have had 800mm of rain on average per year while at the 1200 mark it shows that 6 stations have had around 1200mm of rain on average per year. This becomes useful to know when we plot the data for 2017, 2018 and 2019.

What this shows us is that over the last 3 years we’ve seen stations with more rain than before – we have several stations past the 1600 mark while there’s only been 2 stations reaching the 1600 mark over the last 30 years. Of course, there is a possibility of these being outliers (once off events) however when we look at where the spikes are in 2017 and 2019, there are more stations getting less rain than when we compare it to the mean. Could this point to wetter winters and drier summers?

February was a wet month, right?

Come to think of it, it rained almost every day in February. Coming back to our friend the boxplot and by isolating data for February, we can see that there was nearly 3 times more rain across the nation in February 2020 when compared to February of previous years.

but May has been really dry?

By following the same method as above, we can see that May 2020 was drier than previous years, with the median almost half of what’s expected for this time of year. We can also see that there’s been a consistent drop in rainfall in May over the last 4 years and perhaps it’s figures like this that raises concerns over the possibility of a drought in Ireland in the months to come (apologies for the doomsday rhetoric!).

average rainfall per month – an alternative view

Heatmaps are another useful tool for showing the variety in data over time through the use of size representation or colour coding (as is the case in the above heatmap). Each block represents the average rainfall for each station and month colour coded relative to all other stations and months. The darker blue represents a high volume of rain while the light yellow represents a low volume of rain as represented on the color scale on the right of the heatmap. Taking the bottom left corner block as an example- it shows that Valentia in January has a high volume of rain when compared to Athenry in January (the top left corner block). Taking a look at our wettest and driest stations again – we can see it’s pretty wet all year round in Newport apart from June while it’s fairly dry all year round in Phoenix Park.

what’s the relationship between rain and air temperature?

Exploring the relationship between air temperature and rain on an aggregated level is tricky because the air temperature figures used in this dataset are averages per month while the rain is the total rainfall measurement. This is an example of how analysis can be skewed and it’s something readers should be aware of when reading any type of analysis, but we’ll plough on here anyway.

The graph to the left is a scatterplot. Each point on the graph represents a station’s average total rainfall and average air temperature grouped by month over the last 30 years. There are 25 stations and 12 months, giving us 300 individual points in total.

‘Grouped by month’ in this case means that all points that are from the same month have the same colour. For instance, all points for January are coloured pink while those for September are blue. By grouping them like this we can easily see that the relationship between total rainfall and average temperature varies depending on what month it is. Unsurprisingly, winter and spring months bring lower temperatures and a mix of high and low volumes of rainfall while summer and early autumn months bring higher temperatures with lower rainfall.

different season, different relationship

By dividing the months into winter/spring and summer/autumn, the difference between the conditions are clearer. The top graph to the right shows the winter/spring months while the bottom graph shows the summer/autumn months. By dividing the months out like this we also see a change in the correlation figures. The correlation between the total rainfall and average air temperature for the full dataset was -0.21, but when we calculate the correlation for the divided months it jumps to 0.35 and 0.32 respectively. This indicates weak positive correlation but I imagine that if we were to exclude nighttime temperatures from the averages we’d have a different picture.

one more for luck

The final graph I’ll leave you with is a heatmap representing actual heat! Similar to the heatmap used for the total rainfall per station per month, the graph to the left shows the range in average air temperatures for each station per month over the last 30 years. Unsurprisingly, it’s warmer in the summer and colder in the winter so if you’re hoping for some sun in Ireland, July and August is your best bet.

that’s all for now

Thanks for reading and I hope you enjoyed the post. The code used to create this analysis can be downloaded below along with the associated dataset. The data has been sourced from www.meteireann.ie.

Covid-19: making the most out of a not-so-good situation

There’s something strange about being told you must do something – it inherently makes you not want to do it. With many of us being advised to work from home where possible and the increasingly likeliness of Ireland following countries like Italy and Spain in enforcing lockdowns, I’m sure there’s people out there already climbing the walls. With the gym closed and public gatherings grinding to a halt, somehow the day feels a lot longer than normal. I was thinking about why and it’s all to do with our mindset.

Think back to the last Sunday evening before work when you wish the weekend would last that bit longer. Days like that go by so fast you wonder did someone speed up the clock while you weren’t looking. But when someone tells you that you’ll be at home for the next 2 weeks, time seems to drag and drag.

An example of this comes from a childhood story of mine. Being the oldest in my family meant my parents were always pushy when it came to things like homework and study. Obviously, wanting the best for their child, they made sure I had my homework done. If I got my homework done faster than normal the question ‘have you no studying to be doing instead?’ was asked. As a kid, you can imagine how well I took this. I just wanted to be outside and playing football with my mates.

Then it comes to my sister, the middle child. Again, the same questions. ‘Why aren’t you studying?’ but perhaps a little less intense. And again, similar results. What’s interesting is that for the youngest in my family, my parents were more relaxed about these things (and everything else in general). The results were different. He would spend hours studying, so much so that myself and my sister would always hear ‘he’s always studying, I never have to tell him to do it. In fact, I sometimes have to tell him the opposite!’. Why? Mindset.

So with that in mind here’s some suggestions for the weeks ahead to avoid climbing the walls:

    1. If you’re a beginner runner, try a 6 week couch to 5k program like this one
    2. If you’re an intermediate runner, why not try a 10k program like this one
    3. Bored of eating the same food? How about trying this smart recipe suggestion tool that allows you to specify what ingredients you have and suggests a recipe for them
    4. Try some meditation
    5. Take a free python course
    6. Build a personal WordPress website for your CV
    7. Read some comics – I’ve recently gotten into Rick & Morty
    8. See if you like yoga
    9. Read a book
    10. Chill out and enjoy the time with family because we’re more than likely not going to see the likes of this again

Personally, I’m going to try and customize this site to look like a Gameboy while avoiding my college work.

Stay safe!

Are you Surrounded by Idiots?

We’re all surrounded by idiots.

This is the tongue in cheek declaration by Thomas Erikson, author of the international bestseller Surrounded by Idiots… so don’t take it personally Yellows! In a saturated market of self-help books, Surrounded by Idiots is an easy-going read for those who want to understand their fellow humans.

Similar to concepts that inspired personality tests such as Myers-Briggs and DISC, Swedish behavioral expert Thomas Erikson categorizes people into 4 distinct colors (Red, Blue, Yellow and Green) with associated unique character traits. Using a mix of real life experiences and theoretical behavioral studies, Erikson attempts to give you the tools to not only understand those around you but to influence them too.

This book is definitely one for the reading list if you’ve ever asked yourself ‘why don’t they get me?’. So, without giving away too much of the book (and hopefully not get into any copyright trouble!), below is a quick outline of each personality type.

Why is he so pushy?

Direct, goal-orientated, decisive, impatient and ambitious are all common character traits of a Red. Who turns a playful office 5-a-side game into an all out serious competitive event? A Red. Who is happy when a meeting that was scheduled for 30 minutes only lasts 15 minutes? A Red. Who is confused when a colleague is upset after giving them negative feedback when they were asked for their honest feedback? A Red.

Remind you of anyone? 🙂

Why is she reading the IKEA instructions AGAIN?

Perfectionist, logical, distant, critical and thoughtful are all traits of a Blue. There’s nothing more satisfying to a Blue than being methodical. Details, details, details. Blues can’t get enough! We could all use a bit more Blue in our lives.

Why does he never stop talking?

If you know a friend who captures an audience’s attention or who could sell water to a fish, then you know a Yellow. Enthusiastic, creative, talkative and outgoing are characteristics of Yellows. Great to be around if you’re low on energy, Yellows can bring you up to their natural high with their outgoing and friendly nature.

She never forgets my birthday

Patient, loyal, thoughtful, kind and considerate. We all know a Green. Greens are considered the best team players as a result of their desire to keep everyone around them happy. Never one to want to let team members down, when a Green says they’ll do something, you can rest assured that it’ll be done.

Chameleons – it’s in our nature

Generally people are a mix of colors and it’s certainly not a one size fits all concept. By this I mean that people can have varying personalities. Personally, I’m a mix of Green and Red in the office but at home the Red side of me can’t help but show its face (apologies to my better half and family for that one!). Conversely, at social gatherings I can be a Yellow and when fulfilling my role as a Data Scientist, I’m a Blue.

While some may be against playing the ‘social game’, we all do it. Whether it’s consciously or sub-consciously, we adjust our personalities in different situations depending on the scene and setting – so we may as well understand each other a little more and stop thinking we’re surrounded by idiots.

If you’re interested in finding out what color you are, here’s a link to the test.

And here’s a link to buy the book. Happy reading!

A Disease Epidemic Study

The recent outbreak of the coronavirus 2019-nCoV has been unsettling. As of 25-01-20 there have been 56 deaths contributed to the virus out of 1350 lab-confirmed cases1; a 4% death rate. Coronaviruses are common among animals with humans typically getting infected once over their lifetimes resulting in a mild cold or cough2. Like all viruses however, they can mutate into something more serious leading to acute illness and sometimes death particularly in those already battling with other illnesses.

Source: https://www.bbc.com/news/world-asia-china-51249208

We’ve seen coronaviruses before – SARS in 2002 and MERS in 2012 were both coronaviruses. What’s different about 2019-nCoV (n for novel, and CoV for coronavirus) is the radical steps the taken by the Chinese government to halt the spread of the virus, deciding to restrict transport in and out of 12 cities to date and cancelling all Lunar New Year celebrations – and other governments have been following suit. Schools and attractions such as Disneyland in Hong Kong have been closed until further notice, Taiwan has suspended travel permit applications from Chinese citizens and the U.S began evacuating all American citizens from Wuhan City (the city believed to be ground zero of this virus).

I recently completed a module in social network analysis (SNA) and coincidentally (and somewhat concerningly) for a group project assignment Eoghan Keegan3, Isabel Sicardi Rosell4 and myself created a simulation study of how a virus closely modeled on the 2002 SARS virus would spread across the USA airport network to understand the extent of a potential outbreak and to assess what the most effective vaccination strategy would be.

Using Python to create the simulation model and Gephi to visualize results we were able to determine that in the absence of a complete transportation shutdown, an evolving strategy of assigning limited resources to the airports with the most connections to other airports on a day by day basis was the best way to combat the spread of the virus.

SNA is a useful tool to understand complex and dynamic relationships between entities, objects or people. The name SNA can be misleading- while it’s origins come from the study of sociology it can be applied to a range of topics. In biochemistry SNA is used to understand the relationships of proteins in the human body while in engineering it’s used to fault test power grids. In this instance, we used it to understand the relationships between airports in the U.S. One of the many benefits of SNA is that because each network generates mathematical properties that allow us to understand things like how closely networks are connected, we can compare how connected one network is compared to another. These properties characterize networks.

After sourcing internal U.S flight data from the year 2008 we created a network of flights between airports using each airport as the nodes (the object we want to understand) and the flights between each node as the edges (the relationship between each object). The graph to the left is the network graph created with each circle representing an airport (the nodes) and each line representing a flight connection (the edges) and is organised by its geographical location.

The next step in our process was to understand the strength of the relationships between airports because this is critical in understanding how a disease will spread throughout a network. In SNA, you can measure the strength of relationships in a number of ways and this is called the weight of the relationship. The bigger the weight, the stronger the relationship. In this instance we used the number of flights between airports as the weights.

Measures of Centrality

There are a variety of methods that can be used to determine which airports were the most important in our network and these are known as measures of centrality of a node:

  • degree centrality is the number of edges directly connected to the node. An example of a European airport with high degree centrality would be Amsterdam because it acts as a hub to other airports.
  • closeness centrality is the measure of the average length of the shortest path between the node and all other nodes – essentially, how many steps away is the airport from connecting to another airport. An example is flying from Dublin to Singapore – we would need to stop over in London to continue our journey, so in this instance the length of the journey is 1. We calculate this for every node and average the result, leaving our closeness centrality measurement.
  • betweenness centrality measures how many times the node is on the shortest path between nodes. Going back to our Amsterdam airport example, this would have a high betweenness measure because it acts as a connection point for many of the worlds airports.
  • eigenvector centrality ranks each airports importance by the number of important airports it’s connected to on a scale of 0 to 1 with 1 being the most important – essentially, how important airports are depends on the importance of the airports it’s connected to.

We decided to measure the importance of airports using eigenvector centrality. In the graph to the right, each airports node size is relative to its eigenvector centrality rating. The bigger the circle, the greater the importance. Our results indicate that Hartsfield Jackson Atlanta International Airport was the most important in 2008 with Chicago O’Hare International and Memphis International following closely.

A useful tool in Gephi is the ability to cluster nodes together using modularity. Modularity allows us to assign each airport to a cluster depending on the shared connections between them. If you take another look at the above graph you’ll notice there are 4 distinct colors – these represent the clusters each airport relate to. Taking the North Eastern seaboard as an example (those nodes highlighted in green), while airports in this cluster have connections to airports in other clusters, our results indicate that they are highly connected to each other. The same can be said about the South Eastern seaboard while we have more of a mix between Central and Western clusters.

A final note on the network graph created by the U.S airport network is that is creates a scale free network. Without going too much into details, it shows us that the nodes with the higher eigenvector centrality are at the centre of the network representing how important they are. An important characteristic of this type of network is that it is susceptible to targeted attacks. By removing a few of these important nodes we’re able to break the network. Taking the left most node in the graph as an example – if we were trying to get from there across to the right most node in the graph, we need to go through the central nodes. If the central nodes don’t exist, it may be impossible to get to the right node – thus our network is susceptible to targeted attacks!

Our Simulation

The Virus

As with all simulation studies, parameters have to be set. To design the virus we researched various outbreaks and decided our hypothetical virus would be based on the SARS virus from 2002. Our virus had a rapid onset time resulting in immediate transferal upon contraction, an infection rate of 19% and a death rate of 9.6%.

The Model

Our model is based on a susceptible-infected-removed model with each day representing 1 time period. This model calculates the nodes population on a daily basis and is made up of S (susceptible), I (infected) and R (removed). Susceptible represents the population susceptible to infection from its neighbors, infected represents the population who carry the virus and removed represents the population that has been removed from the model either due to recovery after contracting the virus or death as a result of the virus. For more detailed information of our model such as our random implementation of outbreak of disease, please see the download button at the bottom of this post for our full paper.

The Simulation

We proposed that we could engineer a vaccine within 30 days of the virus outbreak (very unrealistic in real life!) so we ran the model for 30 days without interruption. We implemented several vaccine strategies after this point that provided an immediate 25% reduction of infection with diminishing returns once introduced. Most of our strategies were resource allocation calculations, meaning we decided how to allocate resources based on the importance of nodes as calculated by measures of centrality like degree and closeness. We also decided to attempt to split our resources evenly across the network as a comparison.

The Results

In total there were over 7 million flights transporting 534 million passengers between 516 airports. When left unchecked, the hypothetical virus infected 2.2 million passengers. While 2.2 million passengers doesn’t sound like a lot, our population is closed meaning we don’t consider people who aren’t travelling and how the disease spreads among them.

By distributing resources evenly we showed that the infection rate drops to 1.1 million but when we use SNA measures to allocate resources we found that a greedy algorithm was most effective. We call it a greedy algorithm because it targets the node with he most infected population at each iteration and allocates all resources to that node for that day. Using the greedy algorithm, only 370,000 passengers were infected – a third when compared to even distribution.

Concluding Remarks

While some parameters of our hypothetical model are unrealistic, the results are concerning. It shows that the rapid spread of contagious viruses are a genuine threat. Taking another look at the infection rate graph, we can see that there is an exponential growth of the virus in the first 30 days – information that is readily available to all epidemic specialists and it is with this lens that we can perhaps conclude that China’s radical response to 2019 n-CoV is the correct one.

References

1. https://www.ecdc.europa.eu/en/novel-coronavirus-china

2. https://www.livescience.com/new-china-coronavirus-faq.html

3. https://www.linkedin.com/in/eoghan-keegan-76486310b/

4. https://www.linkedin.com/in/isabelsicardi/

A Disease Epidemic Study