webscraping for your dream car with Python

For those of you who enjoy a pretty car – here is something that might sound familiar. You’ve seen the car on the road, wondered how much it would cost, checked Done Deal, scrolled through the listings, read through the descriptions, decided against it because of the price and then closed the site. Only to repeat again in 6 to 8 weeks and forget what the prices were the last time you browsed. Rinse. Repeat. Well, rinse and repeat no more with Python and a package call beautifulsoup

Python & Python packages

For those unfamiliar with the world of programming, Python is a versatile programming language that is configurable to do almost anything you can think of. Want to automate an Excel report? Python does that. Want to build a machine learning model? Python does that. Want to do something as basic as adding 2 numbers together? Python does that too, you just need to know how!

This is where Python ‘packages’ come in. Python packages, or libraries, are sets of preconfigured code that allows you to skip steps when creating the code to do your task. For example, rather than coding lines of script to create a decision tree model (a type of machine learning model), the scikit-learn package allows you to do this in one line: clf=tree.DecisionTreeClassifier(). Essentially, someone else has done the grunt work for you – the joys of open source!

webscraping with Beautifulsoup

Introducing beautifulsoup, the Python package that allows users to web-scrape with ease. Webscraping is a method of extracting information from websites and is what we will use here to extract all the ads from Done Deal for our dream car. An important point to note here is that when using any webscraping tools, you should do so responsibly. If you make multiple requests to scrape information from a site in a short space of time, your IP address (your networks identifier) can be blocked from the site. Websites do this to avoid heavy loads of unintended traffic to their site. They’ve built the site for users and want to prioritise their experience over webscrapers, naturally enough.

As straight forward as beautifulsoup is to use as a Python package, each website is built differently. This means that a webscraper script built for Done Deal won’t work for Carzone. You could even find a script you created today mightn’t work in 1 months’ time because the websites content layout has changed. Another challenge is that you may be limited to a maximum number of search results depending on the limits put in place on the website. For example, in the search results of Done Deal, we can’t display more than 30 results at a time, limiting the results we get from a single request. Sounds frustrating, right? All part of the fun!

getting started

Now onto the technical stuff – you can skip to the end if you just want the Python script to do your own scraping of Done Deal! Here is the link to my GitHub file. The car we’re going to build a web scraper for is the Toyota GT86. To me, this is an amazingly stylish sports car, suitable for a daily driver and reasonably priced. Heading over to Done Deal we find the listings for the GT86. The URL is what we’re interested in here, we’re going to take this and assign it to a variable called URL. We then use the requests library to read the URL, and use beautifulsoup to parse the html response. What you’re left with is all the code used to create the webpage and from this text you’ll find all the details of the cars. Now, you could manually sieve through this text but I wouldn’t recommend it because we can use parsing tools with the beautifulsoup package to find the data we’re looking for.

# first we need to import all the libaraies we will use 

import requests # this library allows us to read from the website we're interested in  
from urllib.request import urlopen # this library is used for connecting to the websites
import time # we use the time library to get the current date
from bs4 import BeautifulSoup # this is the magical library - this makes it easy to access the data on websites
import pandas as pd # pandas is the building blocks of Python

# before kicking into this, we need to set some parameters
# parameters are like reference points that we can access at any point in the script

# assigning the website url - I've gone for the page with the Toyota GT
url = 'https://www.donedeal.ie/cars/Toyota/GT86'
# this is the parameter we use to open the website
page = urlopen(url)
# this is how we read from the website
html = page.read().decode("utf-8")
# this is how we make sense of the data we access from the website
soup = BeautifulSoup(html, 'html.parser')

# making sure we have access to the webpage in question 

To know what part of the results page we’ll find the nuggets of data we’re interested it, trial and error is involved. For a full in depth overview of what this entails, here are two great articles that helped me along the way:



DoneDeal html response

After exploration of the site, the main bit of html code that we will use to extract the data is the class ‘div’ and attribute ‘class’: ‘Card__Body-sc-1v41pi0-8 cAPCyy’. This is because each ad is contained in this class. What we need to do is extract each element of the ad iteratively and create a dataframe to save them into. The elements of the ads we’re interested in are the ad title, the year of the car, the location of the car and the car price. Because we know that each add has the same elements, instead of extracting each ad individually, we can create functions that iterate over the ads, extracting elements individually and then bringing them together at the end. In our case, we create four functions and create a dataframe of the output of these functions.

# we need to access each element of the ad, so we need a function for each eg add title, car year etc
# below is the function we can use to to get the title of the add

def car_names(card_body):
    names = []
    for item in card_body:
        for i in item:
            names.append(i.find_all("p", attrs = {"class" : "Card__Title-sc-1v41pi0-4 duHUaw"}))
    return names
# the next 3 functions are self explanatory - car year, car location and car price

def car_year(card_body):
    years = []
    for item in card_body:
        for i in item:
            years.append(i.find_all("li", attrs = {"class" : "Card__KeyInfoItem-sc-1v41pi0-5 hGnmwg ki-1"}))
    return years

def car_location(card_body):
    locations = []
    for item in card_body:
        for i in item:
            locations.append(i.find_all("li", attrs = {"class" : "Card__KeyInfoItem-sc-1v41pi0-5 hGnmwg ki-5"}))
    return locations

def car_price(card_body):
    prices = []
    for item in card_body:
        for i in item:
            prices.append(i.find_all("p", attrs = {"class" : "Card__InfoText-sc-1v41pi0-13 jDzR"}))
    return prices
# now we can use the functions we created to extract the information we're interested in and create a dataframe
# a dataframe is like a table in an Excel file - it allows us to 'play' with the data
car_listings = pd.DataFrame({"Name" : car_names(card_body), "Year": car_year(card_body), "Price": car_price(card_body), "Location": car_location(card_body)})
the boring part

et voila, once you run this you have a nice dataframe ready to go. If only. What you’re left with is a bit of a mess! While there are only 20 odd adds on the page, the dataframe has over 100 rows. And because you’ve extracted html text, the data in your columns are contained in lists. And then within some of these lists, you have multiple lists – look at the format of the price column. This is where you need some data wrangling experience to get the data ready for consumption. As always, I recommend you go through each step yourself to observe the changes and purpose of each bit of code. This bit gets easier with time though. 


You can save the results to a file and do the same each time you run the script, or you can append to an existing file. As what you do with the data itself, that’s up to you. One point of note is that the car you’re interested in may have multiple pages of results. To handle this, you can create a for loop and iterate over the result pages. If you want a hand with this, reach out!

Here is the link to the GitHub page. Until next time, nerd out, fellow car enthusiasts!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: