How is the data query for Python starters

HOWTO: Easy Web Scraping Using Python


Selling offer in the webshop

Two weeks ago I was informed of an action by a friendly info e-mail from an often used online shipping company whose name is reminiscent of a river in South America. I was offered three music CDs from a large selection for € 15.

As in the past, I still like buying music on physical sound carriers and wanted to take a closer look at the offer. Now it turned out that around 9,000 CDs were offered, and that over around 400 pages in the online shop. This shop gives me the opportunity to sort the offer according to popularity or customer rating. However, when I look at the popularity in descending order, I find many titles that no longer quite correspond to my age group. On the other hand, when I sort by customer rating, it turns out that the shop processes the ratings unweighted. That means any CD with popular hits is listed with only a 5-star rating before another CD with 4.9 stars out of 1000 ratings.

Web scraping

At first I didn't feel like going through all 400 pages by hand to see if something was of interest to me. Therefore, I resorted to a trick that I have used quite often in the past, namely automatically harvesting the content of the website. This procedure is anything but new, but now the child has a new name within the data science community web scraping.

It is not unlikely that as a data scientist you will have to suck data from the network yourself. That's why I want to use my simple problem to show how low the entry barrier can be with Python.

Get to work



Fortunately, the online provider uses a request method that sets out the parameters of the request in plain text in the URL. Here is an example that I anonymized a little:

You can see that the page is referenced in two places, once with _pg_1 and with & page = 1 &. So if I adjust these two places, I can iterate right through all of the subpages.


Get the website

In order to actually read the website, we use a module that is firmly integrated in Python, namely urllib. The loading of the page looks like this (The URL is mutilated by me again.):

The request object is used to open a website. The HTML code is read out if the code of the request is 200, the usual code for successfully opening a website. We all know the 404 if the page is not found, for example. The string function around the read step is necessary to actually get strings and not bytestrings. Otherwise that would cause us problems later.


Parsing the HTML code

Now that we have the HTML of the page, we can search it by CD title and artist. First of all, I looked directly at the HTML code via the browser beforehand in order to discover suspicious patterns that one can search for. I noticed that all titles begin with

and end with Audio CD. In between there is a lot of code that is unimportant for us.

We cut out the respective code positions with regular expressions from the page and save the information in a list.

re.findall is part of the regular expression package. With this I am looking for all places in the HTML code that correspond to the above pattern, where. *? is the placeholder for any chain. This placeholder is "not greedy", which means it tries to match the unknown area with as few characters as possible, otherwise you would record several albums at the same time.

For each of the code snippets made, I repeat the process with the search scheme '>. *? <', Which now returns "not greedy" the individual contents of the tags. By "looking critically at the code" I can see that the album title is stored on the first day (index 0) and the artist on the 14th day (index 13). In order to get rid of the brackets of the tags, we index the results to the second to the penultimate character ([1: -1])

The result is appended to my page_content result list as a tuple.


Writing in a CSV

At the end there is the easy part. We open a "file handle" and write each album artist tuple separated by a tab in the file. I am using the print literal from Python3, which automatically inserts a line break for every print command and which accepts the file handle as the target parameter for the output.


Overall script

The part with the reading of the website must of course be processed in a loop and the number of pages parameterized. Here would be the overall script:

Zack done!

Review of the data

So, I can now open my offers.txt file in Excel, for example, and look at it with a pivot table. That makes it much easier for me to get an overview of the performers involved in the offer in the first place.

I think it took me about half an hour to write the script as I would have needed to click through the pages, but I had more fun. And since I actually wrote the script in the form of a class with the sub-steps as methods, I will probably be able to use it again in similar scenarios.