How is the data query for Python starters
HOWTO: Easy Web Scraping Using Python
Selling offer in the webshop
Two weeks ago I was informed of an action by a friendly info e-mail from an often used online shipping company whose name is reminiscent of a river in South America. I was offered three music CDs from a large selection for € 15.
As in the past, I still like buying music on physical sound carriers and wanted to take a closer look at the offer. Now it turned out that around 9,000 CDs were offered, and that over around 400 pages in the online shop. This shop gives me the opportunity to sort the offer according to popularity or customer rating. However, when I look at the popularity in descending order, I find many titles that no longer quite correspond to my age group. On the other hand, when I sort by customer rating, it turns out that the shop processes the ratings unweighted. That means any CD with popular hits is listed with only a 5-star rating before another CD with 4.9 stars out of 1000 ratings.
At first I didn't feel like going through all 400 pages by hand to see if something was of interest to me. Therefore, I resorted to a trick that I have used quite often in the past, namely automatically harvesting the content of the website. This procedure is anything but new, but now the child has a new name within the data science community web scraping.
It is not unlikely that as a data scientist you will have to suck data from the network yourself. That's why I want to use my simple problem to show how low the entry barrier can be with Python.
Get to work
Fortunately, the online provider uses a request method that sets out the parameters of the request in plain text in the URL. Here is an example that I anonymized a little:http://www.onlineshop.de/s/ref=lp_12345_pg_1&rh=123456&page=1&ie=UTF8&qid=12345
You can see that the page is referenced in two places, once with _pg_1 and with & page = 1 &. So if I adjust these two places, I can iterate right through all of the subpages.
Get the website
In order to actually read the website, we use a module that is firmly integrated in Python, namely urllib. The loading of the page looks like this (The URL is mutilated by me again.):
The request object is used to open a website. The HTML code is read out if the code of the request is 200, the usual code for successfully opening a website. We all know the 404 if the page is not found, for example. The string function around the read step is necessary to actually get strings and not bytestrings. Otherwise that would cause us problems later.
Parsing the HTML code
Now that we have the HTML of the page, we can search it by CD title and artist. First of all, I looked directly at the HTML code via the browser beforehand in order to discover suspicious patterns that one can search for. I noticed that all titles begin with
and end with Audio CD. In between there is a lot of code that is unimportant for us.
We cut out the respective code positions with regular expressions from the page and save the information in a list.
re.findall is part of the regular expression package. With this I am looking for all places in the HTML code that correspond to the above pattern, where. *? is the placeholder for any chain. This placeholder is "not greedy", which means it tries to match the unknown area with as few characters as possible, otherwise you would record several albums at the same time.
For each of the code snippets made, I repeat the process with the search scheme '>. *? <', Which now returns "not greedy" the individual contents of the tags. By "looking critically at the code" I can see that the album title is stored on the first day (index 0) and the artist on the 14th day (index 13). In order to get rid of the brackets of the tags, we index the results to the second to the penultimate character ([1: -1])
The result is appended to my page_content result list as a tuple.
Writing in a CSV
At the end there is the easy part. We open a "file handle" and write each album artist tuple separated by a tab in the file. I am using the print literal from Python3, which automatically inserts a line break for every print command and which accepts the file handle as the target parameter for the output.
The part with the reading of the website must of course be processed in a loop and the number of pages parameterized. Here would be the overall script:
Review of the data
So, I can now open my offers.txt file in Excel, for example, and look at it with a pivot table. That makes it much easier for me to get an overview of the performers involved in the offer in the first place.
I think it took me about half an hour to write the script as I would have needed to click through the pages, but I had more fun. And since I actually wrote the script in the form of a class with the sub-steps as methods, I will probably be able to use it again in similar scenarios.
- What is grown in Hawaii
- Why was Chyna fired from the WWE
- How do you estimate the house building costs
- What Are the Best Statistics Masters Degrees
- How to register for Quora 1
- Which schools offer free online courses
- How do I start a blog business
- Why is America no longer like China?
- What are some good logical paradoxes
- Should one marry a foreign citizen
- What is a thermal power plant
- Which airports operate 24 7
- How many states does Singapore have
- Is there anything smaller than an electron?
- What is a hole on a gun
- What do economists think of Malaysia's economy?
- What is the riskiest casino game
- How do I export my WordPress site
- What are some examples of simple parables
- How deadly is a knife when thrown
- Have you ever self-diagnosed
- What kind of rock is it
- Why are Koreans so humble and honest
- What is the difference between concentration and meditation
- What are the best sounding girl names
- What are the best biblical girl names
- Which is the worst country
- Which BTS member is the greatest
- What is the definition of insectivorous plants
- How did the fall of Rome begin
- What characterizes the Chinese culture
- Is mushroom haram in Islam
- What is the best tile cleaner on the market
- How boats capsize