Web scraping can be a very useful way to extract data from websites but for anyone that has only viewed the web with a browser they can seem like magic.
For example, here is http://www.geektechstuff.com viewed via a web browser:
And here is http://www.geektechstuff.com when the HTML is read by Python:
The majority of people prefer the browser view, and it is easy to see why. Modern day sites contain text, images, animations, videos and interactive sections; all of which rely on a browser to turn them from code to objects on the screen that humans can understand and interact with. However, with a little bit of Python we can sometimes do a lot more than the browser can do; extracting data from the site, accessing site resources quicker, or saving parts of the site.
To start with, a little bit of web knowledge is needed when thinking about using a web scrapper. If you are web savvy, please feel free to skim read this section or skip it.
The “GET” process:
- Bob has a computer with a web browser. He wants to open the http://www.geektechstuff.com website.
- His browser sends a request via the 7 layer OSI model which (in extreme basics) encapsulates the request into a packet, which leaves the computer, goes via the local router and then starts its hops across the internet.
- Eventually the packet reaches the platform hosting http://www.geektechstuff.com and says that Bob’s computer wants to see http://www.geektechstuff.com
- http://www.geektechstuff.com then sends back a HTML file
HTML – Hyper Text Markup Language
CSS – Cascading Style Sheet
OSI – Open Systems Interconnection
A “HTML” file:
A Hyper Text Markup Language file is the background of websites. The file is made up of opening tags <> and closing tags </> that contain data. These tags can for example be the <title> of the page, <p> paragraph tags, <img> tags saying where images should come from and formatting tags for <i> italics, <b> bold and <u> underlining.
Accessing pages via Python:
With the web basics out of the way, it’s time to look at using Python. Python has a built in library for accessing web pages and it is called urllib.request.
With just a few lines urllib.request can be used to access a webpage and read it with Python.
Although the output from this function pulls back the HTML data, it is also a little hard to read for humans:
Thankfully another Python library can assist with this, the wonderful Beautiful Soup (currently Beautiful Soup version 4).
With Beautiful Soup installed and a few additional lines of Python:
With the basics now in place I will in my next post look to using Python for more web fun 🙂