Web Scrapping Part 1 (Python)

urllib.request in Python

Web scrapping can be a very useful way to extract data from websites but for anyone that has only viewed the web with a browser they can seem like magic.

For example, here is http://www.geektechstuff.com viewed via a web browser:

geektechstuff_website
geektechstuff.com in browser

And here is http://www.geektechstuff.com when the HTML is read by Python:

html_behind_geektechstuff
geektechstuff.com HTML

The majority of people prefer the browser view, and it is easy to see why. Modern day sites contain text, images, animations, videos and interactive sections; all of which rely on a browser to turn them from code to objects on the screen that humans can understand and interact with. However, with a little bit of Python we can sometimes do a lot more than the browser can do; extracting data from the site, accessing site resources quicker, or saving parts of the site.

I have briefly used web scrapping in some of my earlier Python work (Amazon price checker, pollen count) and I’m hoping to now expand that work/knowledge and go a bit more in depth.

To start with, a little bit of web knowledge is needed when thinking about using a web scrapper. If you are web savvy, please feel free to skim read this section or skip it.

The “GET” process:

  1. Bob has a computer with a web browser. He wants to open the http://www.geektechstuff.com website.
  2. His browser sends a request via the 7 layer OSI model which (in extreme basics) encapsulates the request into a packet, which leaves the computer, goes via the local router and then starts its hops across the internet.
  3. Eventually the packet reaches the platform hosting http://www.geektechstuff.com and says that Bob’s computer wants to see http://www.geektechstuff.com
  4. http://www.geektechstuff.com then sends back a HTML file
  5. Bob’s web browser then loads the HTML file, reads any linked CSS files or JavaScript files and asks for any images/animations/videos that should be displayed on the page.

Glossary:

HTML – Hyper Text Markup Language

CSS – Cascading Style Sheet

OSI – Open Systems Interconnection

A “HTML” file:

A Hyper Text Markup Language file is the background of websites. The file is made up of opening tags <> and closing tags </> that contain data. These tags can for example be the <title> of the page, <p> paragraph tags, <img> tags saying where images should come from and formatting tags for <i> italics, <b> bold and <u> underlining.

The HTML file also contains links to the style sheets (CSS) and JavaScript files. When writing HTML, a method of tagging comes into play so that the website creator can reference elements of the page in both the HTML and CSS. It’s this tagging that I’ve used in past projects when scrapping data from a website via Python.

Accessing pages via Python:

With the web basics out of the way, it’s time to look at using Python. Python has a built in library for accessing web pages and it is called urllib.request.

urllib.request in Python
urllib.request in Python

With just a few lines urllib.request can be used to access a webpage and read it with Python.

from urllib.request import urlopen
def show_html(URL_input):
       html = urlopen(URL_input)
       return(html.read())

 

Although the output from this function pulls back the HTML data, it is also a little hard to read for humans:

geektechstuff_html_urllib
urllib reading HTML in Python

Thankfully another Python library can assist with this, the wonderful Beautiful Soup (currently Beautiful Soup version 4).

geektechstuff_beautiful_soup_1
Beautiful Soup in Python

With Beautiful Soup installed and a few additional lines of Python:

from bs4 import BeautifulSoup
def soupy():
     bs=BeautifulSoup(show_html(“https://www.geektechstuff.com&#8221;),’html.parser’)
     return(bs)
print(soupy())
It returns a nicer view of the HTML, making it much easier for humans to read:
geektechstuff.com HTML
geektechstuff.com HTML

With the basics now in place I will in my next post look to using Python for more web fun 🙂

One thought on “Web Scrapping Part 1 (Python)

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.