Web Scraping Part 1 (Python)

Web scraping can be a very useful way to extract data from websites but for anyone that has only viewed the web with a browser they can seem like magic.

For example, here is http://www.geektechstuff.com viewed via a web browser:

geektechstuff_website — geektechstuff.com in browser

And here is http://www.geektechstuff.com when the HTML is read by Python:

html_behind_geektechstuff — geektechstuff.com HTML

The majority of people prefer the browser view, and it is easy to see why. Modern day sites contain text, images, animations, videos and interactive sections; all of which rely on a browser to turn them from code to objects on the screen that humans can understand and interact with. However, with a little bit of Python we can sometimes do a lot more than the browser can do; extracting data from the site, accessing site resources quicker, or saving parts of the site.

I have briefly used web scraping in some of my earlier Python work (Amazon price checker, pollen count) and I’m hoping to now expand that work/knowledge and go a bit more in depth.

To start with, a little bit of web knowledge is needed when thinking about using a web scrapper. If you are web savvy, please feel free to skim read this section or skip it.

The “GET” process:

Bob has a computer with a web browser. He wants to open the http://www.geektechstuff.com website.
His browser sends a request via the 7 layer OSI model which (in extreme basics) encapsulates the request into a packet, which leaves the computer, goes via the local router and then starts its hops across the internet.
Eventually the packet reaches the platform hosting http://www.geektechstuff.com and says that Bob’s computer wants to see http://www.geektechstuff.com
http://www.geektechstuff.com then sends back a HTML file
Bob’s web browser then loads the HTML file, reads any linked CSS files or JavaScript files and asks for any images/animations/videos that should be displayed on the page.

Glossary:

HTML – Hyper Text Markup Language

CSS – Cascading Style Sheet

OSI – Open Systems Interconnection

A “HTML” file:

A Hyper Text Markup Language file is the background of websites. The file is made up of opening tags <> and closing tags </> that contain data. These tags can for example be the <title> of the page, <p> paragraph tags, <img> tags saying where images should come from and formatting tags for <i> italics, <b> bold and <u> underlining.

The HTML file also contains links to the style sheets (CSS) and JavaScript files. When writing HTML, a method of tagging comes into play so that the website creator can reference elements of the page in both the HTML and CSS. It’s this tagging that I’ve used in past projects when scraping data from a website via Python.

Accessing pages via Python:

With the web basics out of the way, it’s time to look at using Python. Python has a built in library for accessing web pages and it is called urllib.request.

With just a few lines urllib.request can be used to access a webpage and read it with Python.

from urllib.request import urlopen

def show_html(URL_input):

html = urlopen(URL_input)

return(html.read())

Although the output from this function pulls back the HTML data, it is also a little hard to read for humans:

geektechstuff_html_urllib — urllib reading HTML in Python

Thankfully another Python library can assist with this, the wonderful Beautiful Soup (currently Beautiful Soup version 4).

geektechstuff_beautiful_soup_1 — Beautiful Soup in Python

With Beautiful Soup installed and a few additional lines of Python:

from bs4 import BeautifulSoup

def soupy():

bs=BeautifulSoup(show_html(“https://www.geektechstuff.com”),’html.parser’)

return(bs)

print(soupy())

It returns a nicer view of the HTML, making it much easier for humans to read:

With the basics now in place I will in my next post look to using Python for more web fun 🙂

2 responses to “Web Scraping Part 1 (Python)”

Web Scrapping Part 2 (Python) – Geek Tech Stuff

April 16, 2019

[…] on from Web Scrapping Part 1. In this part I’m going to look at collecting the hyperlinks on a […]

LikeLike
KFLMIAMI420 (@kflmiami420)

August 18, 2019

Can this program be made to scrape images from instagram accounts by

looking for a line in the the view page source

https://www.instagram.com/alwaysm157/

latest image

and save that url to a .jpg file

LikeLike