Web scraping can be a very useful way to extract data from websites but for anyone that has only viewed the web with a browser they can seem like magic.
For example, here is http://www.geektechstuff.com viewed via a web browser:

And here is http://www.geektechstuff.com when the HTML is read by Python:

The majority of people prefer the browser view, and it is easy to see why. Modern day sites contain text, images, animations, videos and interactive sections; all of which rely on a browser to turn them from code to objects on the screen that humans can understand and interact with. However, with a little bit of Python we can sometimes do a lot more than the browser can do; extracting data from the site, accessing site resources quicker, or saving parts of the site.
I have briefly used web scraping in some of my earlier Python work (Amazon price checker, pollen count) and I’m hoping to now expand that work/knowledge and go a bit more in depth.
To start with, a little bit of web knowledge is needed when thinking about using a web scrapper. If you are web savvy, please feel free to skim read this section or skip it.
The “GET” process:
- Bob has a computer with a web browser. He wants to open the http://www.geektechstuff.com website.
- His browser sends a request via the 7 layer OSI model which (in extreme basics) encapsulates the request into a packet, which leaves the computer, goes via the local router and then starts its hops across the internet.
- Eventually the packet reaches the platform hosting http://www.geektechstuff.com and says that Bob’s computer wants to see http://www.geektechstuff.com
- http://www.geektechstuff.com then sends back a HTML file
- Bob’s web browser then loads the HTML file, reads any linked CSS files or JavaScript files and asks for any images/animations/videos that should be displayed on the page.
Glossary:
HTML – Hyper Text Markup Language
CSS – Cascading Style Sheet
OSI – Open Systems Interconnection
A “HTML” file:
A Hyper Text Markup Language file is the background of websites. The file is made up of opening tags <> and closing tags </> that contain data. These tags can for example be the <title> of the page, <p> paragraph tags, <img> tags saying where images should come from and formatting tags for <i> italics, <b> bold and <u> underlining.
The HTML file also contains links to the style sheets (CSS) and JavaScript files. When writing HTML, a method of tagging comes into play so that the website creator can reference elements of the page in both the HTML and CSS. It’s this tagging that I’ve used in past projects when scraping data from a website via Python.
Accessing pages via Python:
With the web basics out of the way, it’s time to look at using Python. Python has a built in library for accessing web pages and it is called urllib.request.

With just a few lines urllib.request can be used to access a webpage and read it with Python.
Although the output from this function pulls back the HTML data, it is also a little hard to read for humans:

Thankfully another Python library can assist with this, the wonderful Beautiful Soup (currently Beautiful Soup version 4).

With Beautiful Soup installed and a few additional lines of Python:

With the basics now in place I will in my next post look to using Python for more web fun 🙂
Can this program be made to scrape images from instagram accounts by
looking for a line in the the view page source
https://www.instagram.com/alwaysm157/
latest image
and save that url to a .jpg file
LikeLike