Web Scraping Part 2 (Python)

Continuing on from Web Scraping Part 1. In this part I’m going to look at collecting the hyperlinks on a page.

I have uploaded my work in Python to my GitHub, so that code examples are easier to follow: https://github.com/geektechdude/Python_Web_Scrapper

Python Web Scrapper, finds links, pictures and links of links!
Python Web Scrapper, finds links, pictures and links of links!
def find_links(URL_INPUT):
    # returns all the hyperlink destinations on a page
     list_of_urls = []
     html = urlopen(URL_INPUT)
     bs = BeautifulSoup(html, ‘html.parser’)
     for link in bs.find_all(‘a’):
         list_of_urls.append((link.attrs[‘href’]))
    return(list_of_urls)
This looks for the hyperlinks on a page (URL_INPUT) and then adds the links to a list.
geektechstuff_find_links_python
Find_Links in action

I used the same type of function to find pictures on a page:

def find_pictures(URL_INPUT):
    # returns all the pictures on a page and list picture source
    list_of_pic_urls = []
    html = urlopen(URL_INPUT)
    bs = BeautifulSoup(html, ‘html.parser’)
    for img in bs.find_all(‘img’):
        list_of_pic_urls.append(img.attrs[‘src’])
    return(list_of_pic_urls)
I then got to thinking about how a web crawler / scrapper would work on a larger scale, e.g. starting at one page and then spiralling outwards. As this function can escape the web page it is being run on, I urge some caution when using it.
def links_of_links(URL_INPUT):
    # prints the links of the links that find_links finds -use carefully!
    lots_of_links = find_links(URL_INPUT)
    print(lots_of_links)
    for item in lots_of_links:
        try:
            print(“Visiting”,item)
            print(“Found links:”)
            print(find_links(item))
            print(“Finished on”,item)
        except:
            print(‘error’)
Python Webscrapper - links of links in action
Python Web Scraper – links of links in action

I noticed an issue when I first created the links_of_links function in that it crashed, and the reason for that is the links that are actually anchor points to parts of the page (beginning with a hashtag #). I’ll need to review the find_links function to see if I can strip them out, but in the meantime I have added a try / except option to my links_of_links function so that if a hyperlink fails then the function continues on.

Welcome to GeekTechStuff

my home away from home and where I will be sharing my adventures in the world of technology and all things geek.

The technology subjects have varied over the years from Python code to handle ciphers and Pig Latin, to IoT sensors in Azure and Python handling Bluetooth, to Ansible and Terraform and material around DevOps.

Let’s connect