Web Scraping Part 2 (Python)

Python Web Scrapper, finds links, pictures and links of links!

Continuing on from Web Scraping Part 1. In this part I’m going to look at collecting the hyperlinks on a page.

I have uploaded my work in Python to my GitHub, so that code examples are easier to follow: https://github.com/geektechdude/Python_Web_Scrapper

Python Web Scrapper, finds links, pictures and links of links!
Python Web Scrapper, finds links, pictures and links of links!
def find_links(URL_INPUT):
    # returns all the hyperlink destinations on a page
     list_of_urls = []
     html = urlopen(URL_INPUT)
     bs = BeautifulSoup(html, ‘html.parser’)
     for link in bs.find_all(‘a’):
         list_of_urls.append((link.attrs[‘href’]))
    return(list_of_urls)
This looks for the hyperlinks on a page (URL_INPUT) and then adds the links to a list.
geektechstuff_find_links_python
Find_Links in action

I used the same type of function to find pictures on a page:

def find_pictures(URL_INPUT):
    # returns all the pictures on a page and list picture source
    list_of_pic_urls = []
    html = urlopen(URL_INPUT)
    bs = BeautifulSoup(html, ‘html.parser’)
    for img in bs.find_all(‘img’):
        list_of_pic_urls.append(img.attrs[‘src’])
    return(list_of_pic_urls)
I then got to thinking about how a web crawler / scrapper would work on a larger scale, e.g. starting at one page and then spiralling outwards. As this function can escape the web page it is being run on, I urge some caution when using it.
def links_of_links(URL_INPUT):
    # prints the links of the links that find_links finds -use carefully!
    lots_of_links = find_links(URL_INPUT)
    print(lots_of_links)
    for item in lots_of_links:
        try:
            print(“Visiting”,item)
            print(“Found links:”)
            print(find_links(item))
            print(“Finished on”,item)
        except:
            print(‘error’)
Python Webscrapper - links of links in action
Python Web Scraper – links of links in action

I noticed an issue when I first created the links_of_links function in that it crashed, and the reason for that is the links that are actually anchor points to parts of the page (beginning with a hashtag #). I’ll need to review the find_links function to see if I can strip them out, but in the meantime I have added a try / except option to my links_of_links function so that if a hyperlink fails then the function continues on.