Web Scraping Part 2 (Python)

Continuing on from Web Scraping Part 1. In this part I’m going to look at collecting the hyperlinks on a page.

I have uploaded my work in Python to my GitHub, so that code examples are easier to follow: https://github.com/geektechdude/Python_Web_Scrapper

Python Web Scrapper, finds links, pictures and links of links!

def find_links(URL_INPUT):

# returns all the hyperlink destinations on a page

list_of_urls = []

html = urlopen(URL_INPUT)

bs = BeautifulSoup(html, ‘html.parser’)

for link in bs.find_all(‘a’):

list_of_urls.append((link.attrs[‘href’]))

return(list_of_urls)

This looks for the hyperlinks on a page (URL_INPUT) and then adds the links to a list.

geektechstuff_find_links_python — Find_Links in action

I used the same type of function to find pictures on a page:

def find_pictures(URL_INPUT):

# returns all the pictures on a page and list picture source

list_of_pic_urls = []

html = urlopen(URL_INPUT)

bs = BeautifulSoup(html, ‘html.parser’)

for img in bs.find_all(‘img’):

list_of_pic_urls.append(img.attrs[‘src’])

return(list_of_pic_urls)

I then got to thinking about how a web crawler / scrapper would work on a larger scale, e.g. starting at one page and then spiralling outwards. As this function can escape the web page it is being run on, I urge some caution when using it.

def links_of_links(URL_INPUT):

# prints the links of the links that find_links finds -use carefully!

lots_of_links = find_links(URL_INPUT)

print(lots_of_links)

for item in lots_of_links:

try:

print(“Visiting”,item)

print(“Found links:”)

print(find_links(item))

print(“Finished on”,item)

except:

print(‘error’)

Python Webscrapper - links of links in action — Python Web Scraper – links of links in action

I noticed an issue when I first created the links_of_links function in that it crashed, and the reason for that is the links that are actually anchor points to parts of the page (beginning with a hashtag #). I’ll need to review the find_links function to see if I can strip them out, but in the meantime I have added a try / except option to my links_of_links function so that if a hyperlink fails then the function continues on.

GeekTechStuff

Web Scraping Part 2 (Python)

Welcome to GeekTechStuff

Let’s connect

Join the fun!

Recent posts

Comics: Brink – 2000AD (Review)

GitHub – Profile ReadMe

The Letters Page

Star Wars: The Mandalorian and Grogu (Review)

Git Commands I Should Have Learned Earlier…

Personal Update: Post Graduate Diploma