Python Web Scrapper, finds links, pictures and links of links!
def find_links(URL_INPUT):
# returns all the hyperlink destinations on a page
list_of_urls = []
html = urlopen(URL_INPUT)
bs = BeautifulSoup(html, ‘html.parser’)
for link in bs.find_all(‘a’):
list_of_urls.append((link.attrs[‘href’]))
return(list_of_urls)
This looks for the hyperlinks on a page (URL_INPUT) and then adds the links to a list.
Find_Links in action
I used the same type of function to find pictures on a page:
def find_pictures(URL_INPUT):
# returns all the pictures on a page and list picture source
list_of_pic_urls = []
html = urlopen(URL_INPUT)
bs = BeautifulSoup(html, ‘html.parser’)
for img in bs.find_all(‘img’):
list_of_pic_urls.append(img.attrs[‘src’])
return(list_of_pic_urls)
I then got to thinking about how a web crawler / scrapper would work on a larger scale, e.g. starting at one page and then spiralling outwards. As this function can escape the web page it is being run on, I urge some caution when using it.
def links_of_links(URL_INPUT):
# prints the links of the links that find_links finds -use carefully!
lots_of_links = find_links(URL_INPUT)
print(lots_of_links)
for item in lots_of_links:
try:
print(“Visiting”,item)
print(“Found links:”)
print(find_links(item))
print(“Finished on”,item)
except:
print(‘error’)
Python Web Scraper – links of links in action
I noticed an issue when I first created the links_of_links function in that it crashed, and the reason for that is the links that are actually anchor points to parts of the page (beginning with a hashtag #). I’ll need to review the find_links function to see if I can strip them out, but in the meantime I have added a try / except option to my links_of_links function so that if a hyperlink fails then the function continues on.
Related
Published by Geek_Dude
I'm a tech enthusiast that enjoys science, science fiction, comics and video games - pretty much anything geeky.
View all posts by Geek_Dude
You must be logged in to post a comment.