Python Web Scrapper, finds links, pictures and links of links!
def find_links(URL_INPUT):
# returns all the hyperlink destinations on a page
list_of_urls = []
html = urlopen(URL_INPUT)
bs = BeautifulSoup(html, ‘html.parser’)
for link in bs.find_all(‘a’):
list_of_urls.append((link.attrs[‘href’]))
return(list_of_urls)
This looks for the hyperlinks on a page (URL_INPUT) and then adds the links to a list.
Find_Links in action
I used the same type of function to find pictures on a page:
def find_pictures(URL_INPUT):
# returns all the pictures on a page and list picture source
list_of_pic_urls = []
html = urlopen(URL_INPUT)
bs = BeautifulSoup(html, ‘html.parser’)
for img in bs.find_all(‘img’):
list_of_pic_urls.append(img.attrs[‘src’])
return(list_of_pic_urls)
I then got to thinking about how a web crawler / scrapper would work on a larger scale, e.g. starting at one page and then spiralling outwards. As this function can escape the web page it is being run on, I urge some caution when using it.
def links_of_links(URL_INPUT):
# prints the links of the links that find_links finds -use carefully!
lots_of_links = find_links(URL_INPUT)
print(lots_of_links)
for item in lots_of_links:
try:
print(“Visiting”,item)
print(“Found links:”)
print(find_links(item))
print(“Finished on”,item)
except:
print(‘error’)
Python Web Scraper – links of links in action
I noticed an issue when I first created the links_of_links function in that it crashed, and the reason for that is the links that are actually anchor points to parts of the page (beginning with a hashtag #). I’ll need to review the find_links function to see if I can strip them out, but in the meantime I have added a try / except option to my links_of_links function so that if a hyperlink fails then the function continues on.
my home away from home and where I will be sharing my adventures in the world of technology and all things geek.
The technology subjects have varied over the years from Python code to handle ciphers and Pig Latin, to IoT sensors in Azure and Python handling Bluetooth, to Ansible and Terraform and material around DevOps.
You must be logged in to post a comment.