Web Scraping: Interacting With Web Pages (Python)

Using Python to find input classes

So far I have been scraping my website for a list of the divs, links and pictures that it contains, however I also want to interact with my site. Back in part 1 I briefly wrote about the GET command that is used when asking for data from a web page. The opposite of this command is the POST command, and it is used to send information to a web page.

On www.geektechstuff.com there is a search box which can be used to receive text from a visitor and search for the text that the visitor has entered into the box.

geektechstuff_search_box
The geektechstuff.com “Search” box

There are ways to identify the search box as an input method;

  • Manually visit the website and try entering text into the box
  • Manually visit the website, open a web browsers developer tools and look at the search box’s values:
geektechstuff_search_box_html
Looking at the geektechstuff.com Search box’s coding
  • Use Python and BeautifulSoup to search the website for input classes:
Using Python to find input classes
Using Python to find input classes

The Python option takes a few lines of programming to complete:

def find_input(URL_INPUT):
    #finds the input tags in the HTML
    bs=BeautifulSoup(show_html(URL_INPUT),’html.parser’)
    search = bs.find_all(‘input’)
    for result in search:
        print(result, ‘\n’)
        input_value = result.get(‘value’)
        print(input_value, ‘\n’)

I have asked Python/BeautifulSoup to find all mentions on ‘input’ and return both the results and also just the value. I think doing it this way helps when looking for certain inputs, especially if the website designer has been kind enough to appropriately name the input values (e.g. Search).

At this point I took a side step and started playing with the Python Requests library and interacting with Wikipedia. The function outlined below fails but I think it is important to understand why it fails. Please do not carry out the wikipedia_logon() function.

Requests allow for data to be sent to a webpage (POST) rather than read from a webpage (GET). To look at using POST I have hit up Wikipedia.

Wikipedia logon page
Wikipedia logon page

I’m using a logon page on Wikipedia: https://en.m.wikipedia.org/w/index.php?title=Special:UserLogin&returnto=Main+Page and the developer tools within my web browser to view the page headers to see what data is passed when I attempt to login.

A correct login passed the following Form Data to Wikipedia.

geektechstuff_wikipedia_form_logon
Form data being passed Wikipedia.

Knowing what data the form is expected to POST to the page allows for that data to be put into my Python program.

def wikipedia_logon():
    param = {
    “wpName”:”Geektechstuff”,
    “wpPassword”:”INSERT WIKIPEDIA PASSWORD”,
    “wpRemember”:”1″,
    “wploginattempt”:”Log in”,
    “authAction”:”login”
     }
    headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3)                 AppleWebKit/604.5.6 (KHTML, like Gecko) Version/11.0.3 Safari/604.5.6’}
    r = requests.post(“https://en.m.wikipedia.org/w/index.php?         title=Special:UserLogin&returnto=Main+Page”, data=param, headers=headers)
    print(“Cookie says: \n”, r.cookies.get_dict())
    print(“Response code:”,r)
    print(r.text)

 

geektechstuff_func_wiki_logon
wikipedia_logon function

I added some headers to make my Python program “look” more like a regular browser session by including the user-agent details of the Safari browser running on an Apple Mac.

The wikipedia_logon() function currently returns the cookie details, a response code (200, which means success) and the HTML content of the page.

At this point I found a problem. My program is sending the details to the Wikipedia logon page but it does not login. The reason for this lies within a hidden field within the page’s HTML:

geektechstuff_wiki_hidden_token
Hidden LoginToken

The wpLoginToken is hidden, the value changes and it is one of the parameters that the Wikipedia logon is expecting. You may ask, “Why would anyone have this on their site?” and the answer is pretty much to stop bots and hijacking of sessions. Wikipedia does have some cool APIs for bots to interact with so I can see why they would not want bots interacting with their logon pages.

I could create some basic HTML forms to test my program but to me that defeats the purpose as it is building a test to fit the answer. I do think that this little “hidden” input loops very nicely back to my introduction paragraphs on this page though, which is why I have left everything in place (showing the slightly wacky path I have taken with web scraping so far).