python script to click URLs and capture destination
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

i have a long (25k) list of URLs that when loaded, open a page that have links on them, and when a link is clicked it redirects you to a different URL.

i would like a script that can ingest these URLs, click the first href tag it finds, and then captures the URL of the subsequent page.

some of these initial URLs when loaded will not have any href tags on them (they'll be blank) and i would like that captured too.

an example of what the HTML hierarchy where the link resides is:
html
body
div
div
a tag

  1. take list of URLs
  2. open first URL
  3. click first a href link
  4. wait for page to load
  5. capture URL
  6. save URL to file
  7. open next URL
  8. repeat!
awarded to Enjeru

Crowdsource coding tasks.

2 Solutions


Can you provide an example URL from your list?

<a href='http://google.com' target='_blank'> <div class='cat'> <div class='imageholder' > <img src='http://www.smurkcreative.com/wp-content/uploads/2012/10/google-logo-pattern.jpg' /> <span class='dogContent'> </span> </div> </div> </a> </div> </div></div><div style='position: absolute; left: 0px; top: 0px; visibility: hidden;'>
awkw 3 years ago
Winning solution

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from HTMLParser import HTMLParser
from urllib2 import urlopen, Request
from csv import DictWriter
from time import sleep

class FirstLink(HTMLParser):

    def __init__(self):
        HTMLParser.__init__(self)
        self.done = False
        self.data = None

    def handle_starttag(self, tag, attrs):
        if not self.done and tag == "a":
           for name, value in attrs:
                if name == "href":
                    self.done = True
                    self.data = value
                    break

data_file = 'list.txt'
output_file = 'out.csv'
sleep_time = 5  # seconds
timeout = 10  # seconds
user_agent = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'}

with open(data_file, 'r') as input_data, open(output_file, 'wb') as output_data:

    dw = DictWriter(output_data, fieldnames=['original', 'first', 'redirect', 'error'])
    dw.writeheader()

    for i, line in enumerate(input_data):
        url = line.strip()
        parser = FirstLink()
        final = err = None
        req = Request(url, headers=user_agent)

        try:
            print "Working on line {} with URL {}".format(i + 1, url)
            html = urlopen(req, timeout=timeout).read()

            print "Parsing HTML"
            parser.feed(html)

            if parser.done:
                print "Getting redirect of {}".format(parser.data)
                redirect = Request(parser.data, headers=user_agent)
                final = urlopen(redirect, timeout=timeout).url
        except Exception as e:
            print "Error occured {}".format(e)
            err = e

        dw.writerow(dict(original=url, first=parser.data, redirect=final, error=err))
        sleep(sleep_time)
can you output as: original link, output link and leave blank if blank?
awkw 3 years ago
So the original link is the URL in the file. Then the "first" link is the link in the href tag (could be blank). Then the final link would be what the first link redirects to. Would this work (first and redirect would be blank as needed): original,first,redirect
Enjeru 3 years ago
yes, perfect. is it also possible to use google chrome user agent AND have some kind of throttle (one per 5 seconds) for this? thanks! i know this was beyond remind of task so i promise that is it!
awkw 3 years ago
Let me know if that works for you - the parameters in the middle can be changed as needed.
Enjeru 3 years ago
i just ran this and it ends on HTTP error. is there any way it can move to next line?
awkw 3 years ago
Added a column to track errors, and program will continue as desired.
Enjeru 3 years ago
thanks! i'll accept now.
awkw 3 years ago
Glad it works, I made a minor change to track the error's message in case you need to debug things.
Enjeru 3 years ago
no problem if too late but sometimes it hangs and doesn't end or move on. is it possible to output what the script is doing to console?
awkw 3 years ago
I added a timeout (can't think of why else it would hang). The print statements should help you debug any further issues.
Enjeru 3 years ago
View Timeline