Python - Cleanup domain list to make urls and find the sitemap
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I have a list of domains that I need to locate the sitemap.xml file for. To make this list usable we have to figure out a few things

1- http vs https

2- www vs non-www

3- Find the sitemap - start by guessing.. domain.com/sitemap.xml and if that fails lets check in the robots.txt file to see if it was disclosed there sitemap: [absoluteURL] if no luck in either two we can fail

We would like the python to take a csv file with 1 domain per row and the output is the urls to the sitemaps with proper protocol and use of www... if the python code cant figure things out maybe we can fail with reasons

Error #1 protocol failed

Error #2 www vs non-www failed

Error #3 cant find sitemap

The list of domains would be in the thousands so we would want to make sure the script doesnt crash if its long running.

awarded to meo
Tags
python3

Crowdsource coding tasks.

2 Solutions


Here is my solution.

import csv
from urllib.request import urlopen
from urllib.parse import urlparse
def get_sitemap(domain):
    url_list = ['https://' + domain, 'http://' + domain]
    if 'www' not in domain.split('.')[0]:
        url_list.insert(0, 'https://www.' + domain)
        url_list.insert(1, 'http://www.' + domain)
    for url in url_list:
        try:
            sitemap = urlopen(url + '/sitemap.xml')
            return sitemap.geturl()
        except Exception:
            pass
    for url in url_list:
        try:
            r = urlopen(url + '/robots.txt').read().decode('utf8')
            for line in r:
                if 'Sitemap: ' in line:
                    sitemap = line.split(': ')[1]
                    return sitemap
        except Exception:
            pass
    return None
def main(file_name):
    f = open(file_name)
    reader = csv.reader(f)
    sitemaps = {}
    for row in reader:
        try:
            domain = row[0]
            sitemaps[domain] = get_sitemap(domain)
        except Exception:
            pass
    return sitemaps
Winning solution

Here is my solution:

Usage:

python3 main.py data.csv output.csv

Source main.py:

import sys
import csv
import requests
import re


def get_sitemap_url_from_robots(url):
    text = requests.get(url).text
    matches = re.search(r'Sitemap: (.*)', text, re.M | re.I)
    if matches and matches.group(1):
        return matches.group(1)
    else:
        return False


def get_sitemap_url(domain):
    try:
        print("checking", domain)
        domain = domain.split("//")[-1].split("/")[0]  # cleanup domain
        sitemap_url_versions = ["http://" + domain, "https://" + domain, "http://www." + domain, "https://www." + domain]

        for url in sitemap_url_versions:
            if requests.get(url + '/sitemap.xml').status_code == 200:
                return url + "/sitemap.xml"
            else:
                sitemap_url = get_sitemap_url_from_robots(url + "/robots.txt")
                if sitemap_url:
                    return sitemap_url

        return False
    except Exception as err:
        print("=> error: could not connect to ", domain)


def main():
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    print(f"Processing {input_file}")
    with open(input_file) as csvfile:
        list_domain = csv.reader(csvfile, delimiter=',')
        for current_domain in list_domain:
            if current_domain:
                sitemap_url = get_sitemap_url(current_domain[0])
                if (sitemap_url):
                    print("=> success", sitemap_url)
                    with open(output_file, "a") as f:
                        f.write("\"" + sitemap_url + "\",\n")
                else:
                    print("=> failed", current_domain[0])


if len(sys.argv) != 3:
    print(f"Usage: python3 main.py input_filename.csv output_filename.csv")
else:
    main()

Example Source input.csv:

"sohu.com"
"taobao.com"
"babytree.com"
"medium.com"
"huanqiu.com"
"17ok.com"
"fiverr.com"
"instructure.com"
"discordapp.com"
"wordpress.com"

Example Output:

➜✗ python3 main.py input.csv output.csv

Processing input.csv
checking sohu.com
=> success http://sohu.com/sitemap.xml
checking taobao.com
=> success http://taobao.com/sitemap.xml
checking babytree.com
=> success http://babytree.com/sitemap.xml
checking medium.com
=> success https://medium.com/sitemap/sitemap.xml
checking huanqiu.com
=> error: could not connect to huanqiu.com
=> failed huanqiu.com
checking 17ok.com
=> error: could not connect to 17ok.com
=> failed 17ok.com
checking fiverr.com
=> success http://fiverr.com/sitemap.xml
checking instructure.com
=> success http://instructure.com/sitemap.xml
checking discordapp.com
=> failed discordapp.com
checking wordpress.com
=> success http://wordpress.com/sitemap.xml

Example Source output.csv

"http://sohu.com/sitemap.xml",
"http://taobao.com/sitemap.xml",
"http://babytree.com/sitemap.xml",
"https://medium.com/sitemap/sitemap.xml",
"http://fiverr.com/sitemap.xml",
"http://instructure.com/sitemap.xml",
"http://wordpress.com/sitemap.xml",
View Timeline