Optimize python script for parallel processing/scale
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

We are using the script from this bounty

https://bountify.co/python-cleanup-domain-list-to-make-urls-and-find-the-sitemap

But are having an issue with running it against a big csv (10k+ domains) , using large core machine on AWS. In this bounty we are looking for someone to tweak it a bit to use resources and process it much more quickly. Ideally process can be sub 2 hrs

We can assume something like 16+ cores on 10gb network card and tons of memory.

I am not good with python, but I try to take a look. As I see that script is very inefficient.
Stefano Balzarotti 11 days ago
Can we port ti to C++, Rust, or NodeJS? Or it must be in python?
alv-c 11 days ago
Good question. So we have this bundled up as an exe for a non technical stakeholder. If we can get the final solution to an exe on Windows them in game.
Qdev 11 days ago
If you give me enough time I can build a rust solution
alv-c 11 days ago
The problem is not multi threading, yes it can give some improvements, but there are no many CPU calculations. It's all I/O time, a best solution should be asynchronous and avoid useless GET request. Language is not important, I am more confident with c#, but this can be done easily in python too. C / C++ maybe can be good to save memory.
Stefano Balzarotti 11 days ago
awarded to Wuddrum
Tags
python3

Crowdsource coding tasks.

3 Solutions

Winning solution

Hey Qdev, here's the parallelized:

import sys
import csv
import requests
import re

from joblib import Parallel, delayed


def get_sitemap_url_from_robots(url):
    text = requests.get(url).text
    matches = re.search(r'Sitemap: (.*)', text, re.M | re.I)
    if matches and matches.group(1):
        return matches.group(1)
    else:
        return False


def get_sitemap_url(domain):
    try:
        print("checking", domain)
        domain = domain.split("//")[-1].split("/")[0]  # cleanup domain
        sitemap_url_versions = ["http://" + domain, "https://" + domain, "http://www." + domain, "https://www." + domain]

        for url in sitemap_url_versions:
            if requests.get(url + '/sitemap.xml').status_code == 200:
                return url + "/sitemap.xml"
            else:
                sitemap_url = get_sitemap_url_from_robots(url + "/robots.txt")
                if sitemap_url:
                    return sitemap_url

        return False
    except Exception as err:
        print("=> error: could not connect to ", domain)


def process_domain(domain):
    if domain:
        sitemap_url = get_sitemap_url(domain[0])
        if (sitemap_url):
            print("=> success", sitemap_url)
            return sitemap_url
        else:
            print("=> failed", domain[0])


def main():
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    threads = int(sys.argv[3])
    print(f"Processing {input_file}")
    with open(input_file) as csvfile:
        with open(output_file, "a") as f:
            list_domain = csv.reader(csvfile, delimiter=',')
            sitemap_urls = Parallel(n_jobs=threads)(delayed(process_domain)(current_domain) for current_domain in list_domain)
            for sitemap_url in sitemap_urls:
                if sitemap_url:
                    f.write("\"" + sitemap_url + "\",\n")


if len(sys.argv) != 4:
    print(f"Usage: python3 main.py input_filename.csv output_filename.csv 32")
else:
    main()

You'll need to do pip install joblib.

Usage is the same as before, except there's now an additional argument at the end that specifies the number of threads to run in. With your specified hardware I'd day it's safe to run it with 32 threads, if not more.

Example: python3 main.py input_filename.csv output_filename.csv 32

Edit: Minor edit in code order. The script will now open the output file before starting any domain checking. This way, in the case where the script can't access the output file, the script won't stop working only when all of the processing is already done.

I also noticed that there are no timeouts set for the requests, so here's a version of the script with a 3.5 second connection timeout and 6 second data fetch timeout: https://pastebin.com/ictk6P45
This way the threads won't wait too long on dead/malfunctioning domains.
You can adjust the timers on lines 10 and 25.
Wuddrum 11 days ago

Hi, I'm author of the script that you mentioned above, I have updated the script to support parallel requests.
Please check here: https://github.com/minhtc/find-sitemap-url

The last parameter is the number of threads you want to use.

I just found that the robots.txt can have multiple sitemaps. For example, there are 4 sitemap URLs in https://www.fiverr.com/robots.txt I have updated the python script to extract all sitemaps from robots file, please check here: https://github.com/minhtc/find-sitemap-url/
meo 11 days ago
I vote for your solution, I see you replaced GET with HEAD and you use a ProcessPool. I am not good enought in python to suggest better improvements, but as I see your is the best solution.
Stefano Balzarotti 10 days ago

I am not good enough with python to provide a solution now, if you can wait some days I can work on it, but I think here there are people that can do better than me, so I prefer to write here some advice.

To make it very fast, there you just need to change a couple of things:

1) To know if a file exist on the server you just need to do an http request with HEAD verb and check the headers.
HEAD request is much faster than an GET request both from client and server side.

The server don't need to process any content, and the clients downloads only headers.

2) Do request asynchronous, you can use async await pattern or callbacks, the important is to process urls concurrently (concurrently doesn't mean parallel).

If you use async await, it is important to don't wait the single request, you need to start many request at same time and await for all.

I don't know very well how python works, in c# there is a Task.WaitAll that handles automatically, otherwise you need to use semaphores.

3) Use a connection pool, to do many request in parallel you need to open many http sessions, if you do many request on same session you risk that they will be done in sequence.
If you open too many sessions you risk to use too much memory and many request can go in timeout, so use a connection pool.

4) Handle timeouts, don't keep connection open too much time, or you will slow down other requests waiting for a slot in the connection pool.

5) pay attention to response different from 200, 202 can be ok too, 301 and 302 implies a redirect to another url that can be ok.

6) If you want get a little better performance, you can use multi-threading for real parallelism, the approach, is not very different from above, but you need to take some precautions.

6.1) The number of threads should be dynamic with a max number of threads, don't start useless threads, create a thread can be slow and memory consuming, sometimes happens that is more fast to execute a thread than start a new one, better if you can use a thread pool.

5.2) Use semaphores to sync threads, and await all threads to complete,

5.3) make it thread safe and use queues, a good approach can be to put the result in a queue, and another thread can unload to queue into the output CSV,

6)As I see you have only 10k domains and tons of memory, so it's not a big deal to load everything into memory.
But if some day you need to process billions of domains, you can improve the algorithm loading records in chunks.

View Timeline