Mass download list of APKs by Package Names
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

REVISED:
Original post was not getting traction, so I am revising in hope to be able to award bounty.

Now, I will provide list of Android Package Names, for example com.carezone.caredroid.careapp.medications or com.lifescan.reveal
and I need script to be able to download them all to local hard drive. My list will have thousands of Package Names, some some ability to do in parallel will be helpful.

You may retrieve APKs via whichever method you find, but I know there are sites like apkpure.com that may help.

Thanks!

ORIGINAL:
AppAnnie has lists of top Android apps by category. For example,

Business category: https://www.appannie.com/en/apps/google-play/top/united-states/application/business/

You can sign up for free account to get list of top 500 in each category.

I need script to download all APKs in each category. APKs will all download to /downloaded directory (relative path). Output will be CSV listing each app as follows:
App Name, Package Name, App Size, Location (relative path i.e. ./downloaded/xyz.apk

APKs can be downloaded from site like https://apkpure.com or https://www.apkturbo.com. maybe you can find better. Or other technique if you prefer but please check with me first.

Script preferably written in Node or python, but others OK too but please check with me first.

In addition to submitting the script, please also submit CSV file including Business category apps so that we can spot check to make sure it works. No need to submit any APKs in /downloaded, CSV is sufficient to cross-check a few manually.

I am happy to clarify further if you have any questions.

Thanks,
John

Does the script also need to get the list of APKs from appannie?
slang800 9 months ago
Yes, for each category on appannie, it needs to get the list of the top 500 apps, list them in CSV, and also get the APK for each of them. Thank you for looking.
jcszephyr 9 months ago
It seems like you need to use a paid account to actually export the list, even through the API. https://support.appannie.com/hc/en-us/articles/115014440748-2-App-Rank-History
slang800 9 months ago
If you have a paid account, feel free to email me at slang800@gmail.com and I can walk you through creating an API key that I can use to test this out.
slang800 9 months ago
With free account, you should be able to see list of top 500 in each category and pull from there without paying. No API access, just pull through website? Thanks.
jcszephyr 9 months ago
In theory, you could but App Annie makes all their money by selling that kind of data, so I bet they put a significant amount of effort into preventing people from screen-scraping it, and if you're doing these downloads frequently enough, I bet they'd notice. Also, I'd need to write a script that does a fake login to the App Annie website, and if any part of their site changes, that script could break.
slang800 9 months ago
If you're scraping this data infrequently, and don't mind it not being fully automated, I could make a browser extension that exports the category lists after you've already logged in... that would be pretty easy and wouldn't break if they change around their login flow.
slang800 9 months ago
Hi, i created an account but was unable to view the top app list, where do I get the top list?Thank you
Codeword 9 months ago
Hi, do you want to download Paid apps or games too? Otherwise, if you don't want to download paid apps, you can look for other simple solutions available online.
Hasan Bayat 9 months ago
3 solutions - but none is good enough. you must try harder.
bbb 9 months ago
Thank you to all involved for the fantastic work.
jcszephyr 8 months ago
welcome Jcszephyr, Glad I could contribute.Thank you
Codeword 8 months ago
Why didn't my solution won?
Hasan Bayat 8 months ago
awarded to CyteBode

Crowdsource coding tasks.

5 Solutions


Work so far (posting here because the comments section is cluttered):

The App Annie API doesn't allow you to download category rankings unless you have an App Annie Intelligence subscription, which has a significant cost. OP wants a solution that can screen-scrape the App Annie website using a free account, which is possible, but is a fragile solution.

App Annie makes all their money by selling that kind of data, so I bet they put a significant amount of effort into preventing people from screen-scraping it, and if OP is doing these downloads frequently enough, I bet they'd notice. Also, I'd need to write a script that does a fake login to the App Annie website, and if any part of their site changes, or requires a captcha to login, that script could break.

Once you're already logged into the website, the data is easy enough to get with the URL https://www.appannie.com/ajax/top-chart/table/?market=google-play&country_code=US&category=<category>&date=<date>&rank_sorting_type=rank&page_size=500&order_by=sort_order&order_type=desc.

That gives data in the format:

[
  {
    "sort_metric": "changeInRank",
    "name": "Run Sausage Run!",
    "company_url": "/company/1000200000003181/",
    "headerquarters": "Israel",
    "id": 20600008190011,
    "url": "/apps/google-play/app/com.crazylabs.sausage.run/details/",
    "company_name": "TabTale",
    "country_code": "il",
    "app_icon_css": "gp",
    "iap": true,
    "change": 0,
    "icon": "https://static-s.aa-cdn.net/img/gp/20600008190011/5CKz2OBj0E2IQ_-Ms_r0u13rQ7KAzlgDVAVBWQdhTAn5jbh6ru349hkvjxD72x-CkAsy=w300_w80"
  }
]
[
  {
    "sort_metric": "changeInRank",
    "name": "Minecraft",
    "company_url": "/company/1000200000016666/",
    "headerquarters": "Sweden",
    "id": 20600000000768,
    "url": "/apps/google-play/app/com.mojang.minecraftpe/details/",
    "company_name": "Mojang",
    "country_code": "se",
    "app_icon_css": "gp",
    "iap": true,
    "change": 0,
    "icon": "https://static-s.aa-cdn.net/img/gp/20600000000768/VSwHQjcAttxsLE47RuS4PqpC4LT7lCoSjE7Hx5AW_yCxtDvcnsHHvm5CTuL5BPN-uRTP=w300_w80"
  }
]

If you're scraping this data infrequently, and don't mind it not being fully automated, I could make a browser extension that exports the category lists after you've already logged in... that would be pretty easy and wouldn't break if they change around their login flow.

Hi - this looks like a great start. This only needs to run to completion very infrequently, just few times per year, and OK if fragile if works the first time. The end result though is the downloaded APK files, not just the listings. Please check original problem description in post above. For login - could you login via your browser with a free account, and then attach session cookies, user agent, etc to your crawler requests? Should be no need to do the actual login via your script. Thanks, John
jcszephyr 9 months ago
Please find revised bounty above.
jcszephyr 9 months ago

Multi threaded python27 program

main.py

import threading
import uuid
from Queue import Queue
from spider import Spider
from general import *
import time
import urllib2

PROJECT_NAME = 'downloaded_directory' # the directory in which you want to download the apk

HOMEPAGE = 'https://apkpure.com'  
APP_LIST = 'app_list.txt'

NUMBER_OF_THREADS = 4 # no of threads you want
queue = Queue()
Spider(PROJECT_NAME, HOMEPAGE, APP_LIST)
MAX_REQ =50;  # set a max no of request at a time
x=1
def create_spider():
    for _ in range(NUMBER_OF_THREADS):
        t = threading.Thread(target=work)
        #threads.append(t)
        t.daemon = True
        t.start()



def work():
    global x
    while True:
        if x >= MAX_REQ:
            x = 1
            time.sleep(5)
            print "sleeping 5 sec"
        apk = queue.get()
        Spider.crawl_page(threading.current_thread().name, apk)
        queue.task_done()
        x +=1


def create_jobs():
    for link in file_to_set(APP_LIST):
        queue.put(link)
    queue.join()
    crawl()


def crawl():
    queued_links = file_to_set(APP_LIST)
    if len(queued_links) > 0:
        print(str(len(queued_links)) + ' links in the queue')
        create_jobs()

create_spider()
crawl()

# function to downlaod apk files
# this function will read the crawled_list.txt file generated by the program which contains the download lisnk of the apk files fetched by the program
def download_apk():
    with open('crawled_list.txt') as f:
        for line in f:
           # each line here is the download link of the apk, you can use cytebode's download function to download the file

download_apk()

spider.py

from bs4 import BeautifulSoup
import requests
from general import *

class Spider:

    project_name = ''
    queue_file = ''
    crawled_file = ''
    search_page = ''
    queue = set()
    crawled = set()

    def __init__(self, project_name, search_page, app_list):
        Spider.project_name = project_name
        Spider.search_page = search_page

        Spider.queue_file = app_list
        Spider.crawled_file = 'crawled_list.txt'
        self.boot()


    @staticmethod
    def boot():
        create_project_dir(Spider.project_name)
        create_crawled_list(Spider.crawled_file)
        Spider.queue = file_to_set(Spider.queue_file)
        Spider.crawled = file_to_set(Spider.crawled_file)


    @staticmethod
    def crawl_page(thread_name, apk):
        if apk not in Spider.crawled:
            print(thread_name + ' now crawling ' + apk + '\n')
            print('Queue ' + str(len(Spider.queue)) + ' | Crawled  ' + str(len(Spider.crawled)))
            s = Spider.gather_download_link(Spider.search_page+'/search?q=' + apk)
            Spider.add_link_to_queue(s)
            Spider.queue.remove(apk)
            Spider.update_files()


    @staticmethod
    def gather_download_link(search_url):

        try:
            response = requests.get(search_url, stream=True)
            soup = BeautifulSoup(response.text, "html.parser")
            list = soup.findAll('a', attrs={'class': 'more-down'})
            if list:
                link_part = list[0]['href']
                response_1 = requests.get(Spider.search_page+link_part+'/download?from=details', stream=True)
                soup_1 = BeautifulSoup(response_1.text, "html.parser")
                inner_list = soup_1.findAll('a', attrs={'id': 'download_link'})
                if inner_list:
                    return inner_list[0]['href']
        except Exception as e:
            print(str(e))
            return set()


    @staticmethod
    def add_link_to_queue(link):
        if link not in Spider.crawled:
            Spider.crawled.add(link)

    @staticmethod
    def update_files():
        set_to_file(Spider.queue, Spider.queue_file)
        set_to_file(Spider.crawled, Spider.crawled_file)

general.py

import os

def create_project_dir(directory):
    if not os.path.exists(directory):
        print('Wait Creating directory ' + directory)
        os.makedirs(directory)



def create_crawled_list(crawled_list):
    if not os.path.isfile(crawled_list):
        write_file(crawled_list, '')


def write_file(path, data):
    with open(path, 'w') as f:
        f.write(data)


def append_to_file(path, data):
    with open(path, 'a') as file:
        file.write(data + '\n')



def delete_file_contents(path):
    open(path, 'w').close()


def file_to_set(file_name):
    results = set()
    with open(file_name, 'rt') as f:
        for line in f:
            results.add(line.replace('\n', ''))
    return results


def set_to_file(links, file_name):
    with open(file_name,"w") as f:
        for l in sorted(links):
            f.write(l+"\n")

NOTE
1. place all files in same folder
2. Create a text file named app_list.txt where you have a list of app names
3. This is a multithread application, which means it can find multiple download links at the same time suitable for large list
4. I have not made the download function, you can use the function provided by cytebode.
5. All the download links the program finds is copied in a separate text file named crawled_list.txt the program automatically creates this file

It's nice that you're parallelizing the search for download URLs, but the real bottleneck is the downloading of the APK's themselves, which you're leaving single-threaded. Add Python's Global Interpreter Lock to that, and you're not getting any performance gain overall, just more complex code. If any of the threads throws an exception (e.g. if you put a paid app in app_list.txt, or a non-existing one) the whole program hangs.
CyteBode 9 months ago
Also, a few bugs: Line 22 of main.py: threads does not exist. Line 53 of spider.py: You're returning an empty set, which then gets put into another set by add_link_to_queue, but sets are unhashable so an exception is thrown. Furthermore, since you're finding all the download links in the page, you end up with two links for the same file in crawled_list.txt.
CyteBode 9 months ago
Hi cytebode, thanks for your review, actually the bug threads was a typo. Also I know what you say is right, the bottleneck is downloading the files one by one and we both cannot escape from this, if we perform this multithreaded you can guess what happens so downloading is best to be done single-threaded, but I thought if the app list is huge than being a multithreaded it can resolve the download links much faster, that's why I took this approach.Thank you once again
Codeword 9 months ago
Cytebode, it doesn't matter to me whoever wins the bounty, the only thing it matters is that we are contributing as a whole and as a team, to solve a problem.Cheers to the team :)
Codeword 9 months ago

Here is my solution using Node.js Runtime, with few dependencies.

I made a repository for my solution, it is called APK Scrape

Also, it is not yet published on NPM, and if you want, I can publish it at there, and this solution is async, and works in parallel as the Node.js is async.

Here is a the copy/paste of the readme file:

APK Scrape

Scrape and download APK using package identifier from APKPure

Usage

apk-scrape -p ./packages.txt -d ./download

Installation

Clone the repository:

git clone https://github.com/EmpireWorld/apk-scrape.git

Use the command line:

./apk-scrape -p ./packages.txt -d ./download

Help

apk-scrape --help
Winning solution

Here is my solution, following the revised bounty:

import os
import os.path
import sys
import re
import time

from bs4 import BeautifulSoup
import requests


DOMAIN = "https://apkpure.com"
SEARCH_URL = DOMAIN + "/search?q=%s"

DOWNLOAD_DIR = "./downloaded"
PACKAGE_NAMES_FILE = "package_names.txt"
OUTPUT_CSV = "output.csv"

PROGRESS_UPDATE_DELAY = 0.25


def download_file(url, package_name):
    r = requests.get(url, stream=True)

    content_disposition = r.headers.get("content-disposition")
    filename = re.search(r'attachment; filename="(.*)"', content_disposition).groups()
    if filename:
        filename = filename[0]
    else:
        filename = "%s.apk" % (package_name.replace(".", "_"))

    local_path = os.path.normpath(os.path.join(DOWNLOAD_DIR, filename))
    sys.stdout.write("Downloading %s... " % filename)

    total_size = int(r.headers.get('content-length', 0))
    size = 0
    sys.stdout.write("% 6.2f%%" % 0.0)
    t = time.time()
    with open(local_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=65536):
            if chunk:
                size += len(chunk)
                f.write(chunk)

                nt = time.time()
                if nt - t >= PROGRESS_UPDATE_DELAY:
                    sys.stdout.write("\b" * 7)
                    sys.stdout.write("% 6.2f%%" % (100.0 * size / total_size))
                    sys.stdout.flush()
                    t = nt
    sys.stdout.write("\b" * 7)
    sys.stdout.write("100.00%\n")

    return (local_path, size)


if __name__ == '__main__':
    # Output CSV
    output_csv = open(OUTPUT_CSV, "w")
    output_csv.write("App name,Package name,Size,Location\n")


    # Create download directory
    if not os.path.exists(DOWNLOAD_DIR):
        os.makedirs(DOWNLOAD_DIR)
    elif not os.path.isdir(DOWNLOAD_DIR):
        print("%s is not a directory." % DOWNLOAD_DIR)
        sys.exit(-1)


    for line in open(PACKAGE_NAMES_FILE, "r").readlines():
        package_name = line.strip()

        # Search page
        url = SEARCH_URL % package_name
        r = requests.get(url)

        if r.status_code != 200:
            print("Could not get search page for %s." % package_name)
            continue

        soup = BeautifulSoup(r.text, "html.parser")

        first_result = soup.find("dl", class_="search-dl")
        if first_result is None:
            print("Could not find %s" % package_name)
            continue

        search_title = first_result.find("p", class_="search-title")
        search_title_a = search_title.find("a")

        app_name = search_title.text.strip()
        app_url = search_title_a.attrs["href"]


        # App page
        url = DOMAIN + app_url
        r = requests.get(url)

        if r.status_code != 200:
            print("Could not get app page for %s." % package_name)
            continue

        soup = BeautifulSoup(r.text, "html.parser")

        download_button = soup.find("a", class_=" da")

        if download_button is None:
            print("%s is a paid app. Could not download." % package_name)
            continue

        download_url = download_button.attrs["href"]


        # Download app page
        url = DOMAIN + download_url
        r = requests.get(url)

        if r.status_code != 200:
            print("Could not get app download page for %s." % package_name)
            continue

        soup = BeautifulSoup(r.text, "html.parser")

        download_link = soup.find("a", id="download_link")
        download_apk_url = download_link.attrs["href"]

        path, size = download_file(download_apk_url, package_name)


        # Write row to output CSV
        output_csv.write(",".join([
            '"%s"' % app_name.replace('"', '""'),
            '"%s"' % package_name.replace('"', '""'),
            "%d" % size,
            '"%s"' % path.replace('"', '""')]))
        output_csv.write("\n")

The script requires requests and bs4 (BeautifulSoup). The file containing the list of package names (package_names.txt) is just a text file with one entry per line. I tested the script with the two example package names you gave.

I tested the script on Windows and Ubuntu with Python 3.6 and 2.7.

Edit 1: Added mkdir for the download directory. Added double quotes for the csv entries. Made the download function parse the filename from the header. Made the script run on Python 2.7.

Edit 2: Made the progress update only every 0.25 seconds. Fixed the integer division bug with the progress update on Python 2.7. Changed mkdir to makedirs. Added .replace('"', '""') to the CSV entries to escape double quotes. Minor cleanup and refactoring.

Update: I just tested my code with Python 2.7 on a different machine, and it actually works, so the https issue I mentioned was actually just a problem the VM I used.
CyteBode 9 months ago

Here's an updated version of my solution, with concurrent downloads:

import math
from multiprocessing import Process, Queue
import os
import os.path
import re
import sys
import time

try:
    # Python 3
    from queue import Empty as EmptyQueueException
    from queue import Full as FullQueueException
except ImportError:
    # Python 2
    from Queue import Empty as EmptyQueueException
    from Queue import Full as FullQueueException

from bs4 import BeautifulSoup
import requests


DOMAIN = "https://apkpure.com"
SEARCH_URL = DOMAIN + "/search?q=%s"

DOWNLOAD_DIR       = "./downloaded/"
PACKAGE_NAMES_FILE = "package_names.txt"
OUTPUT_CSV         = "output.csv"


CONCURRENT_DOWNLOADS  = 4
CHUNK_SIZE            = 128*1024 # 128 KiB
PROGRESS_UPDATE_DELAY = 0.25
PROCESS_TIMEOUT       = 10.0


MSG_ERROR    = -1
MSG_PAYLOAD  =  0
MSG_START    =  1
MSG_PROGRESS =  2
MSG_END      =  3


class SplitProgBar(object):
    @staticmethod
    def center(text, base):
        if len(text) <= len(base):
            left = (len(base) - len(text)) // 2
            return "%s%s%s" % (base[:left], text, base[left+len(text):])
        else:
            return base

    def __init__(self, n, width):
        self.n = n
        self.sub_width = int(float(width-(n+1))/n)
        self.width = n * (self.sub_width + 1) + 1
        self.progress = [float("NaN")] * n

    def __getitem__(self, ix):
        return self.progress[ix]

    def __setitem__(self, ix, value):
        self.progress[ix] = value

    def render(self):
        bars = []
        for prog in self.progress:
            if math.isnan(prog) or prog < 0.0:
                bars.append(" " * self.sub_width)
                continue
            bar = "=" * int(round(prog*self.sub_width))
            bar += " " * (self.sub_width-len(bar))
            bar = SplitProgBar.center(" %.2f%% " % (prog*100), bar)
            bars.append(bar)

        new_str = "|%s|" % "|".join(bars)
        sys.stdout.write("\r%s" % new_str)

    def clear(self):
        sys.stdout.write("\r%s\r" % (" " * self.width))


class Counter(object):
    def __init__(self, value = 0):
        self.value = value

    def inc(self, n = 1):
        self.value += n

    def dec(self, n = 1):
        self.value -= n

    @property
    def empty(self):
        return self.value == 0


def download_process(id_, qi, qo):
    def send_progress(progress):
        try:
            qo.put_nowait((MSG_PROGRESS, (id_, progress)))
        except FullQueueException:
            pass

    def send_error(msg):
        qo.put((MSG_ERROR, (id_, msg)))

    def send_start(pkg_name):
        qo.put((MSG_START, (id_, pkg_name)))

    def send_finished(pkg_name, app_name, size, path, already=False):
        if already:
            qo.put((MSG_END, (id_, pkg_name, app_name, size, path)))
        else:
            qo.put((MSG_PAYLOAD, (id_, pkg_name, app_name, size, path)))

    while True:
        message = qi.get()

        if message[0] == MSG_PAYLOAD:
            package_name, app_name, download_url = message[1]
        elif message[0] == MSG_END:
            break

        try:
            r = requests.get(download_url, stream=True)
        except requests.exceptions.ConnectionError:
            send_error("Connection error")
            continue

        if r.status_code != 200:
            send_error("HTTP Error %d" % r.status_code)
            r.close()
            continue

        content_disposition = r.headers.get("content-disposition", "")
        content_length = int(r.headers.get('content-length', 0))

        filename = re.search(r'filename="(.+)"', content_disposition)
        if filename and filename.groups():
            filename = filename.groups()[0]
        else:
            filename = "%s.apk" % (package_name.replace(".", "_"))

        local_path = os.path.normpath(os.path.join(DOWNLOAD_DIR, filename))

        if os.path.exists(local_path):
            if not os.path.isfile(local_path):
                # Not a file
                send_error("%s is a directory." % local_path)
                r.close()
                continue
            if os.path.getsize(local_path) == content_length:
                # File has likely already been downloaded
                send_finished(
                    package_name, app_name, content_length, local_path, True)
                r.close()
                continue

        send_start(package_name)

        size = 0
        t = time.time()
        with open(local_path, "wb+") as f:
            for chunk in r.iter_content(chunk_size=CHUNK_SIZE):
                if chunk:
                    size += len(chunk)
                    f.write(chunk)

                    nt = time.time()
                    if nt - t >= PROGRESS_UPDATE_DELAY:
                        send_progress(float(size) / content_length)
                        t = nt

        send_finished(package_name, app_name, size, local_path)


def search_process(qi, qo):
    def send_error(msg):
        qo.put((MSG_ERROR, msg))

    def send_payload(pkg_name, app_name, dl_url):
        qo.put((MSG_PAYLOAD, (pkg_name, app_name, dl_url)))

    while True:
        message = qi.get()

        if message[0] == MSG_PAYLOAD:
            package_name = message[1]
        elif message[0] == MSG_END:
            break

        # Search page
        url = SEARCH_URL % package_name
        try:
            r = requests.get(url)
        except requests.exceptions.ConnectionError:
            send_error("Connection error.")
            continue

        if r.status_code != 200:
            send_error("Could not get search page for %s." % package_name)
            continue

        soup = BeautifulSoup(r.text, "html.parser")

        first_result = soup.find("dl", class_="search-dl")
        if first_result is None:
            send_error("Could not find %s." % package_name)
            continue

        search_title = first_result.find("p", class_="search-title")
        search_title_a = search_title.find("a")

        app_name = search_title.text.strip()
        app_url = search_title_a.attrs["href"]


        # App page
        url = DOMAIN + app_url
        try:
            r = requests.get(url)
        except requests.exceptions.ConnectionError:
            send_error("Connection error.")
            continue

        if r.status_code != 200:
            send_error("Could not get app page for %s." % package_name)
            continue

        soup = BeautifulSoup(r.text, "html.parser")

        download_button = soup.find("a", class_=" da")

        if download_button is None:
            send_error("%s is a paid app. Could not download." % package_name)
            continue

        download_url = download_button.attrs["href"]


        # Download app page
        url = DOMAIN + download_url
        try:
            r = requests.get(url)
        except requests.exceptions.ConnectionError:
            send_error("Connection error.")
            continue

        if r.status_code != 200:
            send_error("Could not get app download page for %s." % package_name)
            continue

        soup = BeautifulSoup(r.text, "html.parser")

        download_link = soup.find("a", id="download_link")
        download_apk_url = download_link.attrs["href"]

        send_payload(package_name, app_name, download_apk_url)


def main():
    # Create the download directory
    if not os.path.exists(DOWNLOAD_DIR):
        os.makedirs(DOWNLOAD_DIR)
    elif not os.path.isdir(DOWNLOAD_DIR):
        print("%s is not a directory." % DOWNLOAD_DIR)
        return -1


    # Read the package names
    if not os.path.isfile(PACKAGE_NAMES_FILE):
        print("Could not find %s." % PACKAGE_NAMES_FILE)
        return -1

    with open(PACKAGE_NAMES_FILE, "r") as f:
        package_names = [line.strip() for line in f.readlines()]


    # CSV file header
    with open(OUTPUT_CSV, "w+") as csv:
        csv.write("App name,Package name,Size,Location\n")


    # Message-passing queues
    search_qi = Queue()
    search_qo = Queue()

    download_qi = Queue()
    download_qo = Queue()


    # Search Process
    search_proc = Process(target=search_process, args=(search_qo, search_qi))
    search_proc.start()


    # Download Processes
    download_procs = []
    for i in range(CONCURRENT_DOWNLOADS):
        download_proc = Process(target=download_process,
                                args=(i, download_qo, download_qi))
        download_procs.append(download_proc)
        download_proc.start()


    active_tasks = Counter()
    def new_search_query():
        if package_names:
            search_qo.put((MSG_PAYLOAD, package_names.pop(0)))
            active_tasks.inc()
            return True
        return False

    # Send some queries to the search process
    for _ in range(CONCURRENT_DOWNLOADS + 1):
        new_search_query()


    prog_bars = SplitProgBar(CONCURRENT_DOWNLOADS, 80)

    def log(msg, pb=True):
        prog_bars.clear()
        print(msg)
        if pb:
            prog_bars.render()
        sys.stdout.flush()

    last_message_time = time.time()
    while True:
        if active_tasks.empty:
            log("Done!", False)
            break

        no_message = True

        try:
            # Messages from the search process
            message = search_qi.get(block=False)
            last_message_time = time.time()
            no_message = False

            if message[0] == MSG_PAYLOAD:
                # Donwload URL found => Start a download
                download_qo.put(message)
                log("  Found app for %s." % message[1][0])

            elif message[0] == MSG_ERROR:
                # Error with search query
                log("!!" + message[1])
                active_tasks.dec()

                # Search for another app
                new_search_query()
        except EmptyQueueException:
            pass

        try:
            # Messages from the download processes
            message = download_qi.get(block=False)
            last_message_time = time.time()
            no_message = False

            if message[0] == MSG_PAYLOAD or message[0] == MSG_END:
                # Download finished
                id_, package_name, app_name, size, location = message[1]
                prog_bars[id_] = float("NaN")

                if message[0] == MSG_PAYLOAD:
                    log("  Finished downloading %s." % package_name)
                elif message[0] == MSG_END:
                    log("  File already downloaded for %s." % package_name)

                # Add row to CSV file
                with open(OUTPUT_CSV, "a") as csv:
                    csv.write(",".join([
                        '"%s"' % app_name.replace('"', '""'),
                        '"%s"' % package_name.replace('"', '""'),
                        "%d" % size,
                        '"%s"' % location.replace('"', '""')]))
                    csv.write("\n")

                active_tasks.dec()

                # Search for another app
                new_search_query()

            elif message[0] == MSG_START:
                # Download started
                id_, package_name = message[1]
                prog_bars[id_] = 0.0
                log("  Started downloading %s." % package_name)

            elif message[0] == MSG_PROGRESS:
                # Download progress
                id_, progress = message[1]
                prog_bars[id_] = progress
                prog_bars.render()

            elif message[0] == MSG_ERROR:
                # Error during download
                id_, msg = message[1]
                log("!!" + msg)
                prog_bars[id_] = 0.0

                active_tasks.dec()

                # Search for another app
                new_search_query()
        except EmptyQueueException:
            pass

        if no_message:
            if time.time() - last_message_time > PROCESS_TIMEOUT:
                log("!!Timed out after %.2f seconds." % (PROCESS_TIMEOUT), False)
                break
            time.sleep(PROGRESS_UPDATE_DELAY / 2.0)

    # End processes
    search_qo.put((MSG_END, ))
    for _ in range(CONCURRENT_DOWNLOADS):
        download_qo.put((MSG_END, ))

    search_proc.join()
    for download_proc in download_procs:
        download_proc.join()

    return 0


if __name__ == '__main__':
    sys.exit(main())

One feature I added was for the downloading to be skipped in case the file already exists locally and it has the same size. That way, every file doesn't get re-downloaded every time.

I'm using processes instead of threads to avoid having Python's Global Interpreter Lock serializing the execution. The main process creates 5 processes, 1 for searching the download URLs and 4 for downloading the files concurrently.

The search process only queries the website at the same rate at which the download processes are going through the downloads. That way, the website doesn't get bombarded by a ton of queries in a short amount of time. It doesn't need to be any faster anyway.

The download processes concurrently each download a file on their own. One thing to consider is that it will likely increase fragmentation on the file system, and if using an hard disk drive, it will increase seek time.

I only did some limited testing, but with a list of 10 entries, I got a speedup of about ~25% compared to my earlier version. This isn't much, but it's expected. All the concurrent downloads allow is for a more efficient use of the available connection speed, notably when a download is slower than the others. The only way it could be 4 times as fast would be if the files were stored on different servers and they were each capped at less than a fourth of the connection speed.

Tested on Windows and Debian, using Python 3.6 and 2.7.

Edit: Added progress bars. Replaced the Message Enum with constants. Added some more error handling. Made the main process only sleep if there were no messages from the other processes (which speeds things up a bit). Added a timeout in case processing stops for some reasons. Code cleanup.

View Timeline