CyteBode's solution to "Mass download list of APKs by Package Names"

Here's an updated version of my solution, with concurrent downloads:<\/p>\n

import<\/span> enum<\/span>\nfrom<\/span> multiprocessing<\/span> import<\/span> Process<\/span>,<\/span> Queue<\/span>\nimport<\/span> os<\/span>\nimport<\/span> os.path<\/span>\nimport<\/span> re<\/span>\nimport<\/span> time<\/span>\n\ntry<\/span>:<\/span>\n    # Python 3<\/span>\n    from<\/span> queue<\/span> import<\/span> Empty<\/span> as<\/span> EmptyQueueException<\/span>\nexcept<\/span> ImportError<\/span>:<\/span>\n    # Python 2<\/span>\n    from<\/span> Queue<\/span> import<\/span> Empty<\/span> as<\/span> EmptyQueueException<\/span>\n\nfrom<\/span> bs4<\/span> import<\/span> BeautifulSoup<\/span>\nimport<\/span> requests<\/span>\n\n\nDOMAIN<\/span> =<\/span> &<\/span>quot<\/span>;<\/span>https<\/span>:<\/span>//<\/span>apkpure<\/span>.<\/span>com<\/span>&<\/span>quot<\/span>;<\/span>\nSEARCH_URL<\/span> =<\/span> DOMAIN<\/span> +<\/span> &<\/span>quot<\/span>;<\/span>/<\/span>search<\/span>?<\/span>q<\/span>=%<\/span>s<\/span>&<\/span>quot<\/span>;<\/span>\n\nDOWNLOAD_DIR<\/span> =<\/span> &<\/span>quot<\/span>;<\/span>./<\/span>downloaded<\/span>/&<\/span>quot<\/span>;<\/span>\nPACKAGE_NAMES_FILE<\/span> =<\/span> &<\/span>quot<\/span>;<\/span>package_names<\/span>.<\/span>txt<\/span>&<\/span>quot<\/span>;<\/span>\nOUTPUT_CSV<\/span> =<\/span> &<\/span>quot<\/span>;<\/span>output<\/span>.<\/span>csv<\/span>&<\/span>quot<\/span>;<\/span>\n\n\nCONCURRENT_DOWNLOADS<\/span> =<\/span> 4<\/span>\nPROCESS_TIMEOUT<\/span> =<\/span> 10.0<\/span>\n\n\nclass<\/span> Message<\/span>(<\/span>enum<\/span>.<\/span>Enum<\/span>):<\/span>\n    error<\/span> =<\/span> -<\/span>1<\/span>\n    payload<\/span> =<\/span> 0<\/span>\n    start<\/span> =<\/span> 1<\/span>\n    end<\/span> =<\/span> 2<\/span>\n\n\ndef<\/span> download_process<\/span>(<\/span>qi<\/span>,<\/span> qo<\/span>):<\/span>\n    while<\/span> True<\/span>:<\/span>\n        message<\/span> =<\/span> qi<\/span>.<\/span>get<\/span>()<\/span>\n\n        if<\/span> message<\/span>[<\/span>0<\/span>]<\/span> ==<\/span> Message<\/span>.<\/span>payload<\/span>:<\/span>\n            package_name<\/span>,<\/span> app_name<\/span>,<\/span> download_url<\/span> =<\/span> message<\/span>[<\/span>1<\/span>]<\/span>\n        elif<\/span> message<\/span>[<\/span>0<\/span>]<\/span> ==<\/span> Message<\/span>.<\/span>end<\/span>:<\/span>\n            break<\/span>\n\n        # Head request for filename and size<\/span>\n        r<\/span> =<\/span> requests<\/span>.<\/span>get<\/span>(<\/span>download_url<\/span>,<\/span> stream<\/span>=<\/span>True<\/span>)<\/span>\n\n        if<\/span> r<\/span>.<\/span>status_code<\/span> !=<\/span> 200<\/span>:<\/span>\n            qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>error<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>HTTP<\/span> Error<\/span> %<\/span>d<\/span>&<\/span>quot<\/span>;<\/span> %<\/span> r<\/span>.<\/span>status_code<\/span>))<\/span>\n            r<\/span>.<\/span>close<\/span>()<\/span>\n            continue<\/span>\n\n        content_disposition<\/span> =<\/span> r<\/span>.<\/span>headers<\/span>.<\/span>get<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>content<\/span>-<\/span>disposition<\/span>&<\/span>quot<\/span>;,<\/span> &<\/span>quot<\/span>;<\/span>&<\/span>quot<\/span>;)<\/span>\n        content_length<\/span> =<\/span> int<\/span>(<\/span>r<\/span>.<\/span>headers<\/span>.<\/span>get<\/span>(<\/span>'content-length'<\/span>,<\/span> 0<\/span>))<\/span>\n\n        filename<\/span> =<\/span> re<\/span>.<\/span>search<\/span>(<\/span>r'filename=&quot;(.*)&quot;'<\/span>,<\/span> content_disposition<\/span>)<\/span>\n        if<\/span> filename<\/span> and<\/span> filename<\/span>.<\/span>groups<\/span>():<\/span>\n            filename<\/span> =<\/span> filename<\/span>.<\/span>groups<\/span>()[<\/span>0<\/span>]<\/span>\n        else<\/span>:<\/span>\n            filename<\/span> =<\/span> &<\/span>quot<\/span>;<\/span>%<\/span>s<\/span>.<\/span>apk<\/span>&<\/span>quot<\/span>;<\/span> %<\/span> (<\/span>package_name<\/span>.<\/span>replace<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>.&<\/span>quot<\/span>;,<\/span> &<\/span>quot<\/span>;<\/span>_<\/span>&<\/span>quot<\/span>;))<\/span>\n\n        local_path<\/span> =<\/span> os<\/span>.<\/span>path<\/span>.<\/span>normpath<\/span>(<\/span>os<\/span>.<\/span>path<\/span>.<\/span>join<\/span>(<\/span>DOWNLOAD_DIR<\/span>,<\/span> filename<\/span>))<\/span>\n\n        if<\/span> os<\/span>.<\/span>path<\/span>.<\/span>exists<\/span>(<\/span>local_path<\/span>):<\/span>\n            if<\/span> not<\/span> os<\/span>.<\/span>path<\/span>.<\/span>isfile<\/span>(<\/span>local_path<\/span>):<\/span>\n                # Not a file<\/span>\n                qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>error<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>%<\/span>s<\/span> is<\/span> a<\/span> directory<\/span>.&<\/span>quot<\/span>;<\/span> %<\/span> local_path<\/span>))<\/span>\n                r<\/span>.<\/span>close<\/span>()<\/span>\n                continue<\/span>\n            if<\/span> os<\/span>.<\/span>path<\/span>.<\/span>getsize<\/span>(<\/span>local_path<\/span>)<\/span> ==<\/span> content_length<\/span>:<\/span>\n                # File has likely already been downloaded<\/span>\n                qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>end<\/span>,<\/span> (<\/span>package_name<\/span>,<\/span> app_name<\/span>,<\/span> content_length<\/span>,<\/span> local_path<\/span>)))<\/span>\n                r<\/span>.<\/span>close<\/span>()<\/span>\n                continue<\/span>\n\n        qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>start<\/span>,<\/span> package_name<\/span>))<\/span>\n\n        size<\/span> =<\/span> 0<\/span>\n        with<\/span> open<\/span>(<\/span>local_path<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>wb<\/span>+&<\/span>quot<\/span>;)<\/span> as<\/span> f<\/span>:<\/span>\n            for<\/span> chunk<\/span> in<\/span> r<\/span>.<\/span>iter_content<\/span>(<\/span>chunk_size<\/span>=<\/span>65536<\/span>):<\/span>\n                if<\/span> chunk<\/span>:<\/span>\n                    size<\/span> +=<\/span> len<\/span>(<\/span>chunk<\/span>)<\/span>\n                    f<\/span>.<\/span>write<\/span>(<\/span>chunk<\/span>)<\/span>\n\n        qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>payload<\/span>,<\/span> (<\/span>package_name<\/span>,<\/span> app_name<\/span>,<\/span> size<\/span>,<\/span> local_path<\/span>)))<\/span>\n\n\ndef<\/span> search_process<\/span>(<\/span>qi<\/span>,<\/span> qo<\/span>):<\/span>\n    while<\/span> True<\/span>:<\/span>\n        message<\/span> =<\/span> qi<\/span>.<\/span>get<\/span>()<\/span>\n\n        if<\/span> message<\/span>[<\/span>0<\/span>]<\/span> ==<\/span> Message<\/span>.<\/span>payload<\/span>:<\/span>\n            package_name<\/span> =<\/span> message<\/span>[<\/span>1<\/span>]<\/span>\n        elif<\/span> message<\/span>[<\/span>0<\/span>]<\/span> ==<\/span> Message<\/span>.<\/span>end<\/span>:<\/span>\n            break<\/span>\n\n        # Search page<\/span>\n        url<\/span> =<\/span> SEARCH_URL<\/span> %<\/span> package_name<\/span>\n        r<\/span> =<\/span> requests<\/span>.<\/span>get<\/span>(<\/span>url<\/span>)<\/span>\n\n        if<\/span> r<\/span>.<\/span>status_code<\/span> !=<\/span> 200<\/span>:<\/span>\n            qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>error<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>Could<\/span> not<\/span> get<\/span> search<\/span> page<\/span> for<\/span> %<\/span>s<\/span>.&<\/span>quot<\/span>;<\/span> %<\/span> package_name<\/span>))<\/span>\n            continue<\/span>\n\n        soup<\/span> =<\/span> BeautifulSoup<\/span>(<\/span>r<\/span>.<\/span>text<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>html<\/span>.<\/span>parser<\/span>&<\/span>quot<\/span>;)<\/span>\n\n        first_result<\/span> =<\/span> soup<\/span>.<\/span>find<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>dl<\/span>&<\/span>quot<\/span>;,<\/span> class_<\/span>=&<\/span>quot<\/span>;<\/span>search<\/span>-<\/span>dl<\/span>&<\/span>quot<\/span>;)<\/span>\n        if<\/span> first_result<\/span> is<\/span> None<\/span>:<\/span>\n            qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>error<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>Could<\/span> not<\/span> find<\/span> %<\/span>s<\/span>.&<\/span>quot<\/span>;<\/span> %<\/span> package_name<\/span>))<\/span>\n            continue<\/span>\n\n        search_title<\/span> =<\/span> first_result<\/span>.<\/span>find<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>p<\/span>&<\/span>quot<\/span>;,<\/span> class_<\/span>=&<\/span>quot<\/span>;<\/span>search<\/span>-<\/span>title<\/span>&<\/span>quot<\/span>;)<\/span>\n        search_title_a<\/span> =<\/span> search_title<\/span>.<\/span>find<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>a<\/span>&<\/span>quot<\/span>;)<\/span>\n\n        app_name<\/span> =<\/span> search_title<\/span>.<\/span>text<\/span>.<\/span>strip<\/span>()<\/span>\n        app_url<\/span> =<\/span> search_title_a<\/span>.<\/span>attrs<\/span>[<\/span>&<\/span>quot<\/span>;<\/span>href<\/span>&<\/span>quot<\/span>;]<\/span>\n\n\n        # App page<\/span>\n        url<\/span> =<\/span> DOMAIN<\/span> +<\/span> app_url<\/span>\n        r<\/span> =<\/span> requests<\/span>.<\/span>get<\/span>(<\/span>url<\/span>)<\/span>\n\n        if<\/span> r<\/span>.<\/span>status_code<\/span> !=<\/span> 200<\/span>:<\/span>\n            qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>error<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>Could<\/span> not<\/span> get<\/span> app<\/span> page<\/span> for<\/span> %<\/span>s<\/span>.&<\/span>quot<\/span>;<\/span> %<\/span> package_name<\/span>))<\/span>\n            continue<\/span>\n\n        soup<\/span> =<\/span> BeautifulSoup<\/span>(<\/span>r<\/span>.<\/span>text<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>html<\/span>.<\/span>parser<\/span>&<\/span>quot<\/span>;)<\/span>\n\n        download_button<\/span> =<\/span> soup<\/span>.<\/span>find<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>a<\/span>&<\/span>quot<\/span>;,<\/span> class_<\/span>=&<\/span>quot<\/span>;<\/span> da<\/span>&<\/span>quot<\/span>;)<\/span>\n\n        if<\/span> download_button<\/span> is<\/span> None<\/span>:<\/span>\n            qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>error<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>%<\/span>s<\/span> is<\/span> a<\/span> paid<\/span> app<\/span>.<\/span> Could<\/span> not<\/span> download<\/span>.&<\/span>quot<\/span>;<\/span> %<\/span> package_name<\/span>))<\/span>\n            continue<\/span>\n\n        download_url<\/span> =<\/span> download_button<\/span>.<\/span>attrs<\/span>[<\/span>&<\/span>quot<\/span>;<\/span>href<\/span>&<\/span>quot<\/span>;]<\/span>\n\n\n        # Download app page<\/span>\n        url<\/span> =<\/span> DOMAIN<\/span> +<\/span> download_url<\/span>\n        r<\/span> =<\/span> requests<\/span>.<\/span>get<\/span>(<\/span>url<\/span>)<\/span>\n\n        if<\/span> r<\/span>.<\/span>status_code<\/span> !=<\/span> 200<\/span>:<\/span>\n            qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>error<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>Could<\/span> not<\/span> get<\/span> app<\/span> download<\/span> page<\/span> for<\/span> %<\/span>s<\/span>.&<\/span>quot<\/span>;<\/span> %<\/span> package_name<\/span>))<\/span>\n            continue<\/span>\n\n        soup<\/span> =<\/span> BeautifulSoup<\/span>(<\/span>r<\/span>.<\/span>text<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>html<\/span>.<\/span>parser<\/span>&<\/span>quot<\/span>;)<\/span>\n\n        download_link<\/span> =<\/span> soup<\/span>.<\/span>find<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>a<\/span>&<\/span>quot<\/span>;,<\/span> id<\/span>=&<\/span>quot<\/span>;<\/span>download_link<\/span>&<\/span>quot<\/span>;)<\/span>\n        download_apk_url<\/span> =<\/span> download_link<\/span>.<\/span>attrs<\/span>[<\/span>&<\/span>quot<\/span>;<\/span>href<\/span>&<\/span>quot<\/span>;]<\/span>\n\n        qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>payload<\/span>,<\/span> (<\/span>package_name<\/span>,<\/span> app_name<\/span>,<\/span> download_apk_url<\/span>)))<\/span>\n\n\ndef<\/span> main<\/span>():<\/span>\n    # Create the download directory<\/span>\n    if<\/span> not<\/span> os<\/span>.<\/span>path<\/span>.<\/span>exists<\/span>(<\/span>DOWNLOAD_DIR<\/span>):<\/span>\n        os<\/span>.<\/span>makedirs<\/span>(<\/span>DOWNLOAD_DIR<\/span>)<\/span>\n    elif<\/span> not<\/span> os<\/span>.<\/span>path<\/span>.<\/span>isdir<\/span>(<\/span>DOWNLOAD_DIR<\/span>):<\/span>\n        print<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>%<\/span>s<\/span> is<\/span> not<\/span> a<\/span> directory<\/span>.&<\/span>quot<\/span>;<\/span> %<\/span> DOWNLOAD_DIR<\/span>)<\/span>\n        return<\/span>\n\n\n    # Read the package names<\/span>\n    if<\/span> not<\/span> os<\/span>.<\/span>path<\/span>.<\/span>isfile<\/span>(<\/span>PACKAGE_NAMES_FILE<\/span>):<\/span>\n        print<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>Could<\/span> not<\/span> find<\/span> %<\/span>s<\/span>.&<\/span>quot<\/span>;<\/span> %<\/span> PACKAGE_NAMES_FILE<\/span>)<\/span>\n        return<\/span>\n    with<\/span> open<\/span>(<\/span>PACKAGE_NAMES_FILE<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>r<\/span>&<\/span>quot<\/span>;)<\/span> as<\/span> f<\/span>:<\/span>\n        package_names<\/span> =<\/span> [<\/span>line<\/span>.<\/span>strip<\/span>()<\/span> for<\/span> line<\/span> in<\/span> f<\/span>.<\/span>readlines<\/span>()]<\/span>\n\n\n    # CSV file header<\/span>\n    with<\/span> open<\/span>(<\/span>OUTPUT_CSV<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>w<\/span>+&<\/span>quot<\/span>;)<\/span> as<\/span> csv<\/span>:<\/span>\n        csv<\/span>.<\/span>write<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>App<\/span> name<\/span>,<\/span>Package<\/span> name<\/span>,<\/span>Size<\/span>,<\/span>Location<\/span>\\n<\/span>&<\/span>quot<\/span>;)<\/span>\n\n\n    # Message-passing queues<\/span>\n    search_qi<\/span> =<\/span> Queue<\/span>()<\/span>\n    search_qo<\/span> =<\/span> Queue<\/span>()<\/span>\n\n    download_qi<\/span> =<\/span> Queue<\/span>()<\/span>\n    download_qo<\/span> =<\/span> Queue<\/span>()<\/span>\n\n\n    # Search Process<\/span>\n    search_proc<\/span> =<\/span> Process<\/span>(<\/span>target<\/span>=<\/span>search_process<\/span>,<\/span> args<\/span>=<\/span>(<\/span>search_qo<\/span>,<\/span> search_qi<\/span>))<\/span>\n    search_proc<\/span>.<\/span>start<\/span>()<\/span>\n\n\n    # Download Processes<\/span>\n    download_procs<\/span> =<\/span> []<\/span>\n    for<\/span> _<\/span> in<\/span> range<\/span>(<\/span>CONCURRENT_DOWNLOADS<\/span>):<\/span>\n        download_proc<\/span> =<\/span> Process<\/span>(<\/span>target<\/span>=<\/span>download_process<\/span>,<\/span>\n                                args<\/span>=<\/span>(<\/span>download_qo<\/span>,<\/span> download_qi<\/span>))<\/span>\n        download_procs<\/span>.<\/span>append<\/span>(<\/span>download_proc<\/span>)<\/span>\n        download_proc<\/span>.<\/span>start<\/span>()<\/span>\n\n\n    iter_package_names<\/span> =<\/span> iter<\/span>(<\/span>package_names<\/span>)<\/span>\n    active_tasks<\/span> =<\/span> 0<\/span>\n\n    # Send some queries to the search process<\/span>\n    for<\/span> _<\/span> in<\/span> range<\/span>(<\/span>CONCURRENT_DOWNLOADS<\/span> +<\/span> 1<\/span>):<\/span>\n        try<\/span>:<\/span>\n            package_name<\/span> =<\/span> next<\/span>(<\/span>iter_package_names<\/span>)<\/span>\n            search_qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>payload<\/span>,<\/span> package_name<\/span>))<\/span>\n            active_tasks<\/span> +=<\/span> 1<\/span>\n        except<\/span> StopIteration<\/span>:<\/span>\n            break<\/span>\n\n\n    while<\/span> True<\/span>:<\/span>\n        if<\/span> active_tasks<\/span> ==<\/span> 0<\/span>:<\/span>\n            print<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>Done<\/span>!<\/span>&<\/span>quot<\/span>;)<\/span>\n            break<\/span>\n\n        try<\/span>:<\/span>\n            # Messages from the search process<\/span>\n            message<\/span> =<\/span> search_qi<\/span>.<\/span>get<\/span>(<\/span>block<\/span>=<\/span>False<\/span>)<\/span>\n\n            if<\/span> message<\/span>[<\/span>0<\/span>]<\/span> ==<\/span> Message<\/span>.<\/span>payload<\/span>:<\/span>\n                # Donwload URL found => Start a download<\/span>\n                download_qo<\/span>.<\/span>put<\/span>(<\/span>message<\/span>)<\/span>\n                print<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>  Found<\/span> app<\/span> for<\/span> %<\/span>s<\/span>.&<\/span>quot<\/span>;<\/span> %<\/span> message<\/span>[<\/span>1<\/span>][<\/span>0<\/span>])<\/span>\n            elif<\/span> message<\/span>[<\/span>0<\/span>]<\/span> ==<\/span> Message<\/span>.<\/span>error<\/span>:<\/span>\n                # Error with search query<\/span>\n                print<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>!!<\/span>&<\/span>quot<\/span>;<\/span> +<\/span> message<\/span>[<\/span>1<\/span>])<\/span>\n                active_tasks<\/span> -=<\/span> 1<\/span>\n\n                # Search for another app<\/span>\n                try<\/span>:<\/span>\n                    package_name<\/span> =<\/span> next<\/span>(<\/span>iter_package_names<\/span>)<\/span>\n                    search_qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>payload<\/span>,<\/span> package_name<\/span>))<\/span>\n                    active_tasks<\/span> +=<\/span> 1<\/span>\n                except<\/span> StopIteration<\/span>:<\/span>\n                    pass<\/span>\n        except<\/span> EmptyQueueException<\/span>:<\/span>\n            pass<\/span>\n\n        try<\/span>:<\/span>\n            # Messages from the download processes<\/span>\n            message<\/span> =<\/span> download_qi<\/span>.<\/span>get<\/span>(<\/span>block<\/span>=<\/span>False<\/span>)<\/span>\n\n            if<\/span> message<\/span>[<\/span>0<\/span>]<\/span> ==<\/span> Message<\/span>.<\/span>payload<\/span> or<\/span> message<\/span>[<\/span>0<\/span>]<\/span> ==<\/span> Message<\/span>.<\/span>end<\/span>:<\/span>\n                # Download done<\/span>\n                package_name<\/span>,<\/span> app_name<\/span>,<\/span> size<\/span>,<\/span> location<\/span> =<\/span> message<\/span>[<\/span>1<\/span>]<\/span>\n\n                if<\/span> message<\/span>[<\/span>0<\/span>]<\/span> ==<\/span> Message<\/span>.<\/span>payload<\/span>:<\/span>\n                    print<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>  Finished<\/span> downloading<\/span> %<\/span>s<\/span>.&<\/span>quot<\/span>;<\/span> %<\/span> package_name<\/span>)<\/span>\n                elif<\/span> message<\/span>[<\/span>0<\/span>]<\/span> ==<\/span> Message<\/span>.<\/span>end<\/span>:<\/span>\n                    print<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>  File<\/span> already<\/span> downloaded<\/span> for<\/span> %<\/span>s<\/span>.&<\/span>quot<\/span>;<\/span> %<\/span> package_name<\/span>)<\/span>\n\n                # Add row to CSV file<\/span>\n                with<\/span> open<\/span>(<\/span>OUTPUT_CSV<\/span>,<\/span> &<\/span>quot<\/span>;<\/span>a<\/span>&<\/span>quot<\/span>;)<\/span> as<\/span> csv<\/span>:<\/span>\n                    csv<\/span>.<\/span>write<\/span>(<\/span>&<\/span>quot<\/span>;,<\/span>&<\/span>quot<\/span>;<\/span>.<\/span>join<\/span>([<\/span>\n                        '&quot;<\/span>%s<\/span>&quot;'<\/span> %<\/span> app_name<\/span>.<\/span>replace<\/span>(<\/span>'&quot;'<\/span>,<\/span> '&quot;&quot;'<\/span>),<\/span>\n                        '&quot;<\/span>%s<\/span>&quot;'<\/span> %<\/span> package_name<\/span>.<\/span>replace<\/span>(<\/span>'&quot;'<\/span>,<\/span> '&quot;&quot;'<\/span>),<\/span>\n                        &<\/span>quot<\/span>;<\/span>%<\/span>d<\/span>&<\/span>quot<\/span>;<\/span> %<\/span> size<\/span>,<\/span>\n                        '&quot;<\/span>%s<\/span>&quot;'<\/span> %<\/span> location<\/span>.<\/span>replace<\/span>(<\/span>'&quot;'<\/span>,<\/span> '&quot;&quot;'<\/span>)]))<\/span>\n                    csv<\/span>.<\/span>write<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>\\n<\/span>&<\/span>quot<\/span>;)<\/span>\n\n                active_tasks<\/span> -=<\/span> 1<\/span>\n\n                # Search for another app<\/span>\n                try<\/span>:<\/span>\n                    package_name<\/span> =<\/span> next<\/span>(<\/span>iter_package_names<\/span>)<\/span>\n                    search_qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>payload<\/span>,<\/span> package_name<\/span>))<\/span>\n                    active_tasks<\/span> +=<\/span> 1<\/span>\n                except<\/span> StopIteration<\/span>:<\/span>\n                    pass<\/span>\n\n            elif<\/span> message<\/span>[<\/span>0<\/span>]<\/span> ==<\/span> Message<\/span>.<\/span>start<\/span>:<\/span>\n                # Download started<\/span>\n                print<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>  Started<\/span> downloading<\/span> %<\/span>s<\/span>.&<\/span>quot<\/span>;<\/span> %<\/span> message<\/span>[<\/span>1<\/span>])<\/span>\n            elif<\/span> message<\/span>[<\/span>0<\/span>]<\/span> ==<\/span> Message<\/span>.<\/span>error<\/span>:<\/span>\n                # Error during download<\/span>\n                print<\/span>(<\/span>&<\/span>quot<\/span>;<\/span>!!<\/span>&<\/span>quot<\/span>;<\/span> +<\/span> message<\/span>[<\/span>1<\/span>])<\/span>\n                active_tasks<\/span> -=<\/span> 1<\/span>\n        except<\/span> EmptyQueueException<\/span>:<\/span>\n            pass<\/span>\n\n        time<\/span>.<\/span>sleep<\/span>(<\/span>1.0<\/span>)<\/span>\n\n    # End processes<\/span>\n    search_qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>end<\/span>,<\/span> ))<\/span>\n    for<\/span> _<\/span> in<\/span> range<\/span>(<\/span>CONCURRENT_DOWNLOADS<\/span>):<\/span>\n        download_qo<\/span>.<\/span>put<\/span>((<\/span>Message<\/span>.<\/span>end<\/span>,<\/span> ))<\/span>\n\n    search_proc<\/span>.<\/span>join<\/span>()<\/span>\n    for<\/span> download_proc<\/span> in<\/span> download_procs<\/span>:<\/span>\n        download_proc<\/span>.<\/span>join<\/span>()<\/span>\n\n\nif<\/span> __name__<\/span> ==<\/span> '__main__'<\/span>:<\/span>\n    main<\/span>()<\/span>\n<\/pre><\/div>\n

One feature I added was for the downloading to be skipped in case the file already exists locally and it has the same size. That way, every file doesn't get re-downloaded every time.<\/p>\n\n

I'm using processes instead of threads to avoid having Python's Global Interpreter Lock serializing the execution. The main process creates 5 processes, 1 for searching the download URLs and 4 for downloading the files concurrently.<\/p>\n\n

The search process only queries the website at the same rate at which the download processes are going through the downloads. That way, the website doesn't get bombarded by a ton of queries in a short amount of time. It doesn't need to be any faster anyway.<\/p>\n\n

The download processes concurrently each download a file on their own. One thing to consider is that it will likely increase fragmentation on the file system, and if using an hard disk drive, it will increase seek time.<\/p>\n\n

The main process orchestrates everything and prints logging events. Due to the asynchronous nature of the downloads, events may appear out of order, and progress isn't printed. It can make it look like the script has hanged (e.g. "Started downloading B" followed by "Finished downloading A").<\/p>\n\n

I only did some limited testing, but with a list of 10 entries, I got a speedup of about 10% compared to my earlier version. This isn't much, but it's expected. All the concurrent downloads allow is for a more efficient use of the available connection speed, notably when a download is slower than the others. The only way it could be 4 times as fast would be if the files were stored on different servers and they were each capped at less than a fourth of the connection speed.<\/p>\n\n

The code is more complex, and quite a bit fragile. Given it speeds things up only a little bit, I don't really recommend using it over my earlier version.<\/p>\n\n

Tested on Windows and Debian, using Python 3.6 and 2.7.<\/p>\n

Here's an updated version of my solution, with concurrent downloads: import enum from multiprocessing import Process, Queue import os import os.path import re import time try: # Python 3 from queue import Empty as EmptyQueueException except ImportError: # Python 2 from Queue import Empty as EmptyQueueException from bs4 import BeautifulSoup import requests DOMAIN = "https://apkpure.com" SEARCH_URL = DOMAIN + "/search?q=%s" DOWNLOAD_DIR = "./downloaded/" PACKAGE_NAMES_FILE = "package_names.txt" OUTPUT_CSV = "output.csv" CONCURRENT_DOWNLOADS = 4 PROCESS_TIMEOUT = 10.0 class Message(enum.Enum): error = -1 payload = 0 start = 1 end = 2 def download_process(qi, qo): while True: message = qi.get() if message[0] == Message.payload: package_name, app_name, download_url = message[1] elif message[0] == Message.end: break # Head request for filename and size r = requests.get(download_url, stream=True) if r.status_code != 200: qo.put((Message.error, "HTTP Error %d" % r.status_code)) r.close() continue r = requests.get(download_url, stream=True) if r.status_code != 200: qo.put((Message.error, "HTTP Error %d" % r.status_code)) r.close() continue content_disposition = r.headers.get("content-disposition", "") content_length = int(r.headers.get('content-length', 0)) filename = re.search(r'filename="(.*)"', content_disposition) if filename and filename.groups(): filename = filename.groups()[0] else: filename = "%s.apk" % (package_name.replace(".", "_")) local_path = os.path.normpath(os.path.join(DOWNLOAD_DIR, filename)) if os.path.exists(local_path): if not os.path.isfile(local_path): # Not a file qo.put((Message.error, "%s is a directory." % local_path)) r.close() continue if os.path.getsize(local_path) == content_length: # File has likely already been downloaded qo.put((Message.end, (package_name, app_name, content_length, local_path))) r.close() continue qo.put((Message.start, package_name)) size = 0 with open(local_path, "wb+") as f: for chunk in r.iter_content(chunk_size=65536): if chunk: size += len(chunk) f.write(chunk) qo.put((Message.payload, (package_name, app_name, size, local_path))) def search_process(qi, qo): while True: message = qi.get() if message[0] == Message.payload: package_name = message[1] elif message[0] == Message.end: break # Search page url = SEARCH_URL % package_name r = requests.get(url) if r.status_code != 200: qo.put((Message.error, "Could not get search page for %s." % package_name)) continue soup = BeautifulSoup(r.text, "html.parser") first_result = soup.find("dl", class_="search-dl") if first_result is None: qo.put((Message.error, "Could not find %s." % package_name)) continue search_title = first_result.find("p", class_="search-title") search_title_a = search_title.find("a") app_name = search_title.text.strip() app_url = search_title_a.attrs["href"] # App page url = DOMAIN + app_url r = requests.get(url) if r.status_code != 200: qo.put((Message.error, "Could not get app page for %s." % package_name)) continue soup = BeautifulSoup(r.text, "html.parser") download_button = soup.find("a", class_=" da") if download_button is None: qo.put((Message.error, "%s is a paid app. Could not download." % package_name)) continue download_url = download_button.attrs["href"] # Download app page url = DOMAIN + download_url r = requests.get(url) if r.status_code != 200: qo.put((Message.error, "Could not get app download page for %s." % package_name)) continue soup = BeautifulSoup(r.text, "html.parser") download_link = soup.find("a", id="download_link") download_apk_url = download_link.attrs["href"] qo.put((Message.payload, (package_name, app_name, download_apk_url))) def main(): # Create the download directory if not os.path.exists(DOWNLOAD_DIR): os.makedirs(DOWNLOAD_DIR) elif not os.path.isdir(DOWNLOAD_DIR): print("%s is not a directory." % DOWNLOAD_DIR) return # Read the package names if not os.path.isfile(PACKAGE_NAMES_FILE): print("Could not find %s." % PACKAGE_NAMES_FILE) return with open(PACKAGE_NAMES_FILE, "r") as f: package_names = [line.strip() for line in f.readlines()] # CSV file header with open(OUTPUT_CSV, "w+") as csv: csv.write("App name,Package name,Size,Location\n") # Message-passing queues search_qi = Queue() search_qo = Queue() download_qi = Queue() download_qo = Queue() # Search Process search_proc = Process(target=search_process, args=(search_qo, search_qi)) search_proc.start() # Download Processes download_procs = [] for _ in range(CONCURRENT_DOWNLOADS): download_proc = Process(target=download_process, args=(download_qo, download_qi)) download_procs.append(download_proc) download_proc.start() iter_package_names = iter(package_names) active_tasks = 0 # Send some queries to the search process for _ in range(CONCURRENT_DOWNLOADS + 1): try: package_name = next(iter_package_names) search_qo.put((Message.payload, package_name)) active_tasks += 1 except StopIteration: break while True: if active_tasks == 0: print("Done!") break try: # Messages from the search process message = search_qi.get(block=False) if message[0] == Message.payload: # Donwload URL found => Start a download download_qo.put(message) print(" Found app for %s." % message[1][0]) elif message[0] == Message.error: # Error with search query print("!!" + message[1]) active_tasks -= 1 # Search for another app try: package_name = next(iter_package_names) search_qo.put((Message.payload, package_name)) active_tasks += 1 except StopIteration: pass except EmptyQueueException: pass try: # Messages from the download processes message = download_qi.get(block=False) if message[0] == Message.payload or message[0] == Message.end: # Download done package_name, app_name, size, location = message[1] if message[0] == Message.payload: print(" Finished downloading %s." % package_name) elif message[0] == Message.end: print(" File already downloaded for %s." % package_name) # Add row to CSV file with open(OUTPUT_CSV, "a") as csv: csv.write(",".join([ '"%s"' % app_name.replace('"', '""'), '"%s"' % package_name.replace('"', '""'), "%d" % size, '"%s"' % location.replace('"', '""')])) csv.write("\n") active_tasks -= 1 # Search for another app try: package_name = next(iter_package_names) search_qo.put((Message.payload, package_name)) active_tasks += 1 except StopIteration: pass elif message[0] == Message.start: # Download started print(" Started downloading %s." % message[1]) elif message[0] == Message.error: # Error during download print("!!" + message[1]) active_tasks -= 1 except EmptyQueueException: pass time.sleep(1.0) # End processes search_qo.put((Message.end, )) for _ in range(CONCURRENT_DOWNLOADS): download_qo.put((Message.end, )) search_proc.join() for download_proc in download_procs: download_proc.join() if __name__ == '__main__': main() One feature I added was for the downloading to be skipped in case the file already exists locally and it has the same size. That way, every file doesn't get re-downloaded every time. I'm using processes instead of threads to avoid having Python's Global Interpreter Lock serializing the execution. The main process creates 5 processes, 1 for searching the download URLs and 4 for downloading the files concurrently. The search process only queries the website at the same rate at which the download processes are going through the downloads. That way, the website doesn't get bombarded by a ton of queries in a short amount of time. It doesn't need to be any faster anyway. The download processes concurrently each download a file on their own. One thing to consider is that it will likely increase fragmentation on the file system, and if using an hard disk drive, it will increase seek time. The main process orchestrates everything and prints logging events. Due to the asynchronous nature of the downloads, events may appear out of order, and progress isn't printed. It can make it look like the script has hanged (e.g. "Started downloading B" followed by "Finished downloading A"). I only did some limited testing, but with a list of 10 entries, I got a speedup of about 10% compared to my earlier version. This isn't much, but it's expected. All the concurrent downloads allow is for a more efficient use of the available connection speed, notably when a download is slower than the others. The only way it could be 4 times as fast would be if the files were stored on different servers and they were each capped at less than a fourth of the connection speed. The code is more complex, and quite a bit fragile. Given it speeds things up only a little bit, I don't really recommend using it over my earlier version. Tested on Windows and Debian, using Python 3.6 and 2.7.
Here's an updated version of my solution, with concurrent downloads: import enummath from multiprocessing import Process, Queue import os import os.path import re import sys import time try: # Python 3 from queue import Empty as EmptyQueueException from queue import Full as FullQueueException except ImportError: # Python 2 from Queue import Empty as EmptyQueueException from Queue import Full as FullQueueException from bs4 import BeautifulSoup import requests DOMAIN = "https://apkpure.com" SEARCH_URL = DOMAIN + "/search?q=%s" DOWNLOAD_DIR = "./ = "./downloaded/" PACKAGE_NAMES_FILE = "package_names.txt" OUTPUT_CSV = " = "output.csv" CONCURRENT_DOWNLOADS = 4 CHUNK_SIZE = 128*1024 # 128 KiB PROGRESS_UPDATE_DELAY = 0.25 PROCESS_TIMEOUT = 10.0 MSG_ERROR = -1 MSG_PAYLOAD = 0 MSG_START = 1 MSG_PROGRESS = 2 MSG_END = 3 class SplitProgBar(object): @staticmethod def center(text, base): if len(text) <= len(base): left = (len(base) - len(text)) // 2 return "%s%s%s" % (base[:left], text, base[left+len(text):]) else: return base def __init__(self, n, width): self.n = n self.sub_width = int(float(width-(n+1))/n) self.width = n * (self.sub_width + 1) + 1 self.progress = [float("NaN")] * n def __getitem__(self, ix): return self.progress[ix] def __setitem__(self, ix, value): self.progress[ix] = value def render(self): bars = [] for prog in self.progress: if math.isnan(prog) or prog < 0.0: bars.append(" " * self.sub_width) continue bar = "=" * int(round(prog*self.sub_width)) bar += " " * (self.sub_width-len(bar)) bar = SplitProgBar.center(" %.2f%% " % (prog*100), bar) bars.append(bar) new_str = "|%s|" % "|".join(bars) sys.stdout.write("\r%s" % new_str) def clear(self): sys.stdout.write("\r%s\r" % (" " * self.width)) class Counter(object): def __init__(self, value = 0): self.value = value def inc(self, n = 1): self.value += n def dec(self, n = 1): self.value -= n @property def empty(self): return self.value == 0 def download_process(id_, qi, qo): def send_progress(progress): try: qo.put_nowait((MSG_PROGRESS, (id_, progress))) except FullQueueException: pass def send_error(msg): qo.put((MSG_ERROR, (id_, msg))) def send_start(pkg_name): qo.put((MSG_START, (id_, pkg_name))) def send_finished(pkg_name, app_name, size, path, already=False): if already: qo.put((MSG_END, (id_, pkg_name, app_name, size, path))) else: qo.put((MSG_PAYLOAD, (id_, pkg_name, app_name, size, path))) while True: message = qi.get() if message[0] == MSG_PAYLOAD: package_name, app_name, download_url = message[1] elif message[0] == MSG_END: break try: r = requests.get(download_url, stream=True) except requests.exceptions.ConnectionError: send_error("Connection error") continue if r.status_code != 200: send_error("HTTP Error %d" % r.status_code) r.close() continue content_disposition = r.headers.get("content-disposition", "") content_length = int(r.headers.get('content-length', 0)) filename = re.search(r'filename="(.+)"', content_disposition) if filename and filename.groups(): filename = filename.groups()[0] else: filename = "%s.apk" % (package_name.replace(".", "_")) local_path = os.path.normpath(os.path.join(DOWNLOAD_DIR, filename)) if os.path.exists(local_path): if not os.path.isfile(local_path): # Not a file send_error("%s is a directory." % local_path) r.close() continue if os.path.getsize(local_path) == content_length: # File has likely already been downloaded send_finished( package_name, app_name, content_length, local_path, True) r.close() continue send_start(package_name) size = 0 t = time.time() with open(local_path, "wb+") as f: for chunk in r.iter_content(chunk_size=CHUNK_SIZE): if chunk: size += len(chunk) f.write(chunk) nt = time.time() if nt - t >= PROGRESS_UPDATE_DELAY: send_progress(float(size) / content_length) t = nt send_finished(package_name, app_name, size, local_path) def search_process(qi, qo): def send_error(msg): qo.put((MSG_ERROR, msg)) def send_payload(pkg_name, app_name, dl_url): qo.put((MSG_PAYLOAD, (pkg_name, app_name, dl_url))) while True: message = qi.get() if message[0] == MSG_PAYLOAD: package_name = message[1] elif message[0] == MSG_END: break # Search page url = SEARCH_URL % package_name try: r = requests.get(url) except requests.exceptions.ConnectionError: send_error("Connection error.") continue if r.status_code != 200: send_error("Could not get search page for %s." % package_name) continue soup = BeautifulSoup(r.text, "html.parser") first_result = soup.find("dl", class_="search-dl") if first_result is None: send_error("Could not find %s." % package_name) continue search_title = first_result.find("p", class_="search-title") search_title_a = search_title.find("a") app_name = search_title.text.strip() app_url = search_title_a.attrs["href"] # App page url = DOMAIN + app_url try: r = requests.get(url) except requests.exceptions.ConnectionError: send_error("Connection error.") continue if r.status_code != 200: send_error("Could not get app page for %s." % package_name) continue soup = BeautifulSoup(r.text, "html.parser") download_button = soup.find("a", class_=" da") if download_button is None: send_error("%s is a paid app. Could not download." % package_name) continue download_url = download_button.attrs["href"] # Download app page url = DOMAIN + download_url try: r = requests.get(url) except requests.exceptions.ConnectionError: send_error("Connection error.") continue if r.status_code != 200: send_error("Could not get app download page for %s." % package_name) continue soup = BeautifulSoup(r.text, "html.parser") download_link = soup.find("a", id="download_link") download_apk_url = download_link.attrs["href"] send_payload(package_name, app_name, download_apk_url) def main(): # Create the download directory if not os.path.exists(DOWNLOAD_DIR): os.makedirs(DOWNLOAD_DIR) elif not os.path.isdir(DOWNLOAD_DIR): print("%s is not a directory." % DOWNLOAD_DIR) return -1 # Read the package names if not os.path.isfile(PACKAGE_NAMES_FILE): print("Could not find %s." % PACKAGE_NAMES_FILE) return -1 with open(PACKAGE_NAMES_FILE, "r") as f: package_names = [line.strip() for line in f.readlines()] # CSV file header with open(OUTPUT_CSV, "w+") as csv: csv.write("App name,Package name,Size,Location\n") # Message-passing queues search_qi = Queue() search_qo = Queue() download_qi = Queue() download_qo = Queue() # Search Process search_proc = Process(target=search_process, args=(search_qo, search_qi)) search_proc.start() # Download Processes download_procs = [] for i in range(CONCURRENT_DOWNLOADS): download_proc = Process(target=download_process, args=(i, download_qo, download_qi)) download_procs.append(download_proc) download_proc.start() active_tasks = Counter() def new_search_query(): if package_names: search_qo.put((MSG_PAYLOAD, package_names.pop(0))) active_tasks.inc() return True return False # Send some queries to the search process for _ in range(CONCURRENT_DOWNLOADS + 1): new_search_query() prog_bars = SplitProgBar(CONCURRENT_DOWNLOADS, 80) def log(msg, pb=True): prog_bars.clear() print(msg) if pb: prog_bars.render() sys.stdout.flush() last_message_time = time.time() while True: if active_tasks.empty: log("Done!", False) break no_message = True try: # Messages from the search process message = search_qi.get(block=False) last_message_time = time.time() no_message = False if message[0] == MSG_PAYLOAD: # Donwload URL found => Start a download download_qo.put(message) log(" Found app for %s." % message[1][0]) elif message[0] == MSG_ERROR: # Error with search query log("!!" + message[1]) active_tasks.dec() # Search for another app new_search_query() except EmptyQueueException: pass try: # Messages from the download processes message = download_qi.get(block=False) last_message_time = time.time() no_message = False if message[0] == MSG_PAYLOAD or message[0] == MSG_END: # Download finished id_, package_name, app_name, size, location = message[1] prog_bars[id_] = float("NaN") if message[0] == MSG_PAYLOAD: log(" Finished downloading %s." % package_name) elif message[0] == MSG_END: log(" File already downloaded for %s." % package_name) # Add row to CSV file with open(OUTPUT_CSV, "a") as csv: csv.write(",".join([ '"%s"' % app_name.replace('"', '""'), '"%s"' % package_name.replace('"', '""'), "%d" % size, '"%s"' % location.replace('"', '""')])) csv.write("\n") active_tasks.dec() # Search for another app new_search_query() elif message[0] == MSG_START: # Download started id_, package_name = message[1] prog_bars[id_] = 0.0 log(" Started downloading %s." % package_name) elif message[0] == MSG_PROGRESS: # Download progress id_, progress = message[1] prog_bars[id_] = progress prog_bars.render() elif message[0] == MSG_ERROR: # Error during download id_, msg = message[1] log("!!" + msg) prog_bars[id_] = 0.0 active_tasks.dec() # Search for another app new_search_query() except EmptyQueueException: pass if no_message: if time.time() - last_message_time > PROCESS_TIMEOUT: log("!!Timed out after %.2f seconds." % (PROCESS_TIMEOUT), False) break time.sleep(PROGRESS_UPDATE_DELAY / 2.0) # End processes search_qo.put((MSG_END, )) for _ in range(CONCURRENT_DOWNLOADS): download_qo.put((MSG_END, )) search_proc.join() for download_proc in download_procs: download_proc.join() return 0 if __name__ == '__main__': sys.exit(main()) One feature I added was for the downloading to be skipped in case the file already exists locally and it has the same size. That way, every file doesn't get re-downloaded every time. I'm using processes instead of threads to avoid having Python's Global Interpreter Lock serializing the execution. The main process creates 5 processes, 1 for searching the download URLs and 4 PROCESS_TIMEOUT = 10.0 class Message(enum.Enum): error = -1 payload = 0 start = 1 end = 2 def download_process(qi, qo): while True: message = qi.get() if message[0] == Message.payload: package_name, app_name, download_url = message[1] elif message[0] == Message.end: break r = requests.get(download_url, stream=True) if r.status_code != 200: qo.put((Message.error, "HTTP Error %d" % r.status_code)) r.close() continue content_disposition = r.headers.get("content-disposition", "") content_length = int(r.headers.get('content-length', 0)) filename = re.search(r'filename="(.*)"', content_disposition) if filename and filename.groups(): filename = filename.groups()[0] else: filename = "%s.apk" % (package_name.replace(".", "_")) local_path = os.path.normpath(os.path.join(DOWNLOAD_DIR, filename)) if os.path.exists(local_path): if not os.path.isfile(local_path): # Not a file qo.put((Message.error, "%s is a directory." % local_path)) r.close() continue if os.path.getsize(local_path) == content_length: # File has likely already been downloaded qo.put((Message.end, (package_name, app_name, content_length, local_path))) r.close() continue qo.put((Message.start, package_name)) size = 0 with open(local_path, "wb+") as f: for chunk in r.iter_content(chunk_size=65536): if chunk: size += len(chunk) f.write(chunk) qo.put((Message.payload, (package_name, app_name, size, local_path))) def search_process(qi, qo): while True: message = qi.get() if message[0] == Message.payload: package_name = message[1] elif message[0] == Message.end: break # Search page url = SEARCH_URL % package_name r = requests.get(url) if r.status_code != 200: qo.put((Message.error, "Could not get search page for %s." % package_name)) continue soup = BeautifulSoup(r.text, "html.parser") first_result = soup.find("dl", class_="search-dl") if first_result is None: qo.put((Message.error, "Could not find %s." % package_name)) continue search_title = first_result.find("p", class_="search-title") search_title_a = search_title.find("a") app_name = search_title.text.strip() app_url = search_title_a.attrs["href"] # App page url = DOMAIN + app_url r = requests.get(url) if r.status_code != 200: qo.put((Message.error, "Could not get app page for %s." % package_name)) continue soup = BeautifulSoup(r.text, "html.parser") download_button = soup.find("a", class_=" da") if download_button is None: qo.put((Message.error, "%s is a paid app. Could not download." % package_name)) continue download_url = download_button.attrs["href"] # Download app page url = DOMAIN + download_url r = requests.get(url) if r.status_code != 200: qo.put((Message.error, "Could not get app download page for %s." % package_name)) continue soup = BeautifulSoup(r.text, "html.parser") download_link = soup.find("a", id="download_link") download_apk_url = download_link.attrs["href"] qo.put((Message.payload, (package_name, app_name, download_apk_url))) def main(): # Create the download directory if not os.path.exists(DOWNLOAD_DIR): os.makedirs(DOWNLOAD_DIR) elif not os.path.isdir(DOWNLOAD_DIR): print("%s is not a directory." % DOWNLOAD_DIR) return # Read the package names if not os.path.isfile(PACKAGE_NAMES_FILE): print("Could not find %s." % PACKAGE_NAMES_FILE) return with open(PACKAGE_NAMES_FILE, "r") as f: package_names = [line.strip() for line in f.readlines()] # CSV file header with open(OUTPUT_CSV, "w+") as csv: csv.write("App name,Package name,Size,Location\n") # Message-passing queues search_qi = Queue() search_qo = Queue() download_qi = Queue() download_qo = Queue() # Search Process search_proc = Process(target=search_process, args=(search_qo, search_qi)) search_proc.start() # Download Processes download_procs = [] for _ in range(CONCURRENT_DOWNLOADS): download_proc = Process(target=download_process, args=(download_qo, download_qi)) download_procs.append(download_proc) download_proc.start() iter_package_names = iter(package_names) active_tasks = 0 # Send some queries to the search process for _ in range(CONCURRENT_DOWNLOADS + 1): try: package_name = next(iter_package_names) search_qo.put((Message.payload, package_name)) active_tasks += 1 except StopIteration: break while True: if active_tasks == 0: print("Done!") break try: # Messages from the search process message = search_qi.get(block=False) if message[0] == Message.payload: # Donwload URL found => Start a download download_qo.put(message) print(" Found app for %s." % message[1][0]) elif message[0] == Message.error: # Error with search query print("!!" + message[1]) active_tasks -= 1 # Search for another app try: package_name = next(iter_package_names) search_qo.put((Message.payload, package_name)) active_tasks += 1 except StopIteration: pass except EmptyQueueException: pass try: # Messages from the download processes message = download_qi.get(block=False) if message[0] == Message.payload or message[0] == Message.end: # Download done package_name, app_name, size, location = message[1] if message[0] == Message.payload: print(" Finished downloading %s." % package_name) elif message[0] == Message.end: print(" File already downloaded for %s." % package_name) # Add row to CSV file with open(OUTPUT_CSV, "a") as csv: csv.write(",".join([ '"%s"' % app_name.replace('"', '""'), '"%s"' % package_name.replace('"', '""'), "%d" % size, '"%s"' % location.replace('"', '""')])) csv.write("\n") active_tasks -= 1 # Search for another app try: package_name = next(iter_package_names) search_qo.put((Message.payload, package_name)) active_tasks += 1 except StopIteration: pass elif message[0] == Message.start: # Download started print(" Started downloading %s." % message[1]) elif message[0] == Message.error: # Error during download print("!!" + message[1]) active_tasks -= 1 except EmptyQueueException: pass time.sleep(1.0) # End processes search_qo.put((Message.end, )) for _ in range(CONCURRENT_DOWNLOADS): download_qo.put((Message.end, )) search_proc.join() for download_proc in download_procs: download_proc.join() if __name__ == '__main__': main() One feature I added was for the downloading to be skipped in case the file already exists locally and it has the same size. That way, every file doesn't get re-downloaded every time. I'm using processes instead of threads to avoid having Python's Global Interpreter Lock serializing the execution. The main process creates 5 processes, 1 for searching the download URLs and 4 for downloading the files concurrently. The search process only queries the website at the same rate at which the download processes are going through the downloads. That way, the website doesn't get bombarded by a ton of queries in a short amount of time. It doesn't need to be any faster anyway. The download processes concurrently each download a file on their own. One thing to consider is that it will likely increase fragmentation on the file system, and if using an hard disk drive, it will increase seek time. The main process orchestrates everything and prints logging events. Due to the asynchronous nature of the downloads, events may appear out of order, and progress isn't printed. It can make it look like the script has hanged (e.g. "Started downloading B" followed by "Finished downloading A"). I only did some limited testing, but with a list of 10 entries, I got a speedup of about ~25% compared 10% compared to my earlier version. This isn't much, but it's expected. All the concurrent downloads allow is for a more efficient use of the available connection speed, notably when a download is slower than the others. The only way it could be 4 times as fast would be if the files were stored on different servers and they were each capped at less than a fourth of the connection speed. The code is more complex, and quite a bit fragile. Given it speeds things up only a little bit, I don't really recommend using it over my earlier version. Tested on Windows and Debian, using Python 3.6 and 2.7. **Edit**: Added progress bars. Replaced the `Message` Enum with constants. Added some more error handling. Made the main process only sleep if there were no messages from the other processes (which speeds things up a bit). Added a timeout in case processing stops for some reasons. Code cleanup.

User: CyteBode

Question: Mass download list of APKs by Package Names

Back to question