Python, parallelize (multi-thread) a simple loop
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I am using PyCharm 2016.3.2 with Python 3.6 as the interpreter to convert PDF files to .TXT The code I have (see below) works fine, but it converts files sequentially and slowly. I wonder if I can take advantage of my 8 core cpu to parallelize the operation and make this a bit faster. Here is the code:

from tika import parser
from os import listdir
for filename in listdir("C:\\Dropbox\\Data"):
    text = parser.from_file('C:\\Dropbox\\Data'+filename)
    with open('C:\\Dropbox\\Data\\textoutput\\'+filename+'.txt', 'w+') as outfile : 
        outfile.write(text["content"])

I'm dealing with hundreds of thousands (>100,000) pdf files (65 GB+) so I believe multi-threading this operation will go a long way.

awarded to iurisilvio

Crowdsource coding tasks.

2 Solutions


You will need the joblib extension to do this https://pythonhosted.org/joblib/parallel.html

Your code will then be something like this (I cannot test for the moment) :

from joblib import Parallel, delayed
import multiprocessing
# what are your inputs, and what operation do you want to
# perform on each input. For example...

def processFile(f):
    text = parser.from_file('C:\\Dropboc\\Data' + f)
    with open('C:\\Dropbox\\Data\\textouput\\' + f + '.txt, 'w+') as outfile:
        outfile.write(text["content"])

num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores)(delayed(processFile)(f) for f in listdir("C:\\Dropbox\\Data"))
I have gotten the following error: ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name__ == '__main__'". Please see the joblib documentation on Parallel for more information I tried to insert __name__ == '__main__' just before the last two lines (num_cores and results) but then I get a different errors that are mostly in reference to tika
tranzsport 3 months ago
Winning solution

A solution using only multiprocessing builtin library.

from os import listdir
from multiprocessing import Pool

from tika import parser


def process_file(filename):
    in_filename = 'C:\\Dropbox\\Data'+filename
    out_filename = 'C:\\Dropbox\\Data\\textoutput\\'+filename+'.txt'

    text = parser.from_file(in_filename)
    with open(out_filename, 'w+') as outfile:
        outfile.write(text["content"])

if __name__ == '__main__':
    pool = Pool()
    pool.map(process_file, listdir("C:\\Dropbox\\Data"))
is this compatible with Python 3.6 (using Pycharm)? I ran this code and got the following error Traceback (most recent call last): File "C:/Dropbox/Data/Edgar Parser Parallel/PDFtoTEXT.py", line 33, in <module> pool.map(listdir("C:\\Dropbox\\Data"), process_file) File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 260, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 343, in _map_async iterable = list(iterable) TypeError: 'function' object is not iterable
tranzsport 3 months ago
I inverted the map arguments. :( Updated my answer now. It should work with python3.6.
iurisilvio 3 months ago
kept getting repeated errors related to calling the multiprocessing procedure however, i looked at some examples in https://docs.python.org/2/library/multiprocessing.html and added if __name__ == '__main__': pool = Pool(2) pool.map(process_file, listdir("C:\\Dropbox\\Data")) and now it works like a charm! I think 2 inside the parenthesis refers to the number of cores? doing this adjustment with name and main, and passing 2 inside the parenthesis did the trick, and i can already see a major improvement, so thanks!
tranzsport 3 months ago
Great! 2 is the number of parallel process. The "if main" is necessary to avoid child processes to launch other processes. I'll update my answer, just to be correct.
iurisilvio 3 months ago
View Timeline