Python, parallelize (multi-thread) a simple loop
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I am using PyCharm 2016.3.2 with Python 3.6 as the interpreter to convert PDF files to .TXT The code I have (see below) works fine, but it converts files sequentially and slowly. I wonder if I can take advantage of my 8 core cpu to parallelize the operation and make this a bit faster. Here is the code:

from tika import parser
from os import listdir
for filename in listdir("C:\\Dropbox\\Data"):
    text = parser.from_file('C:\\Dropbox\\Data'+filename)
    with open('C:\\Dropbox\\Data\\textoutput\\'+filename+'.txt', 'w+') as outfile : 
        outfile.write(text["content"])

I'm dealing with hundreds of thousands (>100,000) pdf files (65 GB+) so I believe multi-threading this operation will go a long way.

awarded to iurisilvio

Crowdsource coding tasks.

2 Solutions


You will need the joblib extension to do this https://pythonhosted.org/joblib/parallel.html

Your code will then be something like this (I cannot test for the moment) :

from joblib import Parallel, delayed
import multiprocessing
# what are your inputs, and what operation do you want to
# perform on each input. For example...

def processFile(f):
    text = parser.from_file('C:\\Dropboc\\Data' + f)
    with open('C:\\Dropbox\\Data\\textouput\\' + f + '.txt, 'w+') as outfile:
        outfile.write(text["content"])

num_cores = multiprocessing.cpu_count()
results = Parallel(n_jobs=num_cores)(delayed(processFile)(f) for f in listdir("C:\\Dropbox\\Data"))
I have gotten the following error: ImportError: [joblib] Attempting to do parallel computing without protecting your import on a system that does not support forking. To use parallel-computing in a script, you must protect your main loop using "if __name__ == '__main__'". Please see the joblib documentation on Parallel for more information I tried to insert __name__ == '__main__' just before the last two lines (num_cores and results) but then I get a different errors that are mostly in reference to tika
tranzsport 12 months ago
Winning solution

A solution using only multiprocessing builtin library.

from os import listdir
from multiprocessing import Pool

from tika import parser


def process_file(filename):
    in_filename = 'C:\\Dropbox\\Data'+filename
    out_filename = 'C:\\Dropbox\\Data\\textoutput\\'+filename+'.txt'

    text = parser.from_file(in_filename)
    with open(out_filename, 'w+') as outfile:
        outfile.write(text["content"])

if __name__ == '__main__':
    pool = Pool()
    pool.map(process_file, listdir("C:\\Dropbox\\Data"))
is this compatible with Python 3.6 (using Pycharm)? I ran this code and got the following error Traceback (most recent call last): File "C:/Dropbox/Data/Edgar Parser Parallel/PDFtoTEXT.py", line 33, in <module> pool.map(listdir("C:\\Dropbox\\Data"), process_file) File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 260, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "C:\Users\user\AppData\Local\Programs\Python\Python36\lib\multiprocessing\pool.py", line 343, in _map_async iterable = list(iterable) TypeError: 'function' object is not iterable
tranzsport 12 months ago
I inverted the map arguments. :( Updated my answer now. It should work with python3.6.
iurisilvio 12 months ago
kept getting repeated errors related to calling the multiprocessing procedure however, i looked at some examples in https://docs.python.org/2/library/multiprocessing.html and added if __name__ == '__main__': pool = Pool(2) pool.map(process_file, listdir("C:\\Dropbox\\Data")) and now it works like a charm! I think 2 inside the parenthesis refers to the number of cores? doing this adjustment with name and main, and passing 2 inside the parenthesis did the trick, and i can already see a major improvement, so thanks!
tranzsport 12 months ago
Great! 2 is the number of parallel process. The "if main" is necessary to avoid child processes to launch other processes. I'll update my answer, just to be correct.
iurisilvio 12 months ago
View Timeline