Create forks of TextBlob and NLTK that load modules quickly
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

TextBlob (http://textblob.readthedocs.org/en/dev/) is a natural language processing toolbox that depends on NLTK (http://www.nltk.org/). By default, when NLTK is imported, it imports everything using import * statements. This means that even a simple TextBlob script takes 300 milliseconds to run on my system. I'm looking for speed.

The goal here is to create a fork of TextBlob and NLTK that does some kind of smarting loading (e.g., importing only the minimal modules needed, or loading modules lazily).

The following sample script should run quite quickly:

from textblob import TextBlob
blob = TextBlob("hello")

and this script should run without breaking:

from textblob import TextBlob

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

blob = TextBlob(text)

print blob.tags

print blob.noun_phrases

for sentence in blob.sentences:
    print(sentence.sentiment.polarity)
    print(sentence.words)

print blob.sentences[0].words[2].pluralize()

print blob.correct()

print blob.parse()

print blob.upper()

print blob.ngrams(n=3)

print blob.detect_language()
awarded to Ibenor

Crowdsource coding tasks.

1 Solution

Winning solution

Hello suchrow,
I had a look into your Problem and you could use the following module to achieve your lazy module loading without creating a fork:

import sys
import imp
import os
import types

def makeImportedModule(name, pathname, desc, scope):
    def _loadModule():
        mod = sys.modules.get(name, None)
        if mod is None or not isinstance(mod, types.ModuleType):
            try:
                file = open(pathname, 'U')
            except:
                file = None

            try:
                mod = imp.load_module(name, file, pathname, desc)
            finally:
                if file is not None:
                    file.close()

            sys.modules[name] = mod

        scope[name] = mod

        frame = sys._getframe(2)
        global_scope = frame.f_globals
        local_scope = frame.f_locals

        moduleParts = name.split('.')
        names = [ '.'.join(moduleParts[-x:]) for x in range(len(moduleParts)) ]
        for modulePart in names:
            if modulePart in local_scope:
                if local_scope[modulePart].__class__.__name__ == 'ModuleProxy':
                    if pathname in repr(local_scope[modulePart]):
                        local_scope[modulePart] = mod
            if modulePart in global_scope:
                if global_scope[modulePart].__class__.__name__ == 'ModuleProxy':
                    if pathname in repr(global_scope[modulePart]):
                        global_scope[modulePart] = mod

        return mod

    class ModuleProxy(object):
        __slots__ = []
        def __hasattr__(self, key):
            mod = _loadModule()
            return hasattr(mod, key)

        def __getattr__(self, key):
            mod = _loadModule()
            return getattr(mod, key)

        def __setattr__(self, key, value):
            mod = _loadModule()
            return setattr(mod, key, value)

        def __repr__(self):
            return "<moduleProxy '%s' from '%s'>" % (name, pathname)

    return ModuleProxy()

class OnDemandLoader(object):
    def __init__(self, name, file, pathname, desc, scope):
        self.file = file
        self.name = name
        self.pathname = pathname
        self.desc = desc
        self.scope = scope

    def load_module(self, fullname):
        if fullname in __builtins__:
            try:
                mod = imp.load_module(self.name, self.file,
                                      self.pathname, self.desc)
            finally:
                if self.file:
                    self.file.close()
            sys.modules[fullname] = mod
        else:
            if self.file:
                self.file.close()
            mod = makeImportedModule(self.name, self.pathname, self.desc,
                                     self.scope)
            sys.modules[fullname] = mod
        return mod

class OnDemandImporter(object):

    def find_module(self, fullname, path=None):
        origName = fullname
        if not path:
            mod = sys.modules.get(fullname, False)
            if mod is None or mod and isinstance(mod, types.ModuleType):
                return mod

        frame = sys._getframe(1)
        global_scope = frame.f_globals

        if '.' in fullname:
            head, fullname = fullname.rsplit('.', 1)

            mod = sys.modules.get(head,None)
            if mod is None:
                return None

            if hasattr(mod, '__path__'):
                path = mod.__path__

        try:
            file, pathname, desc = imp.find_module(fullname, path)
            return OnDemandLoader(origName, file, pathname, desc, global_scope)
        except ImportError:
            return None

def install():
    sys.meta_path.append(OnDemandImporter())

Simply put in a file called importer.py and add this at the beginning of your program:

import importer
importer.install()

Still it will only speed up loading time of the script and reduce memory usage, the programm will not run faster. I added the following lines to your script to determine speed and memory usage:

import resource
import sys 
usage = resource.getrusage(resource.RUSAGE_SELF)
sys.exit('Memory usage: ' + str(usage.ru_maxrss) + ' kByte, Execution time: ' + str(usage.ru_utime) + ' Secs')

(I used sys.exit to write to STDERR instead of STDOUT through print so i can suppress the output of your longer program)
And here is the output of both of your programs once running with and once running without importer:

ldr@cYKoSoFT:~/Projekte/textblob$ ./shorttextblob.py 
Memory usage: 55848 kByte, Execution time: 0.356 Secs
ldr@cYKoSoFT:~/Projekte/textblob$ ./shorttextblob-withimporter.py 
Memory usage: 32888 kByte, Execution time: 0.2 Secs
ldr@cYKoSoFT:~/Projekte/textblob$ ./longtextblob.py > /dev/null
Memory usage: 109296 kByte, Execution time: 4.82 Secs
ldr@cYKoSoFT:~/Projekte/textblob$ ./longtextblob-withimporter.py > /dev/null
Memory usage: 86652 kByte, Execution time: 4.668 Secs
ldr@cYKoSoFT:~/Projekte/textblob$ 

As you see you save roughly 150ms startup time and 23kByte of Memory. In a big programm 150ms startup time shouldn't matter to much. Might make a diffrence if you run a small programm often. Hope this helps you.

Hello suchow, i just wanted to write again because you didn't even say something to my solution. I think my solution deserves the bounty because it does exactly what you wanted. Admittedly its not a fork, but it does exactly what a fork would have done, without the need for someone to maintain it. And you can even use it for other projects. You wouldn't get more speed with a fork because loading some modules and store them in memory just doesn't take so long. Would be great if you could at least give some feedback
Ibenor over 4 years ago