Python: problems reading docx files in Jupyter
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Having trouble scanning a directory for .docx files,
It works if the file is passed directly in a string as shown below:
doc = Document("./data/Interview_Notes_3.docx").
but if the file is called using os.walk, the same file returns:
PackageNotFoundError
The same file that works for a simple string does not work when the filename is created from os.walk() The statement
print("is this a file? ",os.path.isfile(fullname)) # returns True.
validates that the file exists, when the file is passed to Document() it pukes.

The code indents are good, could not get markdown to work.

import os.

import os.path.

from docx import Document.

# This statement works fine.

doc = Document("./data/Interview_Notes_3.docx").

# but when the directory is scanned.

for root, directories, files in os.walk("./data"):.

for filename in files:.

if filename.endswith(".docx"):.

fullname = root + "/" + filename.

print("is this a file? ",os.path.isfile(fullname)) # returns True.

doc = Document(fullname) # Fails with PackageNotFoundError.

awarded to CarlosOlivo
Tags
python3

Crowdsource coding tasks.

1 Solution

Winning solution

It may be for 2 reasons:

  • Concatenating root with filename might not be properly escaping the special characters in the path, use os.path.join()
  • python-docx will throw the exception docx.opc.exceptions.PackageNotFoundError with invalid files, sometimes Word create temp hidden files that starts with ~$, ignore these with and not filename.startswith('~$') to prevent this.

Try this:

import os
import os.path
from docx import Document

doc = Document("./data/Interview_Notes_3.docx")

for root, directories, files in os.walk("./data"):
    for filename in files:
        if filename.endswith(".docx") and not filename.startswith('~$'):
            fullname = os.path.join(root, filename)
            print("is this a file? ",os.path.isfile(fullname))
            doc = Document(fullname)
            coreprops = doc.core_properties
            props = dir(coreprops)

            for prop in props:
                if not prop.startswith("_"):
                    print(f"Property {prop:20s}: {getattr(coreprops, prop)}")
            print("\n")
Excellent, thanks for the fix and the explanation, using os.path.join() works.
broadreach 1 month ago