Please create a simple script to extract data from these PDFs
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Please write a simple script to extract all the meaningful data from the following PDF:
https://about.vanguard.com/investment-stewardship/supporting-files/proxyvote0040.pdf

The extracted data should be in JSON form. It should be something like:

[ {"ticker": "MMM", "meetingDate": "5/14/2019", cusip: "88579Y101",
  proposals: [ {title: "Proposal #1a: ELECT DIRECTORY ...", proposedBy: "ISSUER", 
       voted: true, voteCast: "FOR", forAgainstMgmt: "FOR"} ....
   ...

and so forth for all the pages of the PDF.

The script you write should be licensed to me under a permissive open source license (BSD, MIT, Apache, etc).

NB: The script should be easy to set up, should not require any external network access (I will input the PDF on the command line). The script should take only one command line argument - the PDF file above.

If there are multiple correct submissions, I will award to the shortest, simplest script with the best unit tests :)

Hi @Vlad :) look forward to your submission!
ramonza 1 month ago
BTW if something requires npm/ruby gems / python packages to be installed please provide a Dockerfile as well
ramonza 1 month ago
If there are no working solutions by the end of the week, I will award to anyone who can manually extract all the data correctly.
ramonza 1 month ago
awarded to mashtullah
Tags
pdf

Crowdsource coding tasks.

3 Solutions


Solution

Usage: py proposals.py file.pdf, which prints a Python dict (valid JSON) for writing to a file.
Requires Tika (pip install tika).

Looking now
ramonza 1 month ago
This is not working for me. See error here: https://labs.play-with-docker.com/p/bqifoedim9m000ftc3b0#bqifoedi_bqifoftim9m000ftc3bg
ramonza 1 month ago
I also tried in the latest Ubuntu image (Docker) and it gave me a different error: https://pastebin.com/sG8veWxa
ramonza 1 month ago
According to stackoverflow.com/questions/51514246, Tika requires Java 8 to run properly.
B44ken 1 month ago

Solution

Hello Ramon.

Here is the CLI that you were looking for.

It uses only one package as a dependency so Docker would be overkill here since it's not complex.

README contains setup instructions.

The CLI will generate results.json file in your current working directory.

Here is the results.json file generated with this CLI: Google Drive Link

Thank you,
Vladimir :)

"Docker would be overkill here it's not that complex": https://share.getcloudapp.com/BluZ1jBv 😅
ramonza 1 month ago
looks like it depends on some external program called "pdftotext" - this is exactly why I like Docker :)
ramonza 1 month ago
Hello Ramon. I apologise for ignoring your request. Not everyone has JS tools like I do :) I very much appreciate your tip.
VladimirMikulic 1 month ago
Winning solution

I used nodeJS, hope you are ok with that

1. In the terminal create a new folder for your project , example mkdir extractdata

2. move to your new folder cd extractdata

3. create a new node project with the command npm init

4. install this package npm install pdf-parse

5. now download index.js from this gist and save it in your project folder

6. go back to the terminal and run the project using the command node index.js proxyvote0040.pdf if the pdf file is in the same directory as the script otherwise put the full path.

Note i had limited the scan to one page, but you can change the value on the commandline by adding a second parameter example node index.js pdffile.pdf 50

Let me know how it goes....

Update

For integration purposes,saving the json data in a file is better, the file will have the same name as the pdf file except for the extension(.json)

You can have a cron job periodically run a script to download that pdf from https://about.vanguard.com/investment-stewardship/supporting-files/

Then it runs this script knowing that it will save the json file with the same name...

Then pick up the json data and use it in wherever app that is consuming that data...!

Script Output

This is the json data scrapped from the first page of your sample pdf.

[{"issuer":" 3M Company ","ticker":"MMM","meetingDate":"5/14/2019","cusip":" 88579Y101 ",

"proposals":[{"title":"PROPOSAL #1a: ELECT DIRECTOR THOMAS \"TONY\" K. BROWN ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #1b: ELECT DIRECTOR PAMELA J. CRAIG ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #1c: ELECT DIRECTOR DAVID B. DILLON ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #1d: ELECT DIRECTOR MICHAEL L. ESKEW ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #1e: ELECT DIRECTOR HERBERT L. HENKEL ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #1f: ELECT DIRECTOR AMY E. HOOD ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #1g: ELECT DIRECTOR MUHTAR KENT ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #1h: ELECT DIRECTOR EDWARD M. LIDDY ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #1i: ELECT DIRECTOR DAMBISA F. MOYO ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #1j: ELECT DIRECTOR GREGORY R. PAGE ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #1k: ELECT DIRECTOR MICHAEL F. ROMAN ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #1l: ELECT DIRECTOR PATRICIA A. WOERTZ ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #2: RATIFY PRICEWATERHOUSECOOPERS LLP AS AUDITOR ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"},


{"title":"PROPOSAL #3: ADVISORY VOTE TO RATIFY NAMED EXECUTIVE OFFICERS' COMPENSATION ","proposedBy":"ISSUER","voted":"YES","voteCast":"FOR","forAgainstMgmt":"FOR"}]}]

If there is any requirement i missed let me know...

This is only the first company in the PDF. There are hundreds of others. You should see the PDF is hundreds of pages long and the JSON output is very short. I'm not sure what the problem is but I get the same issue when running locally.
ramonza 1 month ago
@ramonza, i think you didnt read the solution well, i had already told you where to change the number of pages to read, but dont worry i have already edited the script, its now an optional commandline parameter
mashtullah 1 month ago
View Timeline