CLI script to index all html files in directory to lunr.js compatible .json file
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I need a script which can be run in CLI and in a cronjob which will loop through every single HTML file recursively in a target directory, including sub directories and index them into a lunr.js compatible .json file.

Every single HTML file is pretty much set up the following way:
https://drive.google.com/file/d/0B6h9HPRdfghjYmxtXzJwRnBkQU0/view?usp=sharing

Then minimize/compres this .json file into as small as possible.

UPDATE:
It seems lunr.js requires a URL for each json object. Like this:

url: '/path-of-html-file',

So I need the script to be able to get the full local path of the html and use that as the value for 'url':

Bonus: I will also willing to tip anyone that can tell me how to have lunr.js index this .json file so that I can use lunr.js search through this .json file.

Thanks

hey, i saw the document that you want to index, but i'm not sure which fields do you want to index. do you want to have the whole body of the document, or only specific parts?
andijcr over 4 years ago
Hey I want everything in the demo file
user0809 over 4 years ago
awarded to iurisilvio
Tags
json
ubuntu

Crowdsource coding tasks.

3 Solutions


As you can see in lunr.js site you have to set the fields you want to search through and set their value for each document (html page) in one json file.

To make things simpler there is this Jekyll plugin that automatically creates the lunar.js json file.

But how do you actually output display search results? All you get is a returned ID and a weight.
user0809 over 4 years ago
Also I have several separate jekyll instances running but combine the results into a single folder. Using the plugin would mean I will have to combine the lunar.js json file again.
user0809 over 4 years ago
Hey, so are you able to help me out with the script?
user0809 over 4 years ago
You should follow the steps mentioned in the github repo. I'm not sure I understand what you said about the multiple Jekyll instances. For the error you have to do gem install nokogiri json. I could write the script you asked but I'm not sure you 'll get the results you want with lunr.js
tomtoump over 4 years ago
The problem is that even after I managed to get it to work, the outputs json file doesn't contain everything I need. It only takes the title + URL + body on the markdown files. I need everything in the front matter. As long as the script you write creates a json file from everything in the demo html file and which fits my problem the description, then I'll be happy.
user0809 over 4 years ago

I solved using lunr-index-build (https://www.npmjs.com/package/lunr-index-build) package. First, you have to install it: npm install -g lunr-index-build

You can download it here: https://gist.github.com/iurisilvio/2d5cc5014a6a8c7d63b8

Run this command: python index.py --input /your/input/directory --output index.json

It'll generate your lunr index in index.json.


so, i wrote this bash script, the only other requirement is to have node.js installed.
you can change the values of the lunr= variable and lunrIndexOutput= if you don't like my defaults,

save the script in a file called lunr-index-builder.sh and invoke it with a directory to scan. for example bash lunr-index-builder.sh /home/andrea/blog will scan my blog directory for html files, and write a lunr index file with the name index.json

to install the dipendencies install nodejs and npm via sudo apt-get install nodejs npm, set the correct registry with sudo npm config set registry http://registry.npmjs.org/and then install lunr via sudo npm install lunr.

#!/bin/bash

#if you have not installed the module lnr via npm, put the complete path of your lunr module here
lunr="lunr"

#put the name of your output file here
lunrIndexOutput="index.json"

dir=`readlink -f $1`

entriesFileNames=`mktemp`

find $dir -type f -name "*.html" 2>/dev/null > $entriesFileNames

script="

var lunr = require('$lunr')
var fs = require('fs')
//to account old version of nodejs
fs.existsSync = fs.existsSync || require('path').existsSync;

var idx = lunr(function () {
this.ref('id')
this.field('body')
})

var entries=fs.readFileSync('$entriesFileNames')
var filenames=entries.toString().split('\n')

var entries=filenames.map(function(file){
if(fs.existsSync(file)){
    var data=fs.readFileSync(file)
    idx.add({id: file, body: data.toString()})
}
})

fs.writeFile('$lunrIndexOutput', JSON.stringify(idx), function (err) {
    if (err) throw err
console.log('done')
})
"
node -e "$script"
Okay let me try this
user0809 over 4 years ago
Btw, can this index.json file be minified to the smallest file size possible.
user0809 over 4 years ago
You mentioned in the first sentence 'the only other requirement is to have node.js installed'. But in the script I see '#if you have not installed the module lnr via npm, put the complete path of your lunr module here lunr="lunr"'. Do I need to install a 'lunr' module as well as node.js?
user0809 over 4 years ago
yes, you can install it via sudo npm install lunr. or you can download lunr.js from here https://raw.githubusercontent.com/olivernn/lunr.js/master/lunr.min.js and save it (remember to change the lunr= variable
andijcr over 4 years ago
the index.json is already minified. further compression should be done by your webserver
andijcr over 4 years ago
By 'lunr' module do you mean the lunr.js plugin? Or is this anothe module I need to install? I can't seem to find anything in Google about 'lunr module'
user0809 over 4 years ago
i edited the solution to help you install the dependencies
andijcr over 4 years ago
I get 'npm ERR! Error: failed to fetch from registry: lunr' and other errors when trying to install the lunr module
user0809 over 4 years ago
probably you have installed an old version of npm. if you are in a server, it's the mosto probable cause. It's esasier to download lunr.min.js from https://raw.githubusercontent.com/olivernn/lunr.js/master/lunr.min.js , save it in the same folder of this script, and change the lunr= variable in lunr="./lunr.min.js"
andijcr over 4 years ago
I copied over manually the lunr.min.js file and ran the script but I get: undefined:15 if(fs.existsSync(file)){ ^ TypeError: Object # has no method 'existsSync' at eval at (eval:1:82) at Array.map (native) at Object. (eval at (eval:1:82)) at Object. (eval:1:70) at Module._compile (module.js:441:26) at startup (node.js:80:27) at node.js:555:3 Does this script fetch recursively through all folders in a directory?
user0809 over 4 years ago
i tracked the bug launching a fresh installation of ubuntu. ubuntu ships an old version of nodejs, i update the solution to work with the old and the more recent version. This should work! :D
andijcr over 4 years ago
View Timeline