Add additional MERGE functionality to existing Node script
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

This is a continuation of work from/of the following bounty;

https://bountify.co/add-additional-output-functionality-to-existing-node-script

For the sake of brevity, please refer to the above mentioned link/thread for the script and details thereof -thank you.

Task #1;

At, the moment, the MERGE functionality requires all input files to be in the same root folder, e.g.-

folder/file1.json

folder/file2.json

folder/file3.json

Need to add ability (flag) to MERGE from either the same root folder, or from from folder and all sub-folders therein, e.g.-

folder/path/to/1/file1.json

folder/path/to/2/file2.json

folder/path/to/3/file3.json

Additionally, need to be able to declare a specific KEYWORD so that the MERGE script only selects those files containing the given KEYWORD. The reason for this is each folder contains multiple .json files, each their own type. However, the files all conform to a naming convention, wherein the last portion of the file name is the KEYWORD, e.g.-

folder/path/to/1/file1-KEYWORD.json

folder/path/to/2/file2-KEYWORD.json

folder/path/to/3/file3-KEYWORD.json

A typical folder would contain .json files named like so;

folder/path/to/1/029463122-1849_W_35TH_AVE-PROPERTY_DETAILS.json

folder/path/to/1/029463122-1849_W_35TH_AVE-ASSESSMENT.json

folder/path/to/1/029463122-1849_W_35TH_AVE-OWNERSHIP.json

folder/path/to/1/029463122-1849_W_35TH_AVE-MLS.json

folder/path/to/1/029463122-1849_W_35TH_AVE-DWELLING.json

Wherein the last portion of the file name, e.g.- -ASSESSMENT.json contains its distinguishing KEYWORD. The MERGE script would need to only select/use those files containing the user declared KEYWORD (e.g.- ASSESSMENT), whilst ignoring all the other .json files that do not contain the same KEYWORD.

So, if a .json file containing 1000 records is SPLIT into 500 files into their various folders, those same 500 files (with 2 records each) need to also be able to be MERGED back into a single .json file containing 1000 records -just like the original file that was SPLIT.

awarded to farolanf

Crowdsource coding tasks.

1 Solution


Changes

  • Support input in pretty format
  • Added -y option for pretty output
  • Added -r option to traverse merge dir recursively to find json files
  • Added -f pattern option to filter filenames (pattern is case insensitive)
  • Added -g to show basic progress

Example

spmer -m CA -o sample.json -rf OSLER

Merge files in CA folder, output to sample.json, recursively (-r), for filenames containing OSLER only (-f pattern)

spmer.js

#!/usr/bin/env node
const fs = require("fs");
const path = require("path");
const clu = require('command-line-usage');
const cla = require('command-line-args');
const Transform = require('stream').Transform;
const Writable = require('stream').Writable;

const splitOptions = [
  { name: 'split', alias: 's', type: String, arg: 'file', desc: 'a json-line file to split' },
  { name: 'name-key', alias: 'n', type: String, arg: 'key', desc: 'key for the name of file, will groups objects with the same file' },
  { name: 'path-key', alias: 'p', type: String, arg: 'key', desc: 'key for the output path' },
  { name: 'omit-name', alias: 't', type: Boolean, desc: 'omit the name key' },
  { name: 'omit-path', alias: 'u', type: Boolean, desc: 'omit the path key' },
  { name: 'out-key', alias: 'k', type: String, arg: 'key', desc: 'output the groups as array value of this key' },
  { name: 'append', alias: 'a', type: Boolean, desc: 'append to existing files' },
];

const mergeOptions = [
  { name: 'merge', alias: 'm', type: String, arg: 'dir', desc: 'dir with json files to merge' },
  { name: 'filter', alias: 'f', type: String, arg: 'pattern', desc: 'filter files according to pattern' },
  { name: 'recursive', alias: 'r', type: Boolean, desc: 'traverse dir recursively' },
  { name: 'merge-output', alias: 'o', type: String, arg: 'file', desc: 'merge output file' },
  { name: 'index', alias: 'x', type: String, arg: 'key', desc: 'specify index key for ESJSON' },
  { name: 'mjson', type: Boolean, desc: 'output merged as minified-JSON' },
  { name: 'ndjson', type: Boolean, desc: 'output merged as NDJSON' },
  { name: 'esjson', type: Boolean, desc: 'output merged as ESJSON' },
];

const generalOptions = [
  { name: 'pretty', alias: 'y', type: Boolean, desc: 'pretty json output' },
  { name: 'progress', alias: 'g', type: Boolean, desc: 'show progress' },
  { name: 'output-dir', alias: 'd', type: String, defaultValue: '.', arg: 'dir', desc: 'root output dir, defaults to current dir' },
  { name: 'help', alias: 'h', type: Boolean, desc: 'show this help' },
];

const optionDefs = splitOptions.concat(mergeOptions).concat(generalOptions);

const help = [
  {
    header: 'spmer',
    content: [
      'Split a json-line file or merge json files.',
      '',
      'A json-line file is a file containing valid json on each line.',
    ],
  },
  {
    header: 'Usage',
    content: [
      'node spmer.js -s FILE [options]',
      'node spmer.js -m DIR [options]',
      '',
      'The first form to split FILE.',
      'The second form to merge files in DIR.',
    ],
  },
  getSectionOption('Split options', splitOptions),
  getSectionOption('Merge options', mergeOptions),
  getSectionOption('General options', generalOptions),
];

function getSectionOption(title, optionDef) {
  return {
    header: title,
    content: optionDef.map(o => ({ 
      a: o.alias ? '-' + o.alias : null, 
      b: '--' + o.name + ' ' + (o.arg || ''), 
      c: o.desc })),
  };
}

// parse options
const opts = cla(optionDefs);

// handle errors
if (!opts.split && !opts.merge) {
  exitErr('Please specify an action: [-s FILE] to split or [-m DIR] to merge');
} 

function exitErr(str) {
  const errorSection = {
    'header': 'Error',
    'content': str,
  };
  help.push(errorSection);
  console.log(clu(help));
  process.exit(-1);
}

// show help
if (opts.help) {
  console.log(clu(help));
  process.exit(0);
}

class SplitJson extends Transform {

  constructor() {
    super({ encoding: 'utf8', decodeStrings: false });
    // unprocessed chunks & callbacks
    this._chunks = [];
    this._offset = 0;
  }    

  _save(chunk) {
    this._chunks.push(chunk);
  }

  _chunksStr() {
    return this._chunks.join('');
  }

  // flush processed chunks and call the callbacks
  _flushChunks(offset) {
    let flushCount = 0;
    let flushLen = 0;
    for (const chunk of this._chunks) {
      if (flushLen + chunk.length <= offset) {
        flushLen += chunk.length;
        flushCount++;
      }        
      else {
        break;
      }
    }
    this._chunks.splice(0, flushCount);
    return flushLen;
  }

  _transform(chunk, enc, cb) {
    this._save(chunk);
    const str = this._chunksStr();
    this._offset = this._split(str, this._offset);
    this._offset -= this._flushChunks(this._offset);      
    cb();
  }

  _split(str, offset) {

    const closer = {
      '{': '}',
      '[': ']',
    };

    let char;
    let startIdx = -1;
    let depth = 0;
    let i;

    for (i = offset; i < str.length; i++) {
      const ch = str[i];
      if (char ? ch === char : ch === '{' || ch === '[') {
        if (depth === 0) {
          startIdx = i;
          char = ch;
        }
        depth++;
      }
      else if (ch === closer[char]) {
        depth--;
        if (depth === 0) {
          const json = str.substr(startIdx, i - startIdx + 1);
          this.push(json);
          char = null;
        }
      }
    }

    const end = depth === 0 ? i : startIdx < 0 ? i : startIdx;
    return end;
  }
}

class WriteJson extends Writable {

  constructor(options) {
    options.decodeStrings = false;
    super(options);
    this.options = options;
    // track processed files
    this.filenames = [];
  }

  _write(chunk, enc, cb) {

    const str = chunk.toString('utf8');
    // console.log('===', str);

    const { nameKey, pathKey, omitName, omitPath, append, outKey } = this.options;

    const nameDot = nameKey[0] === '[' ? '' : '.';
    const pathDot = pathKey[0] === '[' ? '' : '.';

    const obj = JSON.parse(str);

    const filename = eval('obj'+nameDot+nameKey);
    const filepath = eval('obj'+pathDot+pathKey) || '';

    if (omitName) {
      eval('delete obj'+nameDot+nameKey);       
    }
    if (omitPath) {
      eval('delete obj'+pathDot+pathKey);       
    }

    const outfile = getOutputPath(filepath) + "/" + filename + ".json";

    // truncate if this is the first time writing to this file
    // and not appending 
    let truncate = false;
    if (!append && !this.filenames.includes(outfile)) {
      truncate = true;
      this.filenames.push(outfile);
    }

    if (outKey) {
      // add it to the array on out-key
      let parent;
      if (!truncate && fs.existsSync(outfile)) {
        try {
          parent = JSON.parse(fs.readFileSync(outfile));
        }
        catch(x) {
          if (x instanceof SyntaxError) {
            console.log("\nError:\n  A file exists with the same name but not in a valid JSON format.\n\  Perhaps it's the result of previous operation?\n\  Please delete the file or specify another output-dir.\n");                  
          }
          else {
            console.log(x);
          }
          process.exit(1);
        }
      }
      else {
        parent = { [outKey]: [] };
      }
      parent[outKey].push(obj);
      fs.writeFileSync(outfile, pretty(parent));
    }
    else {
      const data = pretty(obj) + "\n";

      if (truncate) {
        fs.writeFileSync(outfile, data);
      }
      else {
        fs.appendFileSync(outfile, data);
      }
    }
    cb();
  }
}

class Progress {
  constructor(options) {
    this.max = options.hasOwnProperty('max') ? options.max : 100;
    this.value = options.value || 0;
    this.stream = options.stream || process.stdout;
    this.template = options.template || 'Progress: :progress';
  }
  add(value) {
    this.value += value;
    this.render();
  }
  value(value) {
    this.value = value;
    this.render();
  }
  end() {
    this.stream.write('\n');
  }
  render() {
    const progress = Math.round(this.value / this.max * 100) + '%';
    const str = this.template
      .replace(new RegExp(':progress', 'g'), progress)
      .replace(new RegExp(':value', 'g'), this.value)
      .replace(new RegExp(':max', 'g'), this.max);
    this.stream.cursorTo(0);
    this.stream.write(str);
  }
}

class Streams {
  constructor(out) {
    this.out = out;
    this.lastIn = null;
    this.i = 0;
  }
  write(ins) {
    if (!ins) {
      if (this.lastIn) {
        this.lastIn.once('end', this.out.end);
      }
      else {
        this.out.end();
      }
    }
    else {
      const pipe = ins => ins.pipe(this.out, { end: false });
      if (this.lastIn) {
        this.lastIn.on('end', () => pipe(ins));
      }
      else {
        pipe(ins);
      }
      this.lastIn = ins;
    }
  }
}

if (opts.split) {
  if (!opts['name-key']) {
    exitErr("Please specify the name-key.");
  }
  const writeOpts = {
    nameKey: opts['name-key'],
    pathKey: opts['path-key'],
    omitName: opts['omit-name'],
    omitPath: opts['omit-path'],
    outKey: opts['out-key'],
    append: opts.append,
  };
  const progress = new Progress({ max: fs.statSync(opts.split).size });
  const inStream = fs.createReadStream(opts.split, 'utf8');

  if (opts.progress) {
    inStream.on('data', data => progress.add(data.length));
    inStream.on('end', () => progress.end());
  }

  inStream.pipe(new SplitJson()).pipe(new WriteJson(writeOpts));
}
else if (opts.merge) {
  // get the desired output format from the user
  getFormat(function (format) {
    if (Number(format) == 3 && !opts.index) {
      console.log("You forgot to declare an index (e.g.- pid) at EOL, run script again.");
      process.exit();
    }  

    const mergePath = path.resolve(opts.merge);

    const progress = new Progress({ 
      max: 0,
      template: 'Progress: :progress (:value of :max files)' 
    });

    if (opts.progress) {
      walkDir(mergePath, filepath => { 
        if (filepath) { 
          progress.max++; 
        } 
      });
    }

    const streams = new Streams(getOutStream());

    walkDir(mergePath, filepath => {
      writeJSON(format, streams, filepath);
      if (opts.progress) {
        if (filepath) {
          progress.add(1);
        }
        else {
          progress.end();
        }
      }
    }); 
  });
}

function writeJSON(format, streams, filepath) {
  if (!filepath) {
    streams.write(null);
  }
  else if (filepath.endsWith(".json")) {

    class TransformJson extends Transform {
      constructor() {
        super({ encoding: 'utf8', decodeStrings: false });      
      }
      _transform(chunk, enc, cb) {
        const json = getJSON(format, chunk, opts.index);
        this.push(json);
        cb();
      }
    }

    const inStream = fs.createReadStream(filepath, 'utf8');
    streams.write(inStream.pipe(new SplitJson()).pipe(new TransformJson()));
  }
}

function getOutStream() {
  const filename = opts['merge-output'];
  if (!filename) {
    exitErr('Please specify merge-output file.');
  }
  const filepath = path.join(getOutputPath(), filename); 
  return fs.createWriteStream(filepath, 'utf8');
}

function walkDir(dir, fn, depth=0) {

  const files = fs.readdirSync(dir);

  files.forEach(file => {
    const filepath = path.join(dir, file);

    const stats = fs.statSync(filepath);

    if (stats.isDirectory()) {
      if (opts.recursive) {
        walkDir(filepath, fn, depth+1);
      }
    }
    else {
      if (!opts.filter || new RegExp(opts.filter, 'i').test(file)) {
        fn(filepath);
      }
    }
  });

  if (depth === 0) {
    fn(null);
  }
}

function getJSON(format, item, index) {
  function prettyParse(str) {
    return pretty(JSON.parse(str));
  }
  switch (Number(format)) {
    case 1: // minified JSON
      return prettyParse("[" + item + "]") + ",\n";
    case 2: // NDJSON
      return prettyParse(item) + "\n";
    case 3: // ESJSON
      const obj = JSON.parse(item);
      const key = 'obj.'+index;
      return prettyParse('{"index":{"_id":"' + eval(key) + '"}}') + '\n' +
        prettyParse(item) + "\n";
    default:
      break;
  }
}

function pretty(obj) {
  if (typeof obj !== 'object') {
    throw Error('pretty expects an object');
  }
  return JSON.stringify(obj, null, opts.pretty ? 2 : null);
}

// function to use recursion to simulate syncronous access to stdin/out
function getFormat(callback) {
  if (opts.mjson) return callback(1);
  if (opts.ndjson) return callback(2);
  if (opts.esjson) return callback(3);
  process.stdout.write(
    "Select output format: 1:minified JSON, 2: NDJSON, 3:ESJSON: "
  );
  process.stdin.setEncoding('utf8');
  process.stdin.once('data', function (val) {
    // check validity of input
    if (!isNaN(val)) {
      val = +val;   
      if (1 <= val && val <= 3) {
        process.stdin.pause();
        callback(val);
        return;
      }
    }
    // if input is invalid, ask again
    getFormat(callback);
  });
}

function mkDir(dir) {
  return dir.split('/').reduce((path, folder) => {
    path = path + '/' + fixName(folder);
    if (!fs.existsSync(path)) {
      fs.mkdirSync(path);
    }
    return path;
  }, '');
}

function fixName(name) {
  return name.replace(/\s+/g, '_');  
}

function getOutputPath(dir='') {
  return mkDir(path.resolve(path.join(
    opts['output-dir'], 
    dir)));
}
@farolanf - Thank you. What is the proper command to select the key value for merge option 3. ESJSON?
ericjarvies 4 months ago
It's -x key. To display more usage information please use the -h option.
farolanf 4 months ago
  • The script runs, but it took about 5 minutes -after it has completed- before the MERGED file appeared in the folder. I didn't think it was even writing to the folder, until it just appeared whilst I was doing something else. Right now I am waiting for another one to appear, but thus far it has not. for the file that did write, each record was populated with {"index":{"_id":"undefined"}} instead of tax_year.
I used this command; split-merge -m CA -o MERGED-ASSESSMENT.esjson -grf ASSESSMENT -x tax_year And I also used this command; split-merge -m "CA" -o "MERGED-ASSESSMENT.esjson" -grf ASSESSMENT -x tax_year
ericjarvies 4 months ago
@farolanf - The file is actually there (I did ls in Terminal), so apparently something went wrong with Finder app... I will reboot later after some other processes have completed, and perhaps that will solve the problem.
ericjarvies 4 months ago
I copy paste your command split-merge -m CA -o MERGED-ASSESSMENT.esjson -grf ASSESSMENT -x tax_year and ran on my machine, it's finished in 2 secs and the machine is just a 2008 laptop. The merged file is a 30558 lines of NDJSON data.
farolanf 4 months ago
@farolanf - I rebooted my computer, and the write problem -explained above- disappeared, but the merged file is still not writing the property data into the index key value, so all of the records have the exact same index value, e.g.- {"index":{"_id":"undefined"}}. I have tried using different keys, but all end with the same result.
ericjarvies 4 months ago
Ok, after several tries I found out the problem. The index is: year_tax, not tax_year :). If it's undefined then the key is not there, the key might be misspelled. The original json {"year_tax":"2006","value_land":"738000.00","value_imprv":"77000.00","value_total":"815000.00","value_levy":"4519.12","pid":"029-462-100","folio":"721-148-79-0000","folder":"CA/BC/GV/VA/SH/029462100-1035_CONNAUGHT_DR/","filename":"029462100-1035_CONNAUGHT_DR-ASSESSMENT"}
farolanf 4 months ago
@farolanf - I've been using the correct key names, and have tried all of them... pid, taxyear, landvalue, and so forth, but non of them work.
ericjarvies 4 months ago
Your command option was -x tax_year, and the original json (the sample you provided) key is year_tax. {"year_tax":"2006","value_land":"738000.00","value_imprv":"77000.00","value_total":"815000.00","value_levy":"4519.12","pid":"029-462-100","folio":"721-148-79-0000","folder":"CA/BC/GV/VA/SH/029462100-1035_CONNAUGHT_DR/","filename":"029462100-1035_CONNAUGHT_DR-ASSESSMENT"}
farolanf 4 months ago
And just in case, make sure you're using the last version of the script.
farolanf 4 months ago
I copy paste the script from this solution and ran it on my machine, here's the result https://s7.postimg.org/hxgzzt3vv/Selection_034.png
farolanf 4 months ago
@farolanf - Please watch the following video; http://somup.com/cbnrhK8Rh . I will check to make sure I have the latest script (as listed above) now.
ericjarvies 4 months ago
@farolanf - Perhaps you added additional dependancies? I will install the script again and see what happens.
ericjarvies 4 months ago
Ok, I have watched the video and see where the problem is. The taxyear is nested inside the object, so the index key should be `ASSESSMENT.taxyear` but I don't know if the script support the nested key, I'll check it.
farolanf 4 months ago
Please try the newest version, it supports nested keys. The correct command for the data in the video will be spmer -m CA -o sample.json -grf OSLER -x ASSESSMENT[0].pid (the key is pid of the first object in the array pointed by the ASSESSMENT key of the outer object).
farolanf 4 months ago
Please try the newest version, it supports nested keys. The correct key option for the data in the video will be -x ASSESSMENT[0].pid (the key is pid of the first object in the array pointed by the ASSESSMENT key of the outer object).
farolanf 4 months ago
@farolanf - That did the trick. The MERGE output to minified and ndjson do not work correctly. When using split-merge -m CA -o MERGED-ASSESSMENT-minified.json -r and selecting option 1 (minified), the merged file should contain leading [ and trailing ] and one record per row with a trailing comma (,). When selecting option 2 (ndjson), the output should merely contain one record per row without any leading or trailing and without commas. When using thesplit-merge -m CA -o MERGED-ASSESSMENT-minified.json -grf ASSESSMENT then the minified and ndjson merge outputs should contain one record per row, using ASSESSMENT as the object, and al of the records in an array.
ericjarvies 4 months ago
I'm trying to reproduce your problem. First I split the sample with spmer -s sample_records.ndjson -n filename -p folder -k ASSESSMENT. Then I did spmer -m CA -o merged-assessment-min.json -r and choose 1 (minified-JSON). I get [{"ASSESSMENT":[{...},{...},{...}]}], per line. When I did spmer -m CA -o merged-assessment-nd.json -r and choose 2 (NDJSON), I get {"ASSESSMENT":[{...},{...},{...}]} per line.
farolanf 4 months ago
@farolanf - What command would I use to merge back to a file that contains all of the records, but only one record per row? Basically, the objective is to be able to split into it and be able to merge back out of it the exact same -if need-be. So, when merging back and not selecting the -x option, it simply needs to take and ignore the object name, and grab all the records and output them each to their own line. Perhaps this video will better explain; http://somup.com/cbn3X8Pi9
ericjarvies 4 months ago
Merge with spmer -m dir -o output.json -rg and select the format. From the video I see that you want to split a json array elements to separate files and later merge the files back into an array, that operation is not supported by the script.
farolanf 4 months ago