Hash large CSV file in browser offline
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I need to be able to drag a local file in to an html page (offline or local) and have it hash all of the columns with SHA256 we have two other challenges here is that we also need to sanitize the data and there is a ton of rows of data that need to be processed

The steps here are
-Clean the data = Remove spaces, leading zeros, convert to lower, remove non alpah num characters or encoding.
-Convert to sha256 for all fields (minus header)(nulls should still have a hash when processed. )

The scale of this is likely going to be the actual problem. We are fine assuming chrome or FF only if needed. we have prepared 2 files 100K records and 1mm records. The winner of this contest will be the person will be first if they can process the 1mm rows and secondarily if it can be done the persons code that runs fastest will win. the output should just announce to the page its done and have a download link... we dont need to try and draw or show the user the rows or outcome.

i will pay $100 to winner and $50 to second place person.

As for assumptions about code. We will run this as offline zip so you can assume any libs as long as they can run locally or offline. would be awesome to show a progress bar but if not reasonable np.

https://www.dropbox.com/s/x3gdtdp4ayy5hyh/CSV-100K_and_1mm.zip?dl=0

  1. So you want the output to be a downloadable CSV file with SHA256 fields but keep the text headers as they are?
  2. Please clarify "encoding" in "remove non alpah num characters or encoding".
LayZee 2 months ago
1- yes
2-sorry i could have been more clear. character encoding like this https://stackoverflow.com/questions/990904/remove-accents-diacritics-in-a-string-in-javascript where Á gets converted to A.
Qdev 2 months ago
How about a Node.js program that you pass a CSV filename to and then it outputs a hashed CSV file in the same directory? That should give us more performance since we need no browser chrome.
LayZee 2 months ago
this would require the user to install and run node right? if yes, unfortunately I cant head that direction or we would have cooked this up in python or something. Trying for something i can run in a browser. of note, we did get some code running internally but it seems to die after 200K rows. here is the code in case it can be a spring board to more better ideas! https://crts-assets.s3.amazonaws.com/Hash_Processor.zip
Qdev 2 months ago
Will you be running the script locally (offline) from a local webserver? I mean a web server that will be serving the HTML and the rest of the offline files?
kostasx 2 months ago
Locally, good question!
Qdev 2 months ago
Would it be OK to break up the output (hashed CSV file) into smaller chunks and download these sequentially? Let's say 10 CSV files instead of one big one? So the output files for the hash-1000K.csv file, would be hashed-1000k-001.csv, hashed-1000k-002.csv, hashed-1000k-003.csv, etc.
kostasx 2 months ago
Okay everyone here is how things ended up in testing - for transparency i bonused everyone since this helped us out a ton and we learned a couple things from the final solutions.
Layzee - 00:28:86 - winner - $100
CyteBode - 00:49:03 - 2nd place $50
Kostasx - 1:00:49 - 3rd place $25
Alixaxel - 4th place - $10 Thanks everyone for posting solutions. This was great!
Qdev 2 months ago
Thank you, it was a fun assignment. Too rarely do I get to work on performance critical tasks.
LayZee 2 months ago
awarded to LayZee
Tags
javascript

Crowdsource coding tasks.

5 Solutions


A solution that can process 1M records is pretty easy actually.

  • Get rid of sha256.js.
  • Change step: function(results, parser) { to step: async function(results, parser) {.
  • Change the two sha256( calls to await sha256(.

Use the native SubtleCrypto hashing API by implementing sha256() function:

async function sha256(message) {
    // encode as UTF-8
    const msgBuffer = new TextEncoder("utf-8").encode(message);

    // hash the message
    const hashBuffer = await crypto.subtle.digest("SHA-256", msgBuffer);

    // convert ArrayBuffer to Array
    const hashArray = Array.from(new Uint8Array(hashBuffer));

    // convert bytes to hex string
    const hashHex = hashArray
        .map(b => ("00" + b.toString(16)).slice(-2))
        .join("");

    return hashHex;
}

It took a while (like 5 minutes, most of which PapaParse) but I got the hashes on a low-end laptop.

Here's the link to the code if you want to try it out (expires in 24 hours).

There's a lot of other opportunities for optimizations, but it's late and I have to sleep now. :)

For better perfomance I would also suggest: 1. To use 'array' type for 'outputdata' variable and push in it strings corresponding to rows. In the end just do outputdata.join('\n'). 2. To don't use expression header.toLowerCase().indexOf("account") while processing each row and column. Just store column index in the beginning.
Researcher 2 months ago
As always alixaxel thanks for the submission. Will test this out tomorrow morning! Thx in advance.
Qdev 2 months ago
Some code cleaning tasks are missing.
LayZee 2 months ago

Here is my solution, with drag and drop, and a progress bar. I wrote the code without looking at the provided stub, so the cleanup logic my cleanData function follows more what you described in the requirements than the stripNonAlphaNumeric function in the stub, which only removes non-alphanumeric characters.

Dependencies

Put these three .js files in a folder named js, as well the following JavaScript file:

script.js

function splitCSVLine(line, buffer, fn) {
  var line = line.split(",")
  var accum = "";

  var j = 0;
  for (var i = 0; i < line.length; i++) {
    var value = line[i];
    if (accum == "") {
      if (value.startsWith('"')) {
        accum = value;
      } else {
        buffer[j++] = fn(value);
      }
    } else {
      accum += "," + value;
      if (value.endsWith('"')) {
        buffer[j++] = fn(accum);
        accum = "";
      }
    }
  }
  return j;
}

function cleanData(entry) {
  if (entry == "") {
    return entry;
  }
  // Remove whitespace
  entry = entry.replace(/\s+/g, '');
  // Remove leading zeroes
  entry = entry.replace(/^0+|$/g, '');
  // Convert to lower
  entry = entry.toLowerCase();
  // Clean up the encoding
  entry = entry.normalize("NFD").replace(/[\u0300-\u036f]/g, '');
  // Remove non-alphanumeric characters
  entry = entry.replace(/\W|_/g, '');
  return entry;
}

var memo = {};
function process_memo(str) {
  return memo[str] || (memo[str] = sha256(cleanData(str)));
}

function process_normal(str) {
  return sha256(cleanData(str));
}

var dropDiv = document.getElementById("drop");
var statusSpan = document.getElementById("status");
var progressBar = document.getElementById("bar");
var progressBarBG = document.getElementById("progress");
var resultsDiv = document.getElementById("results");
var processing = false;

function handleFileSelect(e) {
  e.stopPropagation();
  e.preventDefault();

  if (processing) {
    return;
  } else {
    processing = true;
  }

  var files = e.dataTransfer.files;

  for (var i = 0, f; f = files[i]; i++) {
    var navigator = new LineNavigator(f);
    var fname = f.name;

    var fname_arr = fname.split(".")
    if (fname_arr[fname_arr.length - 1].toLowerCase() != "csv") {
      continue;
    }

    statusSpan.innerHTML = "Processing " + fname + "...";
    progressBarBG.style.display = "block";

    // Read the headers
    navigator.readLines(0, 1, function(err, index, lines, isEof, progress) {
      if (err) throw err;

      var headers = lines[0];
      var columns = (headers.match(/,/g) || []).length;

      var line_results = Array(columns)
      columns = splitCSVLine(lines[0], line_results, x => x);

      if (isEof) return;

      var results = [headers, "\r\n"];
      var lastProgress = -1;
      function linesReadHandler(err, index, lines, isEof, progress) {
        if (err) throw err;

        if (progress != lastProgress) {
          progressBar.style.width = progress + "%";
          lastProgress = progress;
        }

        // Split the row and process the entries
        for (var j = 0; j < lines.length; j++) {
          var count = splitCSVLine(lines[j], line_results, process_memo);
          if (count != columns) {
            continue;
          }
          results.push(line_results.join(","));
          results.push("\r\n");
        }

        if (isEof) {
          // Update the status
          statusSpan.innerHTML = "Saving to blob...";

          // Create the blob
          var data = new Blob(results, {"type": "text/csv"});
          var csvFile = window.URL.createObjectURL(data);

          // Create a link pointing to the blob's URL
          var div = document.createElement("div");
          var dlAnchor = document.createElement("a");

          dlAnchor.download = "result_" + fname;
          dlAnchor.innerHTML = dlAnchor.download;
          dlAnchor.href = csvFile;

          div.appendChild(dlAnchor)
          resultsDiv.appendChild(div);

          // Reset the status and hide the progress bar
          statusSpan.innerHTML = "Done! You can drag and drop another CSV file.";
          progressBarBG.style.display = "none";
          processing = false;

          // Uncomment the following line so the file gets downloaded automatically
          //dlAnchor.click();

          return;
        }
        navigator.readSomeLines(index + lines.length, linesReadHandler);
      }

      // Read the rest of the file
      navigator.readSomeLines(1, linesReadHandler);
    });

    break;
  }
}

function handleDragOver(e) {
  e.stopPropagation();
  e.preventDefault();
}

dropDiv.addEventListener("dragover", handleDragOver, false);
dropDiv.addEventListener("drop", handleFileSelect, false);

Then create the following HTML file, next to the js folder:

index.html

<!DOCTYPE html>

<html>
  <head>
    <meta charset="utf-8"/>
    <title>SHA256 CSV Entries Hasher</title>
    <style>
      body {
        overflow: hidden;
        margin: 0;
        padding: 0;
      }

      #drop {
        font-size: xx-large;
        text-align: center;
        font-family: arial, sans-serif;
        font-weight: bold;
      }

      .outer {
        display: table;
        position: absolute;
        height: 100%;
        width: 100%;
      }

      .middle {
        display: table-cell;
        vertical-align: middle;
      }

      .inner {
        margin-left: auto;
        margin-right: auto;
        width: 800px;
      }

      #progress {
        display: none;
        width: 100%;
        background-color: #eee;
      }

      #bar {
        width: 0%;
        height: 32px;
        background-color: #08f;
      }
    </style>
    <script type="text/javascript" src="js/file-wrapper.js"></script>
    <script type="text/javascript" src="js/line-navigator.js"></script>
    <script type="text/javascript" src="js/sha256.js"></script>
  </head>
  <body>
    <div class="outer" id="drop">
      <div class="middle">
        <div class="inner">
          <span id="status">Please drag and drop a CSV file.</span>
          <div id="progress">
            <div id="bar"></div>
          </div>
          <div id="results">
          </div>
        </div>
      </div>
    </div>

    <script type="text/javascript" src="js/script.js"></script>
  </body>
</html>

Performance

Instead of using a library such as Papa Parse to parse the CSV, I'm reading the CSV in chunks with line-navigator and processing the lines manually. This provides about a 2x increase in performance. On my 7 years old laptop, processing the 1000K file only takes 1m40s with Firefox 62.0 64-bit.

Furthermore, I've noticed that many of the values would occur more than once (notably the empty string), so I'm using memoization to get another boost in performance. Thanks to it, the time goes down to 1m15s. However, if the real data has more unique values, it would be better not to use it, by replacing process_memo with process_normal in splitCSVLine(lines[j], line_results, process_memo).

Memory usage is fairly high. With the 1000K file, by the time it's fully parsed, it goes up to about 2.5GB before momentarily ballooning up to 7GB+ (likely even reaching into swap) during the creation of the blob. As such it would be necessary to use a 64-bit OS and a 64-bit browser, on a PC with 8 GB of RAM or more.

For some reason, my solution makes Chrome slow down to a crawl, which doesn't happen with Firefox. I think it's due to Chrome being overly aggressive with memory management. As such I would advise to only use Firefox.

Edit: Refactored the JavaScript code out of the HTML file into its own .js file. I keep forgetting that Bountify's HTML syntax highlighter doesn't like JavaScript.

RAM usage gets huge because you are concatenating to an eventually large string.
LayZee 2 months ago
I'm not concatenating to a string, I'm pushing to an array. I'm seeing the same momentary ballooning of RAM usage in your solution when the blob gets created.
CyteBode 2 months ago
I see. Good call. I tried concatenating to a growing string and it was not working out well at all.
LayZee 2 months ago
Yeah I tried doing that as well, but it quickly gets into an AllocationError even if the RAM usage isn't so high.
CyteBode 2 months ago
hey there can you help me with this. maybe i built dir wrong but when i run it just gives me my same csv back https://www.dropbox.com/s/1k93t230a0uwo5q/js.zip?dl=0
thx!
Qdev 2 months ago
You saved script.js as an empty file, so no JavaScript is running. It's giving you the same file because the browser is trying to open the file as it normally would.
CyteBode 2 months ago
ha lets pretend that didnt happen :) thx for the quick reply.
Qdev 2 months ago
Thanks for the tip.
CyteBode 2 months ago
Winning solution

The complete solution with package dependencies is on Plunker but also works when opening index.html from a local directory.

Works in Chrome and Firefox but is optimized for Chrome. It uses a decent amount of CPU and RAM resources (4 GB). Completes 1 million rows in 32 seconds in Chrome on my Windows laptop.

When you pick a new file, it releases the previously hashed CSV file for garbage collection to prevent memory leaks. It outputs a file with separator and newline character matching the input.

My sanitizing and hashing might be different than yours since I started developing before you posted your partial solution but it is easy to replace it with your own.

HTML (index.html)

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <meta http-equiv="X-UA-Compatible" content="ie=edge">
  <title>Hash large CSV file in browser offline</title>
</head>
<body>
  <input id="csv-control" type="file" accept=".csv" />
  <a id="download-link"
    target="_blank"
    style="display: block; visibility: hidden;"></a>

  <pre><code id="log"></code></pre>

  <script src="node_modules/js-sha256/src/sha256.js"></script>
  <script src="node_modules/line-navigator/file-wrapper.js"></script>
  <script src="node_modules/line-navigator/line-navigator.js"></script>
  <script src="node_modules/papaparse/papaparse.js"></script>
  <script src="index.js"></script>
</body>
</html>

JavaScript (index.js)

'use strict';

(function () {
  function clearLog() {
    logElement.innerText = '';
  }

  function hideDownloadLink() {
    downloadLinkElement.style.visibility = 'hidden';
  }

  function log(message) {
    const timestamp = new Date();
    logElement.innerText +=
      `\n${timestamp.toLocaleTimeString()}.${timestamp.getMilliseconds().toString().padStart(3, '0')}: ${message}`;
  }

  function onFileInput(event) {
    hideDownloadLink();
    /** @type {FileList} */
    const fileList = csvControl.files;
    const inputFile = fileList.item(0);

    if (outputObjectUrl !== undefined) {
      URL.revokeObjectURL(outputObjectUrl);
      outputObjectUrl = undefined;
    }

    log(`Processing input file '${inputFile.name}'...`);
    parseCsv(
      inputFile,
      () => log('Input file processed. Generating output file...'))
      .then(
        csvLines => {
          const outputFile = new Blob(csvLines, { type: 'text/csv' });
          outputObjectUrl = URL.createObjectURL(outputFile);
          downloadLinkElement.href = outputObjectUrl;
          const nameOfOutputFile = outputFilename(inputFile.name);
          downloadLinkElement.download = nameOfOutputFile;
          downloadLinkElement.innerText = `Download ${nameOfOutputFile}`;
          showDownloadLink();

          log(`Output file '${nameOfOutputFile}' generated`);
        },
        error => {
          console.error(error);
          log('Error when processing input file. See browser console for details.');
        });
  }

  /**
   * @param {string} inputFilename 
   * @return {string}
   */
  function outputFilename(inputFilename) {
    const fileExtension = /\.\w+$/;
    const hashInFix = '.sha256';

    return inputFilename.match(fileExtension)
      ? inputFilename.replace(fileExtension, hashInFix + '$&')
      : inputFilename + hashInFix;
  }

  /**
   * @param {File} csvFile
   * @param {Function} beforeOutput
   * @returns {Promise<string[]>}
   */
  function parseCsv(csvFile, beforeOutput) {
    return new Promise((resolve, reject) => {
      const csvLines = [];
      /** @type {string} */
      let delimiter;
      /** @type {string} */
      let newline;

      Papa.parse(csvFile, {
        complete(result, inputFile) {
          if (result.errors.length > 0) {
            reject(result.errors);

            return;
          }

          beforeOutput();

          readCsvHeader(inputFile)
            .then(fields => {
              csvLines[0] = fields
                .join(delimiter)
                + newline;

              resolve(csvLines);
            });
        },
        error: (error, inputFile) => {
          reject(error);
        },
        step(result, parser) {
          /** @type {string[][]} */
          const rows = result.data;

          if (delimiter === undefined) {
            delimiter = result.meta.delimiter;
            newline = result.meta.linebreak;
          }

          const csvLine = rows.map(row =>
            row.map(cell =>
              (cell !== '')
                ? sha256(sanitize(cell))
                : emptyHash)
              .join(delimiter)
            + newline)
            .join('');

          csvLines.push(csvLine);
        },
      });
    });
  }

  /**
  * @param {File} csvFile
  * @returns {Promise<string[]>} Resolves to the CSV header.
  */
  function readCsvHeader(csvFile) {
    return new Promise((resolve, reject) => {
      const inputFileLineReader = new LineNavigator(csvFile);
      const readFromStart = 0;
      const readFirstLine = 1;
      inputFileLineReader.readLines(
        readFromStart,
        readFirstLine,
        (error, _index, [firstLine]) => {
          if (error) {
            reject(error);

            return;
          }

          Papa.parse(firstLine, {
            complete: result => {
              resolve(result.meta.fields);
            },
            header: true,
          });
        });
    });
  }

  /**
  * @param {string} text
  * @returns {string} Normalized text.
  */
  function removeDiacritics(text) {
    return text.normalize('NFD').replace(/[\u0300-\u036f]/g, '');
  }

  /**
  * @param {string} cellValue
  * @returns {string} Sanitized value.
  */
  function sanitize(cellValue) {
    return stripNonAlphaNumericalCharacters(
      removeDiacritics(stripLeadingZeroes(cellValue.trim()))).toLowerCase();
  }

  function showDownloadLink() {
    downloadLinkElement.style.visibility = 'visible';
  }

  /**
  * @param {string} text
  * @returns {string} Sanitized text
  */
  function stripLeadingZeroes(text) {
    return text.replace(/^0+/, '');
  }

  /**
  * @param {string} text
  * @returns {string} Sanitized text
  */
  function stripNonAlphaNumericalCharacters(text) {
    return text.replace(/[^A-Z|a-z|0-9]/g, '');
  }

  /** @type {HTLMInputElement} */
  const csvControl = document.getElementById('csv-control');
  /** @type {HTMLAnchorElement} */
  const downloadLinkElement = document.getElementById('download-link');
  const logElement = document.getElementById('log');
  /**
  * SHA256 hash of empty string.
  * @type {string}
  */
  const emptyHash = sha256('');
  /** @type {string} */
  let outputObjectUrl;

  document.addEventListener('DOMContentLoaded', () => {
    csvControl.addEventListener('change', onFileInput);
  });
}());
I believe that your solution is computing hashes for the header row. You can try it with a test file with only one data row. There will be two rows of results instead of one.
CyteBode 2 months ago
Thanks, that bug has been fixed.
LayZee 2 months ago
I've checked the code and used it to produce the final hashed CSV file from the 1000K input CSV. Then I did a check on the first line but the hashes don't match. For example: Line #2 of hash-1000K.csv, the email "nmoon0@amazon.co.jp" must compute to: 6D7BFEA7B01A00F475CEE0FCC1C4F268136730F73AB57C9EF2055EE19202BD9D but the hashed entry is c9547b84b1249a4d085dd1f75cdb62840eba29455f3cfc2f31bc4b2239a288d7. Have you tested the results in some way to ensure that the hashes correspond to the correct results? Am I missing something here?
kostasx 2 months ago
@kostasx You're getting that hash because you're not cleaning up the value as specified. Once cleaned up, it becomes "nmoon0amazoncojp" which computes to the hash LayZee and I are getting.
CyteBode 2 months ago
Thanks @CyteBode. It's getting late in here... :)
kostasx 2 months ago

Here is my solution. Around 35" to process and download the file using streams to read and write the output file.

Link: https://drive.google.com/open?id=1-7iHHFrQWDXtSdmf7IQVtZTDx0AefRVv

Notes:

• Using LineReader.js to read large input files (1000K) via streams.
• Using StreamSaver.js to write large output files (the 1000k CSV file creates a 715MB output file) via streams and avoid browser memory issues.
• Used local testing utilities to ensure that all input entries correspond to the correct output hashed entries
• Using progress bar to indicate processing

• Using modified version of js-sha256 library for fastest hash generation possible
• Using modified version of CSV file for fastest perfomance

Bravo! Why not use the native crypto API and squeeze a bit extra performance?
alixaxel 2 months ago
Well, I did check a bunch of algorithms and it seems that the js-sha256 library is faster than the native one. Also, being synchronous, made it easier to work with.
kostasx 2 months ago
Well, I did check a bunch of algorithms and it seems that the js-sha256 library is faster than the native one. Also, being synchronous, made it easier to work with. At least, in the initial tests, the library performed better than the native one.
kostasx 2 months ago
StreamSaver is a neat find, but it relies on such a ugly hack (popup to mitm.html) to work! Furthermore, I'm having some problems with your solution. The download initiates in Chrome, but not at all in Firefox. With the 1000K file, processing hangs at around a third of the way for both browsers. Memory usage in Firefox is HUGE (~3GB at only a third of the way), but it's fine in Chrome. Since it hangs with the 1000K file, I could only test the speed on my computer with the 100K file. In Chrome: 16s (you) vs 21s (me). In Firefox: 7s for both.
CyteBode 2 months ago
Yes, I only tested the code in Chrome. Had no idea how it behaves in FF. Another couple of ideas, that I am thinking of testing next are utilizing web workers, splitting the output file in chunks (still no comment from the author) and maybe implementing a gpu.js version of the hash algorithm to make the code take advantage of the GPU.
kostasx 2 months ago
That still leaves the hanging. It's hanging for me on the two computers running Windows I've tried it on, but I haven't tested with Linux or macOS. It puzzles me that it works for you, but not for me. Also, looking at your code, you don't seem to be doing the data cleanup as required in the specs. I would be shaving off a fair bit of time if I wasn't doing it either.
CyteBode 2 months ago
You're right. That part is missing. I tried to quickly grep for special characters, but didn't come up with any from the input files. I will put the sanitization code as well and see where the performance gets me. I am also testing the code on a Mac by the way.
kostasx 2 months ago
All right, I just tried on a Hackintosh and it's hanging at the same spot in Firefox, but working fine in Chrome.
CyteBode 2 months ago

Bummer, didn't expect this to end so fast, was working on multiple versions of the solution without posting any (my bad).

Anyways, since this was a pretty fun problem to solve, here's my solution:

Plunkr source (It's better to copy the preview url and open it in a separate tab, plunkr seems to fail loading some of the dependencies 50% of the time)

This solution runs better in Chrome.

Dependencies:

LineNavigator

js-sha256

From benchmarking on my system, this performs ~26% faster than the winning solution (including blob creation. Although I assume the percentage difference would be lower on higher-end systems) and also the ram usage is twice as low - spikes to ~1.6Gb when blob is created.

It's probably possible to create an even better solution by combining some of the tricks I use here with the winning solution.

I can comment on certain parts of the solution, if needed.

I started off with StreamSaver.js, just like kostasx did, but pretty quickly realized how hacky the whole implementation is + it seemed to add a nasty overhead, so I dropped the idea and went on with the current - posted solution.

Had started working on a third version of the solution, which seemed to be more than 50% speed improvement over this one, but it all fell apart when I remembered that I should remove diacritics and since I was working purely with byte arrays, meaning that I would have to convert everything to a string at some point and lose most of the performance gains.

Thanks for another fun bounty, Qdev!

EDIT: Removed live preview link, didn't realize how quickly it would expire.

Man that is low-level 😃 With your Plunkr, Chrome completes in 47 seconds on my laptop (with some cleaning tasks missing as you have noted).
LayZee 2 months ago
This solution should actually have all the cleaning tasks, unless I'm missing something.
The version that didn't have diacritics check was the absolute low-level solution, and I didn't post it here.
Wuddrum 2 months ago
View Timeline