Collapse list of words to get unique root spellings - javascript
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

We have a bunch of spellings of words from a log. Here we have a list of ways people spell abercrombie & fitch , we are trying to arrive at the number of unique combinations of the words but in some cases people didnt finish typing so we need to be able to aggregate unique root spellings. I know this is confusing but let me show some examples

John
Johnathan
Johna
Joh
Jothnat

in the above 3 we would collapse these to say "Johnathan" is the unique root word all of these were attempting to spell. now if we do something more complex like

Sam
Samantha
Sammy

we would distill this to say there are 2 available roots Samantha and Sammy, as Sam would be collapsed to one of the two longer terms.

My list of real terms is here
https://gist.github.com/quotient/b4707732ac4d0bf07e95491cc8a180bb

We want to be able to paste in a bunch of terms like this to a text box and have it output all of the unique root terms from it. Lets assume case doesn't matter in your final solution.

What about the duplicate characters in the same word? I'm seeing you're not taking them in count are you? ( see example 1)
Chlegou 11 days ago
hey there, can you point me to a specific example of your question? sorry im not following it exactly and i want to make sure i am helping.
Qdev 11 days ago
'Johnathan '(1) and 'Jothnat'(2) both contains same characters, but not the same occurrences of characters. (1) contain 1 'T' but (2) contain 2 'T' for example. so depending on that, these could be different routes. Also, should spaces, numbers and special chars like '&' should be counted? or only [a-z,A-Z] ?
Chlegou 11 days ago
awarded to Wuddrum
Tags
javascript

Crowdsource coding tasks.

4 Solutions

Winning solution

Hey Qdev, here's a simple solution I came up with:

Demo: https://wuddrum.github.io/collapse-words/

Source: https://github.com/Wuddrum/collapse-words

The solution is assuming that in the first sample that you provided Jothnat is a typo and is actually meant as Johnat. If not then the solution would require more complexity.

Currently not trimming leading and trailing spaces since that might mean that the user didn't finish a different root (as opposed to one with trimmed spaces). It's a very minor edit if you actually want the spaces to be trimmed.
Wuddrum 11 days ago

<style>
textarea{
  width: 100%;
  height: 400px;
}
</style>

<textarea id="input"></textarea>
<input type="button" value="filter" id="button">
<textarea id="output"></textarea>

<script>
button.onclick = function() {
output.value = input.value
  .split(/[\r\n]+/)
  .sort()
  .reduce((filtered, text) => {
    if(!text.startsWith(filtered[0])) 
      filtered.push(filtered[0]);
    filtered[0] = text; 
    return filtered
  }, [''])
  .join('\n')
}
</script>

You can test it at :https://codepen.io/anon/pen/oOYoPd

function processTextFile(file)
{
    var rawFile = new XMLHttpRequest();
    rawFile.open("GET", file, false);
    rawFile.onreadystatechange = function ()
    {
        if(rawFile.readyState === 4)
        {
            if(rawFile.status === 200 || rawFile.status == 0)
            {
                var allText = rawFile.responseText;
                var roots = getRoots(allText);
                alert(roots);
            }
        }
    }
    rawFile.send(null);
}


function getRoots(text) {
    var lines = text.split("\n").map( (x) => (x.toLowerCase().trim()));
    var roots = [];
    for (line of lines) {
        var absorbed = false
        var rootIndex = 0
        for (rootIndex; rootIndex < roots.length; rootIndex++) { 
            var root = roots[rootIndex];
            if (root.length > line.length) {
                if (root.slice(0,line.length) == line) {
                    //  the root is already there
                    absorbed = true;
                } 
            } else {
                if (line.slice(0,root.length) == root) {
                    //  We replace the root
                    roots[rootIndex] = line;
                    absorbed = true;
                } 
            }

        }
        if (!absorbed) {
            roots.push(line);
        }
    }
    return roots;
}

Hi there! :)

it was challenging, but here it is my solution :

i have made it display the root spellings of all assets, with an option of trimming whatever you want!!

generic solution is the best!

with a regex you configure, you could say which characters could be your root spelling dictionary.

e.g.: /[^a-zA-Z]+/g : will make use of letters only, ignoring any sperial caracters and numbers.

the demo is using only letters to find the root spellings

demo: https://chlegou.github.io/rootSpellings/index.html

source: https://github.com/chlegou/rootSpellings/

Enjoy! ;)

View Timeline