Use a neural network to analyze a dataset and generate output
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

I have a dataset of 750+ names. I need someone to use a neural network to analyze this dataset and generate its own names, using the tech described in this post:

https://goo.gl/pM6gQr

Specifically Torch and char-rnn. Hoping to generate 5000+ outputs. If this works I have some more datasets to try, and we can add those as additional contracts.

The first dataset is: https://goo.gl/EAbm2H

hi there, well, i find this interesting because i'm concerning about machine learning a bit :), i have an idea, of an algorithm, that will take at time, a noun and an adjective or one (word list generator) (if you are looking for a short nouns, and then, we compare their similarity, so if they are close, we keep them, otherwise, we keep searching, we can also decide, the length of the name (how many words on it) and we can keep it running for a day. and we can get +10000 names I THINK. (i'm in exams period, i will follow this bounty, with others solutions, and let me know if you are interested in my algorithm idea, and we will see how it will goes). all the best wishes for you.
Chlegou 22 days ago
Hi Chlegou. Thanks for your response. I am more interested in someone using an established method (Like the Torch/char-rnn combo mentioned in the link) than writing a new one.
moiremusic 22 days ago
Hello moiremusic, What I can understand from the bounty above is that you want to implement the char-mn to generate 5000+ names right ?
Codeword 22 days ago
@Codeword yes, using the dataset provided in the second link.
moiremusic 22 days ago
@Shane. You seem to know your way around this stuff, so no, Torch would not be a requirement. As long as you’re sure you could obtain the desired results (which it sounds like you’re confident you can).
moiremusic 22 days ago
awarded to CyteBode

Crowdsource coding tasks.

1 Solution

Winning solution

The given dataset is too small. As stated in the Github page for char-rnn, "1MB is already considered very small". The dataset is only 8KB, so it's 125 times too small.

I spun up a virtual machine running Ubuntu 14.04 and installed all the prerequisites for running torch-rnn (the faster reimplementation of char-rnn). I then prepared the data and trained the neural network as per the instructions on the Github page for torch-rnn. After a few warnings about the test data being too small, the number of iterations was automatically set to 100, and it took about 5 minutes to complete.

Once I had the training data, I sampled a few thousands characters and it was all gibberish! E.g.: Brycesponnguceieakte, zl eWleorl, dvtrsMeky cRae dnereFo e auyet odre...

One hackish workaround I found was to create a 1MB dataset by repeating multiple shuffled copies of the original list of names. I then let the neural network train on that new dataset for 1000 iterations. After that, I sampled the neural netowrk, and it wasn't gibberish anymore!

However, I saw a lot of entries that were already present in the original dataset. To go around that, I wrote a Python script that keeps sampling from the neural network and ignores any name that is present in either the original dataset or the new list of names, until it reaches at least 5000 names.

Here is the resulting list of >5000 names: https://pastebin.com/gkcCfy1G (set to expire in a week)

I'll be letting the neural network train for a few more hours (it can do up to 16800 iterations for this dataset). I'm curious to see whether that'll make for better output, or if it'll just overfit the original dataset.

I don't know if you require any code, or just the resulting list. If the former, I'll edit my solution with what you may need.

Oh my god these are amazing. Awarding the bounty. Would still love to see the remaining output when it's done, but this is exactly what I was after. Thanks!
moiremusic 22 days ago
Thanks for the quick verification! I let the training run some more, but I noticed that it mostly made the neural network output the same values as earlier entries more and more, which is likely due to my workaround of repeating the dataset. For the first few 100 iterations, most values were new, but there was a lot of gibberish. At 1000 iterations, values were more acceptable, and it was 55% the same. At 2000 it was 85%, and at 3000 it was 93%. This made the sampling process take longer and longer (i.e. 2x as long at 50%, 10x at 90%, etc). Here's the resulting list at 3000 iterations: https://pastebin.com/xcGe9LZv (will expire in a week). I can't really say if there's an appreciable increase in quality. Since the training had already run for a while, I just stopped it at 3000.
CyteBode 22 days ago