Randomly sample with replacement from this Series of Dataframes
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

The function should look like this:

def random_samples(data=None, timesteps=100, batch_size=100):
return random_samples

Series and DataFrame refer to the Pandas datastructures.

Inputs

data which will be in the form of a Series of string indexed DataFrames. These DataFrames each have a different number of rows, the same number of columns and are indexed by date (columns are floats (preferably 32bit numpy floats)).

timesteps refers to the number of rows that should be taken for each sample.

batch_size refers to the number of samples that should be returned.

random_samples is a numpy array of these samples, basically with size batch_size*timesteps*columns. Columns will generally be about size 30. Date indexes should not be included, only the floats.

The data should be randomly sampled with replacement. The random samples should be random windows of size timesteps of time sequences taken from the various DataFrames included in the Series.

I am open to suggestions on a better way to structure the data itself as well (hierarchical indexing for example, although I'm not sure this works if the index lengths are all different?

Many thanks to anyone who can figure out how to do this efficiently! Efficiency/speed here is key, so a slow solution will not work as many samples will have to be returned quickly. Individual slicing by randomly generated starting indexes didn't appear to be fast enough, with the slicing taking up most of the time based on profiling results of my attempts.

Deliverables

Just the code for a function that does this (with some consideration taken for the speed/efficiency of the solution)

awarded to wesm

Crowdsource coding tasks.

1 Solution

Winning solution

This should work with the way you've structured the data. Either way the vast majority of the time should be spent concatenating the subselection into the result array.

import pandas as pd
import numpy as np

def random_samples(data=None, timesteps=100, batch_size=100):
    frame_values = [x.values for x in data]
    lengths = np.array([len(x) for x in frame_values])

    frame_selections = np.random.randint(0, len(data), size=batch_size)

    max_bounds = np.maximum(0, lengths[frame_selections] - timesteps)
    starts = (np.random.rand(batch_size) * (max_bounds + 1)).astype(int)

    samples = []
    for i, start in zip(frame_selections, starts):
        samples.append(frame_values[i][start:start + timesteps][None, :, :])

    return np.concatenate(samples, axis=0)

K = 30
frames = []
for n in np.random.randint(200, 2000, size=100):
    df = pd.DataFrame(np.random.randn(n, K).astype(np.float32),
                      index=pd.date_range('1/1/2000', periods=n))
    frames.append(df)

result = random_samples(frames, batch_size=200)
Thank you for this, this method was a great deal faster than what I was doing before, particularly as you pull larger and larger batches.
davebs almost 8 years ago