**New here?**Learn about Bountify and follow @bountify to get notified of new bounties! Follow @bountify x

The function should look like this:

```
def random_samples(data=None, timesteps=100, batch_size=100):
return random_samples
```

Series and DataFrame refer to the Pandas datastructures.

**Inputs**

**data** which will be in the form of a Series of string indexed DataFrames. These DataFrames each have a different number of rows, the same number of columns and are indexed by date (columns are floats (preferably 32bit numpy floats)).

**timesteps** refers to the number of rows that should be taken for each sample.

**batch_size** refers to the number of samples that should be returned.

**random_samples** is a numpy array of these samples, basically with size **batch_size*timesteps*columns**. Columns will generally be about size 30. Date indexes should not be included, only the floats.

The data should be randomly sampled with replacement. The random samples should be random windows of size **timesteps** of time sequences taken from the various DataFrames included in the Series.

I am open to suggestions on a better way to structure the data itself as well (hierarchical indexing for example, although I'm not sure this works if the index lengths are all different?

Many thanks to anyone who can figure out how to do this efficiently! Efficiency/speed here is key, so a slow solution will not work as many samples will have to be returned quickly. Individual slicing by randomly generated starting indexes didn't appear to be fast enough, with the slicing taking up most of the time based on profiling results of my attempts.

**Deliverables**

Just the code for a function that does this (with some consideration taken for the speed/efficiency of the solution)

## 1 Solution

This should work with the way you've structured the data. Either way the vast majority of the time *should* be spent concatenating the subselection into the result array.

```
import pandas as pd
import numpy as np
def random_samples(data=None, timesteps=100, batch_size=100):
frame_values = [x.values for x in data]
lengths = np.array([len(x) for x in frame_values])
frame_selections = np.random.randint(0, len(data), size=batch_size)
max_bounds = np.maximum(0, lengths[frame_selections] - timesteps)
starts = (np.random.rand(batch_size) * (max_bounds + 1)).astype(int)
samples = []
for i, start in zip(frame_selections, starts):
samples.append(frame_values[i][start:start + timesteps][None, :, :])
return np.concatenate(samples, axis=0)
K = 30
frames = []
for n in np.random.randint(200, 2000, size=100):
df = pd.DataFrame(np.random.randn(n, K).astype(np.float32),
index=pd.date_range('1/1/2000', periods=n))
frames.append(df)
result = random_samples(frames, batch_size=200)
```