Best AWS cloud approach to processing a specific file - just idea no code
New here? Learn about Bountify and follow @bountify to get notified of new bounties! x

Each week i get a new file and i need to process it. I am struggling with making sure we use the best "cloud" approach to processing this so we dont end up with servers and a ton of custom code when there might be something out of the box we are missing.

Here is my use case. Each week i get a 10-20gb csv file with 10-50mm rows of data in it (all product store data) .It has 22 columns in it, there are tons of dups in it and we need to extract 3 specific files from it that have 2-3 columns of the data in it all dedupped by those columns. for example i need a list of just the unique items in columns 1 & 3, or i need all of the unique product names, or a list of all of the stores. I dont want millions of rows with the same store name in them.

Normally we process data with lambdas but in this case the file size is too big and likely needs more than the max of 5 min a lambda can take. I would like this bounty to be a collection of some good ideas of how to ingest the file from s3 process it and place back a number of the extractions mentioned above.

What aws services would you use?
How long do you think it would take to process this?
What sort of risks do you think this approach has?
What programming language is best suited for this operation/task? (based on raw speed if required by the overall solution)

Ill award the best idea from our view and bonus up to 3 other ideas $10

awarded to 5osxcwbf

Crowdsource coding tasks.

2 Solutions

Can we split this huge file? It simplify a lot your problem. After process each small file, you have to merge them, but it's worth it.

If AWS Lambda is unable to split the file, you can do it with a spot instance. After this step, you can continue working with your Lambda solution.

good question - We cant split on writing it since it comes from a 3rd party, once we get it we can split it but I dont think a lambda can tackle it for the file size alone.
Qdev almost 6 years ago
Check the linux split command. The performance is pretty awesome, maybe it is enough to 22GB.
iurisilvio almost 6 years ago
Winning solution

As you noticed Lambda isn't really designed for what you are trying to do.
The solution offered by iurisilvio is pretty terrible and not really a native solution to AWS cloud environment.

If you want to stay on the Amazon AWS platform (you shouldn't) you can use stuff like data pipelines;

AWS Data Pipeline is a web service that you can use to automate the movement and transformation of data. With AWS Data Pipeline, you can define data-driven workflows, so that tasks can be dependent on the successful completion of previous tasks.

(Since they support CSV )

You can use data pipelines in combination with plain old EC2 instances or with specialized elastic map reduce instances or perhaps combine it with Redshift

But this will require running servers & some custom code costing you time & money.

This is why i highly suggest you look into Google Big Query since it seems ideal for you.

a) Supports CSV ingestion out of the box

b) Queries can be written in 'BigQuery SQL dialect', very similar to SQL

c) querying tables with ~50MM rows will never take more than a few seconds

d) storage cost ($0.02/GB-month) plus processing ($5/TB).

e) Setup should be easy, option for a free trial exist.

Since you only need to query a fraction of the columns, your querying cost will be negligible.

Agreed! Lambda is not the best solution. I tried to maintain most of what he already have, but it is not how AWS must work. I never used AWS Data Pipeline, will try some day, but looks like it is what he need.
iurisilvio almost 6 years ago
I hope Qdev has solved his problem and is able to accept my answer & perhaps hand out a tip to you so you can experiment for free with Data Pipelines on AWS :]
5osxcwbf almost 6 years ago
View Timeline