Issue

So I have very large csv file in my s3 database (2 mil+ lines) and I want to import it to dynamodb.

What I tried:

Lambda I manage to get the lambda function to work, but only around 120k lines were imported to ddb after my function being timed out.
Pipeline When using pipeline it got stuck on "waiting for runner" followed by it stopping completely

Solution

Here's a serverless approach to process the large .csv in small chunks with 2 Lambdas and a SQS Queue:

Using a one-off Reader Lambda, extract the primary key information for all records using S3 Select SQL to SELECT s.primary_key FROM S3Object s, querying the .csv in place. See the SelectObjectContent API for details.
The Reader Lambda puts the primary keys into a SQS queue. Add a Dead Letter Queue to capture errors.
Add the queue as a Writer Lambda's event source. Enable batching. Limit concurrency if desired.
Parallel Writer Lambda invocations fetch records for its batch of primary keys from the .csv using S3 Select: SELECT * WHERE s.primary_key IN ('id1', 'id2', 'id3') FROM S3Object s
The Writer Lambda writes its batch of records to the DynamoDB table.

Answered By - fedonev

Answer Checked By - Willingham (PHPFixing Volunteer)