Issue
So I have very large csv file in my s3 database (2 mil+ lines) and I want to import it to dynamodb.
What I tried:
Lambda I manage to get the lambda function to work, but only around 120k lines were imported to ddb after my function being timed out.
Pipeline When using pipeline it got stuck on "waiting for runner" followed by it stopping completely
Solution
Here's a serverless approach to process the large .csv
in small chunks with 2 Lambdas and a SQS Queue:
- Using a one-off Reader Lambda, extract the primary key information for all records using S3 Select SQL to
SELECT s.primary_key FROM S3Object s
, querying the.csv
in place. See the SelectObjectContent API for details. - The Reader Lambda puts the primary keys into a SQS queue. Add a Dead Letter Queue to capture errors.
- Add the queue as a Writer Lambda's event source. Enable batching. Limit concurrency if desired.
- Parallel Writer Lambda invocations fetch records for its batch of primary keys from the
.csv
using S3 Select:SELECT * WHERE s.primary_key IN ('id1', 'id2', 'id3') FROM S3Object s
- The Writer Lambda writes its batch of records to the DynamoDB table.
Answered By - fedonev Answer Checked By - Willingham (PHPFixing Volunteer)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.