How to quickly load a large amount of records to AWS DynamoDB

This article describes a journey to make a Node.js program inserting ~1.7M records to DynamoDB faster, reducing the running time from the initial ~52 hours down to ~5 minutes. The source code for this program is available here.

Problem to solve

As part of an on-going project, I need to create a look-up database of geo locations for all UK postcodes which number ~1.7M records. Being a key-value data store that can scale horizontally on demand, AWS DynamoDB is the perfect candidate for the use case.

Information about each postcode is stored as an object in DynamoDB, with postcode being the key and latitude and longitude as attributes. The latitude/longitude information of each UK postcode is publicly available and updated on regular basis here.

Given that:

There are more than 1.7 million (1,770,724 as at November 2020 ) unit postcodes. This number changes regularly as around 2,750 postcodes are created and 2,500 are terminated each month.

This program is expected to execute on a regular basis, perhaps once per month, to keep the data up to date.

Lessons

Follows is the summary of some of the lessons during the process:

  1. readline module (part of Node.js core) provides a convenient way to read text file line by line. The for await .. of loop enabled by readline allows the program to be async/await and therefore easier to read (i.e. than using callbacks).
  2. Use the .promise() method in AWS Javascript SDK to return a promise that can be used in async/await code.
  3. BatchWriteItem() inserts items in the batch and overwrite existing objects with same keys. It does not do partial item update and has the limit of 25 objects or 16MB of data.
  4. AWS.DynamoDB.DocumentClient class handles concurrency under the hood, allowing multiple requests to DynamoDB to be made at the same time.
  5. Each AWS Lambda has access to 512MB of data storage under /temp directory.
  6. Enable HTTP keep-alive can help make multiple requests to DynamoDB (much) faster. In this particular case, the execution time is cut by half with this change. Remember, since AWS SDK v2.463.0, all you need to do is to set the environment variableAWS_NODEJS_CONNECTION_REUSE_ENABLED to 1.
  7. zlib module in Node.js core works with gzip standard/format, which is popular in *nix systems. It does not work with zip format which is popular in Windows systems though.

Create a DynamoDB table

DynamoDB requires tables to be created explicitly with a specified schema before data can be inserted. Although the schema only needs to cover the keys (hash key and range key), each actual object can have any properties. In this fixed schema vs. schemaless distribution, DynamoDB is somewhere between SQL and MongoDB.

There are a number of ways to create a DynamoDB table, including using AWS web console, running a CloudFormation template or executing a AWS CLI command.

Following is a Node.js program using AWS Javascript SDK to create the Postcodes table in DynamoDB.

Downloading source file

We need the source file containing UK postcodes and geo information for the program to run. You can download the file and unzip it manually from here, or run the following two commands on a Mac OS’ terminal:

First attempt: Processing line by line

I make use of readline module in Node.js core to read the source text file line by line. The module provides an asynchronous iterator which can be used with a for await .. of loop.

Using the .promise() method by AWS SDK to return a promise, I managed to make the whole program async/await and avoid any callbacks 🎉

Once read, each line is converted into an object to be inserted into DynamoDB. There are guardrails to filter out nil objects are there are a small number of UK postcodes with geo information.

This program works, but it takes a (very long) time to iterate through the whole UK postcodes. Running from my development machine with relatively fast internet access (110Mbps download, 9.9Mbps upload), the program took 107 seconds to send the first 1000 postcodes from the list to DynamoDB. The extrapolated time taken to finish off the whole file of 1,769,501 items is 189,336 seconds = 52.6 hours!

An observation from the running of this program is inserting data to DynamoDB is the overwhelmingly time consuming part. Indeed it takes Node.js only ~2.6 seconds to go through the whole of ukpostcodes.csv file and convert each line into an object. The rest of the time is taken for round trips to DynamoDB.

We can reduce the number of round trips by making use of BatchWriteItem .

Second attempt: Writing to DynamoDB in batches

According to DynamoDB’s documentation

A single call to BatchWriteItem can write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB.

As each item in this program is a small object, the limit to take note here is 25 items per batch request.

I updated the program to accumulate postcode objects into an array. When the array reaches 25 items, it is converted into a batch request and written into DynamoDB.

Loading 1000 items, i.e. 40 batches of 25, into DynamoDB takes ~4.5 seconds. The use of BatchWriteItem makes our program ~24 times faster! Indeed, given the size of our objects, there is little time difference between inserting one object and inserting a batch of 25 objects.

The improvement is great, however, it would still take more than 2h for the whole UK postcodes file to be processed. Time for more optimisations.

Third attempt: Making concurrent requests to DynamoDB

The previous version of the program is non-concurrent. It reads 25 lines of text and convert them into objects and make a request to DynamoDB. Whilst the request is being made, the program does an await, which means nothing is being done. At any one time, a maximum of one (batch) request is made to DynamoDB.

Javascript is a single-threaded language, which means no parallelism. But it does not stop us from introducing some concurrency. This is what AWS’ documentation has to say about the subject:

If you use a programming language that supports concurrency, you can use threads to write items in parallel. Your application must include the necessary logic to manage the threads. With languages that don’t support threading, you must update or delete the specified items one at a time. In both situations, BatchWriteItem performs the specified put and delete operations in parallel, giving you the power of the thread pool approach without having to introduce complexity into your application.

What we can do is not to await for every request to DynamoDB. That way, saveToDynamoDB() method would start a request and immediately return, allowing the main program to carry on with the execution whilst the request to DynamoDB is happening in the background. It is the AWS.DynamoDB.DocumentClient class that would handle the concurrent requests for us.

This is one method to achieve concurrency without explicit parallelism.

We need to be careful at this point though. If we don’t ever await, our main program may finish whilst some requests have not reached DynamoDB. In such case, those requests will be ignored and some UK postcodes would go missing.

The parameter concurrentRequests is introduced to control the level of write requests concurrency. The main program would stop and await a bulk of concurrentRequests i.e. 40, requests to DynamoDB. This is an important lever for us to control concurrency to avoid exhausting the provisioned throughput of the DynamoDB table.

Setting the value of concurrentRequests to 1 and the program behaves like the previous version, i.e. no concurrency. The main program would stop and wait for every request to return before carrying on. Setting the value to a large enough value, i.e. 1.7M / 4 = 68,000, and the program would only only wait once at the end, the concurrency level would be solely in the default settings of the AWS.DynamoDB.DocumentClient class.

A benchmark with 40 currentRequests indicates it takes an average of ~0.5 second for 1000 items to be inserted into DynamoDB 🚀. This is an order of magnitude faster than the non-concurrent attempt.

At this point, a complete load of all UK postcodes to DynamoDB from my development machine would take ~15 minutes. Previously, I mentioned that it took Node.js only 2.6 seconds to go through the whole text file convert each line into an object. The bulk of the 15 mins time taken is for data to be sent on my ADSL to DynamoDB.

We need to bring the compute medium closer to the data storage.

Fourth attempt: Bringing the data closer to DynamoDB

The main idea here is rather simple: Rather than running the process on a development machine to upload the data to DynamoDB and be subjected to limited data uploading bandwidth, I can run the process in AWS itself to benefit from the intra-data-centre connectivity.

For this kind of data processing, there is no better compute medium than Lambdas. The current running time of ~15 mins for the latest version of this problem brings it right into the Lambda’s territory (of which the max time out is coincidentally 15 mins at the time of writing). Although our bet here is the low-latency connectivity between Lambda and DynamoDB would cut those 15 mins further.

We will make use a the wonderful Serverless framework to help with infrastructure as code. The main program to read the source file and insert batches of postcodes into DynamoDB concurrently is the same as before. Although a few changes to the program have been made:

  1. The Postcodes table in DynamoDB is provisioned as part of a CloudFormation template in serverless.yml. The table will be created when the project is deployed the first time.
  2. The methods to download the ukpostcodes.zip file and unzip it are written in Node.js and are triggered every time the Lambda is invoked.
  3. We make use of the /tmp folder of the Lambda’s runtime to store temporary files.

Once the project has been deployed, a lambda with the name of etl-ddb-dev-loadPostcodes will be created. The whole postcodes loading process can be triggered using AWS SDK:

It takes the Lambda ~301 seconds to finish off the whole UK postcodes job, including downloading the file and unzipping it. This is 3 times faster than the best I could manage on my development machine 🎉

The source code of the program with a detailed README is available here.

Fourth and a quarter attempt

I have heard about Yan Cui’s magic tip to make multiple operations to DynamoDB faster but have not got a chance to put it into action.

The premise of the optimisation is simple: Node.js’s default HTTP agent does not recycle TCP connections. A new TCP connection is set up per HTTP request. The initial three-way TCP handshake is relatively expensive, sometimes more so than the operation itself.

In our case, there are thousands of operations to be performed, and therefore, thousands of expensive handshakes to be established unnecessarily.

Since we use the latest AWS SDK that is provided inside the Lambda runtime, all we need to do to enforce TCP connection recycling is setting the environment variable AWS_NODEJS_CONNECTION_REUSE_ENABLED to 1

The magic setting works like a valid spell. It shrinks the time taken to nearly half, down to ~154 seconds. This is the most simple optimisation that should always be done for any Node.js program working with DynamoDB.

Final words

There are things we can do to further optimise the process. There always are. One idea is to run multiple processes in parallel for different part of the postcode file. DynamoDB automatically scales out horizontally so we don’t need to worry about the database being the bottle neck. However, I should stop here as final result is good enough for my case.

If you follow till here 🙏 please let me know if you find any of the techniques here useful for you.

Software developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store