This article describes a journey to make a Node.js program inserting ~1.7M records to DynamoDB faster, reducing the running time from the initial ~52 hours down to ~5 minutes. The source code for this program is available here.
Problem to solve
As part of an on-going project, I need to create a look-up database of geo locations for all UK postcodes which number ~1.7M records. Being a key-value data store that can scale horizontally on demand, AWS DynamoDB is the perfect candidate for the use case.
Information about each postcode is stored as an object in DynamoDB, with
postcode being the key and
longitude as attributes. The latitude/longitude information of each UK postcode is publicly available and updated on regular basis here.
There are more than 1.7 million (1,770,724 as at November 2020 ) unit postcodes. This number changes regularly as around 2,750 postcodes are created and 2,500 are terminated each month.
This program is expected to execute on a regular basis, perhaps once per month, to keep the data up to date.
Follows is the summary of some of the lessons during the process:
readlinemodule (part of Node.js core) provides a convenient way to read text file line by line. The
for await .. ofloop enabled by
readlineallows the program to be
async/awaitand therefore easier to read (i.e. than using callbacks).
- Use the
BatchWriteItem()inserts items in the batch and overwrite existing objects with same keys. It does not do partial item update and has the limit of 25 objects or 16MB of data.
AWS.DynamoDB.DocumentClientclass handles concurrency under the hood, allowing multiple requests to DynamoDB to be made at the same time.
- Each AWS Lambda has access to 512MB of data storage under
- Enable HTTP keep-alive can help make multiple requests to DynamoDB (much) faster. In this particular case, the execution time is cut by half with this change. Remember, since AWS SDK v2.463.0, all you need to do is to set the environment variable
zlibmodule in Node.js core works with
gzipstandard/format, which is popular in *nix systems. It does not work with
zipformat which is popular in Windows systems though.
Create a DynamoDB table
DynamoDB requires tables to be created explicitly with a specified schema before data can be inserted. Although the schema only needs to cover the keys (hash key and range key), each actual object can have any properties. In this fixed schema vs. schemaless distribution, DynamoDB is somewhere between SQL and MongoDB.
There are a number of ways to create a DynamoDB table, including using AWS web console, running a CloudFormation template or executing a AWS CLI command.
Downloading source file
We need the source file containing UK postcodes and geo information for the program to run. You can download the file and unzip it manually from here, or run the following two commands on a Mac OS’ terminal:
$ curl https://www.freemaptools.com/download/full-postcodes/ukpostcodes.zip --output ukpostcodes.zip$ unzip -a ukpostcodes.zip
First attempt: Processing line by line
I make use of
readline module in Node.js core to read the source text file line by line. The module provides an asynchronous iterator which can be used with a
for await .. of loop.
.promise() method by AWS SDK to return a promise, I managed to make the whole program
async/await and avoid any callbacks 🎉
Once read, each line is converted into an object to be inserted into DynamoDB. There are guardrails to filter out nil objects are there are a small number of UK postcodes with geo information.
This program works, but it takes a (very long) time to iterate through the whole UK postcodes. Running from my development machine with relatively fast internet access (110Mbps download, 9.9Mbps upload), the program took 107 seconds to send the first 1000 postcodes from the list to DynamoDB. The extrapolated time taken to finish off the whole file of 1,769,501 items is 189,336 seconds = 52.6 hours!
An observation from the running of this program is inserting data to DynamoDB is the overwhelmingly time consuming part. Indeed it takes Node.js only ~2.6 seconds to go through the whole of ukpostcodes.csv file and convert each line into an object. The rest of the time is taken for round trips to DynamoDB.
We can reduce the number of round trips by making use of
Second attempt: Writing to DynamoDB in batches
According to DynamoDB’s documentation
A single call to
BatchWriteItemcan write up to 16 MB of data, which can comprise as many as 25 put or delete requests. Individual items to be written can be as large as 400 KB.
As each item in this program is a small object, the limit to take note here is 25 items per batch request.
I updated the program to accumulate postcode objects into an array. When the array reaches 25 items, it is converted into a batch request and written into DynamoDB.
Loading 1000 items, i.e. 40 batches of 25, into DynamoDB takes ~4.5 seconds. The use of
BatchWriteItem makes our program ~24 times faster! Indeed, given the size of our objects, there is little time difference between inserting one object and inserting a batch of 25 objects.
The improvement is great, however, it would still take more than 2h for the whole UK postcodes file to be processed. Time for more optimisations.
Third attempt: Making concurrent requests to DynamoDB
The previous version of the program is non-concurrent. It reads 25 lines of text and convert them into objects and make a request to DynamoDB. Whilst the request is being made, the program does an
await, which means nothing is being done. At any one time, a maximum of one (batch) request is made to DynamoDB.
If you use a programming language that supports concurrency, you can use threads to write items in parallel. Your application must include the necessary logic to manage the threads. With languages that don’t support threading, you must update or delete the specified items one at a time. In both situations,
BatchWriteItemperforms the specified put and delete operations in parallel, giving you the power of the thread pool approach without having to introduce complexity into your application.
What we can do is not to
await for every request to DynamoDB. That way,
saveToDynamoDB() method would start a request and immediately return, allowing the main program to carry on with the execution whilst the request to DynamoDB is happening in the background. It is the
AWS.DynamoDB.DocumentClient class that would handle the concurrent requests for us.
This is one method to achieve concurrency without explicit parallelism.
We need to be careful at this point though. If we don’t ever
await, our main program may finish whilst some requests have not reached DynamoDB. In such case, those requests will be ignored and some UK postcodes would go missing.
concurrentRequests is introduced to control the level of write requests concurrency. The main program would stop and
await a bulk of
concurrentRequests i.e. 40, requests to DynamoDB. This is an important lever for us to control concurrency to avoid exhausting the provisioned throughput of the DynamoDB table.
Setting the value of
concurrentRequests to 1 and the program behaves like the previous version, i.e. no concurrency. The main program would stop and wait for every request to return before carrying on. Setting the value to a large enough value, i.e.
1.7M / 4 = 68,000, and the program would only only wait once at the end, the concurrency level would be solely in the default settings of the
A benchmark with 40
currentRequests indicates it takes an average of ~0.5 second for 1000 items to be inserted into DynamoDB 🚀. This is an order of magnitude faster than the non-concurrent attempt.
At this point, a complete load of all UK postcodes to DynamoDB from my development machine would take ~15 minutes. Previously, I mentioned that it took Node.js only 2.6 seconds to go through the whole text file convert each line into an object. The bulk of the 15 mins time taken is for data to be sent on my ADSL to DynamoDB.
We need to bring the compute medium closer to the data storage.
Fourth attempt: Bringing the data closer to DynamoDB
The main idea here is rather simple: Rather than running the process on a development machine to upload the data to DynamoDB and be subjected to limited data uploading bandwidth, I can run the process in AWS itself to benefit from the intra-data-centre connectivity.
For this kind of data processing, there is no better compute medium than Lambdas. The current running time of ~15 mins for the latest version of this problem brings it right into the Lambda’s territory (of which the max time out is coincidentally 15 mins at the time of writing). Although our bet here is the low-latency connectivity between Lambda and DynamoDB would cut those 15 mins further.
We will make use a the wonderful Serverless framework to help with infrastructure as code. The main program to read the source file and insert batches of postcodes into DynamoDB concurrently is the same as before. Although a few changes to the program have been made:
Postcodestable in DynamoDB is provisioned as part of a CloudFormation template in
serverless.yml. The table will be created when the project is deployed the first time.
- The methods to download the
ukpostcodes.zipfile and unzip it are written in Node.js and are triggered every time the Lambda is invoked.
- We make use of the
/tmpfolder of the Lambda’s runtime to store temporary files.
Once the project has been deployed, a lambda with the name of
etl-ddb-dev-loadPostcodes will be created. The whole postcodes loading process can be triggered using AWS SDK:
aws lambda invoke --region eu-west-1 \--function-name etl-ddb-dev-loadPostcodes out --log-type Tail \--query 'LogResult' --output text | base64 -d
It takes the Lambda ~301 seconds to finish off the whole UK postcodes job, including downloading the file and unzipping it. This is 3 times faster than the best I could manage on my development machine 🎉
The source code of the program with a detailed README is available here.
Fourth and a quarter attempt
I have heard about Yan Cui’s magic tip to make multiple operations to DynamoDB faster but have not got a chance to put it into action.
The premise of the optimisation is simple: Node.js’s default HTTP agent does not recycle TCP connections. A new TCP connection is set up per HTTP request. The initial three-way TCP handshake is relatively expensive, sometimes more so than the operation itself.
In our case, there are thousands of operations to be performed, and therefore, thousands of expensive handshakes to be established unnecessarily.
Since we use the latest AWS SDK that is provided inside the Lambda runtime, all we need to do to enforce TCP connection recycling is setting the environment variable
AWS_NODEJS_CONNECTION_REUSE_ENABLED to 1
process.env.AWS_NODEJS_CONNECTION_REUSE_ENABLED = 1
The magic setting works like a valid spell. It shrinks the time taken to nearly half, down to ~154 seconds. This is the most simple optimisation that should always be done for any Node.js program working with DynamoDB.
There are things we can do to further optimise the process. There always are. One idea is to run multiple processes in parallel for different part of the postcode file. DynamoDB automatically scales out horizontally so we don’t need to worry about the database being the bottle neck. However, I should stop here as final result is good enough for my case.
If you follow till here 🙏 please let me know if you find any of the techniques here useful for you.