September 1, 2021
3 min read

Uploading CSV from Large Data Tables to S3 Using Node.js Stream

Uploading CSV from Large Data Tables to S3 Using Node.js Stream

Lusha is a Hypergrowth company having grown exponentially in the past year in number of employees, customers, scale, traffic, and more.

In the engineering department, we often face challenges dealing with the faster paced growth in scale, traffic, and data that is growing all the time. One such example occurred one major feature that uploads large CSV files to Amazon S3, stopped working because of performance and memory issues we encounter.

In this blog post, I will describe the challenge we had with this feature that directly impacted Lusha’s customers.

The Problem

In Lusha dashboard UI, customers can add contacts to their personal saved list. These lists can potentially contain a large number of contacts. In the UI of Lusha, there is a button to export a CSV containing all contacts.

When the customer presses Export to CSV he is asked to type his email address and then receives an email with a download link to his CSV file containing all his contact information.

Behind the scenes, we send RabbitMQ message containing all the UI filters the customer selected. On the backend, this process is hooked up to a server that receives the message, pulls the data from our database (we use Cassandra as our main contact database), generates the CSV file to disk, uploads the file to Amazon S3, and sends an email with a download link to the customer.

Here’s an example of a code snippet on the backend server:

Pay attention to line 17, where we add all results from the Cassandra query to a global variable named contacts.

The code works for pulling hundreds of results, but it’s not designed to run hundreds of thousands of results. When faced with this much data, the Node.js process sends back an error message: JavaScript heap out of memory.

The solution

Node.js stream API to the rescue!

Use Cassandra client driver stream API in order to fetch the results (remove the use of the global variable that store entire results in memory),
Create the CSV file using csv-stringify that supports stream API,
Use the Node.js stream.PassThrough in order to upload the CSV file to AWS S3

Let’s see the complete code that supports streaming flow:

In conclusion

In a hypergrowth company that deals with growing quantities of data and large-scale services, the code’s processing capacity becomes tricky and you need to think about memory, CPU and performance.

Node.js stream is a powerful API. It can minimize memory consumption, improve performance and is very simple to use.

I hope this will help you solve the next memory or performance issue you encounter at your company.

Share Tweet Inspire

Written by