Reading a CSV File into an Array

Simply put, a CSV (Comma-Separated Values) file is a text file format that uses commas to separate individual values. In this tutorial, I will demonstrate how to use Streams and Node.js to read files and convert them into two-dimensional arrays of strings (Array<Array<string>>). We will use the Pokemon file as an example.

1. Reading a CSV File

In Node.js, the fs module offers two approaches for interacting with files: synchronous and asynchronous, through fs and fs/promises, respectively. The first follows the standard POSIX function model, while the second provides a promise-based approach. It is crucial to note that these operations are not synchronized or "thread-safe", requiring caution with concurrent operations.

1.1 Using ReadStream to Read CSV Files

ReadStream, a specific implementation of Stream in Node.js, simplifies handling streaming data. It allows reading data from a file efficiently. Let's look at a practical example of how to use it:

TYPESCRIPT

import { createReadStream } from "fs";
const csvData = await createReadStream(pathToFile, {
  encoding: "utf-8",
}).toArray();

This approach generates an array containing the entire file in a single line. Although simple, it is not the most efficient for large files, as it loads all the content into memory. For use cases where this is not a problem, it is possible to split the file into an array of lines using the split method:

TYPESCRIPT

import { createReadStream } from "fs";
const csvData: string[] = await createReadStream(pathToFile, {
  encoding: "utf-8",
}).toArray();
const rows = csvData
  .flatMap((txt) => txt.split(NEW_LINE))
  .map((line) => line.split(COMMA_DELIMITER));

To process the file in parts, or "chunks", we use the Transform class. This class allows converting data chunks from the ReadStream into a specific format. In our context, we will transform the buffer into an array of lines, and subsequently each line into an array of columns:

TYPESCRIPT

1import { Transform, TransformCallback } from "stream";
2
3class LineTransform extends Transform {
4  #lastLine = "";
5
6  _transform(chunk: Buffer, encoding: BufferEncoding, callback: TransformCallback) {
7    let chunkAsString = (this.#lastLine + chunk.toString()).split(NEW_LINE);
8    this.#lastLine = chunkAsString.pop() || "";
9
10    const lines = chunkAsString.map((line) => line.split(COMMA_DELIMITER));
11
12    this.push(lines, encoding);
13    callback();
14  }
15
16  _flush(callback: TransformCallback): void {
17    if (this.#lastLine.length) {
18      this.push([this.#lastLine.split(COMMA_DELIMITER)]);
19    }
20    callback();
21  }
22}

The _transform method handles a data chunk, converting it into an array of lines before forwarding it to the next stream, while _flush is called at the end of reading, processing any remaining data. It is important to consider that a chunk may end in the middle of a line, requiring careful management to preserve data integrity.

Integrating ReadStream with LineTransform, it is possible to read the file in chunks efficiently. Some specific configurations stand out: enabling objectMode for object processing, adjusting the highWaterMark to control chunk size, and using the iterator method to iterate over processed chunks.

TYPESCRIPT

1const csvToRow = new LineTransform({
2  objectMode: true,
3});
4
5const readFileWithReadStream = async (fileFolder: string) => {
6  const pathToFile = join(process.cwd(), fileFolder);
7  const readStreamTransform = createReadStream(pathToFile, {
8    encoding: "utf-8",
9    highWaterMark: 512,
10  }).pipe(csvToRow);
11
12  const streamIterator = readStreamTransform.iterator();
13
14  for await (let chunk of streamIterator) {
15    // process chunk
16  }
17};

I hope this guide has been informative and useful for your Node.js applications. This tutorial illustrates just one facet of the possibilities offered by file handling in Node.js, anticipating more advanced techniques and optimizations that will be covered in future content.

In Node.js, the fs module offers two ways to interact with files: synchronous and asynchronous, through fs and fs/promises. The first adopts the standard POSIX function model, while the second provides an asynchronous approach based on promises. It is important to note that the operations are not synchronized or 'thread-safe', requiring caution with concurrent operations.

1.1 Reading a CSV File using ReadStream

ReadStream is an implementation of Stream in Node.js that makes working with data streaming easier. Streams can be readable, writable, or duplex. ReadStream allows reading data from a file. Here is an example of using it to read files:

TYPESCRIPT

import { createReadStream } from "fs";
const csvData = await createReadStream(pathToFile, {
  encoding: "utf-8",
}).toArray();

This operation results in an array containing the entire file in a single line. Although it is a simple approach, it is not efficient for large files, due to loading all the content into memory. If this is not a problem for your use case, you can split the file into an array of lines using the split method:

TYPESCRIPT

import { createReadStream } from "fs";
const csvData: string[] = await createReadStream(pathToFile, {
  encoding: "utf-8",
}).toArray();
const rows = csvData
  .flatMap((txt) => txt.split(NEW_LINE))
  .map((line) => line.split(COMMA_DELIMITER));

To process the file in parts, or chunks, we use the Transform class. The Transform class receives data chunks from the ReadStream and converts them into a specific format. In our case, we will transform the buffer into an array of lines and each line into an array of columns:

TYPESCRIPT

1import { Transform, TransformCallback } from "stream";
2
3class LineTransform extends Transform {
4  #lastLine = "";
5
6  _transform(chunk: Buffer, encoding: BufferEncoding, callback: TransformCallback) {
7    let chunkAsString = (this.#lastLine + chunk.toString()).split(NEW_LINE);
8    this.#lastLine = chunkAsString.pop() || "";
9
10    const lines = chunkAsString.map((line) => {
11      return line.split(COMMA_DELIMITER);
12    });
13
14    this.push(lines, encoding);
15    callback();
16  }
17
18  _flush(callback: TransformCallback): void {
19    if (this.#lastLine.length) {
20      this.push([this.#lastLine.split(COMMA_DELIMITER)]);
21    }
22    callback();
23  }
24}

Basically, the _transform method receives a data chunk, transforms it into an array of lines, and sends it to the next stream. The _flush method is called when there is no more data to be read, and sends the last chunk. An important detail is that the chunk may be cut in the middle of a line. Therefore, I assume that the last line of each chunk may be incomplete and, consequently, needs to be concatenated with the beginning of the next chunk to ensure data integrity. To handle this, I use the #lastLine property to store the end of the current chunk, which will then be combined with the beginning of the next chunk.

Now, combining ReadStream with LineTransform, I can read the file in chunks. There are some details I would like to highlight: on line 2, I set objectMode to true, which indicates that the stream will process objects. On line 9, I change the highWaterMark value to 512 bytes. This means that the stream will receive 512-byte chunks, unlike the default value which is 16KB. Finally, on line 12, I use the iterator method to iterate over the received chunks.

TYPESCRIPT

1const csvToRow = new LineTransform({
2  objectMode: true,
3});
4
5const readFileWithReadStream = async (fileFolder: string) => {
6  const pathToFile = join(process.cwd(), fileFolder);
7  const readStreamTransform = createReadStream(pathToFile, {
8    encoding: "utf-8",
9    highWaterMark: 512,
10  }).pipe(csvToRow);
11
12  const streamIterator = readStreamTransform.iterator();
13
14  for await (let chunk of streamIterator) {
15    // process chunk
16  }
17};

I hope this tutorial has been informative and useful in your Node.js applications.