
Reading a CSV File into an Array
Simply put, a CSV (Comma-Separated Values) file is a text file format that uses commas to separate individual values. In
this tutorial, I will demonstrate how to use Streams and Node.js to read files and convert them into two-dimensional
arrays of strings (Array<Array<string>>). We will use the
Pokemon file as an example.
1. Reading a CSV File
In Node.js, the fs module offers two approaches for interacting with files: synchronous and asynchronous, through
fs and fs/promises, respectively. The first follows the standard POSIX function model, while the second provides a
promise-based approach. It is crucial to note that these operations are not synchronized or "thread-safe", requiring
caution with concurrent operations.
1.1 Using ReadStream to Read CSV Files
ReadStream, a specific implementation of Stream in Node.js, simplifies handling streaming data. It allows reading data from a file efficiently. Let's look at a practical example of how to use it:
import { createReadStream } from "fs";
const csvData = await createReadStream(pathToFile, {
encoding: "utf-8",
}).toArray();
This approach generates an array containing the entire file in a single line. Although simple, it is not the most
efficient for large files, as it loads all the content into memory. For use cases where this is not a problem, it is
possible to split the file into an array of lines using the split method:
import { createReadStream } from "fs";
const csvData: string[] = await createReadStream(pathToFile, {
encoding: "utf-8",
}).toArray();
const rows = csvData.flatMap((txt) => txt.split(NEW_LINE)).map((line) => line.split(COMMA_DELIMITER));
To process the file in parts, or "chunks", we use the Transform class. This class allows converting data chunks from
the ReadStream into a specific format. In our context, we will transform the buffer into an array of lines, and
subsequently each line into an array of columns:
import { Transform, TransformCallback } from "stream";
class LineTransform extends Transform {
#lastLine = "";
_transform(chunk: Buffer, encoding: BufferEncoding, callback: TransformCallback) {
let chunkAsString = (this.#lastLine + chunk.toString()).split(NEW_LINE);
this.#lastLine = chunkAsString.pop() || "";
const lines = chunkAsString.map((line) => line.split(COMMA_DELIMITER));
this.push(lines, encoding);
callback();
}
_flush(callback: TransformCallback): void {
if (this.#lastLine.length) {
this.push([this.#lastLine.split(COMMA_DELIMITER)]);
}
callback();
}
}
The _transform method handles a data chunk, converting it into an array of lines before forwarding it to the next
stream, while _flush is called at the end of reading, processing any remaining data. It is important to consider that
a chunk may end in the middle of a line, requiring careful management to preserve data integrity.
Integrating ReadStream with LineTransform, it is possible to read the file in chunks efficiently. Some specific
configurations stand out: enabling objectMode for object processing, adjusting the highWaterMark to control chunk
size, and using the iterator method to iterate over processed chunks.
const csvToRow = new LineTransform({
objectMode: true,
});
const readFileWithReadStream = async (fileFolder: string) => {
const pathToFile = join(process.cwd(), fileFolder);
const readStreamTransform = createReadStream(pathToFile, {
encoding: "utf-8",
highWaterMark: 512,
}).pipe(csvToRow);
const streamIterator = readStreamTransform.iterator();
for await (let chunk of streamIterator) {
// process chunk
}
};
I hope this guide has been informative and useful for your Node.js applications. This tutorial illustrates just one facet of the possibilities offered by file handling in Node.js, anticipating more advanced techniques and optimizations that will be covered in future content.
In Node.js, the fs{:js} module offers two ways to interact with files: synchronous and asynchronous, through
fs{:js} and fs/promises{:js}. The first adopts the standard POSIX function model, while the second provides an
asynchronous approach based on promises. It is important to note that the operations are not synchronized or
'thread-safe', requiring caution with concurrent operations.
1.1 Reading a CSV File using ReadStream
ReadStream is an implementation of Stream in Node.js that makes working with data streaming easier. Streams can be readable, writable, or duplex. ReadStream allows reading data from a file. Here is an example of using it to read files:
import { createReadStream } from "fs";
const csvData = await createReadStream(pathToFile, {
encoding: "utf-8",
}).toArray();
This operation results in an array containing the entire file in a single line. Although it is a simple approach, it is
not efficient for large files, due to loading all the content into memory. If this is not a problem for your use case,
you can split the file into an array of lines using the split{:js} method:
import { createReadStream } from "fs";
const csvData: string[] = await createReadStream(pathToFile, {
encoding: "utf-8",
}).toArray();
const rows = csvData.flatMap((txt) => txt.split(NEW_LINE)).map((line) => line.split(COMMA_DELIMITER));
To process the file in parts, or chunks, we use the Transform{:js} class. The Transform class receives data chunks
from the ReadStream and converts them into a specific format. In our case, we will transform the buffer into an array of
lines and each line into an array of columns:
import { Transform, TransformCallback } from "stream";
class LineTransform extends Transform {
#lastLine = "";
_transform(chunk: Buffer, encoding: BufferEncoding, callback: TransformCallback) {
let chunkAsString = (this.#lastLine + chunk.toString()).split(NEW_LINE);
this.#lastLine = chunkAsString.pop() || "";
const lines = chunkAsString.map((line) => {
return line.split(COMMA_DELIMITER);
});
this.push(lines, encoding);
callback();
}
_flush(callback: TransformCallback): void {
if (this.#lastLine.length) {
this.push([this.#lastLine.split(COMMA_DELIMITER)]);
}
callback();
}
}
Basically, the _transform{:js} method receives a data chunk, transforms it into an array of lines, and sends it to the next
stream. The _flush{:js} method is called when there is no more data to be read, and sends the last chunk. An important
detail is that the chunk may be cut in the middle of a line. Therefore, I assume that the last line of each chunk may be
incomplete and, consequently, needs to be concatenated with the beginning of the next chunk to ensure data integrity. To
handle this, I use the #lastLine{:js} property to store the end of the current chunk, which will then be combined with
the beginning of the next chunk.
Now, combining ReadStream with LineTransform, I can read the file in chunks. There are some details I would like to
highlight: on line 2, I set objectMode{:js} to true{:js}, which indicates that the stream will process objects. On line 9,
I change the highWaterMark{:js} value to 512 bytes. This means that the stream will receive 512-byte chunks, unlike the
default value which is 16KB. Finally, on line 12, I use the iterator{:js} method to iterate over the received chunks.
const csvToRow = new LineTransform({
objectMode: true,
});
const readFileWithReadStream = async (fileFolder: string) => {
const pathToFile = join(process.cwd(), fileFolder);
const readStreamTransform = createReadStream(pathToFile, {
encoding: "utf-8",
highWaterMark: 512,
}).pipe(csvToRow);
const streamIterator = readStreamTransform.iterator();
for await (let chunk of streamIterator) {
// process chunk
}
};
I hope this tutorial has been informative and useful in your Node.js applications.