Chunks

This article continues the discussion started in OCaml handles, where we explored the idea of a process and out_channel. In this article, we will explore in_channel. When we move, it is common to pack our belongings into boxes; we can place several items in a box, which makes transportation easier. Imagine the work it would take to carry a pile of items. Similarly, in computer memory, it is preferable to handle items in bulk, as this allows us to deal with multiple elements at once.

Previously, we used out_channel to write to a file. There is a direct correspondence, allowing us to read data through an input channel called in_channel. Let us explore how we can read from a file. The first step is to understand how to open the input channel:

val open_in : string -> in_channel = <fun>
let open_in name =
  open_in_gen [Open_rdonly; Open_text] 0 name

Looking at the signature of the open_in function, we see that it receives a string and returns an in_channel. The function is implemented in terms of open_in_gen, which is a more generic function that allows passing a list of open_flag. The Open_rdonly flag indicates that the file will be opened for reading only, and Open_text indicates that the file will be opened in text mode. The second argument is an integer that represents the file permission, which is ignored when the file is opened in read mode.

After opening the channel, the next step is to read from it. To read a file, we have the input_value function:

val input_value: in_channel -> a'

As we see in the signature, it receives an in_channel and returns an a'. Combining both, we can read files:

let read_greeting_file () =
  let dir = get_dir "greetings" in
  let file = dir ^ "/greeting.txt" in
  let ic = open_in file in
  let content : string = input_value ic in
  close_in ic;
  content

Just as we closed the out_channel with close_out, we close the in_channel with close_in.

Huge Files

The previous example loads the entire file into memory, but what if the file is very large? There will likely not be enough memory to load the entire file. To handle large files, we can read the file in chunks. One way to do this is by using input_line:

val input_line: in_channel -> string

As we see in the signature, it receives an in_channel and returns a string. The function reads a line from the input channel and returns the line read without the newline character. If the end of the file is reached, the function raises exception End_of_file. Thus, we can read a file line by line:

let read_file_line_by_line ic =
  let rec read_lines acc =
    try
      let line = input_line ic in
      read_lines (line :: acc)
    with End_of_file -> List.rev acc
  in
  let lines = read_lines [] in
  close_in ic;
  lines

Another way to read files in chunks is by using input:

val input : in_channel -> bytes -> int -> int -> int

In the signature, we see that it receives an in_channel, a bytes which is the buffer where the read data will be stored, an integer which is the offset in the buffer where the read data will be stored, an integer which is the number of bytes to be read, and it returns an integer which is the number of bytes read. The function reads up to n bytes from the input channel and stores them in the buffer starting from the offset. If the end of the file is reached, the function returns 0. Thus, we can read a file in chunks:

let read_file_in_chunks filename chunk_size =
  let ic = open_in filename in
  try
    while true do
      (* Create a buffer for the chunk *)
      let buffer = Bytes.create chunk_size in
      (* Try to read a chunk from the file *)
      let bytes_read = input ic buffer 0 chunk_size in
      if bytes_read = 0 then
        raise End_of_file; (* End of file reached *)
      (* Process the read chunk here *)
      (* For example, we can print the read chunk *)
      print_endline (Bytes.sub_string buffer 0 bytes_read);
    done
  with End_of_file ->
    close_in ic; (* Make sure to close the file when done *)

OCaml supports reading large files. Typically, the maximum file size is the size of max_int, which on most systems is less than 2 GB. The LargeFile module provides functions that return size and position, and can also seek within these larger files. Using this module is left as an exercise for the reader. 😝