Exploring Static Data Race Freedom in OxCaml

I’m going to show you how data race freedom modes work in the experimental OxCaml branch of OCaml. I was recently part of a Tarides project testing them out; with the goal of evaluating their potential for making future multi-domain OCaml tools reality (those of you on our mailing list may know what I'm talking about!).

This post will showcase some of what we learned and should be interesting for anyone curious about programming using multiple domains, OxCaml, editors, or just OCaml in general!

What are Data Races? With Examples!

A data race is a specific kind of race condition. So let's first define race conditions.

A race condition occurs when the behavior of a software system depends on the relative timing of events, such as the execution of threads. For example, the following code snippet demonstrates a race condition:

let () =
  let d1 = Domain.spawn (fun () -> print_endline "Hello from domain 1") in
  let d2 = Domain.spawn (fun () -> print_endline "Hello from domain 2") in
  Domain.join d1;
  Domain.join d2

The output of this code is non-deterministic, as it depends on the scheduling of the domains. It could print "Hello from domain 1" followed by "Hello from domain 2", or vice versa. This however is not a bug: the behaviour is expected.

Data races are race conditions that are way more sneaky: their behaviour does not only depend on some non-deterministic scheduling, but also on some out-of-control optimisations from the compiler and from the hardware. For example, consider the following code snippet:

let a = ref 0 and b = ref 0
 
let d1 () =
   b := 1; 
   !a

let d2 () =
   a := 1; 
   !b

let main () =
  let h1 = Domain.spawn d1 in
  let h2 = Domain.spawn d2 in
  Domain.join h1, Domain.join h2

Here are the expected results: 0, 1 (d1 runs first); 1, 0 (d2 runs first); and 1, 1 (any other interleaving). They are predictable because each one matches a possible interleaving of the operations performed by the two domains.

However, another output is also possible: 0, 0. It cannot be explained by any interleaving of the program as written, only by a reordering of instructions (due to a compiler or hardware optimisation). This can happen here because this program contains a data race.

By definition, data races occur when:

Two or more domains run in parallel,
at least two access the same mutable value,
and at least one of them writes to it
without a synchronisation mechanism (like locks or atomic operations).

Non-atomic mutable values in OCaml are references, mutable fields of records, and arrays.

How Do We Avoid Data Races in OCaml?

Obviously, specific tools exist to detect data races. ThreadSanitizer, for example, detects data races at runtime.

In OCaml, there are also libraries like qcheck-lin and qcheck-stm that help by randomly generating tests and checking for linearisability. These types of libraries are very useful when combined with TSan, which requires a lot of tests to be effective.

None of these solutions are static however: they catch data races at runtime, and as data races are non-deterministic, they can still be missed. Other ways of preventing data races is, well, simply by not sharing mutable states between domains. However, this would imply we strongly restrict how we use multicore programming in OCaml, which would not be ideal.

Statically Preventing Data Races in OxCaml

So, let's run with a very simple example:

let foo () = 
    let r = ref 42 in
    let d1 = Domain.spawn (fun () -> r := 1)in 
    let d2 = Domain.spawn (fun () -> !r) in

    Domain.join d1, Domain.join d2

In the above scenario, two domains access the mutable value r in parallel without any synchronisation mechanism (like locks or atomics), and at least one of those accesses is a write. This is a data race.

Next, we are going to try to port this code to OxCaml, and see how the mode system prevents it from compiling. Finally, in the next section, we will see how to fix our code to make it compile in OxCaml.

Preventing Data Races With Modes

Two mode axes are needed to ensure data race freedom: portability and contention.

A portable function (or a portable value carrying a function) can be used in a parallel context. For example, Multicore.spawn (that replaces Domain.spawn) requires as its first argument a portable function, as this function could run in a parallel context. The portability axis has 3 values: portable, shareable and nonportable. By default, values are nonportable: a portable annotation is needed to make them portable.

Here are some syntax examples for the portable mode:

let f : (int -> int) @ portable = fun x -> x + 1

let (f @ portable) x = x + 1

(* in the signature *)
module type A = sig
  val f : int -> int @@ portable
end

(* if all the functions of a module are portable *)
module type B = sig @@ portable
  val f : int -> int
  val g : bool -> bool
end

To ensure data race freedom, all functions that provide some form of parallelism must take portable functions as arguments. Also, for backwards compatibility, Domain.spawn is still available in OxCaml, but it's not annotated with the correct modes: it does not provide any DRF guarantee (you get a warning if you use it in OxCaml).

For the following example, we will first use Multicore.spawn and later Parallel.fork_join2 (which also requires a portable function), both of which provide parallelism in OxCaml.

Here is the signature of Multicore.spawn:

val spawn :  ('a : value_or_null).
     ('a @ contended once portable unique -> unit) @ once portable unyielding
  -> 'a @ contended once portable unique
  -> 'a spawn_result @ contended once portable unique

But we can simplify it for our needs. We will use the following function:

val spawn: (unit -> unit) @ portable -> unit

Contention

The contention axis has 3 values: contended, shared, and uncontended. This axis is only pertinent for the non-atomic mutable values, i.e. the ones that can cause a data race. Here is what each value means:

contended: no read or write is allowed.
shared: only reads are allowed; writes are forbidden.
uncontended: reads and writes are both allowed.

The default value is uncontended: a value becomes contended when it is captured by a portable function. The idea is simple: if a function is portable, it can be used in parallel, so if it captures a mutable value, to prevent data races, this value should not be accessed; it becomes contended. We will talk about shared (and shareable for the portability axis) later.

Let's see how this works in the following example:

let foo () =
  let r = ref 42 in
  (* r is a mutable value and is uncontended by default. *)

  let (work @ portable) () = r := 1 in
  (* r is contended because it is accessed in a portable function. 
     The `write` ( `:=`) operation requires an `uncontended` value: this does not compile *)
  
  spawn work

This example does not compile because r is contended and the write operation requires an uncontended value. We could try to keep r uncontended by removing the portable annotation on work, but then we would not be able to use it in spawn.

let foo () = 
    let r = ref 42 in
    (* a is uncontended *)

    let work () =  r := 1 in
    (* a is uncontended and `work` is at its default value: nonportable *)

    spawn work
    (* `spawn` requires a portable function, but `work` is nonportable (default value) : 
         this does not compile.*)

All right, so you could say that in this example there are actually no data races since there is just a single access to a. However, there is no way for the spawn function to know about that: it just knows that it may run concurrently with another function and thus must prevent any risk of a data race. Let's not forget that this is a dummy example, there is also no reason for parallelism. We could have written:

let foo () =
  let r = ref 42 in

  let (work @ portable) () = r := 1 in

  spawn work;
  spawn work

Here there is some parallelism, and still the error message will be the same.

And this is how modes prevent data races statically: portability defines which functions can be run in parallel, contention restricts what those functions can do with the mutable values they capture, and together they ensure that no two domains can access the same mutable value.

The next section explores shareable and shared, two finer-grained mode values that relax this in a controlled way to allow concurrent reads.

Bonus: What About `shareable` and `shared`?

Note 1: As explained above, portable/nonportable and contended/uncontended are already sufficient to statically prevent data races. The shareable and shared modes are more fine-grained annotations that allow you to perform concurrent read operations on shared mutable values.

Note 2: The current published version of OxCaml does not completely support shareable and shared annotations. In particular, there is no Reference module with the right annotations for the read operation. You will have to write your own reference module to try the following code snippets! I will show you how to do this at the end of the article.

The shared and shareable modes are based on a simple observation: two concurrent read operations on a mutable value are not a data race, so they should be permitted. A shared value can be read but not written, and an uncontended mutable value captured by a shareable function becomes shared.

let foo () =
  let open Ref in 
  (* As explained above, for now, you need a homemade Reference module to make this work. *)

  let r = ref 42 in
  (* r is uncontended *)

  let (work @ shareable) () = !r in
  (* r is shared because it is captured in a shareable function. 
     The `read` operation requires a `shared` value: this does compile *)

  spawn work
  (* Does not compile because `spawn` requires a portable function, but `work` is shareable. *)

There is a subtlety: spawn requires a portable function, not a shareable one. The reason is that spawn cannot anticipate what the calling domain will do after the new domain starts. To stay safe regardless of the caller's behaviour, spawn enforces the strictest mode available: portable. Let's check this with the following example:

let foo () =
  let open Ref in 

  let r = ref 42 in
  (* r is uncontended *)

  let (work @ shareable) () = !r in

  spawn work;
  
  (* work can still be running while the main domain runs: *)
  r := 2

Even though work only reads, it races with the r := 2 in the calling domain. This is exactly why spawn requires portable and not shareable: the constraint cannot be weakened.

We actually need another way to provide parallelism with more control: Parallel.fork_join2. The fork_join2 function is blocking: it runs two functions potentially in parallel and waits for both to finish. Here is its (simplified) signature:

val fork_join2
  :  t
  -> (t  -> 'a) @ shareable
  -> (t  -> 'b) @ shareable
  -> #('a * 'b)

We have well contained parallelism here: only the two functions passed as arguments to fork_join2 can run in parallel, and they are both shareable, meaning they can only read shared mutable values, but not write to them, which is not a data race! Let's try it:

let foo par =
  let open Ref in
  let r = create 0 in

  let (read @ shareable) _par = !r in

  let #(_, _) = Parallel.fork_join2 par read read in
  ()

There are no data races here since the two concurrent accesses to r are read operations, so this should compile, and it does because read is a shareable function.

On the opposite, the following code has a data race and does not compile:

let foo par =
  let open Ref in
  let r = create 0 in

  let (read @ shareable) _par = !r in
  let write _par = r := 1 in
  (* If we try to make `write` portable or shareable, we would get a mode error
   on `r` as it will become respectively contended or shared but the (:=) operation
   requires it to be uncontended. *)

  let #(_, _) = Parallel.fork_join2 par read write in
  (* Doesn't compile: `write` is nonportable but `fork_join2` requires both functions
     to be at least shareable. *)
  ()

Sharing in OxCaml

So, how can we make our simple example compile in OxCaml? Remember, we want to perform a read and a write operation on a shared mutable value in parallel:

let foo () = 
    let r = ref 42 in
    let d1 = Domain.spawn (fun () -> r := 1)in 
    let d2 = Domain.spawn (fun () -> !r) in

    Domain.join d1, Domain.join d2

We have to prove to the compiler that there is no data race. As a reminder, the recipe for a data race are the following

Two or more domains run in parallel,
at least two access the same mutable value,
and at least one of them writes to it
without a synchronisation mechanism (like locks or atomic operations).

1, 2, and 3 are what we are trying to do, so we need to break the 4th rule by providing a synchronisation mechanism. For our example, atomic is the easiest path: we can just replace ref with Atomic and we are good to go!

let foo () =
  let r = Atomic.make 0 in

  let read () = Atomic.get r |> ignore in
  let write () = Atomic.set r 1 in

  spawn read;
  spawn write

This compiles!

What about locks? In OxCaml, locks are actually provided through the Capsule API. Explaining it fully is beyond the scope of this article, but the general idea is the following: to perform operations on mutable values, they need to be enclosed in a capsule. The capsule can be shared between domains without getting contended (meaning it is still possible to perform operations on it). It looks like it may create data races, since it allows us to share mutable values between domains, but actually, it is pretty hard to open the capsule to perform operations on its contents.

For the operations we are trying to do, we actually need the most restrictive capsule opener: a mutex. The API for mutex is not that different to the ones in mainline OCaml. There is one big difference however: the mutable value lives entirely inside the capsule. The only way to reach it is by opening the capsule, which here means going through the mutex (via with_lock).

To illustrate:

let foo () =
  let open Await in
  let await = Await_blocking.await Terminator.never in
  let capsule_with_lock = Capsule.With_mutex.create (fun () -> ref 42) in
(* The reference is created in a capsule that can only be opened after acquiring the mutex.
    `capsule_with_lock` contains both the capsule and its associated mutex. *)

  let read r = !r in
  let write r v = r := v in

  spawn (fun () ->
      Capsule.With_mutex.with_lock await capsule_with_lock ~f:(fun r -> read r)
      (* `with_lock` is equivalent to `Mutex.protect` in OCaml.*));
  spawn (fun () ->
      Capsule.With_mutex.with_lock await capsule_with_lock ~f:(fun r ->
          write r 1))

Advanced: Interfacing OCaml Code in OxCaml

Some projects mix OxCaml with plain OCaml: external libraries, vendored modules, or legacy code we cannot (or do not want to) rewrite. When that code holds a mutable state, we face the data race problem again, but with a twist: we cannot add mode annotations since the code is not ours to modify.

Let's go back to our running example, but this time assume that the read and write operations come from a vendored module we do not want to modify:

module Vendored : sig
  val read : unit -> int
  val write : int -> unit
end = struct
  let r = ref 42
  let read () = !r
  let write v = r := v
End

Replacing ref with Atomic (as we did earlier) is not an option here, since we are not allowed to touch the vendored code.

From the OxCaml side, all the functions exported by Vendored are nonportable (the default value) and cannot be called from a portable context (e.g. inside a spawn or a fork_join2 callback). OxCaml provides an unsafe escape hatch with Obj.magic_portable which would make them callable, but it's Obj.magic and so it's ‘cheating’. Also, it gives the compiler no way to prevent two domains from calling them concurrently, so the mutable state inside Vendored.t would still be racy.

What we actually need is a way to leverage OxCaml's mode system to enforce mutual exclusion on vendored function calls at compile time, without modifying the vendored code.

The trick is to encode the mutex discipline directly in the signature of the wrapper functions. We create a mutex, and require a token of type k Capsule.Access.t as an argument of every wrapper function:

module Lock = Capsule.Mutex.Create ()
(* [Lock] exposes two things: a fresh type [k], and a mutex of type
   [k Mutex.t]. Because [k] is fresh and tied to this mutex, a value
   of type [k Capsule.Access.t] can only be produced by acquiring it. *)
type k = Lock.k

module Wrapper : sig @@ portable
  val read : access:k Capsule.Access.t -> int
  val write : access:k Capsule.Access.t -> int -> unit
end = struct
  let read_ = Obj.magic_portable Vendored.read
  let read ~access:_ = read_ ()
  (* [access] is unused at runtime, but the caller must supply one,
     and the only way to obtain one is to hold [Lock.mutex]. *)

  let write_ = Obj.magic_portable Vendored.write
  let write ~access:_ v = write_ v
(* Same [k] as [read]: both functions require [Lock.mutex] to be held when called. *)
end

By taking ~access as an argument, the wrapper signature forces callers to hold Lock.mutex before calling any of its functions – the token is a compile-time proof of authorisation. Obj.magic_portable is still used internally to bridge the vendored code, but it is now safely contained: the wrapper is the trust boundary, and the compiler enforces the mutex discipline everywhere else.

On the caller side, concurrent access to the vendored state now looks like this:

let foo () =
  let open Await in
  let await = Await_blocking.await Terminator.never in

  let safe_write () =
    Mutex.with_key await Lock.mutex ~f:(fun key ->
        Capsule.Expert.Key.access key ~f:(fun access ->
            Wrapper.write ~access 1))
  in
  let safe_read () =
    Mutex.with_key await Lock.mutex ~f:(fun key ->
        Capsule.Expert.Key.access key ~f:(fun access ->
            let _ = Wrapper.read ~access in
            ()))
  in
  spawn safe_write;
  spawn safe_read

Both domains can call read and write, but only under the protection of the mutex: this code is data race free!

This pattern generalises to any non-OxCaml dependency with one important caveat: the guarantee is only as good as the wrapper. A function missing from the wrapper can still be called directly via Obj.magic_portable, bypassing the mutex. The wrapper must therefore be audited manually, ideally complemented by runtime tools like TSan.

Solutions to the Reference Module Exercise

Here, we provide two implementations of the reference module that use shared. First, an implementation for int ref only:

module Ref : sig @@ portable
  type t = { mutable content : int }

  val ref : int -> t
  val ( ! ) : t @ shared -> int
  val ( =: ) : t -> int -> unit
end = struct
  type t = { mutable content : int }
  let ref x = { content = x }

  let ( ! ) r = r.content

  let ( =: ) r v = r.content <- v
end

A more general version:

module Ref : sig @@ portable
  type ('a : value mod contended shareable) t = { mutable content : 'a }

  val ref : 'a -> 'a t
  val ( ! ) : 'a t @ shared -> 'a
  val ( =: ) : 'a t -> 'a -> unit
end = struct
  type ('a : value mod contended shareable) t = { mutable content : 'a }
  let ref x = { content = x }

  let ( ! ) r = r.content

  let ( =: ) r v = r.content <- v
end

The additional annotation on 'a defines the kind of 'a. What it means is that our reference can not carry mutable values or nonportable functions. We need this restriction because of the read function that returns an 'a, and remember, we want to be able to call it from two different domains. So we could write the following:

type p = { mutable x : int }

let foo () =
  let open Ref in
  let r = ref { x = 0 } in
  (* Error message here: { x = 0 } is of type p that has a kind `mutable_data`, 
    incompatible with `value mod contended shareable` *)

  spawn (fun () ->
      let t = !r in
      t.x <- 1);
  spawn (fun () ->
      let t = !r in
      t.x <- 2)

There are two concurrent write operations on the same mutable value t.x, with no synchronisation mechanism: this is a data race. The kind annotation statically prevents us from writing this code, as the kind of p is mutable_data: it is incompatible with our annotation, and therefore this code does not compile.

Stay in Touch

Hopefully this gives you a better understanding of what data race freedom is, why it matters, and how it works in OCaml and OxCaml! If you’re trying out OxCaml’s data race free modes and have thoughts and feedback, please share them with us on the OCaml Discuss forum.

Connect with Tarides on Bluesky, Mastodon, and LinkedIn or sign up to our mailing list to stay updated on our latest projects. We look forward to hearing from you!

Exploring Static Data Race Freedom in OxCaml

What are Data Races? With Examples!

How Do We Avoid Data Races in OCaml?

Statically Preventing Data Races in OxCaml

Preventing Data Races With Modes

Contention

Bonus: What About `shareable` and `shared`?

Sharing in OxCaml

Advanced: Interfacing OCaml Code in OxCaml

Solutions to the Reference Module Exercise

Stay in Touch

Open-Source Development

Explore Commercial Opportunities

Stay Updated on OCaml and MirageOS!

Subscription Succesful

Subscription Succesful

Exploring Static Data Race Freedom in OxCaml

What are Data Races? With Examples!

How Do We Avoid Data Races in OCaml?

Statically Preventing Data Races in OxCaml

Preventing Data Races With Modes

Contention

Bonus: What About shareable and shared?

Sharing in OxCaml

Advanced: Interfacing OCaml Code in OxCaml

Solutions to the Reference Module Exercise

Stay in Touch

Open-Source Development

Explore Commercial Opportunities

Stay Updated on OCaml and MirageOS!

Bonus: What About `shareable` and `shared`?