Benchmarks
ezffi's design goal is that calling a Rust function through its generated FFI binding should cost the same as calling it directly from Rust when theoretically possible. The benches/ directory is where all the benches I currently have live. There's still a lot left to bench and plenty of performance improvements to make, but so far the macro is doing really well, 0% overhead on every bench I have right now. In this chapter we'll explore how the tests are structured and the results I obtained.
See BENCH_RESULTS.md at the root of the repo for the latest numbers, it's periodically updated.
How each bench is structured
Each bench is a small crate under benches/<name>/ with a fixed layout:
benches/<name>/
├── Cargo.toml — `lib` + `staticlib`, plus a `native` Rust bin target
├── build.rs — runs cbindgen (only when `--features ezffi` is on)
├── cbindgen.toml — header generation config
├── src/
│ ├── lib.rs — the function/type under test, gated with
│ │ `#[cfg_attr(feature = "ezffi", ezffi::export)]`
│ └── bin.rs — the Rust "native" runner
├── c/runner.c — the C runner
└── DESCRIPTION.md — one-line summary embedded into the results doc
The whole point is the same Rust source runs three ways:
| label | how it's built | what it measures |
|---|---|---|
rust | cargo build --profile bench-release | pure-Rust baseline, no macro expansion |
rust+ezffi | cargo build --profile bench-release --features ezffi | pure-Rust baseline, with macro expansion |
c | C linked against rust+ezffi staticlib via clang -O3 -flto=thin | same crate called from a C binary across the FFI |
Each runner reads three env vars (ITERATIONS, NUMA, NUMB) at startup so values aren't constant-folded, runs a warmup pass, then times the measured loop with Instant::now() / clock_gettime. A side-channel acc accumulator is printed after the loop — Rust and C must agree byte-for-byte on it, that's the semantic sanity check across the boundary.
For workloads where rustc would otherwise solve the loop in closed form (any purely-linear acc += i + b), a single black_box(i) per iteration is enough to keep each iteration opaque to the optimiser without distorting what's measured.
Here is a quick snippet of the profile config I'm using:
[profile.bench-release]
inherits = "release"
lto = "thin"
Running them
From the repo root:
benches/run.sh
The first time this builds a small Docker image (benches/Dockerfile — clang + rust-lld from rustup so the LLVM versions line up for cross-language LTO). After that it just runs. The host's repo is bind-mounted into the container, so source edits are picked up without rebuilds.
ITERATIONS, NUMA, NUMB can be overridden from the environment if you want different inputs. Default is 1_000_000_000 iterations with arbitrary constants.
When the script finishes it overwrites BENCH_RESULTS.md at the repo root with the latest numbers plus the host's toolchain block. That file is meant to be committed alongside releases.
Adding a new bench
Copy any of the existing dirs (add-fn is the simplest), rename, edit src/lib.rs with the function or type you want to measure, mirror it in src/bin.rs and c/runner.c, and write a one-line DESCRIPTION.md. The workspace glob (benches/*) and run.sh's directory loop pick it up automatically on the next run.
Bench results summary
So far, I have 0% overhead on every single bench when calling from C. The only case where I've seen heavy overhead is in native Rust with the macro active, and oddly enough, that overhead disappears once the C binary links against the same crate (cross-language ThinLTO recovers the missed optimisation). I'll be investigating that in more depth and adding more tests to catch the cases where it shows up.