Benchmarks

ezffi's design goal is that calling a Rust function through its generated FFI binding should cost the same as calling it directly from Rust when theoretically possible. The benches/ directory is where all the benches I currently have live. There's still a lot left to bench and plenty of performance improvements to make, but so far the macro is doing really well, 0% overhead on every bench I have right now. In this chapter we'll explore how the tests are structured and the results I obtained.

See BENCH_RESULTS.md at the root of the repo for the latest numbers, it's periodically updated.

How each bench is structured

Each bench is a small crate under benches/<name>/ with a fixed layout:

benches/<name>/
├── Cargo.toml         — `lib` + `staticlib`, plus a `native` Rust bin target
├── build.rs           — runs cbindgen (only when `--features ezffi` is on)
├── cbindgen.toml      — header generation config
├── src/
│   ├── lib.rs         — the function/type under test, gated with
│   │                    `#[cfg_attr(feature = "ezffi", ezffi::export)]`
│   └── bin.rs         — the Rust "native" runner
├── c/runner.c         — the C runner
└── DESCRIPTION.md     — one-line summary embedded into the results doc

The whole point is the same Rust source runs three ways:

labelhow it's builtwhat it measures
rustcargo build --profile bench-releasepure-Rust baseline, no macro expansion
rust+ezfficargo build --profile bench-release --features ezffipure-Rust baseline, with macro expansion
cC linked against rust+ezffi staticlib via clang -O3 -flto=thinsame crate called from a C binary across the FFI

Each runner reads three env vars (ITERATIONS, NUMA, NUMB) at startup so values aren't constant-folded, runs a warmup pass, then times the measured loop with Instant::now() / clock_gettime. A side-channel acc accumulator is printed after the loop — Rust and C must agree byte-for-byte on it, that's the semantic sanity check across the boundary.

For workloads where rustc would otherwise solve the loop in closed form (any purely-linear acc += i + b), a single black_box(i) per iteration is enough to keep each iteration opaque to the optimiser without distorting what's measured.

Here is a quick snippet of the profile config I'm using:

[profile.bench-release]
inherits = "release"
lto = "thin"

Running them

From the repo root:

benches/run.sh

The first time this builds a small Docker image (benches/Dockerfileclang + rust-lld from rustup so the LLVM versions line up for cross-language LTO). After that it just runs. The host's repo is bind-mounted into the container, so source edits are picked up without rebuilds.

ITERATIONS, NUMA, NUMB can be overridden from the environment if you want different inputs. Default is 1_000_000_000 iterations with arbitrary constants.

When the script finishes it overwrites BENCH_RESULTS.md at the repo root with the latest numbers plus the host's toolchain block. That file is meant to be committed alongside releases.

Adding a new bench

Copy any of the existing dirs (add-fn is the simplest), rename, edit src/lib.rs with the function or type you want to measure, mirror it in src/bin.rs and c/runner.c, and write a one-line DESCRIPTION.md. The workspace glob (benches/*) and run.sh's directory loop pick it up automatically on the next run.

Bench results summary

So far, I have 0% overhead on every single bench when calling from C. The only case where I've seen heavy overhead is in native Rust with the macro active, and oddly enough, that overhead disappears once the C binary links against the same crate (cross-language ThinLTO recovers the missed optimisation). I'll be investigating that in more depth and adding more tests to catch the cases where it shows up.