v0.1 · LLVM backend

rf The direct systems language

Built for simulations, engines, and data-parallel code. The compiler handles layout, aliasing, and parallelism so you don't have to.

simulate.rf
import "mem"

struct #soa Particle {
    pos: [4]r32,
    vel: [4]r32,
    mass: r32,
}

simulate (rw p: []#cacheline Particle, dt: r32) {
    @dispatch(threads = 4)
    for e in p {
        e.pos += e.vel * dt
        e.vel.y -= 9.8 * dt
        if e.pos.y < 0 { e.pos.y = 0 }
    }
}

Get early access.

rf isn't public yet. Join the waitlist and we'll let you know when the first compiler drops.

No spam. One email when it ships.

Why rf.

Every feature exists because expert C++ programmers already write it by hand.
We just made the compiler do it instead.

Access qualifiers → free noalias

Tag parameters r, w, rw.

You get automatic readonly, writeonly, and noalias on every LLVM pointer — SIMD vectorization without a single pragma.

vs C++ T* __restrict pos, const T* __restrict vel on every parameter, forever

Struct-of-arrays → cache-hot loops

Annotate any struct with #soa.

You get separate contiguous arrays per field. Hot loops only touch the fields they read. Zero manual pointer management.

vs C++ float *px,*py,*pz,*vx,*vy,*vz; // 6 mallocs, 6 frees, manual indexing ~30 lines of boilerplate per struct

Disjoint contracts → proven aliasing

Declare #ensure disjoint(a, b) on a function.

You get llvm.assume inserted automatically. The compiler proves your slices don't overlap. If it can't, it's a compile error.

vs C++ __builtin_assume(!ranges_overlap(a, n, b, m)) manually at every call site

Cacheline types → no false sharing

Declare []#cacheline T or #align(64).

You get automatic stride computation, aligned allocation, and zero false-sharing between threads. The compiler knows the stride and optimizes indexing.

vs C++ posix_memalign(&ptr, 64, n * 64); // stride tracked manually easy to get wrong, impossible to verify

@dispatch → parallelism as syntax

Annotate any loop with @dispatch(threads=4).

You get OpenMP-level parallelism without pragmas. Chunked scheduling, thread-local state, and task-parallel @sync regions — all compiler-managed.

vs C++ #pragma omp parallel for schedule(static, chunk) compiler-specific, no type checking

ct intents → guaranteed constant folding

Declare parameters with ct (compile-time constant).

You get a compiler-enforced guarantee that the argument is known at compile time. No template metaprogramming. No constexpr cascades.

vs C++ template<int N> void foo() or consteval comptime as a parameter, not a function attribute

Write less, ship faster.

What takes 30 lines of C++ is one line in rf. The compiler handles the rest.

Memory layout is a type annotation

Mark struct layout at the type level — AOS, SOA, or cacheline-aligned.

You get optimal memory layout for your access pattern. Hot fields stay hot. SIMD lanes stay full. Zero runtime overhead.

rf — 7 lines
struct Vec3 { x, y, z: r32 }

struct #soa Entity {
    pos: Vec3,
    vel: Vec3,
}

entities = mem.alloc(Entity, 4096)   -- 6 arrays, 1 call
C++ — 35+ lines
float* px = (float*)malloc(n * sizeof(float));
float* py = (float*)malloc(n * sizeof(float));
float* pz = (float*)malloc(n * sizeof(float));
float* vx = (float*)malloc(n * sizeof(float));
float* vy = (float*)malloc(n * sizeof(float));
float* vz = (float*)malloc(n * sizeof(float));
// + 6 frees + manual index arithmetic

Parallelism is one word

Data-parallel and task-parallel with @dispatch and @sync.

You get threaded execution without touching pthreads, OpenMP pragmas, or thread pools. Add threads later. Remove them later. Your code doesn't change.

rf — 9 lines
@sync {
    @dispatch process_physics(positions, velocities)
    @dispatch process_ai(entities)
}

@dispatch(threads = 4, chunk = 256)
for w in workers {
    w.counter += 1
}
C++ — 20+ lines
// task parallelism
#pragma omp parallel sections
{
    #pragma omp section
    { process_physics(positions, velocities); }
    #pragma omp section
    { process_ai(entities); }
}

// data parallelism
#pragma omp parallel for schedule(static, 256) num_threads(4)
for (int i = 0; i < n; ++i) {
    workers[i].counter += 1;
}

Aliasing is a contract, not a prayer

Declare disjointness with #ensure disjoint(a, b). Prove it with @disjoint(a, b).

You get the compiler checking your work. If slices might overlap, it won't compile. Most of the time you don't think about it — the compiler just handles it.

rf — compiler-enforced
-- callee: declares the contract
transform (rw a: []s64, rw b: []s64)
    #ensure disjoint(a, b)
{ }

-- caller: proves the contract holds
if @disjoint(a, b) { transform(a, b) }
-- ^ compile error if not provable
C++ — unchecked
// callee: cross your fingers
void transform(T* __restrict a,
                    const T* __restrict b,
                    size_t n) { ... }

// caller: no compiler enforcement
__builtin_assume(!overlap(a, n, b, n));
// ^ forget this → no compiler warning
transform(a, b, n);

Types at a glance.

CategoryWhat you writeWhat you get
Integerss8 s16 s32 s64 u8 u16 u32 u64LLVM integer of exact width
Floatsr32 r64IEEE 754 single/double
Booleanbooli1 with zero-initialization
PointerrawptrUntyped 64-bit pointer
Dense slices[]T{ptr, len} fat pointer
Cacheline slices[]#cacheline T{ptr, len, stride} fat pointer + auto-padding
Fixed arrays[N]TStack-allocated, bounds-checked
Comptime$TMonomorphized per call site
Structs#aos (default) · #soa · #align(N)Layout chosen at declaration time

rf vs C++.

Same output. Radically less input.

TaskrfLOCC++LOC
Restrict pointers r a: []T 1 const T* __restrict a 1
Disjoint proof if @disjoint(a,b) { f(a,b) } 1 __builtin_assume(!overlap(a,n,b,m)) 1
Struct-of-arrays struct #soa E { x,y,z: r32 } 4 float *Ex,*Ey,*Ez; // 3 mallocs + 3 frees + index math 35
Cacheline padding []#cacheline T 1 posix_memalign + manual stride + aligned free 10
Parallel for @dispatch for x in xs { ... } 2 #pragma omp parallel for schedule(static, N) 8
Task parallelism @sync { @dispatch a(); @dispatch b() } 4 #pragma omp parallel sections { ... } 12
Compile-time param fn foo(ct n: s64) 1 template<int N> void foo() or consteval 3
SIMD by default fn update(rw pos: []T, r vel: []T) 1 __restrict + __builtin_assume + pragma simd 6
Deferred cleanup defer close(f) 1 RAII wrapper class + destructor 8
Multi-slice iteration for a,b,c in xs,ys,zs { ... } 1 for(size_t i=0;i<n;++i){ a=xs[i];b=ys[i];c=zs[i]; } 4

Case studies.

Fair comparisons. rf vs competent C++. Assembly-verified.

01

Alias-free physics update

noalias without __restrict — checked at every call site

rf — 3 lines, noalias by default
simulate (rw pos: [][3]r32, vel: [][3]r32, dt: r32) {
    for p, v in pos, vel {
        p = p + v * dt
    }
}
-- rw pos → noalias pos ptr in LLVM IR
-- vel (default r) → noalias readonly
-- [3]r32 → <3 x float> vector type
-- v * dt → scalar broadcast (fmul <3 x float>, float)
C++ — __restrict, manual fields
struct Vec3 { float x, y, z; };

// __restrict: same noalias IR. Manual annotation.
void simulate(Vec3* __restrict pos,
                const Vec3* __restrict vel,
                int n, float dt) {
    for (int i = 0; i < n; i++) {
        pos[i].x += vel[i].x * dt;
        pos[i].y += vel[i].y * dt;
        pos[i].z += vel[i].z * dt;
    }
}
// Without __restrict: extra reloads of vel[i]
// Without __restrict: compiler reloads vel[i] // every iteration — assumes overlap possible

LLVM IR — rf

define void @simulate(
    ptr noalias %pos, i64 %len,
    ptr noalias readonly %vel, i64 %len2,
    float %dt)

LLVM IR — C++ (naive)

define void @simulate_naive(
    ptr %pos, ptr %vel,
    i32 %n, float %dt)     -- no noalias!
rf gives optimal noalias IR by default — the sema analyser checks it at every call site. In C++ you get the same IR with __restrict, but it's a manual annotation.
02

SOA with cacheline padding

2 lines of type definition vs ~40 lines of manual memory management

rf — 2-line layout
struct #soa Particle {
    pos: [3]r32,
    vel: [3]r32,
    mass: r32,
}

-- []#cacheline = auto 64-byte stride
update (rw ents: []#cacheline Particle, dt: r32) {
    for e in ents {
        e.pos = e.pos + e.vel * dt
        e.mass = e.mass * 0.999
    }
}
C++ — ~40 lines of manual layout
struct Particles {
    float* px; float* py; float* pz;
    float* vx; float* vy; float* vz;
    float* mass; int count;
};

// alloc each field with stride math
float* px = _aligned_malloc(n * 64, 64);
float* py = _aligned_malloc(n * 64, 64);
// ... 5 more allocs, 7 frees, manual index * 16

Assembly — both produce identical stride-GEP

gep float, float* %px, i64 %idx
; rf: compiler computes stride from #soa + #cacheline
; C++: programmer computes stride as i * 16
Identical assembly. The difference is source code: rf's 2-line struct definition replaces ~40 lines of manual allocation, stride math, and cleanup. Adding a field in rf is 1 line. In C++ it's 5+ edits across three code regions.
03

Bounds-proven disjointness

The compiler tracks offset+length per slice — and proves non-overlap at compile time

rf — 3 scenarios, 0 bugs
struct T { x: r32 }

update (rw a: []T, rw b: []T)
    #ensure disjoint(a, b)
{ for i in 0..<a.len { a[i].x = a[i].x + b[i].x } }

buf = mem.alloc(T, 200)

-- (1) non-overlapping: no guard needed
a = buf[0..100]        -- offset=0, len=100
b = buf[101..105]      -- offset=101, len=4
update(a, b)            -- OK: bounds prove 0+100 <= 101

-- (2) overlapping: COMPILE ERROR
c = buf[50..150]      -- offset=50, len=100
update(a, c)            -- ERROR: 0..100 overlaps 50..150

-- (3) unknown provenance: guard required
x = some_fn()           -- provenance unknown
@assert(@disjoint(a, x))
update(a, x)            -- OK: guard covers unknown x
C++ — __restrict, no range tracking
// __restrict: same noalias IR, manual check
void update(T* __restrict a, T* __restrict b,
                int n) {
    for (int i = 0; i < n; i++)
        a[i].x += b[i].x;
}

T buf[200];
update(buf, buf + 101, 4);    // OK
update(buf, buf + 50, 100);    // compiles fine, no warning
// ^ rf catches this at compile time:
//   bounds prove 0..100 overlaps 50..150

LLVM IR — both generate noalias

define void @update(
    ptr noalias %a, i64 %len_a,
    ptr noalias %b, i64 %len_b)

-- Same IR. The separation is enforcement:
-- rf tracks offset+len per slice variable
-- Same IR in both. The difference:
-- rf tracks offset+len and catches
-- overlap automatically.
Both produce optimal noalias IR. The difference is that rf tracks every slice's offset and length, and catches overlapping sub-slices at compile time — no annotation needed. Non-overlapping sub-slices from the same allocation work without a guard. In C++ with __restrict, there's no range tracking — overlapping calls compile without complaint.
04

Parallel dispatch with cacheline safety

Type-safe parallelism — no pragmas, no manual stride math

rf — 3-line annotations
struct #soa Entity {
    health: r32,
    pos: [3]r32,
    vel: [3]r32,
}

simulate (rw ents: []#cacheline Entity, dt: r32) {
    @dispatch(threads = 4, chunk = 64)
    for e in ents {
        e.health = e.health - 0.1 * dt
        e.pos = e.pos + e.vel * dt
        if e.health <= 0 { e.health = 0 }
    }
}
if e.health <= 0 { e.health = 0 } } }
C++ — manual everything
// Manual SOA + cacheline padding (see case 02)
// Then add OpenMP pragma:
void simulate(Entities& ents, float dt) {
    #pragma omp parallel for schedule(static, 64) num_threads(4)
    for (int i = 0; i < n; i++) {
        int off = i * 16;  // manual stride
        ents.health[off] -= 0.1f * dt;
        if (ents.health[off] <= 0) ents.health[off] = 0;
    }
}

Assembly — both generate GOMP_loop_static_start

call GOMP_loop_static_start
call GOMP_loop_static_next
; rf: auto-generated from @dispatch + []#cacheline
; C++: from #pragma omp parallel for with manual stride
Same OpenMP calls, same assembly. In rf, the type system handles false sharing ([]#cacheline), thread aliasing (access qualifiers), and chunk boundaries — you write the business logic, the compiler manages the rest. In C++, these are programmer-managed: stride math, padding, and chunk boundaries all need to be correct by hand.
05

Array programming as portable SIMD

[4]r32 is a native vector type — same assembly as hand-written SSE intrinsics

rf — 1 line, portable SIMD
f (a: [4]r32, b: [4]r32, s: r32) -> [4]r32 {
    return a * s + b
}
-- [4]r32 -> <4 x float> in LLVM IR
-- a * s -> scalar broadcast (fmul <4 x float>, float)
-- + b  -> vector add (fadd <4 x float>)
-- Compiles to mulps + addps on x86
-- Same RF code works on ARM NEON, WASM SIMD
C++ — SSE intrinsics, x86-only
// Same assembly as RF. Not portable.
// __m128 doesn't exist on ARM.
__m128 f(__m128 a, __m128 b, float s) {
    return _mm_add_ps(
        _mm_mul_ps(a, _mm_set1_ps(s)), b);
}

// Naive C++: relies on auto-vectorizer
void f(float* a, float* b, float s,
         float* out) {
    for (int i = 0; i < 4; i++)
        out[i] = a[i] * s + b[i];
}  // may or may not vectorize

LLVM IR — RF

define <4 x float> @f(
    <4 x float> %a, <4 x float> %b,
    float %s)

  %mul = fmul <4 x float> %a, float %s
  %add = fadd <4 x float> %b, %mul
  ret <4 x float> %add

x86 Assembly — both produce identical SIMD

shufps  $0, %xmm2, %xmm0    ; broadcast s
mulps   (%rcx), %xmm0       ; a * s
addps   (%rdx), %xmm0       ; + b
; RF generates this from [4]r32 types
; C++ needs __m128 + intrinsics (x86-only)
RF's [4]r32 is a portable SIMD type — it lowers to <4 x float> in LLVM IR and compiles to mulps + addps on x86, fmul + fadd on ARM NEON. The same RF code works on every platform. In C++, getting guaranteed SIMD requires platform-specific intrinsics (__m128, _mm_add_ps) or trusting the auto-vectorizer.