rf — the direct systems language

Why rf.

Every feature exists because expert C++ programmers already write it by hand.
We just made the compiler do it instead.

↔

Access qualifiers → free noalias

Tag parameters r, w, rw.

You get automatic readonly, writeonly, and noalias on every LLVM pointer — SIMD vectorization without a single pragma.

vs C++ T* __restrict pos, const T* __restrict vel on every parameter, forever

▣

Struct-of-arrays → cache-hot loops

Annotate any struct with #soa.

You get separate contiguous arrays per field. Hot loops only touch the fields they read. Zero manual pointer management.

vs C++ float *px,*py,*pz,*vx,*vy,*vz; // 6 mallocs, 6 frees, manual indexing ~30 lines of boilerplate per struct

≡

Disjoint contracts → proven aliasing

Declare #ensure disjoint(a, b) on a function.

You get llvm.assume inserted automatically. The compiler proves your slices don't overlap. If it can't, it's a compile error.

vs C++ __builtin_assume(!ranges_overlap(a, n, b, m)) manually at every call site

☰

Cacheline types → no false sharing

Declare []#cacheline T or #align(64).

You get automatic stride computation, aligned allocation, and zero false-sharing between threads. The compiler knows the stride and optimizes indexing.

vs C++ posix_memalign(&ptr, 64, n * 64); // stride tracked manually easy to get wrong, impossible to verify

⎘

@dispatch → parallelism as syntax

Annotate any loop with @dispatch(threads=4).

You get OpenMP-level parallelism without pragmas. Chunked scheduling, thread-local state, and task-parallel @sync regions — all compiler-managed.

vs C++ #pragma omp parallel for schedule(static, chunk) compiler-specific, no type checking

△

ct intents → guaranteed constant folding

Declare parameters with ct (compile-time constant).

You get a compiler-enforced guarantee that the argument is known at compile time. No template metaprogramming. No constexpr cascades.

vs C++ template<int N> void foo() or consteval comptime as a parameter, not a function attribute

Write less, ship faster.

What takes 30 lines of C++ is one line in rf. The compiler handles the rest.

Memory layout is a type annotation

Mark struct layout at the type level — AOS, SOA, or cacheline-aligned.

You get optimal memory layout for your access pattern. Hot fields stay hot. SIMD lanes stay full. Zero runtime overhead.

rf — 7 lines

struct Vec3 { x, y, z: r32 }

struct #soa Entity {
    pos: Vec3,
    vel: Vec3,
}

entities = mem.alloc(Entity, 4096)   -- 6 arrays, 1 call

C++ — 35+ lines

float* px = (float*)malloc(n * sizeof(float));
float* py = (float*)malloc(n * sizeof(float));
float* pz = (float*)malloc(n * sizeof(float));
float* vx = (float*)malloc(n * sizeof(float));
float* vy = (float*)malloc(n * sizeof(float));
float* vz = (float*)malloc(n * sizeof(float));
// + 6 frees + manual index arithmetic

Parallelism is one word

Data-parallel and task-parallel with @dispatch and @sync.

You get threaded execution without touching pthreads, OpenMP pragmas, or thread pools. Add threads later. Remove them later. Your code doesn't change.

rf — 9 lines

@sync {
    @dispatch process_physics(positions, velocities)
    @dispatch process_ai(entities)
}

@dispatch(threads = 4, chunk = 256)
for w in workers {
    w.counter += 1
}

C++ — 20+ lines

// task parallelism
#pragma omp parallel sections
{
    #pragma omp section
    { process_physics(positions, velocities); }
    #pragma omp section
    { process_ai(entities); }
}

// data parallelism
#pragma omp parallel for schedule(static, 256) num_threads(4)
for (int i = 0; i < n; ++i) {
    workers[i].counter += 1;
}

Aliasing is a contract, not a prayer

Declare disjointness with #ensure disjoint(a, b). Prove it with @disjoint(a, b).

You get the compiler checking your work. If slices might overlap, it won't compile. Most of the time you don't think about it — the compiler just handles it.

rf — compiler-enforced

-- callee: declares the contract
transform (rw a: []s64, rw b: []s64)
    #ensure disjoint(a, b)
{ }

-- caller: proves the contract holds
if @disjoint(a, b) { transform(a, b) }
-- ^ compile error if not provable

C++ — unchecked

// callee: cross your fingers
void transform(T* __restrict a,
                    const T* __restrict b,
                    size_t n) { ... }

// caller: no compiler enforcement
__builtin_assume(!overlap(a, n, b, n));
// ^ forget this → no compiler warning
transform(a, b, n);

Types at a glance.

Category	What you write	What you get
Integers	`s8` `s16` `s32` `s64` `u8` `u16` `u32` `u64`	LLVM integer of exact width
Floats	`r32` `r64`	IEEE 754 single/double
Boolean	`bool`	i1 with zero-initialization
Pointer	`rawptr`	Untyped 64-bit pointer
Dense slices	`[]T`	`{ptr, len}` fat pointer
Cacheline slices	`[]#cacheline T`	`{ptr, len, stride}` fat pointer + auto-padding
Fixed arrays	`[N]T`	Stack-allocated, bounds-checked
Comptime	`$T`	Monomorphized per call site
Structs	`#aos` (default) · `#soa` · `#align(N)`	Layout chosen at declaration time

rf vs C++.

Same output. Radically less input.

Task	rf	LOC	C++	LOC
Restrict pointers	`r a: []T`	1	`const T* __restrict a`	1
Disjoint proof	`if @disjoint(a,b) { f(a,b) }`	1	`__builtin_assume(!overlap(a,n,b,m))`	1
Struct-of-arrays	`struct #soa E { x,y,z: r32 }`	4	`float Ex,Ey,*Ez; // 3 mallocs + 3 frees + index math`	35
Cacheline padding	`[]#cacheline T`	1	`posix_memalign + manual stride + aligned free`	10
Parallel for	`@dispatch for x in xs { ... }`	2	`#pragma omp parallel for schedule(static, N)`	8
Task parallelism	`@sync { @dispatch a(); @dispatch b() }`	4	`#pragma omp parallel sections { ... }`	12
Compile-time param	`fn foo(ct n: s64)`	1	`template<int N> void foo()` or `consteval`	3
SIMD by default	`fn update(rw pos: []T, r vel: []T)`	1	`__restrict + __builtin_assume + pragma simd`	6
Deferred cleanup	`defer close(f)`	1	`RAII wrapper class + destructor`	8
Multi-slice iteration	`for a,b,c in xs,ys,zs { ... }`	1	`for(size_t i=0;i<n;++i){ a=xs[i];b=ys[i];c=zs[i]; }`	4

Case studies.

Fair comparisons. rf vs competent C++. Assembly-verified.

Alias-free physics update

noalias without __restrict — checked at every call site

rf — 3 lines, noalias by default

simulate (rw pos: [][3]r32, vel: [][3]r32, dt: r32) {
    for p, v in pos, vel {
        p = p + v * dt
    }
}
-- rw pos → noalias pos ptr in LLVM IR
-- vel (default r) → noalias readonly
-- [3]r32 → <3 x float> vector type
-- v * dt → scalar broadcast (fmul <3 x float>, float)

C++ — __restrict, manual fields

struct Vec3 { float x, y, z; };

// __restrict: same noalias IR. Manual annotation.
void simulate(Vec3* __restrict pos,
                const Vec3* __restrict vel,
                int n, float dt) {
    for (int i = 0; i < n; i++) {
        pos[i].x += vel[i].x * dt;
        pos[i].y += vel[i].y * dt;
        pos[i].z += vel[i].z * dt;
    }
}
// Without __restrict: extra reloads of vel[i]

// Without __restrict: compiler reloads vel[i] // every iteration — assumes overlap possible

LLVM IR — rf

define void @simulate(
    ptr noalias %pos, i64 %len,
    ptr noalias readonly %vel, i64 %len2,
    float %dt)

LLVM IR — C++ (naive)

define void @simulate_naive(
    ptr %pos, ptr %vel,
    i32 %n, float %dt)     -- no noalias!

rf gives optimal noalias IR by default — the sema analyser checks it at every call site. In C++ you get the same IR with __restrict, but it's a manual annotation.

SOA with cacheline padding

2 lines of type definition vs ~40 lines of manual memory management

rf — 2-line layout

struct #soa Particle {
    pos: [3]r32,
    vel: [3]r32,
    mass: r32,
}

-- []#cacheline = auto 64-byte stride
update (rw ents: []#cacheline Particle, dt: r32) {
    for e in ents {
        e.pos = e.pos + e.vel * dt
        e.mass = e.mass * 0.999
    }
}

C++ — ~40 lines of manual layout

struct Particles {
    float* px; float* py; float* pz;
    float* vx; float* vy; float* vz;
    float* mass; int count;
};

// alloc each field with stride math
float* px = _aligned_malloc(n * 64, 64);
float* py = _aligned_malloc(n * 64, 64);
// ... 5 more allocs, 7 frees, manual index * 16

Assembly — both produce identical stride-GEP

gep float, float* %px, i64 %idx
; rf: compiler computes stride from #soa + #cacheline
; C++: programmer computes stride as i * 16

Identical assembly. The difference is source code: rf's 2-line struct definition replaces ~40 lines of manual allocation, stride math, and cleanup. Adding a field in rf is 1 line. In C++ it's 5+ edits across three code regions.

Bounds-proven disjointness

The compiler tracks offset+length per slice — and proves non-overlap at compile time

rf — 3 scenarios, 0 bugs

struct T { x: r32 }

update (rw a: []T, rw b: []T)
    #ensure disjoint(a, b)
{ for i in 0..<a.len { a[i].x = a[i].x + b[i].x } }

buf = mem.alloc(T, 200)

-- (1) non-overlapping: no guard needed
a = buf[0..100]        -- offset=0, len=100
b = buf[101..105]      -- offset=101, len=4
update(a, b)            -- OK: bounds prove 0+100 <= 101

-- (2) overlapping: COMPILE ERROR
c = buf[50..150]      -- offset=50, len=100
update(a, c)            -- ERROR: 0..100 overlaps 50..150

-- (3) unknown provenance: guard required
x = some_fn()           -- provenance unknown
@assert(@disjoint(a, x))
update(a, x)            -- OK: guard covers unknown x

C++ — __restrict, no range tracking

// __restrict: same noalias IR, manual check
void update(T* __restrict a, T* __restrict b,
                int n) {
    for (int i = 0; i < n; i++)
        a[i].x += b[i].x;
}

T buf[200];
update(buf, buf + 101, 4);    // OK
update(buf, buf + 50, 100);    // compiles fine, no warning
// ^ rf catches this at compile time:
//   bounds prove 0..100 overlaps 50..150

LLVM IR — both generate noalias

define void @update(
    ptr noalias %a, i64 %len_a,
    ptr noalias %b, i64 %len_b)

-- Same IR. The separation is enforcement:
-- rf tracks offset+len per slice variable
-- Same IR in both. The difference:
-- rf tracks offset+len and catches
-- overlap automatically.

Both produce optimal noalias IR. The difference is that rf tracks every slice's offset and length, and catches overlapping sub-slices at compile time — no annotation needed. Non-overlapping sub-slices from the same allocation work without a guard. In C++ with __restrict, there's no range tracking — overlapping calls compile without complaint.

Parallel dispatch with cacheline safety

Type-safe parallelism — no pragmas, no manual stride math

rf — 3-line annotations

struct #soa Entity {
    health: r32,
    pos: [3]r32,
    vel: [3]r32,
}

simulate (rw ents: []#cacheline Entity, dt: r32) {
    @dispatch(threads = 4, chunk = 64)
    for e in ents {
        e.health = e.health - 0.1 * dt
        e.pos = e.pos + e.vel * dt
        if e.health <= 0 { e.health = 0 }
    }
}

if e.health <= 0 { e.health = 0 } } }

C++ — manual everything

// Manual SOA + cacheline padding (see case 02)
// Then add OpenMP pragma:
void simulate(Entities& ents, float dt) {
    #pragma omp parallel for schedule(static, 64) num_threads(4)
    for (int i = 0; i < n; i++) {
        int off = i * 16;  // manual stride
        ents.health[off] -= 0.1f * dt;
        if (ents.health[off] <= 0) ents.health[off] = 0;
    }
}

Assembly — both generate GOMP_loop_static_start

call GOMP_loop_static_start
call GOMP_loop_static_next
; rf: auto-generated from @dispatch + []#cacheline
; C++: from #pragma omp parallel for with manual stride

Same OpenMP calls, same assembly. In rf, the type system handles false sharing ([]#cacheline), thread aliasing (access qualifiers), and chunk boundaries — you write the business logic, the compiler manages the rest. In C++, these are programmer-managed: stride math, padding, and chunk boundaries all need to be correct by hand.

Array programming as portable SIMD

[4]r32 is a native vector type — same assembly as hand-written SSE intrinsics

rf — 1 line, portable SIMD

f (a: [4]r32, b: [4]r32, s: r32) -> [4]r32 {
    return a * s + b
}
-- [4]r32 -> <4 x float> in LLVM IR
-- a * s -> scalar broadcast (fmul <4 x float>, float)
-- + b  -> vector add (fadd <4 x float>)
-- Compiles to mulps + addps on x86
-- Same RF code works on ARM NEON, WASM SIMD

C++ — SSE intrinsics, x86-only

// Same assembly as RF. Not portable.
// __m128 doesn't exist on ARM.
__m128 f(__m128 a, __m128 b, float s) {
    return _mm_add_ps(
        _mm_mul_ps(a, _mm_set1_ps(s)), b);
}

// Naive C++: relies on auto-vectorizer
void f(float* a, float* b, float s,
         float* out) {
    for (int i = 0; i < 4; i++)
        out[i] = a[i] * s + b[i];
}  // may or may not vectorize

LLVM IR — RF

define <4 x float> @f(
    <4 x float> %a, <4 x float> %b,
    float %s)

  %mul = fmul <4 x float> %a, float %s
  %add = fadd <4 x float> %b, %mul
  ret <4 x float> %add

x86 Assembly — both produce identical SIMD

shufps  $0, %xmm2, %xmm0    ; broadcast s
mulps   (%rcx), %xmm0       ; a * s
addps   (%rdx), %xmm0       ; + b
; RF generates this from [4]r32 types
; C++ needs __m128 + intrinsics (x86-only)

RF's [4]r32 is a portable SIMD type — it lowers to <4 x float> in LLVM IR and compiles to mulps + addps on x86, fmul + fadd on ARM NEON. The same RF code works on every platform. In C++, getting guaranteed SIMD requires platform-specific intrinsics (__m128, _mm_add_ps) or trusting the auto-vectorizer.

rf The direct systems language

Get early access.