Built for simulations, engines, and data-parallel code. The compiler handles layout, aliasing, and parallelism so you don't have to.
import "mem"
struct #soa Particle {
pos: [4]r32,
vel: [4]r32,
mass: r32,
}
simulate (rw p: []#cacheline Particle, dt: r32) {
@dispatch(threads = 4)
for e in p {
e.pos += e.vel * dt
e.vel.y -= 9.8 * dt
if e.pos.y < 0 { e.pos.y = 0 }
}
}
rf isn't public yet. Join the waitlist and we'll let you know when the first compiler drops.
Every feature exists because expert C++ programmers already write it by hand.
We just made the compiler do it instead.
Tag parameters r, w, rw.
You get automatic readonly, writeonly, and noalias on every LLVM pointer — SIMD vectorization without a single pragma.
T* __restrict pos, const T* __restrict vel
on every parameter, forever
Annotate any struct with #soa.
You get separate contiguous arrays per field. Hot loops only touch the fields they read. Zero manual pointer management.
float *px,*py,*pz,*vx,*vy,*vz; // 6 mallocs, 6 frees, manual indexing
~30 lines of boilerplate per struct
Declare #ensure disjoint(a, b) on a function.
You get llvm.assume inserted automatically. The compiler proves your slices don't overlap. If it can't, it's a compile error.
__builtin_assume(!ranges_overlap(a, n, b, m))
manually at every call site
Declare []#cacheline T or #align(64).
You get automatic stride computation, aligned allocation, and zero false-sharing between threads. The compiler knows the stride and optimizes indexing.
posix_memalign(&ptr, 64, n * 64); // stride tracked manually
easy to get wrong, impossible to verify
Annotate any loop with @dispatch(threads=4).
You get OpenMP-level parallelism without pragmas. Chunked scheduling, thread-local state, and task-parallel @sync regions — all compiler-managed.
#pragma omp parallel for schedule(static, chunk)
compiler-specific, no type checking
Declare parameters with ct (compile-time constant).
You get a compiler-enforced guarantee that the argument is known at compile time. No template metaprogramming. No constexpr cascades.
template<int N> void foo() or consteval
comptime as a parameter, not a function attribute
What takes 30 lines of C++ is one line in rf. The compiler handles the rest.
Mark struct layout at the type level — AOS, SOA, or cacheline-aligned.
You get optimal memory layout for your access pattern. Hot fields stay hot. SIMD lanes stay full. Zero runtime overhead.
struct Vec3 { x, y, z: r32 }
struct #soa Entity {
pos: Vec3,
vel: Vec3,
}
entities = mem.alloc(Entity, 4096) -- 6 arrays, 1 call
float* px = (float*)malloc(n * sizeof(float));
float* py = (float*)malloc(n * sizeof(float));
float* pz = (float*)malloc(n * sizeof(float));
float* vx = (float*)malloc(n * sizeof(float));
float* vy = (float*)malloc(n * sizeof(float));
float* vz = (float*)malloc(n * sizeof(float));
// + 6 frees + manual index arithmetic
Data-parallel and task-parallel with @dispatch and @sync.
You get threaded execution without touching pthreads, OpenMP pragmas, or thread pools. Add threads later. Remove them later. Your code doesn't change.
@sync {
@dispatch process_physics(positions, velocities)
@dispatch process_ai(entities)
}
@dispatch(threads = 4, chunk = 256)
for w in workers {
w.counter += 1
}
// task parallelism
#pragma omp parallel sections
{
#pragma omp section
{ process_physics(positions, velocities); }
#pragma omp section
{ process_ai(entities); }
}
// data parallelism
#pragma omp parallel for schedule(static, 256) num_threads(4)
for (int i = 0; i < n; ++i) {
workers[i].counter += 1;
}
Declare disjointness with #ensure disjoint(a, b). Prove it with @disjoint(a, b).
You get the compiler checking your work. If slices might overlap, it won't compile. Most of the time you don't think about it — the compiler just handles it.
-- callee: declares the contract
transform (rw a: []s64, rw b: []s64)
#ensure disjoint(a, b)
{ }
-- caller: proves the contract holds
if @disjoint(a, b) { transform(a, b) }
-- ^ compile error if not provable
// callee: cross your fingers
void transform(T* __restrict a,
const T* __restrict b,
size_t n) { ... }
// caller: no compiler enforcement
__builtin_assume(!overlap(a, n, b, n));
// ^ forget this → no compiler warning
transform(a, b, n);
| Category | What you write | What you get |
|---|---|---|
| Integers | s8 s16 s32 s64 u8 u16 u32 u64 | LLVM integer of exact width |
| Floats | r32 r64 | IEEE 754 single/double |
| Boolean | bool | i1 with zero-initialization |
| Pointer | rawptr | Untyped 64-bit pointer |
| Dense slices | []T | {ptr, len} fat pointer |
| Cacheline slices | []#cacheline T | {ptr, len, stride} fat pointer + auto-padding |
| Fixed arrays | [N]T | Stack-allocated, bounds-checked |
| Comptime | $T | Monomorphized per call site |
| Structs | #aos (default) · #soa · #align(N) | Layout chosen at declaration time |
Same output. Radically less input.
| Task | rf | LOC | C++ | LOC |
|---|---|---|---|---|
| Restrict pointers | r a: []T |
1 | const T* __restrict a |
1 |
| Disjoint proof | if @disjoint(a,b) { f(a,b) } |
1 | __builtin_assume(!overlap(a,n,b,m)) |
1 |
| Struct-of-arrays | struct #soa E { x,y,z: r32 } |
4 | float *Ex,*Ey,*Ez; // 3 mallocs + 3 frees + index math |
35 |
| Cacheline padding | []#cacheline T |
1 | posix_memalign + manual stride + aligned free |
10 |
| Parallel for | @dispatch for x in xs { ... } |
2 | #pragma omp parallel for schedule(static, N) |
8 |
| Task parallelism | @sync { @dispatch a(); @dispatch b() } |
4 | #pragma omp parallel sections { ... } |
12 |
| Compile-time param | fn foo(ct n: s64) |
1 | template<int N> void foo() or consteval |
3 |
| SIMD by default | fn update(rw pos: []T, r vel: []T) |
1 | __restrict + __builtin_assume + pragma simd |
6 |
| Deferred cleanup | defer close(f) |
1 | RAII wrapper class + destructor |
8 |
| Multi-slice iteration | for a,b,c in xs,ys,zs { ... } |
1 | for(size_t i=0;i<n;++i){ a=xs[i];b=ys[i];c=zs[i]; } |
4 |
Fair comparisons. rf vs competent C++. Assembly-verified.
noalias without __restrict — checked at every call site
simulate (rw pos: [][3]r32, vel: [][3]r32, dt: r32) {
for p, v in pos, vel {
p = p + v * dt
}
}
-- rw pos → noalias pos ptr in LLVM IR
-- vel (default r) → noalias readonly
-- [3]r32 → <3 x float> vector type
-- v * dt → scalar broadcast (fmul <3 x float>, float)
struct Vec3 { float x, y, z; };
// __restrict: same noalias IR. Manual annotation.
void simulate(Vec3* __restrict pos,
const Vec3* __restrict vel,
int n, float dt) {
for (int i = 0; i < n; i++) {
pos[i].x += vel[i].x * dt;
pos[i].y += vel[i].y * dt;
pos[i].z += vel[i].z * dt;
}
}
// Without __restrict: extra reloads of vel[i]// Without __restrict: compiler reloads vel[i]
// every iteration — assumes overlap possible
define void @simulate(
ptr noalias %pos, i64 %len,
ptr noalias readonly %vel, i64 %len2,
float %dt)
define void @simulate_naive(
ptr %pos, ptr %vel,
i32 %n, float %dt) -- no noalias!
noalias IR by default — the sema analyser checks it at every call site. In C++ you get the same IR with __restrict, but it's a manual annotation.
2 lines of type definition vs ~40 lines of manual memory management
struct #soa Particle {
pos: [3]r32,
vel: [3]r32,
mass: r32,
}
-- []#cacheline = auto 64-byte stride
update (rw ents: []#cacheline Particle, dt: r32) {
for e in ents {
e.pos = e.pos + e.vel * dt
e.mass = e.mass * 0.999
}
}
struct Particles {
float* px; float* py; float* pz;
float* vx; float* vy; float* vz;
float* mass; int count;
};
// alloc each field with stride math
float* px = _aligned_malloc(n * 64, 64);
float* py = _aligned_malloc(n * 64, 64);
// ... 5 more allocs, 7 frees, manual index * 16
gep float, float* %px, i64 %idx
; rf: compiler computes stride from #soa + #cacheline
; C++: programmer computes stride as i * 16
The compiler tracks offset+length per slice — and proves non-overlap at compile time
struct T { x: r32 }
update (rw a: []T, rw b: []T)
#ensure disjoint(a, b)
{ for i in 0..<a.len { a[i].x = a[i].x + b[i].x } }
buf = mem.alloc(T, 200)
-- (1) non-overlapping: no guard needed
a = buf[0..100] -- offset=0, len=100
b = buf[101..105] -- offset=101, len=4
update(a, b) -- OK: bounds prove 0+100 <= 101
-- (2) overlapping: COMPILE ERROR
c = buf[50..150] -- offset=50, len=100
update(a, c) -- ERROR: 0..100 overlaps 50..150
-- (3) unknown provenance: guard required
x = some_fn() -- provenance unknown
@assert(@disjoint(a, x))
update(a, x) -- OK: guard covers unknown x
// __restrict: same noalias IR, manual check
void update(T* __restrict a, T* __restrict b,
int n) {
for (int i = 0; i < n; i++)
a[i].x += b[i].x;
}
T buf[200];
update(buf, buf + 101, 4); // OK
update(buf, buf + 50, 100); // compiles fine, no warning
// ^ rf catches this at compile time:
// bounds prove 0..100 overlaps 50..150
define void @update(
ptr noalias %a, i64 %len_a,
ptr noalias %b, i64 %len_b)
-- Same IR. The separation is enforcement:
-- rf tracks offset+len per slice variable
-- Same IR in both. The difference:
-- rf tracks offset+len and catches
-- overlap automatically.
noalias IR. The difference is that rf tracks every slice's offset and length, and catches overlapping sub-slices at compile time — no annotation needed. Non-overlapping sub-slices from the same allocation work without a guard. In C++ with __restrict, there's no range tracking — overlapping calls compile without complaint.
Type-safe parallelism — no pragmas, no manual stride math
struct #soa Entity {
health: r32,
pos: [3]r32,
vel: [3]r32,
}
simulate (rw ents: []#cacheline Entity, dt: r32) {
@dispatch(threads = 4, chunk = 64)
for e in ents {
e.health = e.health - 0.1 * dt
e.pos = e.pos + e.vel * dt
if e.health <= 0 { e.health = 0 }
}
}
if e.health <= 0 { e.health = 0 }
}
}
// Manual SOA + cacheline padding (see case 02)
// Then add OpenMP pragma:
void simulate(Entities& ents, float dt) {
#pragma omp parallel for schedule(static, 64) num_threads(4)
for (int i = 0; i < n; i++) {
int off = i * 16; // manual stride
ents.health[off] -= 0.1f * dt;
if (ents.health[off] <= 0) ents.health[off] = 0;
}
}
call GOMP_loop_static_start
call GOMP_loop_static_next
; rf: auto-generated from @dispatch + []#cacheline
; C++: from #pragma omp parallel for with manual stride
[]#cacheline), thread aliasing (access qualifiers), and chunk boundaries — you write the business logic, the compiler manages the rest. In C++, these are programmer-managed: stride math, padding, and chunk boundaries all need to be correct by hand.
[4]r32 is a native vector type — same assembly as hand-written SSE intrinsics
f (a: [4]r32, b: [4]r32, s: r32) -> [4]r32 {
return a * s + b
}
-- [4]r32 -> <4 x float> in LLVM IR
-- a * s -> scalar broadcast (fmul <4 x float>, float)
-- + b -> vector add (fadd <4 x float>)
-- Compiles to mulps + addps on x86
-- Same RF code works on ARM NEON, WASM SIMD
// Same assembly as RF. Not portable.
// __m128 doesn't exist on ARM.
__m128 f(__m128 a, __m128 b, float s) {
return _mm_add_ps(
_mm_mul_ps(a, _mm_set1_ps(s)), b);
}
// Naive C++: relies on auto-vectorizer
void f(float* a, float* b, float s,
float* out) {
for (int i = 0; i < 4; i++)
out[i] = a[i] * s + b[i];
} // may or may not vectorize
define <4 x float> @f(
<4 x float> %a, <4 x float> %b,
float %s)
%mul = fmul <4 x float> %a, float %s
%add = fadd <4 x float> %b, %mul
ret <4 x float> %add
shufps $0, %xmm2, %xmm0 ; broadcast s
mulps (%rcx), %xmm0 ; a * s
addps (%rdx), %xmm0 ; + b
; RF generates this from [4]r32 types
; C++ needs __m128 + intrinsics (x86-only)
[4]r32 is a portable SIMD type — it lowers to <4 x float> in LLVM IR and compiles to mulps + addps on x86, fmul + fadd on ARM NEON. The same RF code works on every platform. In C++, getting guaranteed SIMD requires platform-specific intrinsics (__m128, _mm_add_ps) or trusting the auto-vectorizer.