Performance

#104 Apr 2026

104. Vec::swap_remove — O(1) Removal When Order Doesn't Matter

Calling Vec::remove(i) shifts every element after i left by one — that’s O(n) per call. If you don’t care about order, swap_remove does the same job in O(1).

The hidden cost of Vec::remove

remove takes the value out of the vector, but to keep the rest in order it has to shift every following element left by one slot. On a long vector, removing near the front is a giant memcpy:

1
2
3
4
5
let mut v = vec![1, 2, 3, 4, 5];
let removed = v.remove(1);

assert_eq!(removed, 2);
assert_eq!(v, [1, 3, 4, 5]);

Inside a loop — “remove every item that matches X” — that O(n) cost compounds into O(n²). A tight loop that should be linear silently becomes quadratic.

swap_remove: trade order for speed

swap_remove(i) removes the element at index i by swapping it with the last element, then popping the tail. No shifting, no memcpy chain — just one swap and a pop:

1
2
3
4
5
6
let mut v = vec![1, 2, 3, 4, 5];
let removed = v.swap_remove(1);

assert_eq!(removed, 2);
// 5 took 2's spot — order is gone, but the op was O(1).
assert_eq!(v, [1, 5, 3, 4]);

If your Vec is really a set — entities in a game world, pending jobs in a worker pool, items keyed by id elsewhere — order rarely matters. swap_remove is free.

Removing many items in place

The classic mistake: iterate forward calling remove. swap_remove lets you do it without the O(n²) blow-up. Walk by index and don’t bump the cursor when you remove:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
let mut nums = vec![1, 2, 3, 4, 5, 6, 7, 8];

let mut i = 0;
while i < nums.len() {
    if nums[i] % 2 == 0 {
        nums.swap_remove(i);
        // Don't increment — a new element is now at position i.
    } else {
        i += 1;
    }
}

nums.sort(); // canonicalize for the assert
assert_eq!(nums, [1, 3, 5, 7]);

For bulk predicate-based filtering, Vec::retain stays cleaner (and is also O(n)). Reach for swap_remove when you have a specific index you want gone in O(1).

With structs

swap_remove returns the removed value, so you can hand it off to another collection on the way out:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#[derive(PartialEq, Debug)]
struct Job { id: u32 }

let mut queue = vec![
    Job { id: 1 },
    Job { id: 2 },
    Job { id: 3 },
    Job { id: 4 },
];

let pulled = queue.swap_remove(0);

assert_eq!(pulled, Job { id: 1 });
assert_eq!(queue.len(), 3);

When to reach for it

Use swap_remove whenever you’d write remove(i) and the surrounding code doesn’t depend on element order. Order-preserving removal still needs remove, but most “remove one item from this Vec” code is order-agnostic — and swap_remove turns an O(n) pothole into a one-line O(1) win. Stable since Rust 1.0, one of those quietly-essential APIs that’s easy to forget exists.

98. sort_by_cached_key — Stop Recomputing Expensive Sort Keys

sort_by_key sounds like it computes the key once per element. It doesn’t — it calls your closure at every comparison, so an n-element sort can pay for that key O(n log n) times. If the key is expensive, sort_by_cached_key is the fix you’ve been looking for.

The trap

The signature reads nicely: “sort by this key.” The implementation, less so — the closure fires on every comparison, not once per element:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
use std::cell::Cell;

let mut items = vec!["banana", "fig", "apple", "cherry", "date"];
let calls = Cell::new(0);

items.sort_by_key(|s| {
    calls.set(calls.get() + 1);
    // Pretend this is a heavy computation: allocating, hashing,
    // parsing, calling a regex, opening a file, etc.
    s.to_string()
});

// 5 elements, but the key ran way more than 5 times.
assert!(calls.get() > items.len());
assert_eq!(items, ["apple", "banana", "cherry", "date", "fig"]);

For identity keys that cost a pointer-deref, nobody cares. For anything that allocates.to_string(), .to_lowercase(), format!(...), a regex capture, a trimmed-and-lowered filename — the cost compounds quickly. I’ve seen a profile where 80% of total runtime was the key closure being called 40,000 times to sort 2,000 items.

The fix

slice::sort_by_cached_key runs your closure exactly once per element, stashes the results in a scratch buffer, then sorts against the cache. This is the Schwartzian transform, wrapped up in a method call:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
use std::cell::Cell;

let mut items = vec!["banana", "fig", "apple", "cherry", "date"];
let calls = Cell::new(0);

items.sort_by_cached_key(|s| {
    calls.set(calls.get() + 1);
    s.to_string()
});

// Exactly one call per element — no matter how big the slice is.
assert_eq!(calls.get(), items.len());
assert_eq!(items, ["apple", "banana", "cherry", "date", "fig"]);

Same result, linear key-function calls. The memory trade is a Vec<(K, usize)> the size of the slice — cheap next to the cost of re-running an allocating closure on every compare.

When to reach for which

The rule is about where your time goes, not how fancy the key looks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
let mut nums = vec![5u32, 2, 8, 1, 9, 3];

// Trivial key: sort_by_key is fine (and avoids the scratch alloc).
nums.sort_by_key(|n| *n);
assert_eq!(nums, [1, 2, 3, 5, 8, 9]);

// Expensive key: sort_by_cached_key wins.
let mut files = vec!["Cargo.TOML", "src/MAIN.rs", "README.md", "build.RS"];
files.sort_by_cached_key(|path| path.to_lowercase());
assert_eq!(files, ["build.RS", "Cargo.TOML", "README.md", "src/MAIN.rs"]);

Use sort_by_key for cheap, Copy-ish keys. Use sort_by_cached_key the moment your closure allocates, hashes, parses, or otherwise does real work — it’s the difference between O(n log n) and O(n) calls to that closure.

90. black_box — Stop the Compiler From Erasing Your Benchmarks

Your benchmark ran in 0 nanoseconds? Congratulations — the compiler optimised away the code you were trying to measure. std::hint::black_box prevents that by hiding values from the optimiser.

The problem: the optimiser is too smart

The Rust compiler aggressively eliminates dead code. If it can prove a result is never used, it simply removes the computation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
fn sum_range(n: u64) -> u64 {
    (0..n).sum()
}

fn main() {
    let start = std::time::Instant::now();
    let result = sum_range(10_000_000);
    let elapsed = start.elapsed();

    // Without using `result`, the compiler may skip the entire computation
    println!("took: {elapsed:?}");

    // Force the result to be "used" so the above isn't optimised out
    assert!(result > 0);
}

In release mode, the compiler can see through this and may still optimise the loop away — or even compute the answer at compile time. Your benchmark reports near-zero time, and you learn nothing.

Enter black_box

std::hint::black_box takes a value and returns it unchanged, but the compiler treats it as an opaque barrier — it can’t see through it, so it can’t optimise away whatever produced that value:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
use std::hint::black_box;

fn sum_range(n: u64) -> u64 {
    (0..n).sum()
}

fn main() {
    let start = std::time::Instant::now();
    let result = sum_range(black_box(10_000_000));
    let _ = black_box(result);
    let elapsed = start.elapsed();

    println!("sum = {result}, took: {elapsed:?}");
}

Two black_box calls do the trick:

  1. Wrap the input — prevents the compiler from constant-folding the argument
  2. Wrap the output — prevents dead-code elimination of the computation

Before and after

Without black_box (release mode):

1
sum = 49999995000000, took: 83ns     ← suspiciously fast

With black_box (release mode):

1
sum = 49999995000000, took: 5.612ms  ← actual work

It works on any type

black_box is generic — it works on integers, strings, structs, references, whatever:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
use std::hint::black_box;

fn main() {
    // Hide a vector from the optimiser
    let data: Vec<i32> = black_box(vec![1, 2, 3, 4, 5]);
    let total: i32 = data.iter().sum();
    let total = black_box(total);

    assert_eq!(total, 15);
}

Micro-benchmark recipe

Here’s a minimal pattern for quick-and-dirty micro-benchmarks:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
use std::hint::black_box;
use std::time::Instant;

fn fibonacci(n: u32) -> u64 {
    match n {
        0 => 0,
        1 => 1,
        _ => fibonacci(n - 1) + fibonacci(n - 2),
    }
}

fn main() {
    let iterations = 100;
    let start = Instant::now();
    for _ in 0..iterations {
        black_box(fibonacci(black_box(30)));
    }
    let elapsed = start.elapsed();
    println!("{iterations} runs in {elapsed:?} ({:?}/iter)", elapsed / iterations);
}

Without black_box, the compiler could hoist the pure function call out of the loop or eliminate it entirely. With it, each iteration does real work.

When to use it

Reach for black_box whenever you’re timing code and the results look suspiciously fast. It’s also the foundation that benchmarking frameworks like criterion and the built-in #[bench] use under the hood.

It’s not a full benchmarking harness — for serious measurement you still want warmup, statistics, and outlier detection. But when you need a quick sanity check, black_box + Instant gets the job done.

Available since Rust 1.66 on stable.

89. cold_path — Tell the Compiler Which Branch Won't Happen

Your error-handling branch fires once in a million calls, but the compiler doesn’t know that. core::hint::cold_path lets you mark unlikely branches so the optimiser can focus on the hot path.

Why branch layout matters

Modern CPUs predict which way a branch will go and speculatively execute instructions ahead of time. When the prediction is right, execution flies. When it’s wrong, the pipeline stalls.

Compilers already try to guess which branches are hot, but they don’t always get it right — especially when both sides of an if look equally plausible from a static analysis perspective. That’s where cold_path comes in.

The basics

Call cold_path() at the start of a branch that is rarely taken:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
use std::hint::cold_path;

fn process(value: Option<u64>) -> u64 {
    if let Some(v) = value {
        v * 2
    } else {
        cold_path();
        log_miss();
        0
    }
}

fn log_miss() {
    // imagine logging or metrics here
}

The compiler now knows the else arm is unlikely. It can lay out the machine code so the hot path (the Some arm) has no jumps, keeping it in the instruction cache and the branch predictor happy.

In match expressions

cold_path works well in match arms too — mark the rare variants:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
use std::hint::cold_path;

fn handle_status(code: u16) -> &'static str {
    match code {
        200 => "ok",
        301 => "moved",
        404 => { cold_path(); "not found" }
        500 => { cold_path(); "server error" }
        _   => { cold_path(); "unknown" }
    }
}

Only the branches you expect to be common stay on the fast track.

Building likely and unlikely helpers

If you’ve used C/C++, you might miss __builtin_expect. With cold_path you can build the same thing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
use std::hint::cold_path;

#[inline(always)]
const fn likely(b: bool) -> bool {
    if !b { cold_path(); }
    b
}

#[inline(always)]
const fn unlikely(b: bool) -> bool {
    if b { cold_path(); }
    b
}

fn check_bounds(index: usize, len: usize) -> bool {
    if unlikely(index >= len) {
        panic!("out of bounds: {} >= {}", index, len);
    }
    true
}

Now you can annotate conditions directly instead of marking individual branches.

A word of caution

Misusing cold_path on a branch that actually runs often can hurt performance — the compiler will deprioritise it, and you’ll get more pipeline stalls, not fewer. Always benchmark before sprinkling hints around. Profile first, hint second.

The bottom line

cold_path is a zero-cost, zero-argument function that tells the optimiser what you already know: this branch is the exception, not the rule. It’s a small tool, but in hot loops and latency-sensitive code, it can make a measurable difference.

Stabilised in Rust 1.95 (April 2026).

76. slice::as_chunks — Split Slices into Fixed-Size Arrays

You’re calling .chunks(4) and immediately doing chunk.try_into().unwrap() to get an array. as_chunks gives you &[[T; N]] directly — a slice of properly typed arrays, plus the remainder.

The Problem

When you use .chunks(N), each chunk is a &[T] — a dynamically sized slice. The compiler doesn’t know its length, so you’re stuck converting manually:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
fn sum_pairs(data: &[i32]) -> Vec<i32> {
    data.chunks(2)
        .filter(|c| c.len() == 2) // skip incomplete last chunk
        .map(|c| c[0] + c[1])     // runtime indexing, no guarantees
        .collect()
}

fn main() {
    let values = [1, 2, 3, 4, 5];
    assert_eq!(sum_pairs(&values), vec![3, 7]);
}

That works, but the compiler can’t verify your index access at compile time. You’re also throwing away the last chunk if it’s incomplete, with no easy way to inspect it.

After: as_chunks

Stabilized in Rust 1.88, as_chunks splits a slice into a &[[T; N]] and a remainder &[T] in one call:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
fn sum_pairs(data: &[i32]) -> Vec<i32> {
    let (chunks, _remainder) = data.as_chunks::<2>();
    chunks.iter().map(|[a, b]| a + b).collect()
}

fn main() {
    let values = [1, 2, 3, 4, 5];
    assert_eq!(sum_pairs(&values), vec![3, 7]);

    // The remainder is available too
    let (chunks, remainder) = values.as_chunks::<2>();
    assert_eq!(chunks, &[[1, 2], [3, 4]]);
    assert_eq!(remainder, &[5]);
}

Each chunk is &[i32; 2], so you can pattern-match [a, b] directly. The compiler knows the size — no bounds checks, no panics.

Processing Fixed-Width Records

Parsing binary data with fixed-width fields is where as_chunks shines. Imagine RGB pixel data packed as bytes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
fn brighten(pixels: &mut [u8], amount: u8) {
    let (chunks, _) = pixels.as_chunks_mut::<3>();
    for [r, g, b] in chunks {
        *r = r.saturating_add(amount);
        *g = g.saturating_add(amount);
        *b = b.saturating_add(amount);
    }
}

fn main() {
    let mut pixels = [100, 150, 200, 50, 60, 70, 255, 128, 0];
    brighten(&mut pixels, 30);
    assert_eq!(pixels, [130, 180, 230, 80, 90, 100, 255, 158, 30]);
}

No manual stride arithmetic. Each iteration gives you exactly 3 bytes, pattern-matched into r, g, b.

Don’t Lose the Remainder

Unlike chunks_exact() where you call .remainder() on the iterator after consuming it, as_chunks returns the remainder upfront:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
fn main() {
    let data = [10, 20, 30, 40, 50, 60, 70];

    let (fours, rest) = data.as_chunks::<4>();
    assert_eq!(fours.len(), 1);       // one complete chunk: [10, 20, 30, 40]
    assert_eq!(fours[0], [10, 20, 30, 40]);
    assert_eq!(rest, &[50, 60, 70]);  // leftover elements

    // as_rchunks starts from the right instead
    let (rest, fours) = data.as_rchunks::<4>();
    assert_eq!(rest, &[10, 20, 30]);
    assert_eq!(fours[0], [40, 50, 60, 70]);
}

as_rchunks is the mirror — it aligns chunks to the end, putting the remainder at the front. Useful when your trailing data is the structured part (e.g., a checksum or footer).

The Full Family

All stabilized in Rust 1.88, these come in immutable and mutable variants:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
fn main() {
    let data: &[u8] = &[1, 2, 3, 4, 5, 6, 7];

    // as_chunks — align from left, remainder on right
    let (chunks, rem) = data.as_chunks::<3>();
    assert_eq!(chunks, &[[1, 2, 3], [4, 5, 6]]);
    assert_eq!(rem, &[7]);

    // as_rchunks — align from right, remainder on left
    let (rem, chunks) = data.as_rchunks::<3>();
    assert_eq!(rem, &[1]);
    assert_eq!(chunks, &[[2, 3, 4], [5, 6, 7]]);

    // Mutable versions: as_chunks_mut, as_rchunks_mut
    let mut buf = [0u8; 6];
    let (chunks, _) = buf.as_chunks_mut::<2>();
    chunks[0] = [0xCA, 0xFE];
    chunks[1] = [0xBA, 0xBE];
    chunks[2] = [0xDE, 0xAD];
    assert_eq!(buf, [0xCA, 0xFE, 0xBA, 0xBE, 0xDE, 0xAD]);
}

Whenever you’re reaching for .chunks(N) with a compile-time constant, as_chunks::<N>() gives you stronger types, better ergonomics, and the remainder without gymnastics.

75. select_nth_unstable — Find the Kth Element Without Sorting

You’re sorting an entire Vec just to grab the median or the top 3 elements. select_nth_unstable does it in O(n) — no full sort required.

The classic approach: sort everything, then index:

1
2
3
4
let mut scores = vec![82, 45, 99, 67, 73, 91, 55];
scores.sort();
let median = scores[scores.len() / 2];
assert_eq!(median, 73);

That’s O(n log n) to answer an O(n) question. select_nth_unstable uses a partial-sort algorithm (quickselect) to put the element at a given index into its final sorted position — everything before it is smaller or equal, everything after is greater or equal:

1
2
3
4
let mut scores = vec![82, 45, 99, 67, 73, 91, 55];
let mid = scores.len() / 2;
scores.select_nth_unstable(mid);
assert_eq!(scores[mid], 73);

The method returns three mutable slices — elements below, the pivot element, and elements above — so you can work with each partition directly:

1
2
3
4
5
6
7
let mut data = vec![10, 80, 30, 90, 50, 70, 20];
let (lower, median, upper) = data.select_nth_unstable(3);
// lower contains 3 elements all <= *median
// upper contains 3 elements all >= *median
assert_eq!(*median, 50);
assert!(lower.iter().all(|&x| x <= 50));
assert!(upper.iter().all(|&x| x >= 50));

Need the top 3 scores without sorting the full list? Select the boundary, then sort only the small tail:

1
2
3
4
5
6
let mut scores = vec![82, 45, 99, 67, 73, 91, 55];
let k = scores.len() - 3;
scores.select_nth_unstable(k);
let top_3 = &mut scores[k..];
top_3.sort_unstable(); // sort only 3 elements, not all 7
assert_eq!(top_3, &[82, 91, 99]);

Custom ordering works too. Find the median by absolute value:

1
2
3
4
let mut vals = vec![-20, 5, -10, 15, -3, 8, 1];
let mid = vals.len() / 2;
vals.select_nth_unstable_by_key(mid, |v| v.abs());
assert_eq!(vals[mid].abs(), 8);

The “unstable” in the name means equal elements might be reordered (like sort_unstable) — it doesn’t mean the API is experimental. This is stable since Rust 1.49, available on any [T] where T: Ord.

36. Cow<str> — Clone on Write

Stop cloning strings “just in case” — Cow<str> lets you borrow when you can and clone only when you must.

The problem

You’re writing a function that sometimes needs to modify a string and sometimes doesn’t. The easy fix? Clone every time:

1
2
3
4
5
6
7
fn ensure_greeting(name: &str) -> String {
    if name.starts_with("Hello") {
        name.to_string() // unnecessary clone!
    } else {
        format!("Hello, {name}!")
    }
}

This works, but that first branch allocates a brand-new String even though name is already perfect as-is. In a hot loop, those wasted allocations add up.

Enter Cow<str>

Cow stands for Clone on Write. It holds either a borrowed reference or an owned value, and only clones when you actually need to mutate or take ownership:

1
2
3
4
5
6
7
8
9
use std::borrow::Cow;

fn ensure_greeting(name: &str) -> Cow<str> {
    if name.starts_with("Hello") {
        Cow::Borrowed(name) // zero-cost: just wraps the reference
    } else {
        Cow::Owned(format!("Hello, {name}!"))
    }
}

Now the happy path (name already starts with “Hello”) does zero allocation. The caller gets a Cow<str> that derefs to &str transparently — most code won’t even notice the difference.

Using Cow values

Because Cow<str> implements Deref<Target = str>, you can use it anywhere a &str is expected:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
use std::borrow::Cow;

fn ensure_greeting(name: &str) -> Cow<str> {
    if name.starts_with("Hello") {
        Cow::Borrowed(name)
    } else {
        Cow::Owned(format!("Hello, {name}!"))
    }
}

fn main() {
    let greeting = ensure_greeting("Hello, world!");
    assert_eq!(&*greeting, "Hello, world!");

    // Call &str methods directly on Cow
    assert!(greeting.contains("world"));

    // Only clone into String when you truly need ownership
    let _owned: String = greeting.into_owned();

    let greeting2 = ensure_greeting("Rust");
    assert_eq!(&*greeting2, "Hello, Rust!");
}

When to reach for Cow

Cow shines in these situations:

  • Conditional transformations — functions that modify input only sometimes (normalization, trimming, escaping)
  • Config/lookup values — return a static default or a dynamically built string
  • Parser outputs — most tokens are slices of the input, but some need unescaping

The Cow type works with any ToOwned pair, not just strings. You can use Cow<[u8]>, Cow<Path>, or Cow<[T]> the same way.

Quick reference

OperationCost
Cow::Borrowed(s)Free — wraps a reference
Cow::Owned(s)Whatever creating the owned value costs
*cow (deref)Free
cow.into_owned()Free if already owned, clones if borrowed
cow.to_mut()Clones if borrowed, then gives &mut access