Performance Impacts of Hybrid Cores

All cores are equal, but some are more equal than others.

I was reading about how modern intel CPUs have multiple types of cores that have different max clock speeds to achieve better power efficiency. For example, on my desktop (Intel i9-14900K):

[I] (base) aneesh@earth:~/a/_/2025$ lscpu --all --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ       MHZ
  0      0    0 0:0:0:0          yes 5700.0000 800.0000  972.4320
  0      0    0 0:0:0:0          yes 5700.0000 800.0000 1062.0959
  0      0    1 4:4:1:0          yes 5700.0000 800.0000 1100.0110
  0      0    1 4:4:1:0          yes 5700.0000 800.0000  800.0000
  0      0    2 8:8:2:0          yes 5700.0000 800.0000  800.0000
  0      0    2 8:8:2:0          yes 5700.0000 800.0000  800.0000
  0      0    3 12:12:3:0        yes 5700.0000 800.0000  889.0060
  0      0    3 12:12:3:0        yes 5700.0000 800.0000  800.0000
  0      0    4 16:16:4:0        yes 6000.0000 800.0000 1053.1350
  0      0    4 16:16:4:0        yes 6000.0000 800.0000 1065.4250
  0      0    5 20:20:5:0        yes 6000.0000 800.0000 1100.0000
  0      0    5 20:20:5:0        yes 6000.0000 800.0000  800.0000
  0      0    6 24:24:6:0        yes 5700.0000 800.0000 1048.7610
  0      0    6 24:24:6:0        yes 5700.0000 800.0000  800.0000
  0      0    7 28:28:7:0        yes 5700.0000 800.0000  800.0000
  0      0    7 28:28:7:0        yes 5700.0000 800.0000  800.0000
  0      0    8 32:32:8:0        yes 4400.0000 800.0000  800.0000
  0      0    9 33:33:8:0        yes 4400.0000 800.0000  800.0000
  0      0   10 34:34:8:0        yes 4400.0000 800.0000  800.1590
  0      0   11 35:35:8:0        yes 4400.0000 800.0000  800.0000
  0      0   12 36:36:9:0        yes 4400.0000 800.0000  800.1810
  0      0   13 37:37:9:0        yes 4400.0000 800.0000  800.0000
  0      0   14 38:38:9:0        yes 4400.0000 800.0000  800.0000
  0      0   15 39:39:9:0        yes 4400.0000 800.0000  799.5310
  0      0   16 40:40:10:0       yes 4400.0000 800.0000  800.0000
  0      0   17 41:41:10:0       yes 4400.0000 800.0000  800.0000
  0      0   18 42:42:10:0       yes 4400.0000 800.0000  800.0000
  0      0   19 43:43:10:0       yes 4400.0000 800.0000  800.2030
  0      0   20 44:44:11:0       yes 4400.0000 800.0000  800.0000
  0      0   21 45:45:11:0       yes 4400.0000 800.0000  800.0000
  0      0   22 46:46:11:0       yes 4400.0000 800.0000  800.0000
  0      0   23 47:47:11:0       yes 4400.0000 800.0000  800.0000

This is useful info, but not very fun or empirical. I wanted to see if I could observe the difference in performance between the different core types, so I wrote the following benchmark:

fn fib(n: usize) -> u64 {
    match n {
        0 => 1,
        1 => 1,
        _ => fib(n - 1) + fib(n - 2),
    }
}

fn main() {
    let timer = std::time::Instant::now();
    println!("{}", fib(42));
    println!("elapsed: {:?}", timer.elapsed());
}

In Linux, processes can be set to run on a specific core via taskset. Using hyperfine, I ran the benchmark (built in release mode) on every core. Here’s the results:

[I] (base) aneesh@earth:~/t/cputest$ hyperfine -P cpu 0 31 "taskset -c {cpu} ./target/release/cputest"
Benchmark 1: taskset -c 0 ./target/release/cputest
  Time (mean ± σ):     417.4 ms ±   2.9 ms    [User: 418.0 ms, System: 0.3 ms]
  Range (min … max):   416.1 ms … 425.7 ms    10 runs
 
...

Benchmark 8: taskset -c 7 ./target/release/cputest
  Time (mean ± σ):     418.1 ms ±   1.2 ms    [User: 418.7 ms, System: 0.0 ms]
  Range (min … max):   416.5 ms … 420.1 ms    10 runs
 
...

Benchmark 17: taskset -c 16 ./target/release/cputest
  Time (mean ± σ):     738.4 ms ±   0.2 ms    [User: 738.9 ms, System: 0.0 ms]
  Range (min … max):   738.1 ms … 738.5 ms    10 runs
 
... 

Summary
  taskset -c 9 ./target/release/cputest ran
    1.00 ± 0.00 times faster than taskset -c 8 ./target/release/cputest
    1.01 ± 0.00 times faster than taskset -c 10 ./target/release/cputest
    1.01 ± 0.00 times faster than taskset -c 11 ./target/release/cputest
    1.05 ± 0.00 times faster than taskset -c 1 ./target/release/cputest
    1.05 ± 0.01 times faster than taskset -c 0 ./target/release/cputest
    1.05 ± 0.00 times faster than taskset -c 13 ./target/release/cputest
    1.05 ± 0.00 times faster than taskset -c 7 ./target/release/cputest
    1.05 ± 0.00 times faster than taskset -c 3 ./target/release/cputest
    1.05 ± 0.00 times faster than taskset -c 5 ./target/release/cputest
    1.05 ± 0.00 times faster than taskset -c 15 ./target/release/cputest
    1.05 ± 0.00 times faster than taskset -c 12 ./target/release/cputest
    1.05 ± 0.00 times faster than taskset -c 6 ./target/release/cputest
    1.05 ± 0.00 times faster than taskset -c 14 ./target/release/cputest
    1.06 ± 0.00 times faster than taskset -c 4 ./target/release/cputest
    1.06 ± 0.01 times faster than taskset -c 2 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 20 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 18 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 16 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 21 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 27 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 22 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 19 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 30 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 17 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 29 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 25 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 28 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 26 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 31 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 24 ./target/release/cputest
    1.86 ± 0.00 times faster than taskset -c 23 ./target/release/cputest

This seems to be in line with what we saw from lscpu’s output - cores 8,9,10,11 are the fastest, and are about 5% faster than the 12 other cores in 0-15. The remaining cores are much slower.

As far as I know, the default Linux scheduler doesn’t take each core’s properties into account during scheduling, and many parallel libraries don’t either. I’d like to try playing around with explicit core affinity sometime - it might be interesting to move lower priority, or IO bound tasks to a slower core and save the fastest cores for the bulk of the compute. Regardless, this was an interesting experiment and I gained a deeper understanding of the hardware I use daily.

Written on January 3, 2025