Understanding the Energy Footprint of Your Code

This deep dive explores how software energy consumption affects real-world sustainability, complete with practical measurements using Rust and RAPL.

Ever stared at your cloud provider’s energy consumption report and wondered, "Could my code be part of the problem?" Or perhaps you’ve optimized an algorithm for speed, only to ask: "Does faster code also mean greener code?"

As software engineers, we've traditionally focused on two primary metrics: performance (time complexity) and space efficiency (space complexity). We optimize our algorithms to run faster and use less memory, often treating these as the ultimate measures of code quality. But in an era where data centers consume ~1.3% of global electricity—a figure projected to double by 2026 (source)—it’s time to rethink what "efficient" code really means.

While hardware advancements drive efficiency, software decisions also play a critical role. For instance, choosing between a naive loop and SIMD-optimized code—can dramatically alter energy usage. The relationship between performance optimization and energy efficiency isn't always straightforward. Sometimes, faster code might consume more power but finish so quickly that the total energy consumption is lower. Other times, aggressive optimization might lead to higher overall energy usage due to increased CPU power states.

Let's explore how to measure, analyze, and understand your code's energy footprint.

Why Energy Efficiency Matters in Code

Modern applications rarely operate in isolation. A single API call might trigger:

Database queries
Network transfers
Parallel processing across CPU cores

Each operation consumes energy. While individual optimizations may seem trivial, at scale—across millions of users or petabytes of data—they compound into significant environmental and financial costs.

Consider this example:

# Naive sum (loop)
total = 0
for x in large_array:
    total += x

# Optimized sum (vectorized)
import numpy as np
total = np.sum(large_array)

The vectorized version leverages SIMD (Single Instruction, Multiple Data) instructions, processing 16+ values per CPU cycle. But does this speed boost translate to energy savings? Let’s measure it.

Measuring Energy Consumption: The Basics

Tools of the Trade

On Linux systems, two primary methods exist for software-based energy measurement:

RAPL (Running Average Power Limit)

Intel/AMD CPUs expose energy counters via MSRs (Model-Specific Registers).
Accessed Through: cat '/sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj'

Time-Based Estimation

Fallback for non-x86 CPUs (ARM, RISC-V).
Energy ≈ Time × TDP (Thermal Design Power) × Utilization.

Key Metrics

Metric	Formula	Unit	Purpose
Power	Energy/Time	Watts	Instantaneous 'rate' of energy.
Energy	Power x Time	Joules	Total consumed over task duration.
Energy per Operation	Energy/Operations	J/Op	Energy cost per operation.

Let's Measure: Practical Implementation

To achieve reliable energy measurements, we need precise control over hardware resources and predictable execution patterns. This makes language choice crucial. I chose Rust for this implementation because:

Rust's ownership system and compile-time checks allow us to write high-level, safe code that compiles down to efficient machine code with no runtime overhead.
Unsafe blocks provide controlled access to low-level hardware features (like RAPL registers) while maintaining memory safety elsewhere. This allows precise control over CPU features without sacrificing code safety:

unsafe {
    // RAPL MSR access
    CPU_SET(core, &mut set);
    // SIMD operations
    _mm512_storeu_ps(sum_array.as_mut_ptr(), sum);
}

Unlike garbage-collected languages, Rust's deterministic memory management ensures no unexpected pauses or energy spikes from GC cycles during measurements.

Here's the complete implementation demonstrating these principles:

// Enable experimental features for AVX-512 SIMD instructions
#![feature(stdarch_x86_avx512)]
#![feature(avx512_target_feature)]

use std::error::Error;
use std::fs;
use std::process::Command;
use std::thread;
use std::time::{Duration, Instant};
use libc::{sched_setaffinity, cpu_set_t, CPU_SET}; // For core CPU core isolation
use rayon::prelude::*; // For parellel iteration
use std::arch::x86_64::{_mm512_add_ps, _mm512_loadu_ps, _mm512_setzero_ps, _mm512_storeu_ps};

/// RAPL provides hardware-level energy consumption data through MSRs (Model Specific Registers)
struct RaplEnergy {
    domains: Vec<(String, u64)>,
    timestamp: Instant,
}

impl RaplEnergy {
    /// RaplEnergy instance by reading current energy values
    fn new() -> Result<Self, Box<dyn Error>> {
        let mut domains = Vec::new();
        // Read energy values from the powercap interface
        // This is more portable than direct MSR access
        for entry in fs::read_dir("/sys/class/powercap")? {
            let path = entry?.path();
            if path.join("name").exists() {
                let name = fs::read_to_string(path.join("name"))?.trim().to_string();
                // Read energy value in microjoules
                let energy = fs::read_to_string(path.join("energy_uj"))?.trim().parse()?;
                domains.push((name, energy));
            }
        }
        Ok(Self { domains, timestamp: Instant::now() })
    }

    /// Calculates energy difference between two measurements
    /// Returns the difference in joules
    fn delta(&self, other: &Self) -> f64 {
        self.domains.iter().zip(other.domains.iter())
            .map(|((n1, e1), (_, e2))| {
                let delta = if e1 > e2 { e1 - e2 } else { 0 };
                match n1.as_str() {
                    "package-0" | "core" => delta as f64 * 1e-6, // CPU energy
                    "dram" => delta as f64 * 1e-6, // Memory energy
                    _ => 0.0 // Ignore unknown domains
                }
            }).sum()
    }
}

/// Prepares the system for more consistent energy measurements

fn prepare_system() -> Result<(), Box<dyn Error>> {
    // Set CPU governor to performance mode (maximum frequency)
    Command::new("cpupower")
        .args(["frequency-set", "-g", "performance"])
        .status()?;
    // Disable address space randomization for consistent memory access patterns
    fs::write("/proc/sys/kernel/randomize_va_space", "0")?;
    // Clear system caches to start measurements from a known state
    fs::write("/proc/sys/vm/drop_caches", "3")?;
    Ok(())
}

/// Pins the current thread to a specific CPU core
fn isolate_cpu(core: usize) -> Result<(), Box<dyn Error>> {
    // Create a CPU set with only the specified core enabled
    let mut set = unsafe { std::mem::zeroed::<cpu_set_t>() };
    unsafe { CPU_SET(core, &mut set) };
    // it sets the thread's CPU affinity
    if unsafe { sched_setaffinity(0, std::mem::size_of_val(&set), &set) } != 0 {
        return Err("Failed to set CPU affinity".into());
    }
    Ok(())
}

/// Benchmarks a function for both time and energy consumption
fn benchmark<F: Fn() -> f32 + Send + Sync>(func: F, num_runs: usize) -> (f64, f64) {
    // Warm up the CPU and caches
    (0..10).for_each(|_| { func(); });
    thread::sleep(Duration::from_secs(2)); // Let system stabilize

    // Run the benchmark multiple times in parallel
    let results: Vec<_> = (0..num_runs).into_par_iter().map(|_| {
        isolate_cpu(0).unwrap(); // Pin to core 0 for consistency
        let start_energy = RaplEnergy::new().unwrap();
        let start_time = Instant::now();
        let result = func();
        let duration = start_time.elapsed();
        let end_energy = RaplEnergy::new().unwrap();
        (duration, end_energy.delta(&start_energy), result)
    }).collect();

    // Calculate median energy consumption
    let mut energies: Vec<f64> = results.iter().map(|(_, e, _)| *e).collect();
    energies.sort_by(|a, b| a.partial_cmp(b).unwrap());
    let energy = energies[energies.len() / 2];

    // Calculate median duration
    let mut durations: Vec<f64> = results.iter().map(|(d, _, _)| d.as_secs_f64()).collect();
    durations.sort_by(|a, b| a.partial_cmp(b).unwrap());
    let duration = durations[durations.len() / 2];

    (duration, energy)
}

/// SIMD-optimized sum function using AVX-512 instructions
#[target_feature(enable = "avx512f")]
unsafe fn simd_sum(data: &[f32]) -> f32 {
    // Initialize a 512-bit zero vector (16 x float32)
    let mut sum = _mm512_setzero_ps();

    // Process 16 elements at a time using SIMD
    data.chunks_exact(16)
        .for_each(|chunk| sum = _mm512_add_ps(sum, _mm512_loadu_ps(chunk.as_ptr())));

    let mut sum_array = [0.0f32; 16];
    _mm512_storeu_ps(sum_array.as_mut_ptr(), sum);
    let mut result = sum_array.iter().sum::<f32>();
    // + remaining elements that didn't fit in SIMD vectors
    result += data.chunks_exact(16).remainder().iter().sum::<f32>();
    result
}

fn main() -> Result<(), Box<dyn Error>> {
    prepare_system()?;

    // test data: array of 100M floats
    let data: Vec<f32> = (0..100_000_000).map(|x| x as f32).collect();

    let config = [
        // using iterator fold
        (
            "Naive",
            Box::new(|| data.iter().fold(0.0, |acc, &x| acc + x)) as Box<dyn Fn() -> f32 + Send + Sync>
        ),
        // SIMD implementation using AVX-512
        (
            "SIMD",
            Box::new(|| unsafe { simd_sum(&data) }) as Box<dyn Fn() -> f32 + Send + Sync>
        ),
    ];

    println!("| Implementation | Time (ms) | Energy (J) | J/Element |");
    println!("|----------------|-----------|------------|-----------|");

    for (name, func) in &config {
        let (duration, energy) = benchmark(func, 21); // 21 runs for stable median

        println!(
            "| {:<14} | {:8.2} | {:9.3} | {:8.3} |",
            name,
            duration * 1000.0, // to milliseconds
            energy,
            energy / data.len() as f64 * 1e9 // energy per element (nanojoules)
        );
    }

    Ok(())
}

Key Design Decisions

Cross-Platform RAPL Access: Reads /sys/class/powercap instead of raw MSRs, working on both Intel/AMD and some ARM servers.
Memory Hierarchy Control:

fs::write("/proc/sys/vm/drop_caches", "3")?;  // Flush page/dentry/inode caches

CPU Isolation:

isolate_cpu(0)?;  // Pin benchmark to CPU core 0

This minimizes interference from other processes which could skew energy measurements.

CPU Frequency Scaling:

Command::new("cpupower")
    .args(["frequency-set", "-g", "performance"])
    .status()?;

Ensures the CPU runs at maximum frequency for consistent measurements.

VA Space Randomization:

fs::write("/proc/sys/kernel/randomize_va_space", "0")?;

Disables address space layout randomization to reduce memory access variability

Statistical Rigor: Uses 21-run median values to filter out transient noise while maintaining practical runtimes via Rayon’s parallelism.

Case: SIMD vs Naive Sum

Test Environment

Specifications

CPU: Intel(R) Xeon(R) CPU E3-1240 V2 @ 3.40GHz
- 4 cores / 8 threads
- 8MB L3 Cache
Memory: 4 x 8GiB DIMM DDR3
- 1600 MHz frequency
- 0.6 ns latency
- Quad-channel configuration
Test Data: 100M float32 elements
- Total size: 400MB (exceeds L3 cache)
- Sequantial access pattern
- Memory bandwith: ~25.6 GB/s theoretical maximum
Operating System: Ubuntu 24.04 LTS
Compiler: rustc 1.84.0

Results:

Implementation	Time (ms)	Energy (J)	Nano Joule/Element
Naive	670.65	29.633	296.334
SIMD	221.22	9.123	91.23

The SIMD implementation is 3x faster and 3.2x more energy-efficient than the naive loop. This translates to significant savings in large-scale data processing.

Hardware-Specific Analysis

Vector Processing

Newer architectures like Apple M4 could potentially triple this performance

Memory Subsystem Impact

DDR3-1600 bandwidth becomes the bottleneck, modern DDR5 systems could show 2-3x better results
Quad-channel configuration helps minimize memory bottleneck

Energy Efficiency Breakdown

The 3.2x energy improvement can be attributed to:

Reduced CPU Active Time:

Fewer instructions executed
Shorter active duty cycle

Memory Access Efficiency

Better cache utilization
Fewer DRAM operations
More efficient memory access patterns

These results demonstrate that SIMD optimization not only improves performance but provides even better energy efficiency gains.

Challanges in Accurate Measurement

Operating systems are busy environments. At any momemnt, your machine is running background tasks, managing processs scheduling, performing memory management. This system noise can impact with energy measurements thorugh:

Context Switching Overhead:

Process interruptions require CPU state saves/stores
Cache line invalidations trigger energy-expensive reloads

Memory Hierarchy Effects:

Page cache state affects access patterns
NUMA effects vary energy costs
TLB misses add unpredictable overhead

Hardware Limitiations:

RAPL Accuracy: ±1-3% error on modern Intel CPUs, may vary by model
Counter Granularity: Energy counters update every ~1ms. Tasks completing faster than this suffer measurement inaccuracies.

Conclusion

Code energy efficiency might seem like a niche concern - after all, what difference could a few processor instructions make? But in today's world of massive scale computing, these small optimizations compound into significant real-world impactas demonstrated by the comprasion between naive and SIMD-optimized summation:

Implementation	Time (ms)	Energy (J)	Nano Joule/Element
Naive	670.65	29.633	296.334
SIMD	221.22	9.123	91.23

The SIMD implementation is 3x faster and consumes 3.25x less energy per operation. While these savings might seem small for a single execution, they compound dramatically at scale.

Real-World Impact

Imagine this code is part of an HTTP server handling 1 million requests per day:

So, what they mean for real-world applications? Let's calculate the annual savings:

daily_ops = 1_000_000
days_per_year = 365

watt_hour =  3600 # Joules
kwh_hour = 1000 * watt_hour # Joules
naive_annual_energy = 29.633 * daily_ops * days_per_year / kwh_hour # in kWh = 3004,45~

simd_annual_energy = 9.123 * daily_ops * days_per_year / kwh_hour # in kWh = 924,58~

annual_energy_saved = naive_annual_energy - simd_annual_energy # in kWh = 2.079~

#For the United States in 2023, the average emissions factor is approximately:
# 0.85 pounds (0.385 kg) of CO2 per kWh
# For more: https://www.epa.gov/egrid
co2_saved = energy_saved * 0.385 # in kg = 800,415

Annual Savings:

Energy saved: 2,079kWh
CO₂ savings: Assuming the average carbon intensity of electricity is 0.385 kg CO₂/kWh, this translates to: 800,415kg of CO₂ saved annually

2,079 kWh of energy saved can:

Charge an iPhone 15 approximately 138,600 times (at 15 Wh per full charge):
- This means charge one iPhone daily for 380 years
Running an air conditioner (1500W) for about 1,386 hours (or ~58 days continuously)

800.415 kg of CO₂ savings is equivalent to:

About 36-37 mature trees absorbing CO₂ for almost one year (average tree absorbs ~25 kg CO₂/year)
Driving reduction of approximately 3,200 kilometers in an average passenger car (assuming ~250g CO₂/km)

Maybe this can not seem much at the first glance, but this is just for a single function. And this is just one small function, in one service. Imagine the cumulative impact across:

Entire codebases
Multiple services
Thousands of servers
Millions of deployments

The next time you're deciding whether to optimize that "hot" function, remember: you're not just making your code faster - you're contributing to a more sustainable digital future.

What makes this perspective particularly powerful is that energy efficiency often aligns with other software engineering goals. Faster code typically means more satisfied users, lower infrastructure costs, and, as we've seen, reduced environmental impact. It's a rare win-win-win scenario in software engineering.