How to take into account memory access #2513

MartyIX · 2024-01-26T12:55:52Z

MartyIX
Jan 26, 2024

I wrote a very simple benchmark for comparing whether it makes sense to compute Math.Pow() (MasterBranch benchmark) or whether having precomputed values (PrBranch benchmark) is better:

using System;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;

namespace Benchmark;

public static class Program
{
    /// <summary>
    /// Entry point of the benchmark.
    /// </summary>
    public static void Main()
    {
        _ = BenchmarkRunner.Run<BenchmarkTest>();
    }
}

/// <summary>
/// Benchmark class for a comparison of current code and a proposed patch.
/// </summary>
/// <seealso href="https://benchmarkdotnet.org/articles/guides/"/>
public class BenchmarkTest
{
    private static readonly decimal[] precomputed = new decimal[]
    {
        1m,
        0.1m,
        0.01m,
        0.001m,
        0.0001m,
        0.00001m,
        0.000001m,
        0.0000001m,
        0.00000001m,
        0.000000001m,
        0.0000000001m,
        0.00000000001m,
        0.000000000001m,
        0.0000000000001m,
        0.00000000000001m,
        0.000000000000001m,
        0.0000000000000001m,
        0.00000000000000001m,
        0.000000000000000001m,
        0.0000000000000000001m,
    };

    /// <summary>
    /// Executed once.
    /// </summary>
    [GlobalSetup]
    public void GlobalSetup()
    {
    }

    [Benchmark]
    public decimal MasterBranch()
    {
        decimal r = 0;

        for (int decimals = 0; decimals < 5; decimals++)
        {
            r += (decimal)Math.Pow(10, -decimals);
        }

        return r;
    }

    [Benchmark]
    public decimal PrBranch()
    {
        decimal r = 0;

        for (int decimals = 0; decimals < 5; decimals++)
        {
            r += precomputed[decimals];
        }

        return r;
    }
}

Now results are

Method	Mean	Error	StdDev
MasterBranch	241.94 ns	4.832 ns	5.564 ns
PrBranch	24.65 ns	0.305 ns	0.270 ns

But now the feedback I got is that BenchmarkTest.precomputed might be stored in CPU-registry or some other fast cache. Then the benchmark results would be incorrect because PrBranch benchmark is essentially "cheating". In reality, many processes/threads/etc. compete to store data in the fast cache as it is a scare resource.

Do you agree with the feedback I got? Could anyone give me a tip how to improve the benchmark a bit?

Thanks you!

Answered by timcassell

Jan 26, 2024

Take 3 (I told you it's difficult!)

Code

public class BenchmarkTest
{
    private static readonly decimal[] precomputedCache = new decimal[]
    {
        1m,
        0.1m,
        0.01m,
        0.001m,
        0.0001m,
        0.00001m,
        0.000001m,
        0.0000001m,
        0.00000001m,
        0.000000001m,
        0.0000000001m,
        0.00000000001m,
        0.000000000001m,
        0.0000000000001m,
        0.00000000000001m,
        0.000000000000001m,
        0.0000000000000001m,
        0.00000000000000001m,
        0.000000000000000001m,
        0.0000000000000000001m,
    };

    [Benchmark]
    public decimal Calculate()
    {
        decimal r = 0;

        for (int …

View full answer

timcassell · 2024-01-26T15:13:44Z

timcassell
Jan 26, 2024
Collaborator

The feedback is correct. It is a difficult thing to measure the speed of cache vs ram in C#. I gave it a try and got these results:

Method	Mean	Error	StdDev
Calculate	479.73 ns	2.926 ns	2.443 ns
Cache	67.01 ns	0.579 ns	0.514 ns
Ram	188.18 ns	2.287 ns	2.139 ns

Code

public class BenchmarkTest
{
    const int CountToMeasure = 5;

    private static readonly decimal[] precomputedCache = new decimal[]
    {
        1m,
        0.1m,
        0.01m,
        0.001m,
        0.0001m,
        0.00001m,
        0.000001m,
        0.0000001m,
        0.00000001m,
        0.000000001m,
        0.0000000001m,
        0.00000000001m,
        0.000000000001m,
        0.0000000000001m,
        0.00000000000001m,
        0.000000000000001m,
        0.0000000000000001m,
        0.00000000000000001m,
        0.000000000000000001m,
        0.0000000000000000001m,
    };

    private static readonly decimal[][] precomputedRAM = GetPrecomputedRAM();
    private static List<byte[]> spacers;

    private static decimal[][] GetPrecomputedRAM()
    {
        Processor.GetPerCoreCacheSizes(out var l1, out var l2, out var l3);
        var totalCacheSize = l1 + l2 + l3;
        // 16 is size of decimal in bytes, plus array overhead.
        var precomputedSize = (precomputedCache.Length * 128) + IntPtr.Size * 3;
        // We copy the array as many times as needed to fill the cache.
        int copyCount = 0;
        for (int cacheFillSize = 0; cacheFillSize < totalCacheSize; cacheFillSize += precomputedSize)
        {
            ++copyCount;
        }
        // + CountToMeasure more to be sure that we're pulling from RAM and not cache and that we have a minimum size.
        copyCount += CountToMeasure;
        var array = new decimal[copyCount][];
        spacers = new List<byte[]>();
        for (int i = 0; i < copyCount; ++i)
        {
            // Allocate a spacer array the size of the cache line size to prevent the cpu from pulling in mulitple caches when reading from RAM.
            // You can probably get the actual value from the processor information, but I just used the common 4k size because I'm lazy.
            spacers.Add(new byte[64*64]);
            // Copy the cache.
            array[i] = precomputedCache.ToArray();
        }
        return array;
    }

    int currentIndex = 0;
    static readonly int ramMod = precomputedRAM.Length;

    [Benchmark]
    public decimal Calculate()
    {
        decimal r = 0;

        for (int decimals = 0; decimals < CountToMeasure; decimals++)
        {
            r += (decimal) Math.Pow(10, -decimals);
            // We do the index calculations to match the extra work of Ram.
            ++currentIndex;
            currentIndex %= ramMod;
        }

        return r;
    }

    [Benchmark]
    public decimal Cache()
    {
        decimal r = 0;

        for (int decimals = 0; decimals < CountToMeasure; decimals++)
        {
            r += precomputedCache[decimals];
            // We do the index calculations to match the extra work of Ram.
            ++currentIndex;
            currentIndex %= ramMod;
        }

        return r;
    }

    [Benchmark]
    public decimal Ram()
    {
        decimal r = 0;

        for (int decimals = 0; decimals < CountToMeasure; decimals++)
        {
            r += precomputedRAM[currentIndex][decimals];
            ++currentIndex;
            currentIndex %= ramMod;
        }

        return r;
    }
}

class Processor
{
    [DllImport("kernel32.dll")]
    public static extern int GetCurrentThreadId();

    //[DllImport("kernel32.dll")]
    //public static extern int GetCurrentProcessorNumber();

    [StructLayout(LayoutKind.Sequential, Pack = 4)]
    private struct GROUP_AFFINITY
    {
        public UIntPtr Mask;

        [MarshalAs(UnmanagedType.U2)]
        public ushort Group;

        [MarshalAs(UnmanagedType.ByValArray, SizeConst = 3, ArraySubType = UnmanagedType.U2)]
        public ushort[] Reserved;
    }

    [DllImport("kernel32", SetLastError = true)]
    private static extern Boolean SetThreadGroupAffinity(IntPtr hThread, ref GROUP_AFFINITY GroupAffinity, ref GROUP_AFFINITY PreviousGroupAffinity);

    [StructLayout(LayoutKind.Sequential)]
    public struct PROCESSORCORE
    {
        public byte Flags;
    };

    [StructLayout(LayoutKind.Sequential)]
    public struct NUMANODE
    {
        public uint NodeNumber;
    }

    public enum PROCESSOR_CACHE_TYPE
    {
        CacheUnified,
        CacheInstruction,
        CacheData,
        CacheTrace
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct CACHE_DESCRIPTOR
    {
        public byte Level;
        public byte Associativity;
        public ushort LineSize;
        public uint Size;
        public PROCESSOR_CACHE_TYPE Type;
    }

    [StructLayout(LayoutKind.Explicit)]
    public struct SYSTEM_LOGICAL_PROCESSOR_INFORMATION_UNION
    {
        [FieldOffset(0)]
        public PROCESSORCORE ProcessorCore;
        [FieldOffset(0)]
        public NUMANODE NumaNode;
        [FieldOffset(0)]
        public CACHE_DESCRIPTOR Cache;
        [FieldOffset(0)]
        private UInt64 Reserved1;
        [FieldOffset(8)]
        private UInt64 Reserved2;
    }

    public enum LOGICAL_PROCESSOR_RELATIONSHIP
    {
        RelationProcessorCore,
        RelationNumaNode,
        RelationCache,
        RelationProcessorPackage,
        RelationGroup,
        RelationAll = 0xffff
    }

    public struct SYSTEM_LOGICAL_PROCESSOR_INFORMATION
    {
#pragma warning disable 0649
        public UIntPtr ProcessorMask;
        public LOGICAL_PROCESSOR_RELATIONSHIP Relationship;
        public SYSTEM_LOGICAL_PROCESSOR_INFORMATION_UNION ProcessorInformation;
#pragma warning restore 0649
    }

    [DllImport(@"kernel32.dll", SetLastError = true)]
    public static extern bool GetLogicalProcessorInformation(IntPtr Buffer, ref uint ReturnLength);

    private const int ERROR_INSUFFICIENT_BUFFER = 122;

    private static SYSTEM_LOGICAL_PROCESSOR_INFORMATION[] _logicalProcessorInformation = null;

    public static SYSTEM_LOGICAL_PROCESSOR_INFORMATION[] LogicalProcessorInformation
    {
        get
        {
            if (_logicalProcessorInformation != null)
                return _logicalProcessorInformation;

            uint ReturnLength = 0;

            GetLogicalProcessorInformation(IntPtr.Zero, ref ReturnLength);

            if (Marshal.GetLastWin32Error() == ERROR_INSUFFICIENT_BUFFER)
            {
                IntPtr Ptr = Marshal.AllocHGlobal((int) ReturnLength);
                try
                {
                    if (GetLogicalProcessorInformation(Ptr, ref ReturnLength))
                    {
                        int size = Marshal.SizeOf(typeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION));
                        int len = (int) ReturnLength / size;
                        _logicalProcessorInformation = new SYSTEM_LOGICAL_PROCESSOR_INFORMATION[len];
                        IntPtr Item = Ptr;

                        for (int i = 0; i < len; i++)
                        {
                            _logicalProcessorInformation[i] = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION) Marshal.PtrToStructure(Item, typeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION));
                            Item += size;
                        }

                        return _logicalProcessorInformation;
                    }
                }
                finally
                {
                    Marshal.FreeHGlobal(Ptr);
                }
            }
            return null;
        }
    }

    public static void GetPerCoreCacheSizes(out Int64 L1, out Int64 L2, out Int64 L3)
    {
        L1 = 0;
        L2 = 0;
        L3 = 0;

        var info = Processor.LogicalProcessorInformation;
        foreach (var entry in info)
        {
            if (entry.Relationship != Processor.LOGICAL_PROCESSOR_RELATIONSHIP.RelationCache)
                continue;
            Int64 mask = (Int64) entry.ProcessorMask;
            if ((mask & (Int64) 1) == 0)
                continue;
            var cache = entry.ProcessorInformation.Cache;
            switch (cache.Level)
            {
                case 1: L1 = L1 + cache.Size; break;
                case 2: L2 = L2 + cache.Size; break;
                case 3: L3 = L3 + cache.Size; break;
                default:
                    break;
            }
        }
    }
}

Credit for the Processor class: https://stackoverflow.com/a/62208108 (You may need a different solution if you're not on Windows.)

And honestly, I don't even trust my own results, because it's common knowledge that Ram is 10-100x slower than cache.

3 replies

This comment has been hidden.

Sign in to view

timcassell Jan 26, 2024
Collaborator

Take 3 (I told you it's difficult!)

Code

public class BenchmarkTest
{
    private static readonly decimal[] precomputedCache = new decimal[]
    {
        1m,
        0.1m,
        0.01m,
        0.001m,
        0.0001m,
        0.00001m,
        0.000001m,
        0.0000001m,
        0.00000001m,
        0.000000001m,
        0.0000000001m,
        0.00000000001m,
        0.000000000001m,
        0.0000000000001m,
        0.00000000000001m,
        0.000000000000001m,
        0.0000000000000001m,
        0.00000000000000001m,
        0.000000000000000001m,
        0.0000000000000000001m,
    };

    [Benchmark]
    public decimal Calculate()
    {
        decimal r = 0;

        for (int decimals = 0; decimals < 5; decimals++)
        {
            r += (decimal) Math.Pow(10, -decimals);
        }

        return r;
    }

    [Benchmark]
    public decimal Cache()
    {
        decimal r = 0;

        for (int decimals = 0; decimals < 5; decimals++)
        {
            r += precomputedCache[decimals];
        }

        return r;
    }

    [Benchmark]
    public decimal Ram()
    {
        decimal r = 0;

        for (int decimals = 0; decimals < 5; decimals++)
        {
            CacheHelper.FlushCache();
            r += precomputedCache[decimals];
        }

        return r;
    }

    [Benchmark]
    public void FlushCacheOverhead()
    {
        for (int decimals = 0; decimals < 5; decimals++)
        {
            CacheHelper.FlushCache();
        }
    }
}

static class CacheHelper
{

    private static readonly byte[] cacheFlusher1 = GetCacheFlusher();
    private static readonly byte[] cacheFlusher2 = GetCacheFlusher();

    private static byte[] GetCacheFlusher()
    {
        Processor.GetPerCoreCacheSizes(out var l1, out var l2, out var l3);
        var totalCacheSize = l1 + l2 + l3;
        return new byte[totalCacheSize];
    }

    // You can probably get the actual cache line size from the processor information, but I just used the common 64 size because I'm lazy.
    const int cacheLineSize = 64;
    // Write to field to prevent dead code elimination.
    public static byte holder;
    private static bool flusher1;

    public static void FlushCache()
    {
        flusher1 = !flusher1;
        var array = flusher1 ? cacheFlusher1 : cacheFlusher2;
        // Touch every cache line to flush the cache.
        for (int i = 0, max = array.Length; i < max; i += cacheLineSize)
        {
            holder = array[i];
        }
    }
}

class Processor
{
    [DllImport("kernel32.dll")]
    public static extern int GetCurrentThreadId();

    //[DllImport("kernel32.dll")]
    //public static extern int GetCurrentProcessorNumber();

    [StructLayout(LayoutKind.Sequential, Pack = 4)]
    private struct GROUP_AFFINITY
    {
        public UIntPtr Mask;

        [MarshalAs(UnmanagedType.U2)]
        public ushort Group;

        [MarshalAs(UnmanagedType.ByValArray, SizeConst = 3, ArraySubType = UnmanagedType.U2)]
        public ushort[] Reserved;
    }

    [DllImport("kernel32", SetLastError = true)]
    private static extern Boolean SetThreadGroupAffinity(IntPtr hThread, ref GROUP_AFFINITY GroupAffinity, ref GROUP_AFFINITY PreviousGroupAffinity);

    [StructLayout(LayoutKind.Sequential)]
    public struct PROCESSORCORE
    {
        public byte Flags;
    };

    [StructLayout(LayoutKind.Sequential)]
    public struct NUMANODE
    {
        public uint NodeNumber;
    }

    public enum PROCESSOR_CACHE_TYPE
    {
        CacheUnified,
        CacheInstruction,
        CacheData,
        CacheTrace
    }

    [StructLayout(LayoutKind.Sequential)]
    public struct CACHE_DESCRIPTOR
    {
        public byte Level;
        public byte Associativity;
        public ushort LineSize;
        public uint Size;
        public PROCESSOR_CACHE_TYPE Type;
    }

    [StructLayout(LayoutKind.Explicit)]
    public struct SYSTEM_LOGICAL_PROCESSOR_INFORMATION_UNION
    {
        [FieldOffset(0)]
        public PROCESSORCORE ProcessorCore;
        [FieldOffset(0)]
        public NUMANODE NumaNode;
        [FieldOffset(0)]
        public CACHE_DESCRIPTOR Cache;
        [FieldOffset(0)]
        private UInt64 Reserved1;
        [FieldOffset(8)]
        private UInt64 Reserved2;
    }

    public enum LOGICAL_PROCESSOR_RELATIONSHIP
    {
        RelationProcessorCore,
        RelationNumaNode,
        RelationCache,
        RelationProcessorPackage,
        RelationGroup,
        RelationAll = 0xffff
    }

    public struct SYSTEM_LOGICAL_PROCESSOR_INFORMATION
    {
#pragma warning disable 0649
        public UIntPtr ProcessorMask;
        public LOGICAL_PROCESSOR_RELATIONSHIP Relationship;
        public SYSTEM_LOGICAL_PROCESSOR_INFORMATION_UNION ProcessorInformation;
#pragma warning restore 0649
    }

    [DllImport(@"kernel32.dll", SetLastError = true)]
    public static extern bool GetLogicalProcessorInformation(IntPtr Buffer, ref uint ReturnLength);

    private const int ERROR_INSUFFICIENT_BUFFER = 122;

    private static SYSTEM_LOGICAL_PROCESSOR_INFORMATION[] _logicalProcessorInformation = null;

    public static SYSTEM_LOGICAL_PROCESSOR_INFORMATION[] LogicalProcessorInformation
    {
        get
        {
            if (_logicalProcessorInformation != null)
                return _logicalProcessorInformation;

            uint ReturnLength = 0;

            GetLogicalProcessorInformation(IntPtr.Zero, ref ReturnLength);

            if (Marshal.GetLastWin32Error() == ERROR_INSUFFICIENT_BUFFER)
            {
                IntPtr Ptr = Marshal.AllocHGlobal((int) ReturnLength);
                try
                {
                    if (GetLogicalProcessorInformation(Ptr, ref ReturnLength))
                    {
                        int size = Marshal.SizeOf(typeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION));
                        int len = (int) ReturnLength / size;
                        _logicalProcessorInformation = new SYSTEM_LOGICAL_PROCESSOR_INFORMATION[len];
                        IntPtr Item = Ptr;

                        for (int i = 0; i < len; i++)
                        {
                            _logicalProcessorInformation[i] = (SYSTEM_LOGICAL_PROCESSOR_INFORMATION) Marshal.PtrToStructure(Item, typeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION));
                            Item += size;
                        }

                        return _logicalProcessorInformation;
                    }
                }
                finally
                {
                    Marshal.FreeHGlobal(Ptr);
                }
            }
            return null;
        }
    }

    public static void GetPerCoreCacheSizes(out Int64 L1, out Int64 L2, out Int64 L3)
    {
        L1 = 0;
        L2 = 0;
        L3 = 0;

        var info = Processor.LogicalProcessorInformation;
        foreach (var entry in info)
        {
            if (entry.Relationship != Processor.LOGICAL_PROCESSOR_RELATIONSHIP.RelationCache)
                continue;
            Int64 mask = (Int64) entry.ProcessorMask;
            if ((mask & (Int64) 1) == 0)
                continue;
            var cache = entry.ProcessorInformation.Cache;
            switch (cache.Level)
            {
                case 1: L1 = L1 + cache.Size; break;
                case 2: L2 = L2 + cache.Size; break;
                case 3: L3 = L3 + cache.Size; break;
                default:
                    break;
            }
        }
    }
}

Method	Mean	Error	StdDev
Calculate	457.24 ns	2.536 ns	2.372 ns
Cache	55.43 ns	0.410 ns	0.364 ns
Ram	5,026,508.40 ns	18,058.436 ns	14,098.839 ns
FlushCacheOverhead	5,019,991.08 ns	26,890.458 ns	20,994.301 ns

You can see I flushed the cpu cache before each access. This of course has its own overhead, so I measured that overhead in its own benchmark. Subtracting that overhead we get these results:

Method	Mean	Error	StdDev
Calculate	457.24 ns	2.536 ns	2.372 ns
Cache	55.43 ns	0.410 ns	0.364 ns
Ram	6,517.32 ns	?	?

It looks like this time ram is ~117x slower than cache. That's much more believable to me considering the common 10-100x.

Answer selected by MartyIX

MartyIX Jan 26, 2024
Author

Thank you for the very thorough response. Much appreciated!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

How to take into account memory access #2513

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

This comment has been hidden.

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

How to take into account memory access #2513

Uh oh!

MartyIX Jan 26, 2024

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

timcassell Jan 26, 2024 Collaborator

This comment has been hidden.

Uh oh!

Uh oh!

timcassell Jan 26, 2024 Collaborator

Uh oh!

MartyIX Jan 26, 2024 Author

MartyIX
Jan 26, 2024

Replies: 1 comment 3 replies

timcassell
Jan 26, 2024
Collaborator

timcassell Jan 26, 2024
Collaborator

MartyIX Jan 26, 2024
Author