Skip to content

~270 times faster writing (or maybe not writing?) with arr[i] than with *(arr+i)? #540

Open
@DougLeonard

Description

@DougLeonard

I just stumbled into this and it's not actually causing me a specific problem. Bug or not? Maybe.

  • [x ] Checked for duplicates
    I tried to.

Describe the bug

Looping over array writes using arr[i] is about 270 times faster(edited) than *(arr+i). I am not claiming this is definitely a real difference in write performance. That's complicated.

Expected behavior

I think there's room to quibble, for an interpreter especially, what "identical" means below, but I'd expect more similar results.

According to the standard (7.6.1.2 in maybe latest version)

"The expression E1[E2] is identical (by definition) to *((E1)+(E2)), except that in the case of an array operand, the result is an lvalue if that operand is an lvalue and an xvalue otherwise."

To Reproduce

Just run this in cling:

#include <iostream>
#include <chrono>
// Include necessary libraries


void benchmark() {
    const int size = (int)(1E7);
    volatile int* arr = new volatile int[size];  // Create a dynamic array
    const int multiplier = 10;  // loop the loop this many times

    //prefill to page memory maybe:
    for (int i = 0; i < size; i++) {
        arr[i] = size-i; 
    }

    auto start = std::chrono::high_resolution_clock::now();
    for (int j = 0; j < multiplier; j++){
        for (int i = 0; i < size; ++i) {
                arr[i] = i;
        }
    }
    auto end = std::chrono::high_resolution_clock::now();
    auto duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    std::chrono::duration<double> duration = std::chrono::duration<double>(duration_ns);
    std::cout << "Indexed:      " << duration.count() << " seconds" << std::endl;

    
    volatile int* arr2 = new volatile int[size];  // Create a dynamic array

    start = std::chrono::high_resolution_clock::now();
    for (int j = 0; j < multiplier; j++){
        for (int i = 0; i < size; ++i) {
//                arr2[i] = i;  // swap these lines for control test.
                *(arr2+i) = i;
        }
    }
    end = std::chrono::high_resolution_clock::now();
    duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
    duration = std::chrono::duration<double>(duration_ns);
    std::cout << "Pointer math  " << duration.count() << " seconds" << std::endl;
}

int main () {
        benchmark();
}

I get some fluctuation but this is a typical output:

Indexed:      0.104759 seconds
Pointer math  27.4305 seconds

Setup

   ------------------------------------------------------------------
  | Welcome to ROOT 6.30/04                        https://root.cern |
  | (c) 1995-2024, The ROOT Team; conception: R. Brun, F. Rademakers |
  | Built for linuxx8664gcc on Jul 26 2024, 08:45:57                 |
  | From heads/master@tags/v6-30-04                                  |
  | With c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0                   |
  | Try '.help'/'.?', '.demo', '.license', '.credits', '.quit'/'.q'  |
   ------------------------------------------------------------------

Compiled myself. This is an ubuntu VM running in VMware workstation on windows 10. I tried to keep system loads low and reproduce results.

Additional context

For what it's worth. I also made a version using a memory-mapped file for allocation, to maybe prevent optimizing away the loop writes, and results were the same.

Actually understanding what's going here is not so simple. The only way I know to be certain to prevent optimizing the writes away is to isolate them in separate compiled binary functions so that the interpreter cannot know what the callee is doing. I wrote a version like that as well. To prevent too much function call overhead, I compiled the inner loops as separate compilation units (.so files) using cling, and interpreted the outer code and outer loop. That gave 0.4s for both. This is also similar to what I get with g++ with default optimization.

So it's tempting to say that for the posted version, a) the interpreter is just really slow, and b) it optimizes away the writes in the indexed case. Even so, this is still unexpected behavior to me in that it it treats them differently.

But this may not be the case. The 0.1 seconds posted result is not the same as removing the loops. That gives about 1ns. Also the time does scale proportionally to the total number of writes, both inner and outer loops. So the loops are surely not being optimized away. The actual writes might be. But finally, in the modular version, (which I can provide) g++ -O3 gives ~ 0.04s (4E-2)s for each, and again, that scales correctly with both loop lengths. This tempts the conclusion that 0.1s in the original result can be real, and that the difference would be optimization, but not optimizing away the writes.

The total range here, real or not, is between 0.04s to 27s, a factor of 675.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions