tag:blogger.com,1999:blog-88918849561655800802025-07-22T03:10:22.646+02:00Database ArchitectsA blog by and for database architects.Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.comBlogger48125tag:blogger.com,1999:blog-8891884956165580080.post-73305066858596126522024-12-27T17:57:00.000+01:002024-12-27T17:57:37.600+01:00Advent of Code 2024 in pure SQL<p>&nbsp;On a whim I decided to do this years <a href="https://adventofcode.com/2024">advent of code</a> in pure SQL. That was an interesting experience that I can recommend to everybody because it forces you to think differently about the problems. And I can report that <a href="https://github.com/neumannt/aoc24">it was possible to solve every problem in pure SQL</a>.</p><p>In many cases SQL was actually surprisingly pleasant to use. The full solution for day 11 (including the puzzle input) is shown below:</p><p><br /></p> <!--HTML generated using hilite.me--><div style="background: rgb(255, 255, 255); border-color: gray; border-image: none; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;"><table><tbody><tr><td><pre style="line-height: 125%; margin: 0px;"> 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30</pre></td><td><pre style="line-height: 125%; margin: 0px;"><span style="color: #008800; font-weight: bold;">with</span> <span style="color: #008800; font-weight: bold;">recursive</span> aoc10_input(i) <span style="color: #008800; font-weight: bold;">as</span> (<span style="color: #008800; font-weight: bold;">select</span> <span style="background-color: #fff0f0;">'</span> <span style="background-color: #fff0f0;">89010123</span> <span style="background-color: #fff0f0;">78121874</span> <span style="background-color: #fff0f0;">87430965</span> <span style="background-color: #fff0f0;">96549874</span> <span style="background-color: #fff0f0;">45678903</span> <span style="background-color: #fff0f0;">32019012</span> <span style="background-color: #fff0f0;">01329801</span> <span style="background-color: #fff0f0;">10456732</span> <span style="background-color: #fff0f0;">'</span>), lines(y,line) <span style="color: #008800; font-weight: bold;">as</span> ( <span style="color: #008800; font-weight: bold;">select</span> <span style="color: #0000dd; font-weight: bold;">0</span>, substr(i,<span style="color: #0000dd; font-weight: bold;">1</span>,<span style="color: #008800; font-weight: bold;">position</span>(E<span style="background-color: #fff0f0;">'\n'</span> <span style="color: #008800; font-weight: bold;">in</span> i)<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>), substr(i,<span style="color: #008800; font-weight: bold;">position</span>(E<span style="background-color: #fff0f0;">'\n'</span> <span style="color: #008800; font-weight: bold;">in</span> i)<span style="color: #333333;">+</span><span style="color: #0000dd; font-weight: bold;">1</span>) <span style="color: #008800; font-weight: bold;">from</span> aoc10_input <span style="color: #008800; font-weight: bold;">union</span> <span style="color: #008800; font-weight: bold;">all</span> <span style="color: #008800; font-weight: bold;">select</span> y<span style="color: #333333;">+</span><span style="color: #0000dd; font-weight: bold;">1</span>,substr(r,<span style="color: #0000dd; font-weight: bold;">1</span>,<span style="color: #008800; font-weight: bold;">position</span>(E<span style="background-color: #fff0f0;">'\n'</span> <span style="color: #008800; font-weight: bold;">in</span> r)<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>), substr(r,<span style="color: #008800; font-weight: bold;">position</span>(E<span style="background-color: #fff0f0;">'\n'</span> <span style="color: #008800; font-weight: bold;">in</span> r)<span style="color: #333333;">+</span><span style="color: #0000dd; font-weight: bold;">1</span>) <span style="color: #008800; font-weight: bold;">from</span> lines l(y,l,r) <span style="color: #008800; font-weight: bold;">where</span> <span style="color: #008800; font-weight: bold;">position</span>(E<span style="background-color: #fff0f0;">'\n'</span> <span style="color: #008800; font-weight: bold;">in</span> r)<span style="color: #333333;">&gt;</span><span style="color: #0000dd; font-weight: bold;">0</span> ), field(x,y,v) <span style="color: #008800; font-weight: bold;">as</span> ( <span style="color: #008800; font-weight: bold;">select</span> x,y,ascii(substr(line,x::<span style="color: #007020;">integer</span>,<span style="color: #0000dd; font-weight: bold;">1</span>))<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">48</span> <span style="color: #008800; font-weight: bold;">from</span> (<span style="color: #008800; font-weight: bold;">select</span> <span style="color: #333333;">*</span> <span style="color: #008800; font-weight: bold;">from</span> lines l <span style="color: #008800; font-weight: bold;">where</span> line<span style="color: #333333;">&lt;&gt;</span><span style="background-color: #fff0f0;">''</span>) s, <span style="color: #008800; font-weight: bold;">lateral</span> generate_series(<span style="color: #0000dd; font-weight: bold;">1</span>,<span style="color: #008800; font-weight: bold;">length</span>(line)) <span style="color: #008800; font-weight: bold;">g</span>(x) ), paths(x,y,v,sx,sy) <span style="color: #008800; font-weight: bold;">as</span> ( <span style="color: #008800; font-weight: bold;">select</span> x,y,<span style="color: #0000dd; font-weight: bold;">9</span>,x,y <span style="color: #008800; font-weight: bold;">from</span> field <span style="color: #008800; font-weight: bold;">where</span> v <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">9</span> <span style="color: #008800; font-weight: bold;">union</span> <span style="color: #008800; font-weight: bold;">all</span> <span style="color: #008800; font-weight: bold;">select</span> f.x,f.y,f.v,p.sx,p.sy <span style="color: #008800; font-weight: bold;">from</span> field f, paths p <span style="color: #008800; font-weight: bold;">where</span> f.v<span style="color: #333333;">=</span>p.v<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span> <span style="color: #008800; font-weight: bold;">and</span> ((f.x<span style="color: #333333;">=</span>p.x <span style="color: #008800; font-weight: bold;">and</span> <span style="color: #008800; font-weight: bold;">abs</span>(f.y<span style="color: #333333;">-</span>p.y)<span style="color: #333333;">=</span><span style="color: #0000dd; font-weight: bold;">1</span>) <span style="color: #008800; font-weight: bold;">or</span> (f.y<span style="color: #333333;">=</span>p.y <span style="color: #008800; font-weight: bold;">and</span> <span style="color: #008800; font-weight: bold;">abs</span>(f.x<span style="color: #333333;">-</span>p.x)<span style="color: #333333;">=</span><span style="color: #0000dd; font-weight: bold;">1</span>)) <span style="color: #008800; font-weight: bold;">and</span> p.v<span style="color: #333333;">&gt;</span><span style="color: #0000dd; font-weight: bold;">0</span>), results <span style="color: #008800; font-weight: bold;">as</span> (<span style="color: #008800; font-weight: bold;">select</span> <span style="color: #333333;">*</span> <span style="color: #008800; font-weight: bold;">from</span> paths <span style="color: #008800; font-weight: bold;">where</span> v<span style="color: #333333;">=</span><span style="color: #0000dd; font-weight: bold;">0</span>), part1 <span style="color: #008800; font-weight: bold;">as</span> (<span style="color: #008800; font-weight: bold;">select</span> <span style="color: #008800; font-weight: bold;">distinct</span> <span style="color: #333333;">*</span> <span style="color: #008800; font-weight: bold;">from</span> results) <span style="color: #008800; font-weight: bold;">select</span> (<span style="color: #008800; font-weight: bold;">select</span> <span style="color: #008800; font-weight: bold;">count</span>(<span style="color: #333333;">*</span>) <span style="color: #008800; font-weight: bold;">from</span> part1) <span style="color: #008800; font-weight: bold;">as</span> part1, (<span style="color: #008800; font-weight: bold;">select</span> <span style="color: #008800; font-weight: bold;">count</span>(<span style="color: #333333;">*</span>) <span style="color: #008800; font-weight: bold;">from</span> results) <span style="color: #008800; font-weight: bold;">as</span> part2 </pre></td></tr></tbody></table></div> <p>Parsing the input is a bit painful in SQL, but it is not too bad. Lines 1-10 are simply the puzzle input, lines 11-17 split the input into individual lines, and lines 18-21 construct a 2D array from the input. The algorithm itself is pretty short, lines 22-27 perform a recursive traversal of the field, and lines 28-39 extract the puzzle answer from the traversal results. For this kind of small scale traversals SQL works just fine.</p><p>Other days were more painful. <a href="https://github.com/neumannt/aoc24/blob/master/day16.sql">Day 16</a> for example does conceptually a very similar traversal of a field, and it computes the minimal traversal distance for each visited. Expressing that in SQL in easy, but evaluation is wasteful. When replacing the reference input with a real puzzle input the field is quite large, and the recursive query generates and preserves a lot of state, even though we only care about the last iteration of the recursive query. As a consequence you need a machine with over 200GB memory to execute that query, even though most of the computed tuples are irrelevant. We could fix that excessive memory consumption by using&nbsp;<a href="https://www.cidrdb.org/cidr2023/papers/p14-hirn.pdf">iteration semantic</a> during recursion, but that is not widely supported by DBMSes. Umbra could do it, but Postgres and DuckDB cannot, thus I have not used it in my solutions.</p><p>And sometimes the programming model of recursive SQL clashes with what we want to do. On <a href="https://github.com/neumannt/aoc24/blob/master/day23.sql">day 23</a> we had to find the maximum clique in sparse graph. This can be computed reasonably well with the <a href="https://en.wikipedia.org/wiki/Bron%E2%80%93Kerbosch_algorithm">Bron-Kerbosch algorithm</a>, but expressing that in recursive SQL is quite convoluted because the algorithm wants to maintain multiple sets, but recursive SQL only passes a single set along. It can be done, but the result <a href="https://github.com/neumannt/aoc24/blob/218e93d12b477e694ab88e2c25c6070c28a6fbf4/day23.sql#L52">does not look pretty</a>.</p><p>This experiment has shown two things to me 1) it is possible to code quite complex algorithms in SQL, and often the SQL code is surprisingly pleasant, and 2) recursive SQL would be much more efficient and more pleasant to use if we had mechanisms to update state. There is <a href="https://www.cidrdb.org/cidr2025/papers.html">ongoing work</a> on supporting more complex control flow in recursion via a trampoline mechanisms, which is very useful, too, but we should definitively look into more complex state manipulation mechanisms. With just a bit extra functionality SQL would be quite a solid choice for running complex algorithms directly inside a database.<br /></p>Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com6tag:blogger.com,1999:blog-8891884956165580080.post-16529363134222518992024-12-13T08:37:00.002+01:002024-12-13T08:38:51.938+01:00What are important data systems problems, ignored by research?<p>In November, I had the pleasure of attending the <a href="https://cwida.github.io/dbdbd2024/program/">Dutch-Belgian DataBase Day</a>, where I moderated a panel on practical challenges often overlooked in database research. Our distinguished panelists included Allison Lee (founding engineer at Snowflake), Andy Pavlo (professor at CMU), and Hannes Mühleisen (co-creator of DuckDB and researcher at CWI), with attendees contributing to the discussion and sharing their perspectives. In this post, I'll attempt to summarize the discussion in the hope that it inspires young (and young-at-heart) researchers to tackle these challenges. Additionally, I'll link to some paper that can serve as motivation and starting points for research in these areas.</p> <p>One significant yet understudied problem raised by multiple panellists is the handling of variable-length strings. Any analysis of real-world analytical queries reveals that strings are <a href="https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/getreal.pdf">ubiquitous</a>. For instance, <a href="https://www.vldb.org/pvldb/vol17/p3694-saxena.pdf">Amazon Redshift recently reported</a> that around 50% of all columns are strings. Since strings are typically larger than numeric data, this implies that strings are a substantial majority of real-world data. Dealing with strings presents two major challenges. First, query processing is often slow due to the variable size of strings and the (time and space) overhead of dynamic allocation. Second, surprisingly little research has been dedicated to efficient database-specific string compression. Given the importance of strings on real-world query performance and storage consumption, it is surprising how little research there is on the topic (there <a href="https://sigmodrecord.org/publications/sigmodRecord/2103/pdfs/16_och-gubner.pdf">are</a> <a href="https://15721.courses.cs.cmu.edu/spring2020/papers/09-compression/muller-edbt2014.pdf">some</a> <a href="https://www.vldb.org/pvldb/vol13/p2649-boncz.pdf">exceptions</a>).</p> <p>Allison highlighted a related issue: standard benchmarks, like TPC-H, are overly simplistic, which may partly explain why string processing is understudied. TPC-H queries involve little complex string processing and don't use strings as join or aggregation keys. Moreover, TPC-H strings have static upper bounds, allowing them to be treated as fixed-size objects. This sidesteps the real challenges of variable-size strings and their complex operations. More broadly, standard benchmarks fall short of reflecting real-world workloads, as they lack advanced relational operators (e.g., window functions, CTEs) and complex expressions. To drive meaningful progress, we likely need new, more realistic benchmarks. While the participants agreed on most points, one particularly interesting topic of discussion was distributed query processing. Allison pointed out that many query processing papers overlook distributed processing, making them hard to adopt in industrial systems. Hannes, however, argued that most user workloads can be handled on a single machine, which should be the primary focus of publicly funded research. My personal view is that both single-node and distributed processing are important, and there is ample room to address both challenges.</p> <p>While database researchers often focus on database engine architectures, Andy argued that surrounding topics, such as <a href="https://www.vldb.org/pvldb/vol16/p3335-butrovich.pdf">network connection handling</a> (e.g., database proxies), receive little attention despite their practical importance. Surprisingly, there is also limited research on scheduling database workloads and optimizing the network stack, even though <a href="https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/LookingGlass2_CIDR25.pdf">communication bottlenecks</a> frequently constrain efficient OLTP systems. Multi-statement stored procedures, though a potential solution, are not widely adopted and fail to address this issue in practice. I believe there are significant research opportunities in exploring how to better structure the interface between applications and database systems.</p> <p>One striking fact about major database conferences, such as SIGMOD and VLDB, is how few papers address practical database system problems. From personal experience, I believe this presents a significant opportunity for researchers seeking both academic and real-world impact. Solutions to the problems discussed above (and many others) are likely to gain industry attention and be adopted by real database systems. Moreover, with the availability of open-source systems like DuckDB, DataFusion, LeanStore, and PostgreSQL, conducting systems research has become easier than ever.</p> Viktor Leishttp://www.blogger.com/profile/09732217689829100056noreply@blogger.com2tag:blogger.com,1999:blog-8891884956165580080.post-38379956053648992812024-12-10T15:44:00.001+01:002024-12-10T15:44:52.487+01:00C++ exception performance three years later<p>About three years ago we noticed serious&nbsp;<a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2544r0.html">performance problems in C++ exception unwinding</a>. Due to contention on the unwinding path these became more and more severe the more cores a system had, and unwinding could slow down by orders of magnitude. Due to the constraints of backwards compatibility this contention was not easy to eliminate, and <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2022/p2544r0.html">P2544</a>&nbsp;discussed ways to fix this problem via language changes in C++.</p><p>But fortunately people found less invasive solutions. First, Florian Weimer <a href="https://sourceware.org/git/?p=glibc.git;a=commit;h=5d28a8962dcb6ec056b81d730e3c6fb57185a210">changed the glibc</a>&nbsp;to provide a lock-free mechanism to find the (static) unwind tables for a given shared object. Which eliminates the most serious contention for "simple" C++ programs. For example in a&nbsp;<a href="https://github.com/neumannt/exceptionperformance/blob/master/main.cpp">micro-benchmark</a>&nbsp;that calls a function with some computations (100 calls to sqrt per function invocation), and which throws with a certain probability, we previously had very poor scalability with increasing core count. With his patch we now see with gcc 14.2 on a dual-socket EPYC 7713 the following performance development (runtime in ms):</p><p><br /> </p><style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-0lax{text-align:left;vertical-align:top} </style> <table class="tg"><thead> <tr> <th class="tg-0lax"></th> <th class="tg-0lax">1</th> <th class="tg-0lax">2</th> <th class="tg-0lax">4</th> <th class="tg-0lax">8</th> <th class="tg-0lax">16</th> <th class="tg-0lax">32</th> <th class="tg-0lax">64</th> <th class="tg-0lax">128</th> <th class="tg-0lax">threads</th> </tr></thead> <tbody> <tr> <td class="tg-0lax">0% failure<br /></td> <td class="tg-0lax">29</td> <td class="tg-0lax">29</td> <td class="tg-0lax">29</td> <td class="tg-0lax">29</td> <td class="tg-0lax">29</td> <td class="tg-0lax">29</td> <td class="tg-0lax">29</td> <td class="tg-0lax">42</td> <td class="tg-0lax"></td> </tr> <tr> <td class="tg-0lax">0.1% failure</td> <td class="tg-0lax">29</td> <td class="tg-0lax">29</td> <td class="tg-0lax">29</td> <td class="tg-0lax">29</td> <td class="tg-0lax">29</td> <td class="tg-0lax">29</td> <td class="tg-0lax">29</td> <td class="tg-0lax">32</td> <td class="tg-0lax"></td> </tr> <tr> <td class="tg-0lax">1% failure</td> <td class="tg-0lax">29<br /></td> <td class="tg-0lax">30<br /></td> <td class="tg-0lax">30<br /></td> <td class="tg-0lax">30</td> <td class="tg-0lax">30</td> <td class="tg-0lax">30</td> <td class="tg-0lax">32</td> <td class="tg-0lax">34</td> <td class="tg-0lax"></td> </tr> <tr> <td class="tg-0lax">10% failure</td> <td class="tg-0lax">36</td> <td class="tg-0lax">36</td> <td class="tg-0lax">37</td> <td class="tg-0lax">37</td> <td class="tg-0lax">37</td> <td class="tg-0lax">37<br /></td> <td class="tg-0lax">47<br /></td> <td class="tg-0lax">65</td> <td class="tg-0lax"></td> </tr> </tbody></table> <p></p><p>Which is more or less perfect. 128 threads are a bit slower, but that is to be expected as one EPYC only has 64 cores. With higher failure rates unwinding itself becomes slower but that is still acceptable here. Thus most C++ programs are just fine.</p><p>For&nbsp;<a href="http://umbra-db.com">our use case</a>&nbsp;that is not enough, though. We dynamically generate machine code at runtime, and we want to be able to pass exceptions through generated code. The _dl_find_object mechanism of glibc is not used for JITed code, instead libgcc maintains its own lookup structure. Historically this was a simple list with a global lock, which of course had terrible performance. But through&nbsp;<a href="https://databasearchitects.blogspot.com/2022/06/making-unwinding-through-jit-ed-code.html">a series of patches</a>&nbsp;we managed to change libgcc into using a <a href="https://github.com/gcc-mirror/gcc/blob/master/libgcc/unwind-dw2-btree.h">lock-free b-tree</a>&nbsp;for maintaining the dynamic unwinding frames. Using a similar experiment to the one above, but now with JIT-generated code (using LLVM 19), we get the following:</p><p><br /></p> <style type="text/css"> .tg {border-collapse:collapse;border-spacing:0;} .tg td{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; overflow:hidden;padding:10px 5px;word-break:normal;} .tg th{border-color:black;border-style:solid;border-width:1px;font-family:Arial, sans-serif;font-size:14px; font-weight:normal;overflow:hidden;padding:10px 5px;word-break:normal;} .tg .tg-0lax{text-align:left;vertical-align:top} </style> <table class="tg"><thead> <tr> <th class="tg-0lax"></th> <th class="tg-0lax">1</th> <th class="tg-0lax">2</th> <th class="tg-0lax">4</th> <th class="tg-0lax">8</th> <th class="tg-0lax">16</th> <th class="tg-0lax">32</th> <th class="tg-0lax">64</th> <th class="tg-0lax">128</th> <th class="tg-0lax">threads</th> </tr></thead> <tbody> <tr> <td class="tg-0lax">0% failure<br /></td> <td class="tg-0lax">32</td> <td class="tg-0lax">38</td> <td class="tg-0lax">48</td> <td class="tg-0lax">64</td> <td class="tg-0lax">48</td> <td class="tg-0lax">36</td> <td class="tg-0lax">59</td> <td class="tg-0lax">62</td> <td class="tg-0lax"></td> </tr> <tr> <td class="tg-0lax">0.1% failure</td> <td class="tg-0lax">32</td> <td class="tg-0lax">32</td> <td class="tg-0lax">32</td> <td class="tg-0lax">32<br /></td> <td class="tg-0lax">32<br /></td> <td class="tg-0lax">48<br /></td> <td class="tg-0lax">62</td> <td class="tg-0lax">68<br /></td> <td class="tg-0lax"></td> </tr> <tr> <td class="tg-0lax">1% failure</td> <td class="tg-0lax">41<br /></td> <td class="tg-0lax">40<br /></td> <td class="tg-0lax">40<br /></td> <td class="tg-0lax">40</td> <td class="tg-0lax">53<br /></td> <td class="tg-0lax">69<br /></td> <td class="tg-0lax">80<br /></td> <td class="tg-0lax">83<br /></td> <td class="tg-0lax"></td> </tr> <tr> <td class="tg-0lax">10% failure</td> <td class="tg-0lax">123</td> <td class="tg-0lax">113</td> <td class="tg-0lax">103<br /></td> <td class="tg-0lax">116<br /></td> <td class="tg-0lax">128</td> <td class="tg-0lax">127<br /></td> <td class="tg-0lax">131<br /></td> <td class="tg-0lax">214<br /></td> <td class="tg-0lax"></td> </tr> </tbody></table> <p>The numbers have more noise than for statically generated code, but overall observation is the same: Unwinding now scales with the number of cores, and we can safely use C++ exceptions even on machines with large core counts.</p><p>So is everything perfect now? No. First, only gcc has a fully scalable frame lookup mechanism. clang has its own implementation, and as far as I know it still does not scale properly due to a global lock in DwarfFDECache. Note that at least on many Linux distributions clang uses libgcc by default, thus the problem is not immediately obvious there, but a pure llvm/clang build will have scalability problems. And&nbsp; second unwinding through JIT-ed code is a quite a bit slower, which is unfortunate. But admittedly the problem is less severe than shown here, the benchmark with JITed code simply has a stack frame more to unwind due to the way static code and JITed code interact. And it makes sense to prioritize static unwinding over dynamic unwinding frames, as most people never JIT-generate code.</p><p>Overall we are now quite happy with the unwinding mechanism. The bottlenecks are gone, and performance is fine even with high core counts. It is still not appropriate for high failure rates, something like <a href="https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p0709r4.pdf">P709</a>&nbsp;would be better for that, but we can live with the status quo.</p><p><br /></p><p><br /></p>Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com2tag:blogger.com,1999:blog-8891884956165580080.post-72721001662922311482024-06-06T15:59:00.000+02:002024-06-06T15:59:31.215+02:00B-trees Require Fewer Comparisons Than Balanced Binary Search Trees<p>Due to better access locality, B-trees are faster than binary search trees <i>in practice</i> -- but are they also better <i>in theory</i>? To answer this question, let's look at the number of comparisons required for a search operation. Assuming we store <i>n</i> elements in a binary search tree, the lower bound for the number of comparisons is <i>log<sub>2</sub> n</i> in the worst case. However, this is only achievable for a perfectly balanced tree. Maintaining such a tree's perfect balance during insert/delete operations requires <i>O(n)</i> time in the worst case.<br /><br />Balanced binary search trees, therefore, leave some slack in terms of how balanced they are and have slightly worse bounds. For example, it is well known that an AVL tree guarantees at most <i>1.44 log<sub>2</sub> n</i> comparisons, and a Red-Black tree guarantees <i>2 log<sub>2</sub> n</i> comparisons. In other words, AVL trees require at most 1.44 times the minimum number of comparisons, and Red-Black trees require up to twice the minimum.<br /><br />How many comparisons does a B-tree need? In B-trees with degree <i>k</i>, each node (except the root) has between <i>k</i> and <i>2k</i> children. For <i>k=2</i>, a B-tree is essentially the same data structure as a Red-Black tree and therefore provides the same guarantee of <i>2 log<sub>2</sub> n</i> comparisons. So how about larger, more realistic values of <i>k</i>?<br /><br />To analyze the general case, we start with a B-tree that has the highest possible height for <i>n</i> elements. The height is maximal when each node has only <i>k</i> children (for simplicity, this analysis ignores the special case of underfull root nodes). This implies that the worst-case height of a B-tree is <i>log<sub>k</sub> n</i>. During a lookup, one has to perform a binary search that takes <i>log<sub>2</sub> k</i> comparisons in each of the <i>log<sub>k</sub> n</i> nodes. So in total, we have <i>log<sub>2</sub> k * log<sub>k</sub> n = log<sub>2</sub> n</i> comparisons.<br /><br />This actually matches the best case, and to construct the worst case, we have to modify the tree somewhat. On one (and only one) arbitrary path from the root to a single leaf node, we increase the number of children from <i>k</i> to <i>2k</i>. In this situation, the tree height is still less than or equal to <i>log<sub>k</sub> n</i>, but we now have one worst-case path where we need <i>log<sub>2</sub> 2k</i> (instead of <i>log<sub>2</sub> k</i>) comparisons. On this worst-case path, we have <i>log<sub>2</sub> 2k * log<sub>k</sub> n = (log<sub>2</sub> 2k) / (log<sub>2</sub> k) * log<sub>2</sub> n</i> comparisons.<br /><br />Using this formula, we get the following bounds:<br />k=2: 2 log<sub>2</sub> n<br />k=4: 1.5 log<sub>2</sub> n<br />k=8: 1.33 log<sub>2</sub> n<br />k=16: 1.25 log<sub>2</sub> n<br />...<br />k=512: 1.11 log<sub>2</sub> n<br /><br />We see that as k grows, B-trees get closer to the lower bound. For <i>k&gt;=8</i>, B-trees are guaranteed to perform fewer comparisons than AVL trees in the worst case. As <i>k</i> increases, B-trees become more balanced. One intuition for this result is that for larger <i>k</i> values, B-trees become increasingly similar to sorted arrays which achieve the <i>log<sub>2</sub> n</i> lower bound. Practical B-trees often use fairly large values of <i>k</i> (e.g., 100) and therefore offer tight bounds -- in addition to being more cache-friendly than binary search trees.<br /><br />(Caveat: For simplicity, the analysis assumes that <i>log<sub>2</sub> n</i> and <i>log<sub>2</sub> 2k</i> are integers, and that the root has either <i>k</i> or <i>2k</i> entries. Nevertheless, the observation that larger <i>k</i> values lead to tighter bounds should hold in general.)<br /><br /></p>Viktor Leishttp://www.blogger.com/profile/09732217689829100056noreply@blogger.com0tag:blogger.com,1999:blog-8891884956165580080.post-14749265648749186172024-02-19T09:00:00.001+01:002024-02-21T11:20:30.816+01:00SSDs Have Become Ridiculously Fast, Except in the Cloud<p>In recent years, flash-based SSDs have largely replaced disks for most storage use cases. Internally, each SSD consists of many independent flash chips, each of which can be accessed in parallel. Assuming the SSD controller keeps up, the throughput of an SSD therefore primarily depends on the interface speed to the host. In the past six years, we have seen a rapid transition from SATA to PCIe 3.0 to PCIe 4.0 to PCIe 5.0. As a result, there was an explosion in SSD throughput: <img width="90%" src="https://db.in.tum.de/~leis/dbtrends/ssd-bandwidth.svg" /> </p> <p> At the same time, we saw not just better performance, but also more capacity per dollar: </p> <img width="90%" src="https://db.in.tum.de/~leis/dbtrends/ssd-capacity.svg" /> <p> The two plots illustrate the power of a commodity market. The combination of open standards (NVMe and PCIe), huge demand, and competing vendors led to great benefits for customers. Today, top PCIe 5.0 data center SSDs such as the <a href="https://europe.kioxia.com/en-europe/business/ssd/enterprise-ssd.html">Kioxia CM7-R</a> or <a href="https://semiconductor.samsung.com/emea/ssd/enterprise-ssd/pm1743/">Samsung PM1743</a> achieve up to 13 GB/s read throughput and 2.7M+ random read IOPS. Modern servers have around 100 PCIe lanes, making it possible to have a dozen of SSDs (each usually using 4 lanes) in a single server at full bandwidth. For example, in our lab we have a single-socket Zen 4 server with 8 Kioxia CM7-R SSDs, which achieves 100GB/s (!) I/O bandwidth:</p> <img width="90%" src="https://db.in.tum.de/~leis/dbtrends/iostat.png" /> <p> AWS EC2 was an early NVMe pioneer, launching the <a href="https://aws.amazon.com/blogs/aws/now-available-i3-instances-for-demanding-io-intensive-applications/">i3 instance</a> with 8 physically-attached NVMe SSDs in early 2017. At that time, NVMe SSDs were still expensive, and having 8 in a single server was quite remarkable. The per-SSD read (2 GB/s) and write (1 GB/s) performance was considered state of the art as well. Another step forward occurred in 2019 with the launch of <a href="https://aws.amazon.com/blogs/aws/new-the-next-generation-i3en-of-i-o-optimized-ec2-instances/">i3en instances</a>, which doubled storage capacity per dollar.</p> <p> Since then, several NVMe instance types, including i4i and im4gn, have been launched. Surprisingly, however, the performance has not increased; seven years after the i3 launch, we are still stuck with 2 GB/s per SSD. Indeed, the venerable i3 and i3en instances basically remain the best EC2 has to offer in terms of IO-bandwidth/$ and SSD-capacity/$, respectively. Personally, I find this very surprising given the SSD bandwidth explosion and cost reductions we have seen on the commodity market. At this point, the performance gap between state-of-the-art SSDs and those offered by major cloud vendors, especially in read throughput, write throughput, and IOPS, is nearing an order of magnitude. (Azure's top NVMe instances are only slightly faster than AWS's.)</p> <p> What makes this stagnation in the cloud even more surprising is that we have seen great advances in other areas. For example, during the same 2017 to 2023 time frame, EC2 network bandwidth exploded, increasing from 10 Gbit/s (c4) to 200 Gbit/s (c7gn). Now, I can only speculate why the cloud vendors have not caught up on the storage side: <ul> <li> One theory is that EC2 intentionally caps the write speed at 1 GB/s to avoid frequent device failure, given the total number of writes per SSD is limited. However, this does not explain why the read bandwidth is stuck at 2 GB/s.</li> <li> A second possibility is that there is no demand for faster storage because very few storage systems can actually exploit tens of GB/s of I/O bandwidth. See our <a href="https://www.vldb.org/pvldb/vol16/p2090-haas.pdf">recent VLDB paper</a>. On the other hand, as long as fast storage devices are not widely available, there is also little incentive to optimize existing systems.</li> <li> A third theory is that if EC2 were to launch fast and cheap NVMe instance storage, it would disrupt the cost structure of its other storage service (in particular EBS). This is, of course, the classic innovator's dilemma, but one would hope that one of the smaller cloud vendors would make this step to gain a competitive edge.</li> </ul> </p> <p> Overall, I'm not fully convinced by any of these three arguments. Actually, I hope that we'll soon see cloud instances with 10 GB/s SSDs, making this post obsolete. </p> Viktor Leishttp://www.blogger.com/profile/09732217689829100056noreply@blogger.com21tag:blogger.com,1999:blog-8891884956165580080.post-27466172187739107792023-04-09T14:08:00.005+02:002023-11-03T13:09:44.917+01:00The Great CPU Stagnation<p> For at least five decades, <a href="https://en.wikipedia.org/wiki/Moore%27s_law">Moore's law</a> consistently delivered increasing numbers of transistors. Equally significant, <a href="https://en.wikipedia.org/wiki/Dennard_scaling">Dennard scaling</a> led to each transistor using less energy, enabling higher clock frequencies. This was great, as higher clock frequencies enhanced existing software performance automatically, without necessitating any code rewrite. However, around 2005, Dennard scaling began to falter, and clock frequencies have largely <a href="https://github.com/karlrupp/microprocessor-trend-data">plateaued</a> since then. </p> <p> Despite this, Moore's law continued to advance, with the additional available transistors being channeled into creating more cores per chip. The following graph displays the number of cores for the largest available x86 CPU at the time: <img src="https://db.in.tum.de/~leis/dbtrends/multicore.svg" /> <br/> Notice the logarithmic scale: this represents the exponential trend we had become accustomed to, with core counts doubling roughly every three years. Regrettably, when considering cost per core, this impressive trend appears to have stalled, ushering in an era of CPU stagnation. </p> <p> To demonstrate this stagnation, I gathered data from wikichip.org on AMD's Epyc single-socket CPU lineup, introduced in 2017 and now in its fourth generation (<a href="https://en.wikichip.org/wiki/amd/cores/naples">Naples</a>, <a href="https://en.wikichip.org/wiki/amd/cores/rome">Rome</a>, <a href="https://en.wikichip.org/wiki/amd/cores/milan">Milan</a>, <a href="https://en.wikichip.org/wiki/amd/cores/genoa">Genoa</a>): <table cellspacing="0" border="0"> <colgroup span="3" width="109"></colgroup> <colgroup span="3" width="86"></colgroup> <colgroup width="109"></colgroup> <tr> <td height="27" align="center"><b>Model</b></td> <td align="center"><b>Gen</b></td> <td align="center"><b>Launch</b></td> <td align="center"><b>Cores</b></td> <td align="center"><b>GHz</b></td> <td align="center"><b>IPC</b></td> <td align="center"><b>Price</b></td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">7351P</td> <td align="left">Naples</td> <td align="right">06/2017</td> <td align="right" sdval="16" sdnum="1033;">16</td> <td align="right" sdval="2.4" sdnum="1033;0;#,##0.0">2.4</td> <td align="right" sdval="1" sdnum="1033;0;0.00">1.00</td> <td align="right" sdval="750" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$750</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">7401P</td> <td align="left">Naples</td> <td align="right">06/2017</td> <td align="right" sdval="24" sdnum="1033;">24</td> <td align="right" sdval="2" sdnum="1033;0;#,##0.0">2.0</td> <td align="right" sdval="1" sdnum="1033;0;0.00">1.00</td> <td align="right" sdval="1075" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$1,075</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">7551P</td> <td align="left">Naples</td> <td align="right">06/2017</td> <td align="right" sdval="32" sdnum="1033;">32</td> <td align="right" sdval="2" sdnum="1033;0;#,##0.0">2.0</td> <td align="right" sdval="1" sdnum="1033;0;0.00">1.00</td> <td align="right" sdval="2100" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$2,100</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">7302P</td> <td align="left">Rome</td> <td align="right">08/2019</td> <td align="right" sdval="16" sdnum="1033;">16</td> <td align="right" sdval="3" sdnum="1033;0;#,##0.0">3.0</td> <td align="right" sdval="1.15" sdnum="1033;0;0.00">1.15</td> <td align="right" sdval="825" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$825</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">7402P</td> <td align="left">Rome</td> <td align="right">08/2019</td> <td align="right" sdval="24" sdnum="1033;">24</td> <td align="right" sdval="2.8" sdnum="1033;0;#,##0.0">2.8</td> <td align="right" sdval="1.15" sdnum="1033;0;0.00">1.15</td> <td align="right" sdval="1250" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$1,250</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">7502P</td> <td align="left">Rome</td> <td align="right">08/2019</td> <td align="right" sdval="32" sdnum="1033;">32</td> <td align="right" sdval="2.5" sdnum="1033;0;#,##0.0">2.5</td> <td align="right" sdval="1.15" sdnum="1033;0;0.00">1.15</td> <td align="right" sdval="2300" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$2,300</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">7702P</td> <td align="left">Rome</td> <td align="right">08/2019</td> <td align="right" sdval="64" sdnum="1033;">64</td> <td align="right" sdval="2" sdnum="1033;0;#,##0.0">2.0</td> <td align="right" sdval="1.15" sdnum="1033;0;0.00">1.15</td> <td align="right" sdval="4425" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$4,425</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">7313P</td> <td align="left">Milan</td> <td align="right">03/2021</td> <td align="right" sdval="16" sdnum="1033;">16</td> <td align="right" sdval="3" sdnum="1033;0;#,##0.0">3.0</td> <td align="right" sdval="1.37" sdnum="1033;0;0.00">1.37</td> <td align="right" sdval="913" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$913</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">7443P</td> <td align="left">Milan</td> <td align="right">03/2021</td> <td align="right" sdval="24" sdnum="1033;">24</td> <td align="right" sdval="2.85" sdnum="1033;0;#,##0.0">2.9</td> <td align="right" sdval="1.37" sdnum="1033;0;0.00">1.37</td> <td align="right" sdval="1337" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$1,337</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">7543P</td> <td align="left">Milan</td> <td align="right">03/2021</td> <td align="right" sdval="32" sdnum="1033;">32</td> <td align="right" sdval="2.8" sdnum="1033;0;#,##0.0">2.8</td> <td align="right" sdval="1.37" sdnum="1033;0;0.00">1.37</td> <td align="right" sdval="2730" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$2,730</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">7713P</td> <td align="left">Milan</td> <td align="right">03/2021</td> <td align="right" sdval="64" sdnum="1033;">64</td> <td align="right" sdval="2" sdnum="1033;0;#,##0.0">2.0</td> <td align="right" sdval="1.37" sdnum="1033;0;0.00">1.37</td> <td align="right" sdval="5010" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$5,010</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">9354P</td> <td align="left">Genoa</td> <td align="right">11/2022</td> <td align="right" sdval="32" sdnum="1033;">32</td> <td align="right" sdval="3.25" sdnum="1033;0;#,##0.0">3.3</td> <td align="right" sdval="1.57" sdnum="1033;0;0.00">1.57</td> <td align="right" sdval="2730" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$2,730</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">9454P</td> <td align="left">Genoa</td> <td align="right">11/2022</td> <td align="right" sdval="48" sdnum="1033;">48</td> <td align="right" sdval="2.75" sdnum="1033;0;#,##0.0">2.8</td> <td align="right" sdval="1.57" sdnum="1033;0;0.00">1.57</td> <td align="right" sdval="4598" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$4,598</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">9554P</td> <td align="left">Genoa</td> <td align="right">11/2022</td> <td align="right" sdval="64" sdnum="1033;">64</td> <td align="right" sdval="3.1" sdnum="1033;0;#,##0.0">3.1</td> <td align="right" sdval="1.57" sdnum="1033;0;0.00">1.57</td> <td align="right" sdval="7104" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$7,104</td> </tr> <tr> <td height="28" align="left" sdnum="1033;0;@">9654P</td> <td align="left">Genoa</td> <td align="right">11/2022</td> <td align="right" sdval="96" sdnum="1033;">96</td> <td align="right" sdval="2.4" sdnum="1033;0;#,##0.0">2.4</td> <td align="right" sdval="1.57" sdnum="1033;0;0.00">1.57</td> <td align="right" sdval="10625" sdnum="1033;0;[$$-409]#,##0;[RED]-[$$-409]#,##0">$10,625</td> </tr> </table> </p> <p> Over these past six years, AMD has emerged as the x86 performance per dollar leader. Examining these numbers should provide insight into the state of server CPUs. Let's first observe CPU cores per dollar: <img src="https://db.in.tum.de/~leis/dbtrends/coresperdollar.svg" /> </p> <p> This deviates significantly from the expected exponential improvement graphs. In fact, CPU cores are becoming slightly more expensive over time! Admittedly, newer cores outperform their predecessors. When accounting for both clock frequency and higher IPC, we obtain the following image: </p> <img src="https://db.in.tum.de/~leis/dbtrends/instrperdollar.svg" /> <p> This isn't much better. The performance improvement over a 6-year period is underwhelming when normalized for cost. Similar results can also be observed for <a href="https://databasearchitects.blogspot.com/2021/07/aws-ec2-hardware-trends-2015-2021.html">Intel CPUs in EC2</a>. </p> <p> Lastly, let's examine transistor counts, only taking into account the logic transistors. Despite improved production nodes from 14nm (Naples) over 7nm (Rome/Milan) to 5nm (Genoa), cost-adjusted figures reveal stagnation: <img src="https://db.in.tum.de/~leis/dbtrends/transistorsperdollar.svg" /> </p> <p> In conclusion, the results are disheartening. Rapid and exponential improvements in CPU speed seem to be relics of the past. We now find ourselves in a markedly different landscape compared to the historical norm in computing. The implications could be far-reaching. For example, most software is extremely <a href="https://www.facebook.com/permalink.php?story_fbid=pfbid0iPixEvPJQGzNa6t2x6HUL5TYqfmKGqSgfkBg6QaTyHF5frXQi7eLGxC7uPQv5U5jl&amp;id=100006735798590">inefficient</a> when compared to what hardware can theoretically achieve, and maybe this needs to change. Furthermore, historically specialized chips enjoyed only limited success due to the rapid advancement of commodity CPUs. Perhaps, custom chips will have a much bigger role in the future. </p> <p> P.S. Due to popular demand, here's how the last graph looks like after adjusting for inflation: <img src="https://db.in.tum.de/~leis/dbtrends/transistorsperdollar17.svg" /> </p> Viktor Leishttp://www.blogger.com/profile/09732217689829100056noreply@blogger.com7tag:blogger.com,1999:blog-8891884956165580080.post-87555303398007314712023-02-07T15:31:00.001+01:002023-02-10T07:47:09.162+01:00Five Decades of Database ResearchSince 1975, over 24 thousand articles have have been published in major database venues (SIGMOD, VLDB/PVLDB, ICDE, EDBT, CIDR, TODS, VLDB Journal, TKDE). The number of papers per year is rising: <img src="https://db.in.tum.de/~leis/dbtrends/papers.svg" /> <br /> <br /> Over time, the topics change. Looking at the percentage of keywords appearing in paper titles (in that particular year), we can see interesting trends: <img src="https://db.in.tum.de/~leis/dbtrends/models.svg" /> <br /> <img src="https://db.in.tum.de/~leis/dbtrends/top.svg" /> <br /> <img src="https://db.in.tum.de/~leis/dbtrends/ml.svg" /> <br /> <img src="https://db.in.tum.de/~leis/dbtrends/stream.svg" /> <br /> <img src="https://db.in.tum.de/~leis/dbtrends/cloud.svg" /> <br /> <img src="https://db.in.tum.de/~leis/dbtrends/mem.svg" /> <br /> <img src="https://db.in.tum.de/~leis/dbtrends/class.svg" /> <br /> Viktor Leishttp://www.blogger.com/profile/09732217689829100056noreply@blogger.com0tag:blogger.com,1999:blog-8891884956165580080.post-31718660745232360552023-01-23T13:13:00.001+01:002023-02-07T10:56:00.391+01:00For systems, research is development and development is research <p>The <a href="https://www.cidrdb.org/cidr2023/program.html">Conference on Innovative Data Systems Research (CIDR) 2023</a> is over, and as usual both the official program and the informal discussions have been great. CIDR encourages innovative, risky, and controversial ideas as well as honest exchanges. One intensely-discussed talk was the <a href="https://www.youtube.com/watch?v=dv4A2LIFG80">keynote by Hannes Mühleisen</a>, who together with Mark Raasveldt is the brain behind DuckDB. <br /> <br />In the keynote, Hannes lamented the incentives of systems researchers in academia (e.g., papers over running code). He also criticized the often obscure topics database systems researchers work on while neglecting many practical and pressing problems (e.g., top-k algorithms rather than practically-important issues like strings). <a href="https://www.youtube.com/watch?v=DJFKl_5JTnA">Michael Stonebraker has similar </a><a href="https://www.youtube.com/watch?v=DJFKl_5JTnA">thoughts</a> on the database systems community. I share many of these criticisms, but I'm more optimistic regarding what systems research in academia can do, and would therefore like to share my perspective. <br /> <br />Software is different: copying it is free, which has two implications: (1) Most systems are somewhat unique -- otherwise one could have used an existing one. (2) The cost of software is dominated by development effort. I argue that, together, these two observations mean that systems research and system development are two sides of the same coin. <br /> <br />Because developing complex systems is difficult, reinventing the wheel is not a good idea -- it's much better to stand on the proverbial shoulders of giants. Thus, developers should look at the existing literature to find out what others have done, and should experimentally compare existing approaches. Often there are no good solutions for some problems, requiring new inventions, which need to be written up to communicate them to others. Writing will not just allow communication, it will also improve conceptual clarity and understanding, leading to better software. Of course, all these activities (literature review, experiments, invention, writing) are indistinguishable from systems research. <br /> <br />On the other hand, doing systems research without integrating the new techniques into real systems can also lead to problems. Without being grounded by real systems, researchers risk wasting their time on intellectually-difficult, but practically-pointless problems. (And indeed much of what is published at the major database conferences falls into this trap.) Building real systems leads to a treasure trove of open problems. Publishing solutions to these often directly results in technological progress, better systems, and adoption by other systems. <br /> <br />To summarize: systems research is (or should be) indistinguishable from systems development. In principle, this methodology could work in both industry and academia. Both places have problematic incentives, but different ones. Industry often has a very short time horizon, which can lead to very incremental developments. Academic paper-counting incentives can lead to lots of papers without any impact on real systems. <br /> <br />Building systems in academia may not be the best strategy to publish the maximum number of papers or citations, but can lead to real-world impact, technological progress, and (in the long run even) academic accolades. The key is therefore to work with people who have shown how to overcome these systemic pathologies, and build systems over a long time horizon. There are many examples such academic projects (e.g., PostgreSQL, C-Store/Vertica, H-Store/VoltDB, ShoreMT, Proteus, Quickstep, Peloton, KÙZU, AsterixDB, MonetDB, Vectorwise, DuckDB, Hyper, LeanStore, and Umbra).</p><p><br /></p>Viktor Leishttp://www.blogger.com/profile/09732217689829100056noreply@blogger.com1tag:blogger.com,1999:blog-8891884956165580080.post-49715376636326761732022-06-26T11:00:00.001+02:002022-06-27T12:29:08.518+02:00Making unwinding through JIT-ed code scalable - b-tree operations<p>This article is part of the series about scalable unwinding that starts <a href="https://databasearchitects.blogspot.com/2022/06/making-unwinding-through-jit-ed-code.html">here</a>.</p><p>Now that we have all infrastructure in place, we look at the high-level algorithms. For inserts, we walk down the tree until we hit the leaf-node that should contain the new value. If that node is full, we split the leaf node, and insert a new separator into the parent node to distinguish the two nodes. To avoid propagating that split further up (as the inner node might be full, too, requiring an inner split), we eagerly split full inner nodes when walking down. This guarantees that the parent of a node is never full, which allows us to look at nodes purely from top-to-bottom, which greatly simplifies locking.</p><p>The splits themselves are relatively simple, we just copy the right half of each node into a new node, reduce the size of the original node, and insert a separator into the parent. However two problems require some care 1) we might have to split the root, which does not have a parent itself, and 2) the node split could mean that the value we try to insert could be either in the left or the right node. The split functions always update the node iterator to the correct node, and release the lock on the node that is not needed after the split.<br /></p><!--HTML generated using hilite.me--><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// Insert a new separator after splitting</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">btree_node_update_separator_after_split</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>n, <span style="color: #333399; font-weight: bold;">uintptr_t</span> old_separator, <span style="color: #333399; font-weight: bold;">uintptr_t</span> new_separator, <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>new_right) { <span style="color: #333399; font-weight: bold;">unsigned</span> slot <span style="color: #333333;">=</span> btree_node_find_inner_slot (n, old_separator); <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> n<span style="color: #333333;">-&gt;</span>entry_count; index <span style="color: #333333;">&gt;</span> slot; <span style="color: #333333;">--</span>index) n<span style="color: #333333;">-&gt;</span>content.children[index] <span style="color: #333333;">=</span> n<span style="color: #333333;">-&gt;</span>content.children[index <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">1</span>]; n<span style="color: #333333;">-&gt;</span>content.children[slot].separator <span style="color: #333333;">=</span> new_separator; n<span style="color: #333333;">-&gt;</span>content.children[slot <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>].child <span style="color: #333333;">=</span> new_right; n<span style="color: #333333;">-&gt;</span>entry_count<span style="color: #333333;">++</span>; } <span style="color: #888888;">// Check if we are splitting the root</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">btree_handle_root_split</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree <span style="color: #333333;">*</span>t, <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">**</span>node, <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">**</span>parent) { <span style="color: #888888;">// We want to keep the root pointer stable to allow for contention</span> <span style="color: #888888;">// free reads. Thus, we split the root by first moving the content</span> <span style="color: #888888;">// of the root node to a new node, and then split that new node</span> <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!*</span>parent) { <span style="color: #888888;">// Allocate a new node, this guarantees us that we will have a parent</span> <span style="color: #888888;">// afterwards</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>new_node <span style="color: #333333;">=</span> btree_allocate_node (t, btree_node_is_inner (<span style="color: #333333;">*</span>node)); <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>old_node <span style="color: #333333;">=</span> <span style="color: #333333;">*</span>node; new_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">=</span> old_node<span style="color: #333333;">-&gt;</span>entry_count; new_node<span style="color: #333333;">-&gt;</span>content <span style="color: #333333;">=</span> old_node<span style="color: #333333;">-&gt;</span>content; old_node<span style="color: #333333;">-&gt;</span>content.children[<span style="color: #0000dd; font-weight: bold;">0</span>].separator <span style="color: #333333;">=</span> max_separator; old_node<span style="color: #333333;">-&gt;</span>content.children[<span style="color: #0000dd; font-weight: bold;">0</span>].child <span style="color: #333333;">=</span> new_node; old_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">1</span>; old_node<span style="color: #333333;">-&gt;</span>type <span style="color: #333333;">=</span> btree_node_inner; <span style="color: #333333;">*</span>parent <span style="color: #333333;">=</span> old_node; <span style="color: #333333;">*</span>node <span style="color: #333333;">=</span> new_node; } } <span style="color: #888888;">// Split an inner node</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">btree_split_inner</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree <span style="color: #333333;">*</span>t, <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">**</span>inner, <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">**</span>parent, <span style="color: #333399; font-weight: bold;">uintptr_t</span> target) { <span style="color: #888888;">// Check for the root</span> btree_handle_root_split (t, inner, parent); <span style="color: #888888;">// Create two inner node</span> <span style="color: #333399; font-weight: bold;">uintptr_t</span> right_fence <span style="color: #333333;">=</span> btree_node_get_fence_key (<span style="color: #333333;">*</span>inner); <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>left_inner <span style="color: #333333;">=</span> <span style="color: #333333;">*</span>inner; <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>right_inner <span style="color: #333333;">=</span> btree_allocate_node (t, <span style="color: #007020;">true</span>); <span style="color: #333399; font-weight: bold;">unsigned</span> split <span style="color: #333333;">=</span> left_inner<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">/</span> <span style="color: #0000dd; font-weight: bold;">2</span>; right_inner<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">=</span> left_inner<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-</span> split; <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">&lt;</span> right_inner<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) right_inner<span style="color: #333333;">-&gt;</span>content.children[index] <span style="color: #333333;">=</span> left_inner<span style="color: #333333;">-&gt;</span>content.children[split <span style="color: #333333;">+</span> index]; left_inner<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">=</span> split; <span style="color: #333399; font-weight: bold;">uintptr_t</span> left_fence <span style="color: #333333;">=</span> btree_node_get_fence_key (left_inner); btree_node_update_separator_after_split (<span style="color: #333333;">*</span>parent, right_fence, left_fence, right_inner); <span style="color: #008800; font-weight: bold;">if</span> (target <span style="color: #333333;">&lt;=</span> left_fence) { <span style="color: #333333;">*</span>inner <span style="color: #333333;">=</span> left_inner; btree_node_unlock_exclusive (right_inner); } <span style="color: #008800; font-weight: bold;">else</span> { <span style="color: #333333;">*</span>inner <span style="color: #333333;">=</span> right_inner; btree_node_unlock_exclusive (left_inner); } } <span style="color: #888888;">// Split a leaf node</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">btree_split_leaf</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree <span style="color: #333333;">*</span>t, <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">**</span>leaf, <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">**</span>parent, <span style="color: #333399; font-weight: bold;">uintptr_t</span> fence, <span style="color: #333399; font-weight: bold;">uintptr_t</span> target) { <span style="color: #888888;">// Check for the root</span> btree_handle_root_split (t, leaf, parent); <span style="color: #888888;">// Create two leaf node</span> <span style="color: #333399; font-weight: bold;">uintptr_t</span> right_fence <span style="color: #333333;">=</span> fence; <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>left_leaf <span style="color: #333333;">=</span> <span style="color: #333333;">*</span>leaf; <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>right_leaf <span style="color: #333333;">=</span> btree_allocate_node (t, <span style="color: #007020;">false</span>); <span style="color: #333399; font-weight: bold;">unsigned</span> split <span style="color: #333333;">=</span> left_leaf<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">/</span> <span style="color: #0000dd; font-weight: bold;">2</span>; right_leaf<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">=</span> left_leaf<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-</span> split; <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> right_leaf<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) right_leaf<span style="color: #333333;">-&gt;</span>content.entries[index] <span style="color: #333333;">=</span> left_leaf<span style="color: #333333;">-&gt;</span>content.entries[split <span style="color: #333333;">+</span> index]; left_leaf<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">=</span> split; <span style="color: #333399; font-weight: bold;">uintptr_t</span> left_fence <span style="color: #333333;">=</span> right_leaf<span style="color: #333333;">-&gt;</span>content.entries[<span style="color: #0000dd; font-weight: bold;">0</span>].base <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">1</span>; btree_node_update_separator_after_split (<span style="color: #333333;">*</span>parent, right_fence, left_fence, right_leaf); <span style="color: #008800; font-weight: bold;">if</span> (target <span style="color: #333333;">&lt;=</span> left_fence) { <span style="color: #333333;">*</span>leaf <span style="color: #333333;">=</span> left_leaf; btree_node_unlock_exclusive (right_leaf); } <span style="color: #008800; font-weight: bold;">else</span> { <span style="color: #333333;">*</span>leaf <span style="color: #333333;">=</span> right_leaf; btree_node_unlock_exclusive (left_leaf); } } <span style="color: #888888;">// Insert an entry</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">bool</span> <span style="color: #0066bb; font-weight: bold;">btree_insert</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree <span style="color: #333333;">*</span>t, <span style="color: #333399; font-weight: bold;">uintptr_t</span> base, <span style="color: #333399; font-weight: bold;">uintptr_t</span> size, <span style="color: #008800; font-weight: bold;">struct</span> object <span style="color: #333333;">*</span>ob) { <span style="color: #888888;">// Sanity check</span> <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>size) <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">false</span>; <span style="color: #888888;">// Access the root</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>iter, <span style="color: #333333;">*</span>parent <span style="color: #333333;">=</span> <span style="color: #007020;">NULL</span>; { version_lock_lock_exclusive (<span style="color: #333333;">&amp;</span>(t<span style="color: #333333;">-&gt;</span>root_lock)); iter <span style="color: #333333;">=</span> t<span style="color: #333333;">-&gt;</span>root; <span style="color: #008800; font-weight: bold;">if</span> (iter) { btree_node_lock_exclusive (iter); } <span style="color: #008800; font-weight: bold;">else</span> { t<span style="color: #333333;">-&gt;</span>root <span style="color: #333333;">=</span> iter <span style="color: #333333;">=</span> btree_allocate_node (t, <span style="color: #007020;">false</span>); } version_lock_unlock_exclusive (<span style="color: #333333;">&amp;</span>(t<span style="color: #333333;">-&gt;</span>root_lock)); } <span style="color: #888888;">// Walk down the btree with classic lock coupling and eager splits.</span> <span style="color: #888888;">// Strictly speaking this is not performance optimal, we could use</span> <span style="color: #888888;">// optimistic lock coupling until we hit a node that has to be modified.</span> <span style="color: #888888;">// But that is more difficult to implement and frame registration is</span> <span style="color: #888888;">// rare anyway, we use simple locking for now</span> <span style="color: #333399; font-weight: bold;">uintptr_t</span> fence <span style="color: #333333;">=</span> max_separator; <span style="color: #008800; font-weight: bold;">while</span> (btree_node_is_inner (iter)) { <span style="color: #888888;">// Use eager splits to avoid lock coupling up</span> <span style="color: #008800; font-weight: bold;">if</span> (iter<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">==</span> max_fanout_inner) btree_split_inner (t, <span style="color: #333333;">&amp;</span>iter, <span style="color: #333333;">&amp;</span>parent, base); <span style="color: #333399; font-weight: bold;">unsigned</span> slot <span style="color: #333333;">=</span> btree_node_find_inner_slot (iter, base); <span style="color: #008800; font-weight: bold;">if</span> (parent) btree_node_unlock_exclusive (parent); parent <span style="color: #333333;">=</span> iter; fence <span style="color: #333333;">=</span> iter<span style="color: #333333;">-&gt;</span>content.children[slot].separator; iter <span style="color: #333333;">=</span> iter<span style="color: #333333;">-&gt;</span>content.children[slot].child; btree_node_lock_exclusive (iter); } <span style="color: #888888;">// Make sure we have space</span> <span style="color: #008800; font-weight: bold;">if</span> (iter<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">==</span> max_fanout_leaf) btree_split_leaf (t, <span style="color: #333333;">&amp;</span>iter, <span style="color: #333333;">&amp;</span>parent, fence, base); <span style="color: #008800; font-weight: bold;">if</span> (parent) btree_node_unlock_exclusive (parent); <span style="color: #888888;">// Insert in node</span> <span style="color: #333399; font-weight: bold;">unsigned</span> slot <span style="color: #333333;">=</span> btree_node_find_leaf_slot (iter, base); <span style="color: #008800; font-weight: bold;">if</span> ((slot <span style="color: #333333;">&lt;</span> iter<span style="color: #333333;">-&gt;</span>entry_count) <span style="color: #333333;">&amp;&amp;</span> (iter<span style="color: #333333;">-&gt;</span>content.entries[slot].base <span style="color: #333333;">==</span> base)) { <span style="color: #888888;">// duplicate entry, this should never happen</span> btree_node_unlock_exclusive (iter); <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">false</span>; } <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> iter<span style="color: #333333;">-&gt;</span>entry_count; index <span style="color: #333333;">&gt;</span> slot; <span style="color: #333333;">--</span>index) iter<span style="color: #333333;">-&gt;</span>content.entries[index] <span style="color: #333333;">=</span> iter<span style="color: #333333;">-&gt;</span>content.entries[index <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">1</span>]; <span style="color: #008800; font-weight: bold;">struct</span> leaf_entry <span style="color: #333333;">*</span>e <span style="color: #333333;">=</span> <span style="color: #333333;">&amp;</span>(iter<span style="color: #333333;">-&gt;</span>content.entries[slot]); e<span style="color: #333333;">-&gt;</span>base <span style="color: #333333;">=</span> base; e<span style="color: #333333;">-&gt;</span>size <span style="color: #333333;">=</span> size; e<span style="color: #333333;">-&gt;</span>ob <span style="color: #333333;">=</span> ob; iter<span style="color: #333333;">-&gt;</span>entry_count<span style="color: #333333;">++</span>; btree_node_unlock_exclusive (iter); <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">true</span>; } </pre></div> <p></p><p>Deletion is more complex, as there are more cases. We have to maintain the invariant that each node is at least half full. Just like insertion we have the problem that operations can trickle up, e.g., deleting in element in a node might make it less than half-full, merging that node with a half-full neighbor deletes an entry from the parent, which can make that node less than half-full, etc. We solve that problem by merging while going down: When traversing the tree during element-removal, we check if the current node is less than half full. If yes, we merge/balance it with a neighbor node. If the parent becomes less than half-full that will be fixed at the next traversal. Strictly speaking this means nodes can, at least temporarily, be less than half full, but that is fine for asymptotic complexity, as we are never more than one element below the threshold.</p><p>The merge logic examines that least-full neighbor of the current code. If both nodes together would fit in one node, they are merged and the separator for the left node is removed from the parent. Otherwise, elements are shifted from the less-full node to the other node, which makes both nodes at least half full. The separator of the left node is updated after the shift:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// Merge (or balance) child nodes</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span> <span style="color: #0066bb; font-weight: bold;">btree_merge_node</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree <span style="color: #333333;">*</span>t, <span style="color: #333399; font-weight: bold;">unsigned</span> child_slot, <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>parent, <span style="color: #333399; font-weight: bold;">uintptr_t</span> target) { <span style="color: #888888;">// Choose the emptiest neighbor and lock both. The target child is already</span> <span style="color: #888888;">// locked</span> <span style="color: #333399; font-weight: bold;">unsigned</span> left_slot; <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>left_node, <span style="color: #333333;">*</span>right_node; <span style="color: #008800; font-weight: bold;">if</span> ((child_slot <span style="color: #333333;">==</span> <span style="color: #0000dd; font-weight: bold;">0</span>) <span style="color: #333333;">||</span> (((child_slot <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>) <span style="color: #333333;">&lt;</span> parent<span style="color: #333333;">-&gt;</span>entry_count) <span style="color: #333333;">&amp;&amp;</span> (parent<span style="color: #333333;">-&gt;</span>content.children[child_slot <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>].child<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">&lt;</span> parent<span style="color: #333333;">-&gt;</span>content.children[child_slot <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">1</span>].child<span style="color: #333333;">-&gt;</span>entry_count))) { left_slot <span style="color: #333333;">=</span> child_slot; left_node <span style="color: #333333;">=</span> parent<span style="color: #333333;">-&gt;</span>content.children[left_slot].child; right_node <span style="color: #333333;">=</span> parent<span style="color: #333333;">-&gt;</span>content.children[left_slot <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>].child; btree_node_lock_exclusive (right_node); } <span style="color: #008800; font-weight: bold;">else</span> { left_slot <span style="color: #333333;">=</span> child_slot <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">1</span>; left_node <span style="color: #333333;">=</span> parent<span style="color: #333333;">-&gt;</span>content.children[left_slot].child; right_node <span style="color: #333333;">=</span> parent<span style="color: #333333;">-&gt;</span>content.children[left_slot <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>].child; btree_node_lock_exclusive (left_node); } <span style="color: #888888;">// Can we merge both nodes into one node?</span> <span style="color: #333399; font-weight: bold;">unsigned</span> total_count <span style="color: #333333;">=</span> left_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">+</span> right_node<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333399; font-weight: bold;">unsigned</span> max_count <span style="color: #333333;">=</span> btree_node_is_inner (left_node) <span style="color: #333333;">?</span> max_fanout_inner <span style="color: #333333;">:</span> max_fanout_leaf; <span style="color: #008800; font-weight: bold;">if</span> (total_count <span style="color: #333333;">&lt;=</span> max_count) { <span style="color: #888888;">// Merge into the parent?</span> <span style="color: #008800; font-weight: bold;">if</span> (parent<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">==</span> <span style="color: #0000dd; font-weight: bold;">2</span>) { <span style="color: #888888;">// Merge children into parent. This can only happen at the root</span> <span style="color: #008800; font-weight: bold;">if</span> (btree_node_is_inner (left_node)) { <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> left_node<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) parent<span style="color: #333333;">-&gt;</span>content.children[index] <span style="color: #333333;">=</span> left_node<span style="color: #333333;">-&gt;</span>content.children[index]; <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> right_node<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) parent<span style="color: #333333;">-&gt;</span>content.children[index <span style="color: #333333;">+</span> left_node<span style="color: #333333;">-&gt;</span>entry_count] <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>content.children[index]; } <span style="color: #008800; font-weight: bold;">else</span> { parent<span style="color: #333333;">-&gt;</span>type <span style="color: #333333;">=</span> btree_node_leaf; <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> left_node<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) parent<span style="color: #333333;">-&gt;</span>content.entries[index] <span style="color: #333333;">=</span> left_node<span style="color: #333333;">-&gt;</span>content.entries[index]; <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> right_node<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) parent<span style="color: #333333;">-&gt;</span>content.entries[index <span style="color: #333333;">+</span> left_node<span style="color: #333333;">-&gt;</span>entry_count] <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>content.entries[index]; } parent<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">=</span> total_count; btree_release_node (t, left_node); btree_release_node (t, right_node); <span style="color: #008800; font-weight: bold;">return</span> parent; } <span style="color: #008800; font-weight: bold;">else</span> { <span style="color: #888888;">// Regular merge</span> <span style="color: #008800; font-weight: bold;">if</span> (btree_node_is_inner (left_node)) { <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> right_node<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) left_node<span style="color: #333333;">-&gt;</span>content.children[left_node<span style="color: #333333;">-&gt;</span>entry_count<span style="color: #333333;">++</span>] <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>content.children[index]; } <span style="color: #008800; font-weight: bold;">else</span> { <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> right_node<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) left_node<span style="color: #333333;">-&gt;</span>content.entries[left_node<span style="color: #333333;">-&gt;</span>entry_count<span style="color: #333333;">++</span>] <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>content.entries[index]; } parent<span style="color: #333333;">-&gt;</span>content.children[left_slot].separator <span style="color: #333333;">=</span> parent<span style="color: #333333;">-&gt;</span>content.children[left_slot <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>].separator; <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> left_slot <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>; index <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span> <span style="color: #333333;">&lt;</span> parent<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) parent<span style="color: #333333;">-&gt;</span>content.children[index] <span style="color: #333333;">=</span> parent<span style="color: #333333;">-&gt;</span>content.children[index <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>]; parent<span style="color: #333333;">-&gt;</span>entry_count<span style="color: #333333;">--</span>; btree_release_node (t, right_node); btree_node_unlock_exclusive (parent); <span style="color: #008800; font-weight: bold;">return</span> left_node; } } <span style="color: #888888;">// No merge possible, rebalance instead</span> <span style="color: #008800; font-weight: bold;">if</span> (left_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">&gt;</span> right_node<span style="color: #333333;">-&gt;</span>entry_count) { <span style="color: #888888;">// Shift from left to right</span> <span style="color: #333399; font-weight: bold;">unsigned</span> to_shift <span style="color: #333333;">=</span> (left_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-</span> right_node<span style="color: #333333;">-&gt;</span>entry_count) <span style="color: #333333;">/</span> <span style="color: #0000dd; font-weight: bold;">2</span>; <span style="color: #008800; font-weight: bold;">if</span> (btree_node_is_inner (left_node)) { <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> right_node<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) { <span style="color: #333399; font-weight: bold;">unsigned</span> pos <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">1</span> <span style="color: #333333;">-</span> index; right_node<span style="color: #333333;">-&gt;</span>content.children[pos <span style="color: #333333;">+</span> to_shift] <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>content.children[pos]; } <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> to_shift; <span style="color: #333333;">++</span>index) right_node<span style="color: #333333;">-&gt;</span>content.children[index] <span style="color: #333333;">=</span> left_node<span style="color: #333333;">-&gt;</span>content .children[left_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-</span> to_shift <span style="color: #333333;">+</span> index]; } <span style="color: #008800; font-weight: bold;">else</span> { <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> right_node<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) { <span style="color: #333399; font-weight: bold;">unsigned</span> pos <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">1</span> <span style="color: #333333;">-</span> index; right_node<span style="color: #333333;">-&gt;</span>content.entries[pos <span style="color: #333333;">+</span> to_shift] <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>content.entries[pos]; } <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> to_shift; <span style="color: #333333;">++</span>index) right_node<span style="color: #333333;">-&gt;</span>content.entries[index] <span style="color: #333333;">=</span> left_node<span style="color: #333333;">-&gt;</span>content .entries[left_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-</span> to_shift <span style="color: #333333;">+</span> index]; } left_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-=</span> to_shift; right_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">+=</span> to_shift; } <span style="color: #008800; font-weight: bold;">else</span> { <span style="color: #888888;">// Shift from right to left</span> <span style="color: #333399; font-weight: bold;">unsigned</span> to_shift <span style="color: #333333;">=</span> (right_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-</span> left_node<span style="color: #333333;">-&gt;</span>entry_count) <span style="color: #333333;">/</span> <span style="color: #0000dd; font-weight: bold;">2</span>; <span style="color: #008800; font-weight: bold;">if</span> (btree_node_is_inner (left_node)) { <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> to_shift; <span style="color: #333333;">++</span>index) left_node<span style="color: #333333;">-&gt;</span>content.children[left_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">+</span> index] <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>content.children[index]; <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> right_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-</span> to_shift; <span style="color: #333333;">++</span>index) right_node<span style="color: #333333;">-&gt;</span>content.children[index] <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>content.children[index <span style="color: #333333;">+</span> to_shift]; } <span style="color: #008800; font-weight: bold;">else</span> { <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> to_shift; <span style="color: #333333;">++</span>index) left_node<span style="color: #333333;">-&gt;</span>content.entries[left_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">+</span> index] <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>content.entries[index]; <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">!=</span> right_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-</span> to_shift; <span style="color: #333333;">++</span>index) right_node<span style="color: #333333;">-&gt;</span>content.entries[index] <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>content.entries[index <span style="color: #333333;">+</span> to_shift]; } left_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">+=</span> to_shift; right_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-=</span> to_shift; } <span style="color: #333399; font-weight: bold;">uintptr_t</span> left_fence; <span style="color: #008800; font-weight: bold;">if</span> (btree_node_is_leaf (left_node)) { left_fence <span style="color: #333333;">=</span> right_node<span style="color: #333333;">-&gt;</span>content.entries[<span style="color: #0000dd; font-weight: bold;">0</span>].base <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">1</span>; } <span style="color: #008800; font-weight: bold;">else</span> { left_fence <span style="color: #333333;">=</span> btree_node_get_fence_key (left_node); } parent<span style="color: #333333;">-&gt;</span>content.children[left_slot].separator <span style="color: #333333;">=</span> left_fence; btree_node_unlock_exclusive (parent); <span style="color: #008800; font-weight: bold;">if</span> (target <span style="color: #333333;">&lt;=</span> left_fence) { btree_node_unlock_exclusive (right_node); <span style="color: #008800; font-weight: bold;">return</span> left_node; } <span style="color: #008800; font-weight: bold;">else</span> { btree_node_unlock_exclusive (left_node); <span style="color: #008800; font-weight: bold;">return</span> right_node; } } <span style="color: #888888;">// Remove an entry</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">struct</span> object <span style="color: #333333;">*</span> <span style="color: #0066bb; font-weight: bold;">btree_remove</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree <span style="color: #333333;">*</span>t, <span style="color: #333399; font-weight: bold;">uintptr_t</span> base) { <span style="color: #888888;">// Access the root</span> version_lock_lock_exclusive (<span style="color: #333333;">&amp;</span>(t<span style="color: #333333;">-&gt;</span>root_lock)); <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>iter <span style="color: #333333;">=</span> t<span style="color: #333333;">-&gt;</span>root; <span style="color: #008800; font-weight: bold;">if</span> (iter) btree_node_lock_exclusive (iter); version_lock_unlock_exclusive (<span style="color: #333333;">&amp;</span>(t<span style="color: #333333;">-&gt;</span>root_lock)); <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>iter) <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">NULL</span>; <span style="color: #888888;">// Same strategy as with insert, walk down with lock coupling and</span> <span style="color: #888888;">// merge eagerly</span> <span style="color: #008800; font-weight: bold;">while</span> (btree_node_is_inner (iter)) { <span style="color: #333399; font-weight: bold;">unsigned</span> slot <span style="color: #333333;">=</span> btree_node_find_inner_slot (iter, base); <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>next <span style="color: #333333;">=</span> iter<span style="color: #333333;">-&gt;</span>content.children[slot].child; btree_node_lock_exclusive (next); <span style="color: #008800; font-weight: bold;">if</span> (btree_node_needs_merge (next)) { <span style="color: #888888;">// Use eager merges to avoid lock coupling up</span> iter <span style="color: #333333;">=</span> btree_merge_node (t, slot, iter, base); } <span style="color: #008800; font-weight: bold;">else</span> { btree_node_unlock_exclusive (iter); iter <span style="color: #333333;">=</span> next; } } <span style="color: #888888;">// Remove existing entry</span> <span style="color: #333399; font-weight: bold;">unsigned</span> slot <span style="color: #333333;">=</span> btree_node_find_leaf_slot (iter, base); <span style="color: #008800; font-weight: bold;">if</span> ((slot <span style="color: #333333;">&gt;=</span> iter<span style="color: #333333;">-&gt;</span>entry_count) <span style="color: #333333;">||</span> (iter<span style="color: #333333;">-&gt;</span>content.entries[slot].base <span style="color: #333333;">!=</span> base)) { <span style="color: #888888;">// not found, this should never happen</span> btree_node_unlock_exclusive (iter); <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">NULL</span>; } <span style="color: #008800; font-weight: bold;">struct</span> object <span style="color: #333333;">*</span>ob <span style="color: #333333;">=</span> iter<span style="color: #333333;">-&gt;</span>content.entries[slot].ob; <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> slot; index <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span> <span style="color: #333333;">&lt;</span> iter<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) iter<span style="color: #333333;">-&gt;</span>content.entries[index] <span style="color: #333333;">=</span> iter<span style="color: #333333;">-&gt;</span>content.entries[index <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>]; iter<span style="color: #333333;">-&gt;</span>entry_count<span style="color: #333333;">--</span>; btree_node_unlock_exclusive (iter); <span style="color: #008800; font-weight: bold;">return</span> ob; } </pre></div> <p>Lookups are conceptually simple, we just walk down the b-tree. However we do the traversal using optimistic lock coupling, which means the data could change behind our back at any time. As a consequence, all reads have to be (relaxed) atomic reads, and we have to validate the current lock before acting upon a value that we have read. In case of failures (e.g., concurrent writes during reading), we simply restart the traversal.&nbsp; <br /></p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// Find the corresponding entry for the given address</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">struct</span> object <span style="color: #333333;">*</span> <span style="color: #0066bb; font-weight: bold;">btree_lookup</span> (<span style="color: #008800; font-weight: bold;">const</span> <span style="color: #008800; font-weight: bold;">struct</span> btree <span style="color: #333333;">*</span>t, <span style="color: #333399; font-weight: bold;">uintptr_t</span> target_addr) { <span style="color: #888888;">// Within this function many loads are relaxed atomic loads.</span> <span style="color: #888888;">// Use a macro to keep the code reasonable</span> <span style="color: #557799;">#define RLOAD(x) __atomic_load_n (&amp;(x), __ATOMIC_RELAXED)</span> <span style="color: #888888;">// For targets where unwind info is usually not registered through these</span> <span style="color: #888888;">// APIs anymore, avoid any sequential consistent atomics.</span> <span style="color: #888888;">// Use relaxed MO here, it is up to the app to ensure that the library</span> <span style="color: #888888;">// loading/initialization happens-before using that library in other</span> <span style="color: #888888;">// threads (in particular unwinding with that library's functions</span> <span style="color: #888888;">// appearing in the backtraces). Calling that library's functions</span> <span style="color: #888888;">// without waiting for the library to initialize would be racy.</span> <span style="color: #008800; font-weight: bold;">if</span> (__builtin_expect (<span style="color: #333333;">!</span>RLOAD (t<span style="color: #333333;">-&gt;</span>root), <span style="color: #0000dd; font-weight: bold;">1</span>)) <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">NULL</span>; <span style="color: #888888;">// The unwinding tables are mostly static, they only change when</span> <span style="color: #888888;">// frames are added or removed. This makes it extremely unlikely that they</span> <span style="color: #888888;">// change during a given unwinding sequence. Thus, we optimize for the</span> <span style="color: #888888;">// contention free case and use optimistic lock coupling. This does not</span> <span style="color: #888888;">// require any writes to shared state, instead we validate every read. It is</span> <span style="color: #888888;">// important that we do not trust any value that we have read until we call</span> <span style="color: #888888;">// validate again. Data can change at arbitrary points in time, thus we always</span> <span style="color: #888888;">// copy something into a local variable and validate again before acting on</span> <span style="color: #888888;">// the read. In the unlikely event that we encounter a concurrent change we</span> <span style="color: #888888;">// simply restart and try again.</span> <span style="color: #997700; font-weight: bold;">restart:</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>iter; <span style="color: #333399; font-weight: bold;">uintptr_t</span> lock; { <span style="color: #888888;">// Accessing the root node requires defending against concurrent pointer</span> <span style="color: #888888;">// changes Thus we couple rootLock -&gt; lock on root node -&gt; validate rootLock</span> <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>version_lock_lock_optimistic (<span style="color: #333333;">&amp;</span>(t<span style="color: #333333;">-&gt;</span>root_lock), <span style="color: #333333;">&amp;</span>lock)) <span style="color: #008800; font-weight: bold;">goto</span> restart; iter <span style="color: #333333;">=</span> RLOAD (t<span style="color: #333333;">-&gt;</span>root); <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>version_lock_validate (<span style="color: #333333;">&amp;</span>(t<span style="color: #333333;">-&gt;</span>root_lock), lock)) <span style="color: #008800; font-weight: bold;">goto</span> restart; <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>iter) <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">NULL</span>; <span style="color: #333399; font-weight: bold;">uintptr_t</span> child_lock; <span style="color: #008800; font-weight: bold;">if</span> ((<span style="color: #333333;">!</span>btree_node_lock_optimistic (iter, <span style="color: #333333;">&amp;</span>child_lock)) <span style="color: #333333;">||</span> (<span style="color: #333333;">!</span>version_lock_validate (<span style="color: #333333;">&amp;</span>(t<span style="color: #333333;">-&gt;</span>root_lock), lock))) <span style="color: #008800; font-weight: bold;">goto</span> restart; lock <span style="color: #333333;">=</span> child_lock; } <span style="color: #888888;">// Now we can walk down towards the right leaf node</span> <span style="color: #008800; font-weight: bold;">while</span> (<span style="color: #007020;">true</span>) { <span style="color: #008800; font-weight: bold;">enum</span> node_type type <span style="color: #333333;">=</span> RLOAD (iter<span style="color: #333333;">-&gt;</span>type); <span style="color: #333399; font-weight: bold;">unsigned</span> entry_count <span style="color: #333333;">=</span> RLOAD (iter<span style="color: #333333;">-&gt;</span>entry_count); <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>btree_node_validate (iter, lock)) <span style="color: #008800; font-weight: bold;">goto</span> restart; <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>entry_count) <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">NULL</span>; <span style="color: #008800; font-weight: bold;">if</span> (type <span style="color: #333333;">==</span> btree_node_inner) { <span style="color: #888888;">// We cannot call find_inner_slot here because we need (relaxed)</span> <span style="color: #888888;">// atomic reads here</span> <span style="color: #333399; font-weight: bold;">unsigned</span> slot <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; <span style="color: #008800; font-weight: bold;">while</span> ( ((slot <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>) <span style="color: #333333;">&lt;</span> entry_count) <span style="color: #333333;">&amp;&amp;</span> (RLOAD (iter<span style="color: #333333;">-&gt;</span>content.children[slot].separator) <span style="color: #333333;">&lt;</span> target_addr)) <span style="color: #333333;">++</span>slot; <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>child <span style="color: #333333;">=</span> RLOAD (iter<span style="color: #333333;">-&gt;</span>content.children[slot].child); <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>btree_node_validate (iter, lock)) <span style="color: #008800; font-weight: bold;">goto</span> restart; <span style="color: #888888;">// The node content can change at any point in time, thus we must</span> <span style="color: #888888;">// interleave parent and child checks</span> <span style="color: #333399; font-weight: bold;">uintptr_t</span> child_lock; <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>btree_node_lock_optimistic (child, <span style="color: #333333;">&amp;</span>child_lock)) <span style="color: #008800; font-weight: bold;">goto</span> restart; <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>btree_node_validate (iter, lock)) <span style="color: #008800; font-weight: bold;">goto</span> restart; <span style="color: #888888;">// make sure we still point to the correct node after</span> <span style="color: #888888;">// acquiring the optimistic lock</span> <span style="color: #888888;">// Go down</span> iter <span style="color: #333333;">=</span> child; lock <span style="color: #333333;">=</span> child_lock; } <span style="color: #008800; font-weight: bold;">else</span> { <span style="color: #888888;">// We cannot call find_leaf_slot here because we need (relaxed)</span> <span style="color: #888888;">// atomic reads here</span> <span style="color: #333399; font-weight: bold;">unsigned</span> slot <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; <span style="color: #008800; font-weight: bold;">while</span> (((slot <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">1</span>) <span style="color: #333333;">&lt;</span> entry_count) <span style="color: #333333;">&amp;&amp;</span> (RLOAD (iter<span style="color: #333333;">-&gt;</span>content.entries[slot].base) <span style="color: #333333;">+</span> RLOAD (iter<span style="color: #333333;">-&gt;</span>content.entries[slot].size) <span style="color: #333333;">&lt;=</span> target_addr)) <span style="color: #333333;">++</span>slot; <span style="color: #008800; font-weight: bold;">struct</span> leaf_entry entry; entry.base <span style="color: #333333;">=</span> RLOAD (iter<span style="color: #333333;">-&gt;</span>content.entries[slot].base); entry.size <span style="color: #333333;">=</span> RLOAD (iter<span style="color: #333333;">-&gt;</span>content.entries[slot].size); entry.ob <span style="color: #333333;">=</span> RLOAD (iter<span style="color: #333333;">-&gt;</span>content.entries[slot].ob); <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>btree_node_validate (iter, lock)) <span style="color: #008800; font-weight: bold;">goto</span> restart; <span style="color: #888888;">// Check if we have a hit</span> <span style="color: #008800; font-weight: bold;">if</span> ((entry.base <span style="color: #333333;">&lt;=</span> target_addr) <span style="color: #333333;">&amp;&amp;</span> (target_addr <span style="color: #333333;">&lt;</span> entry.base <span style="color: #333333;">+</span> entry.size)) { <span style="color: #008800; font-weight: bold;">return</span> entry.ob; } <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">NULL</span>; } } <span style="color: #557799;">#undef RLOAD</span> } </pre></div> <p></p><p>This is the end of the article series discussing the gcc patch for lock-free unwinding. With that patch, we get scalable unwinding even on a machine with 256 hardware contexts. I hope the series helps with understanding the patch, and eventually allows it to be integrated into gcc.<br /></p><p><br /></p>Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com0tag:blogger.com,1999:blog-8891884956165580080.post-52030477093284989802022-06-26T10:56:00.000+02:002022-06-26T10:56:00.701+02:00Making unwinding through JIT-ed code scalable - The b-tree<p>This article is part of the series about scalable unwinding that starts <a href="https://databasearchitects.blogspot.com/2022/06/making-unwinding-through-jit-ed-code.html">here</a>.</p><p>We use a&nbsp;<a href="https://en.wikipedia.org/wiki/B-tree">b-tree</a> because it offers fast lookup, good data locality, and a scalable implementation is reasonable easy when using optimistic lock coupling. Nevertheless a b-tree is a non-trivial data structure. To avoid having one huge article that includes all details of the b-tree, we just discuss the data structure themselves and some helper functions here, the insert/remove/lookup operations will be discussed in the next article.</p><p>A b-tree partitions its elements by value. An inner node contains a sorted list of separator/child pairs, with the guarantee that the elements in the sub-tree rooted at the child pointer will be &lt;= the separator. The leaf nodes contains sorted lists of (base, size, object) entries, where the object is responsible for unwinding entries between base and base+size.&nbsp; An b-tree maintains the invariants that 1) all nodes except the root are at least half full, and 2) a leaf nodes have the same distance to the root. This guarantees us logarithmic lookup costs. Note that we use fence-keys, i.e., the inner nodes have a separator for the right-most entries, too, which is not the case in all b-tree implementations:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// The largest possible separator value</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">const</span> <span style="color: #333399; font-weight: bold;">uintptr_t</span> max_separator <span style="color: #333333;">=</span> <span style="color: #333333;">~</span>((<span style="color: #333399; font-weight: bold;">uintptr_t</span>) (<span style="color: #0000dd; font-weight: bold;">0</span>)); <span style="color: #888888;">// Inner entry. The child tree contains all entries &lt;= separator</span> <span style="color: #008800; font-weight: bold;">struct</span> inner_entry { <span style="color: #333399; font-weight: bold;">uintptr_t</span> separator; <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>child; }; <span style="color: #888888;">// Leaf entry. Stores an object entry</span> <span style="color: #008800; font-weight: bold;">struct</span> leaf_entry { <span style="color: #333399; font-weight: bold;">uintptr_t</span> base, size; <span style="color: #008800; font-weight: bold;">struct</span> object <span style="color: #333333;">*</span>ob; }; <span style="color: #888888;">// node types</span> <span style="color: #008800; font-weight: bold;">enum</span> node_type { btree_node_inner, btree_node_leaf, btree_node_free }; <span style="color: #888888;">// Node sizes. Chosen such that the result size is roughly 256 bytes</span> <span style="color: #557799;">#define max_fanout_inner 15</span> <span style="color: #557799;">#define max_fanout_leaf 10</span> <span style="color: #888888;">// A btree node</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node { <span style="color: #888888;">// The version lock used for optimistic lock coupling</span> <span style="color: #008800; font-weight: bold;">struct</span> version_lock version_lock; <span style="color: #888888;">// The number of entries</span> <span style="color: #333399; font-weight: bold;">unsigned</span> entry_count; <span style="color: #888888;">// The type</span> <span style="color: #008800; font-weight: bold;">enum</span> node_type type; <span style="color: #888888;">// The payload</span> <span style="color: #008800; font-weight: bold;">union</span> { <span style="color: #888888;">// The inner nodes have fence keys, i.e., the right-most entry includes a</span> <span style="color: #888888;">// separator</span> <span style="color: #008800; font-weight: bold;">struct</span> inner_entry children[max_fanout_inner]; <span style="color: #008800; font-weight: bold;">struct</span> leaf_entry entries[max_fanout_leaf]; } content; }; </pre></div> <p></p><p>To simplify the subsequent code we define a number of helper functions that are largely straight-forward, and that allow to distinguish leaf and inner node and provide searching within a node. The lock operations directly map to operations on the version lock:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// Is an inner node?</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">bool</span> <span style="color: #0066bb; font-weight: bold;">btree_node_is_inner</span> (<span style="color: #008800; font-weight: bold;">const</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>n) { <span style="color: #008800; font-weight: bold;">return</span> n<span style="color: #333333;">-&gt;</span>type <span style="color: #333333;">==</span> btree_node_inner; } <span style="color: #888888;">// Is a leaf node?</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">bool</span> <span style="color: #0066bb; font-weight: bold;">btree_node_is_leaf</span> (<span style="color: #008800; font-weight: bold;">const</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>n) { <span style="color: #008800; font-weight: bold;">return</span> n<span style="color: #333333;">-&gt;</span>type <span style="color: #333333;">==</span> btree_node_leaf; } <span style="color: #888888;">// Should the node be merged?</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">bool</span> <span style="color: #0066bb; font-weight: bold;">btree_node_needs_merge</span> (<span style="color: #008800; font-weight: bold;">const</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>n) { <span style="color: #008800; font-weight: bold;">return</span> n<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">&lt;</span> (btree_node_is_inner (n) <span style="color: #333333;">?</span> (max_fanout_inner <span style="color: #333333;">/</span> <span style="color: #0000dd; font-weight: bold;">2</span>) <span style="color: #333333;">:</span> (max_fanout_leaf <span style="color: #333333;">/</span> <span style="color: #0000dd; font-weight: bold;">2</span>)); } <span style="color: #888888;">// Get the fence key for inner nodes</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">uintptr_t</span> <span style="color: #0066bb; font-weight: bold;">btree_node_get_fence_key</span> (<span style="color: #008800; font-weight: bold;">const</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>n) { <span style="color: #888888;">// For inner nodes we just return our right-most entry</span> <span style="color: #008800; font-weight: bold;">return</span> n<span style="color: #333333;">-&gt;</span>content.children[n<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">-</span> <span style="color: #0000dd; font-weight: bold;">1</span>].separator; } <span style="color: #888888;">// Find the position for a slot in an inner node</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">unsigned</span> <span style="color: #0066bb; font-weight: bold;">btree_node_find_inner_slot</span> (<span style="color: #008800; font-weight: bold;">const</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>n, <span style="color: #333399; font-weight: bold;">uintptr_t</span> value) { <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>, ec <span style="color: #333333;">=</span> n<span style="color: #333333;">-&gt;</span>entry_count; index <span style="color: #333333;">!=</span> ec; <span style="color: #333333;">++</span>index) <span style="color: #008800; font-weight: bold;">if</span> (n<span style="color: #333333;">-&gt;</span>content.children[index].separator <span style="color: #333333;">&gt;=</span> value) <span style="color: #008800; font-weight: bold;">return</span> index; <span style="color: #008800; font-weight: bold;">return</span> n<span style="color: #333333;">-&gt;</span>entry_count; } <span style="color: #888888;">// Find the position for a slot in a leaf node</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">unsigned</span> <span style="color: #0066bb; font-weight: bold;">btree_node_find_leaf_slot</span> (<span style="color: #008800; font-weight: bold;">const</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>n, <span style="color: #333399; font-weight: bold;">uintptr_t</span> value) { <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>, ec <span style="color: #333333;">=</span> n<span style="color: #333333;">-&gt;</span>entry_count; index <span style="color: #333333;">!=</span> ec; <span style="color: #333333;">++</span>index) <span style="color: #008800; font-weight: bold;">if</span> (n<span style="color: #333333;">-&gt;</span>content.entries[index].base <span style="color: #333333;">+</span> n<span style="color: #333333;">-&gt;</span>content.entries[index].size <span style="color: #333333;">&gt;</span> value) <span style="color: #008800; font-weight: bold;">return</span> index; <span style="color: #008800; font-weight: bold;">return</span> n<span style="color: #333333;">-&gt;</span>entry_count; } <span style="color: #888888;">// Try to lock the node exclusive</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">bool</span> <span style="color: #0066bb; font-weight: bold;">btree_node_try_lock_exclusive</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>n) { <span style="color: #008800; font-weight: bold;">return</span> version_lock_try_lock_exclusive (<span style="color: #333333;">&amp;</span>(n<span style="color: #333333;">-&gt;</span>version_lock)); } <span style="color: #888888;">// Lock the node exclusive, blocking as needed</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">btree_node_lock_exclusive</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>n) { version_lock_lock_exclusive (<span style="color: #333333;">&amp;</span>(n<span style="color: #333333;">-&gt;</span>version_lock)); } <span style="color: #888888;">// Release a locked node and increase the version lock</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">btree_node_unlock_exclusive</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>n) { version_lock_unlock_exclusive (<span style="color: #333333;">&amp;</span>(n<span style="color: #333333;">-&gt;</span>version_lock)); } <span style="color: #888888;">// Acquire an optimistic "lock". Note that this does not lock at all, it</span> <span style="color: #888888;">// only allows for validation later</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">bool</span> <span style="color: #0066bb; font-weight: bold;">btree_node_lock_optimistic</span> (<span style="color: #008800; font-weight: bold;">const</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>n, <span style="color: #333399; font-weight: bold;">uintptr_t</span> <span style="color: #333333;">*</span>lock) { <span style="color: #008800; font-weight: bold;">return</span> version_lock_lock_optimistic (<span style="color: #333333;">&amp;</span>(n<span style="color: #333333;">-&gt;</span>version_lock), lock); } <span style="color: #888888;">// Validate a previously acquire lock</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">bool</span> <span style="color: #0066bb; font-weight: bold;">btree_node_validate</span> (<span style="color: #008800; font-weight: bold;">const</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>n, <span style="color: #333399; font-weight: bold;">uintptr_t</span> lock) { <span style="color: #008800; font-weight: bold;">return</span> version_lock_validate (<span style="color: #333333;">&amp;</span>(n<span style="color: #333333;">-&gt;</span>version_lock), lock); } </pre></div> <p></p><p>With that we come to the b-tree itself, which consists of a pointer to the root node, a version lock to protect the root, and a free list:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// A btree. Suitable for static initialization, all members are zero at the</span> <span style="color: #888888;">// beginning</span> <span style="color: #008800; font-weight: bold;">struct</span> btree { <span style="color: #888888;">// The root of the btree</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>root; <span style="color: #888888;">// The free list of released node</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>free_list; <span style="color: #888888;">// The version lock used to protect the root</span> <span style="color: #008800; font-weight: bold;">struct</span> version_lock root_lock; }; <span style="color: #888888;">// Initialize a btree. Not actually used, just for exposition</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">btree_init</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree <span style="color: #333333;">*</span>t) { t<span style="color: #333333;">-&gt;</span>root <span style="color: #333333;">=</span> <span style="color: #007020;">NULL</span>; t<span style="color: #333333;">-&gt;</span>free_list <span style="color: #333333;">=</span> <span style="color: #007020;">NULL</span>; t<span style="color: #333333;">-&gt;</span>root_lock.version_lock <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; }; </pre></div> <p></p><p>We need that free list because readers operate without visible synchronization. If we would simply free() a node, we would risk that a concurrent reader is still looking at that node, even though no relevant data exist on that node. (But the reader does not know this until it did the read). To prevent that, we put freed nodes in the free list, which ensures that the memory location remains valid, and prefer using nodes from the free list when allocating new nodes:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// Allocate a node. This node will be returned in locked exclusive state</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span> <span style="color: #0066bb; font-weight: bold;">btree_allocate_node</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree <span style="color: #333333;">*</span>t, <span style="color: #333399; font-weight: bold;">bool</span> inner) { <span style="color: #008800; font-weight: bold;">while</span> (<span style="color: #007020;">true</span>) { <span style="color: #888888;">// Try the free list first</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>next_free <span style="color: #333333;">=</span> __atomic_load_n (<span style="color: #333333;">&amp;</span>(t<span style="color: #333333;">-&gt;</span>free_list), __ATOMIC_SEQ_CST); <span style="color: #008800; font-weight: bold;">if</span> (next_free) { <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>btree_node_try_lock_exclusive (next_free)) <span style="color: #008800; font-weight: bold;">continue</span>; <span style="color: #888888;">// The node might no longer be free, check that again after acquiring</span> <span style="color: #888888;">// the exclusive lock</span> <span style="color: #008800; font-weight: bold;">if</span> (next_free<span style="color: #333333;">-&gt;</span>type <span style="color: #333333;">==</span> btree_node_free) { <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>ex <span style="color: #333333;">=</span> next_free; <span style="color: #008800; font-weight: bold;">if</span> (__atomic_compare_exchange_n ( <span style="color: #333333;">&amp;</span>(t<span style="color: #333333;">-&gt;</span>free_list), <span style="color: #333333;">&amp;</span>ex, next_free<span style="color: #333333;">-&gt;</span>content.children[<span style="color: #0000dd; font-weight: bold;">0</span>].child, <span style="color: #007020;">false</span>, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) { next_free<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; next_free<span style="color: #333333;">-&gt;</span>type <span style="color: #333333;">=</span> inner <span style="color: #333333;">?</span> btree_node_inner <span style="color: #333333;">:</span> btree_node_leaf; <span style="color: #008800; font-weight: bold;">return</span> next_free; } } btree_node_unlock_exclusive (next_free); <span style="color: #008800; font-weight: bold;">continue</span>; } <span style="color: #888888;">// No free node available, allocate a new one</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>new_node <span style="color: #333333;">=</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>) (malloc (<span style="color: #008800; font-weight: bold;">sizeof</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree_node))); version_lock_initialize_locked_exclusive ( <span style="color: #333333;">&amp;</span>(new_node<span style="color: #333333;">-&gt;</span>version_lock)); <span style="color: #888888;">// initialize the node in locked state</span> new_node<span style="color: #333333;">-&gt;</span>entry_count <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; new_node<span style="color: #333333;">-&gt;</span>type <span style="color: #333333;">=</span> inner <span style="color: #333333;">?</span> btree_node_inner <span style="color: #333333;">:</span> btree_node_leaf; <span style="color: #008800; font-weight: bold;">return</span> new_node; } } <span style="color: #888888;">// Release a node. This node must be currently locked exclusively and will</span> <span style="color: #888888;">// be placed in the free list</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">btree_release_node</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree <span style="color: #333333;">*</span>t, <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>node) { <span style="color: #888888;">// We cannot release the memory immediately because there might still be</span> <span style="color: #888888;">// concurrent readers on that node. Put it in the free list instead</span> node<span style="color: #333333;">-&gt;</span>type <span style="color: #333333;">=</span> btree_node_free; <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>next_free <span style="color: #333333;">=</span> __atomic_load_n (<span style="color: #333333;">&amp;</span>(t<span style="color: #333333;">-&gt;</span>free_list), __ATOMIC_SEQ_CST); <span style="color: #008800; font-weight: bold;">do</span> { node<span style="color: #333333;">-&gt;</span>content.children[<span style="color: #0000dd; font-weight: bold;">0</span>].child <span style="color: #333333;">=</span> next_free; } <span style="color: #008800; font-weight: bold;">while</span> (<span style="color: #333333;">!</span>__atomic_compare_exchange_n (<span style="color: #333333;">&amp;</span>(t<span style="color: #333333;">-&gt;</span>free_list), <span style="color: #333333;">&amp;</span>next_free, node, <span style="color: #007020;">false</span>, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)); btree_node_unlock_exclusive (node); } </pre></div> <p></p><p>The last remaining infrastructure code is destroying the b-tree. Here, we simply walk the tree recursively and release all nodes. The recursion is safe because the depth is bound logarithmic:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// Recursively release a tree. The btree is by design very shallow, thus</span> <span style="color: #888888;">// we can risk recursion here</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">btree_release_tree_recursively</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree <span style="color: #333333;">*</span>t, <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>node) { btree_node_lock_exclusive (node); <span style="color: #008800; font-weight: bold;">if</span> (btree_node_is_inner (node)) { <span style="color: #008800; font-weight: bold;">for</span> (<span style="color: #333399; font-weight: bold;">unsigned</span> index <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; index <span style="color: #333333;">&lt;</span> node<span style="color: #333333;">-&gt;</span>entry_count; <span style="color: #333333;">++</span>index) btree_release_tree_recursively (t, node<span style="color: #333333;">-&gt;</span>content.children[index].child); } btree_release_node (t, node); } <span style="color: #888888;">// Destroy a tree and release all nodes</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">btree_destroy</span> (<span style="color: #008800; font-weight: bold;">struct</span> btree <span style="color: #333333;">*</span>t) { <span style="color: #888888;">// Disable the mechanism before cleaning up</span> <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>old_root <span style="color: #333333;">=</span> __atomic_exchange_n (<span style="color: #333333;">&amp;</span>(t<span style="color: #333333;">-&gt;</span>root), <span style="color: #007020;">NULL</span>, __ATOMIC_SEQ_CST); <span style="color: #008800; font-weight: bold;">if</span> (old_root) btree_release_tree_recursively (t, old_root); <span style="color: #888888;">// Release all free nodes</span> <span style="color: #008800; font-weight: bold;">while</span> (t<span style="color: #333333;">-&gt;</span>free_list) { <span style="color: #008800; font-weight: bold;">struct</span> btree_node <span style="color: #333333;">*</span>next <span style="color: #333333;">=</span> t<span style="color: #333333;">-&gt;</span>free_list<span style="color: #333333;">-&gt;</span>content.children[<span style="color: #0000dd; font-weight: bold;">0</span>].child; free (t<span style="color: #333333;">-&gt;</span>free_list); t<span style="color: #333333;">-&gt;</span>free_list <span style="color: #333333;">=</span> next; } } </pre></div> <p></p><p>This finished the infrastructure part of the b-tree, the high-level insert/remove/lookup functions are covered&nbsp;<a href="https://databasearchitects.blogspot.com/2022/06/btreeoperations.html">in the next article</a> .<br /></p>Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com0tag:blogger.com,1999:blog-8891884956165580080.post-35356182673644213692022-06-26T10:52:00.002+02:002022-06-26T10:52:43.297+02:00Making unwinding through JIT-ed code scalable - Optimistic Lock Coupling<p>This article is part of the series about scalable unwinding that starts <a href="https://databasearchitects.blogspot.com/2022/06/making-unwinding-through-jit-ed-code.html">here</a>.</p><p>When thinking about exception handling it is reasonable to assume that we will have far more unwinding requests than changes to the unwinding tables. In our setup, the tables only change when JITed code is added to or removed from the program.&nbsp; That is always expensive to begin with due the mprotect calls, TLB shootdowns, etc. Thus we can safely assume that we will have at most a few hundred updates per second even in extreme cases, probably far less. Lookups however can easily reach thousands or even millions per second, as we do one lookup per frame.</p><p>This motivates us to use a read-optimized data structure, a b-tree with optimistic lock coupling: Writers use traditional lock coupling (lock parent node exclusive, lock child node exclusive, release parent node, lock child of child, etc.), which works fine as long as there is not too much contention. Readers however have to do something else, as we expect thousands of them. One might be tempted to use a rw-lock for readers, but that does not help. Locking an rw-lock in shared mode causes an atomic write, which makes the threads fight over the cache line of the lock even if there is no (logical) contention.</p><p>Instead, we use version locks, where readers do no write at all:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// Common logic for version locks</span> <span style="color: #008800; font-weight: bold;">struct</span> version_lock { <span style="color: #888888;">// The lock itself. The lowest bit indicates an exclusive lock,</span> <span style="color: #888888;">// the second bit indicates waiting threads. All other bits are</span> <span style="color: #888888;">// used as counter to recognize changes.</span> <span style="color: #888888;">// Overflows are okay here, we must only prevent overflow to the</span> <span style="color: #888888;">// same value within one lock_optimistic/validate</span> <span style="color: #888888;">// range. Even on 32 bit platforms that would require 1 billion</span> <span style="color: #888888;">// frame registrations within the time span of a few assembler</span> <span style="color: #888888;">// instructions.</span> <span style="color: #333399; font-weight: bold;">uintptr_t</span> version_lock; }; <span style="color: #557799;">#ifdef __GTHREAD_HAS_COND</span> <span style="color: #888888;">// We should never get contention within the tree as it rarely changes.</span> <span style="color: #888888;">// But if we ever do get contention we use these for waiting</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">__gthread_mutex_t</span> version_lock_mutex <span style="color: #333333;">=</span> __GTHREAD_MUTEX_INIT; <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">__gthread_cond_t</span> version_lock_cond <span style="color: #333333;">=</span> __GTHREAD_COND_INIT; <span style="color: #557799;">#endif</span> </pre></div> <br /><p></p><p>The version lock consists of a single number, where the lower two bits indicate the lock status. As we will see below, for exclusive locks we will use them just like a regular mutex, with the addition that the higher bits are incremented on every unlock. If we do get contention we use version_lock_mutex and version_lock_cond for sleeping, but that should be very rare. For readers we do not modify the lock at all but just remember its state. After the read is finished we check the state again. If it changed, we did a racy read, and try again. Note that such a locking mechanism is sometimes call a&nbsp;<a href="https://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf">sequence lock</a> in the literature. The great advantage is that readers can run fully parallel, and the performance is excellent as long as writes are uncommon.<br /></p><p>Initialiizing the lock and trying to acquire the lock in exclusive more are straight forward:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// Initialize in locked state</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">version_lock_initialize_locked_exclusive</span> (<span style="color: #008800; font-weight: bold;">struct</span> version_lock <span style="color: #333333;">*</span>vl) { vl<span style="color: #333333;">-&gt;</span>version_lock <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">1</span>; } <span style="color: #888888;">// Try to lock the node exclusive</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">bool</span> <span style="color: #0066bb; font-weight: bold;">version_lock_try_lock_exclusive</span> (<span style="color: #008800; font-weight: bold;">struct</span> version_lock <span style="color: #333333;">*</span>vl) { <span style="color: #333399; font-weight: bold;">uintptr_t</span> state <span style="color: #333333;">=</span> __atomic_load_n (<span style="color: #333333;">&amp;</span>(vl<span style="color: #333333;">-&gt;</span>version_lock), __ATOMIC_SEQ_CST); <span style="color: #008800; font-weight: bold;">if</span> (state <span style="color: #333333;">&amp;</span> <span style="color: #0000dd; font-weight: bold;">1</span>) <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">false</span>; <span style="color: #008800; font-weight: bold;">return</span> __atomic_compare_exchange_n (<span style="color: #333333;">&amp;</span>(vl<span style="color: #333333;">-&gt;</span>version_lock), <span style="color: #333333;">&amp;</span>state, state <span style="color: #333333;">|</span> <span style="color: #0000dd; font-weight: bold;">1</span>, <span style="color: #007020;">false</span>, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST); } </pre></div> <p></p><p>We simply set the lock to 1 to initialize a new lock in locked state. The try_lock tries to change the lowest bit, and fails if that is not possible.</p><p>For blocking lock_exclusive calls we first try the same as try_lock. If that fails, we acquire the mutex, try to lock again, and sleep if we did not get the lock:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// Lock the node exclusive, blocking as needed</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">version_lock_lock_exclusive</span> (<span style="color: #008800; font-weight: bold;">struct</span> version_lock <span style="color: #333333;">*</span>vl) { <span style="color: #557799;">#ifndef __GTHREAD_HAS_COND</span> <span style="color: #997700; font-weight: bold;">restart:</span> <span style="color: #557799;">#endif</span> <span style="color: #888888;">// We should virtually never get contention here, as frame</span> <span style="color: #888888;">// changes are rare</span> <span style="color: #333399; font-weight: bold;">uintptr_t</span> state <span style="color: #333333;">=</span> __atomic_load_n (<span style="color: #333333;">&amp;</span>(vl<span style="color: #333333;">-&gt;</span>version_lock), __ATOMIC_SEQ_CST); <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>(state <span style="color: #333333;">&amp;</span> <span style="color: #0000dd; font-weight: bold;">1</span>)) { <span style="color: #008800; font-weight: bold;">if</span> (__atomic_compare_exchange_n (<span style="color: #333333;">&amp;</span>(vl<span style="color: #333333;">-&gt;</span>version_lock), <span style="color: #333333;">&amp;</span>state, state <span style="color: #333333;">|</span> <span style="color: #0000dd; font-weight: bold;">1</span>, <span style="color: #007020;">false</span>, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) <span style="color: #008800; font-weight: bold;">return</span>; } <span style="color: #888888;">// We did get contention, wait properly</span> <span style="color: #557799;">#ifdef __GTHREAD_HAS_COND</span> __gthread_mutex_lock (<span style="color: #333333;">&amp;</span>version_lock_mutex); state <span style="color: #333333;">=</span> __atomic_load_n (<span style="color: #333333;">&amp;</span>(vl<span style="color: #333333;">-&gt;</span>version_lock), __ATOMIC_SEQ_CST); <span style="color: #008800; font-weight: bold;">while</span> (<span style="color: #007020;">true</span>) { <span style="color: #888888;">// Check if the lock is still held</span> <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>(state <span style="color: #333333;">&amp;</span> <span style="color: #0000dd; font-weight: bold;">1</span>)) { <span style="color: #008800; font-weight: bold;">if</span> (__atomic_compare_exchange_n (<span style="color: #333333;">&amp;</span>(vl<span style="color: #333333;">-&gt;</span>version_lock), <span style="color: #333333;">&amp;</span>state, state <span style="color: #333333;">|</span> <span style="color: #0000dd; font-weight: bold;">1</span>, <span style="color: #007020;">false</span>, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) { __gthread_mutex_unlock (<span style="color: #333333;">&amp;</span>version_lock_mutex); <span style="color: #008800; font-weight: bold;">return</span>; } <span style="color: #008800; font-weight: bold;">else</span> { <span style="color: #008800; font-weight: bold;">continue</span>; } } <span style="color: #888888;">// Register waiting thread</span> <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>(state <span style="color: #333333;">&amp;</span> <span style="color: #0000dd; font-weight: bold;">2</span>)) { <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>__atomic_compare_exchange_n (<span style="color: #333333;">&amp;</span>(vl<span style="color: #333333;">-&gt;</span>version_lock), <span style="color: #333333;">&amp;</span>state, state <span style="color: #333333;">|</span> <span style="color: #0000dd; font-weight: bold;">2</span>, <span style="color: #007020;">false</span>, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST)) <span style="color: #008800; font-weight: bold;">continue</span>; } <span style="color: #888888;">// And sleep</span> __gthread_cond_wait (<span style="color: #333333;">&amp;</span>version_lock_cond, <span style="color: #333333;">&amp;</span>version_lock_mutex); state <span style="color: #333333;">=</span> __atomic_load_n (<span style="color: #333333;">&amp;</span>(vl<span style="color: #333333;">-&gt;</span>version_lock), __ATOMIC_SEQ_CST); } <span style="color: #557799;">#else</span> <span style="color: #888888;">// Spin if we do not have condition variables available</span> <span style="color: #888888;">// We expect no contention here, spinning should be okay</span> <span style="color: #008800; font-weight: bold;">goto</span> restart; <span style="color: #557799;">#endif</span> } </pre></div> <br /><p></p><p>When sleeping we set the second-lowest bit, too, to indicate waiting threads. The unlock function checks that bit, and wakes up the threads if needed:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// Release a locked node and increase the version lock</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">version_lock_unlock_exclusive</span> (<span style="color: #008800; font-weight: bold;">struct</span> version_lock <span style="color: #333333;">*</span>vl) { <span style="color: #888888;">// increase version, reset exclusive lock bits</span> <span style="color: #333399; font-weight: bold;">uintptr_t</span> state <span style="color: #333333;">=</span> __atomic_load_n (<span style="color: #333333;">&amp;</span>(vl<span style="color: #333333;">-&gt;</span>version_lock), __ATOMIC_SEQ_CST); <span style="color: #333399; font-weight: bold;">uintptr_t</span> ns <span style="color: #333333;">=</span> (state <span style="color: #333333;">+</span> <span style="color: #0000dd; font-weight: bold;">4</span>) <span style="color: #333333;">&amp;</span> (<span style="color: #333333;">~</span>((<span style="color: #333399; font-weight: bold;">uintptr_t</span>) <span style="color: #0000dd; font-weight: bold;">3</span>)); state <span style="color: #333333;">=</span> __atomic_exchange_n (<span style="color: #333333;">&amp;</span>(vl<span style="color: #333333;">-&gt;</span>version_lock), ns, __ATOMIC_SEQ_CST); <span style="color: #557799;">#ifdef __GTHREAD_HAS_COND</span> <span style="color: #008800; font-weight: bold;">if</span> (state <span style="color: #333333;">&amp;</span> <span style="color: #0000dd; font-weight: bold;">2</span>) { <span style="color: #888888;">// Wake up waiting threads. This should be extremely rare.</span> __gthread_mutex_lock (<span style="color: #333333;">&amp;</span>version_lock_mutex); __gthread_cond_broadcast (<span style="color: #333333;">&amp;</span>version_lock_cond); __gthread_mutex_unlock (<span style="color: #333333;">&amp;</span>version_lock_mutex); } <span style="color: #557799;">#endif</span> } </pre></div> <p></p><p>Readers do not modify the lock at all. When they "lock" in shared mode they store the current state, and check if the state is still the same in the validate function:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 120%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #888888;">// Acquire an optimistic "lock". Note that this does not lock at all, it</span> <span style="color: #888888;">// only allows for validation later</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">bool</span> <span style="color: #0066bb; font-weight: bold;">version_lock_lock_optimistic</span> (<span style="color: #008800; font-weight: bold;">const</span> <span style="color: #008800; font-weight: bold;">struct</span> version_lock <span style="color: #333333;">*</span>vl, <span style="color: #333399; font-weight: bold;">uintptr_t</span> <span style="color: #333333;">*</span>lock) { <span style="color: #333399; font-weight: bold;">uintptr_t</span> state <span style="color: #333333;">=</span> __atomic_load_n (<span style="color: #333333;">&amp;</span>(vl<span style="color: #333333;">-&gt;</span>version_lock), __ATOMIC_SEQ_CST); <span style="color: #333333;">*</span>lock <span style="color: #333333;">=</span> state; <span style="color: #888888;">// Acquiring the lock fails when there is currently an exclusive lock</span> <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #333333;">!</span>(state <span style="color: #333333;">&amp;</span> <span style="color: #0000dd; font-weight: bold;">1</span>); } <span style="color: #888888;">// Validate a previously acquire lock</span> <span style="color: #008800; font-weight: bold;">static</span> <span style="color: #008800; font-weight: bold;">inline</span> <span style="color: #333399; font-weight: bold;">bool</span> <span style="color: #0066bb; font-weight: bold;">version_lock_validate</span> (<span style="color: #008800; font-weight: bold;">const</span> <span style="color: #008800; font-weight: bold;">struct</span> version_lock <span style="color: #333333;">*</span>vl, <span style="color: #333399; font-weight: bold;">uintptr_t</span> lock) { <span style="color: #888888;">// Prevent the reordering of non-atomic loads behind the atomic load.</span> <span style="color: #888888;">// Hans Boehm, Can Seqlocks Get Along with Programming Language Memory</span> <span style="color: #888888;">// Models?, Section 4.</span> __atomic_thread_fence (__ATOMIC_ACQUIRE); <span style="color: #888888;">// Check that the node is still in the same state</span> <span style="color: #333399; font-weight: bold;">uintptr_t</span> state <span style="color: #333333;">=</span> __atomic_load_n (<span style="color: #333333;">&amp;</span>(vl<span style="color: #333333;">-&gt;</span>version_lock), __ATOMIC_SEQ_CST); <span style="color: #008800; font-weight: bold;">return</span> (state <span style="color: #333333;">==</span> lock); } </pre></div> <p></p><p>We fail early if the lock is currently locked exclusive.</p><p>Note that optimistic lock coupling conceptually does the same as classical lock coupling. The main difference is that we have to valid a lock before we can act upon a value that we have read. This means: 1) lock parent, 2) fetch child pointer, 3) validate parent, restart if validation fails, 4) lock child, etc.</p><p>In the next article we look at the <a href="https://databasearchitects.blogspot.com/2022/06/btree.html">b-tree data structure</a> that we use to store the frames.<br /></p>Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com0tag:blogger.com,1999:blog-8891884956165580080.post-34427107927557544452022-06-26T10:48:00.001+02:002022-06-26T10:48:52.446+02:00Making unwinding through JIT-ed code scalable - Replacing the gcc hooks<p>This article is part of the series about scalable unwinding that starts <a href="https://databasearchitects.blogspot.com/2022/06/making-unwinding-through-jit-ed-code.html">here</a>.</p><p>As discussed in the previous article, the gcc mechanism does not scale because it uses a global lock to protect its list of unwinding frames. To solve that problem, we replace that list with a read-optimized b-tree that allows for concurrent reads and writes. In this article we just discuss the patches to gcc necessary to enable that mechanism, the b-tree itself is discussed in subsequent articles.</p><p>We start by replacing the old fast path mechanism with a b-tree root:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 110%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: navy; font-weight: bold;">index 8ee55be5675..d546b9e4c43 100644</span> <span style="color: #a00000;">--- a/libgcc/unwind-dw2-fde.c</span> <span style="color: #00a000;">+++ b/libgcc/unwind-dw2-fde.c</span> <span style="color: purple; font-weight: bold;">@@ -42,15 +42,34 @@ see the files COPYING3 and COPYING.RUNTIME respectively. If not, see</span> #endif #endif <span style="color: #00a000;">+#ifdef ATOMIC_FDE_FAST_PATH</span> <span style="color: #00a000;">+#include "unwind-dw2-btree.h"</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+static struct btree registered_frames;</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+static void</span> <span style="color: #00a000;">+release_registered_frames (void) __attribute__ ((destructor (110)));</span> <span style="color: #00a000;">+static void</span> <span style="color: #00a000;">+release_registered_frames (void)</span> <span style="color: #00a000;">+{</span> <span style="color: #00a000;">+ /* Release the b-tree and all frames. Frame releases that happen later are</span> <span style="color: #00a000;">+ * silently ignored */</span> <span style="color: #00a000;">+ btree_destroy (&amp;registered_frames);</span> <span style="color: #00a000;">+}</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+static void</span> <span style="color: #00a000;">+get_pc_range (const struct object *ob, uintptr_t *range);</span> <span style="color: #00a000;">+static void</span> <span style="color: #00a000;">+init_object (struct object *ob);</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+#else</span> <span style="color: #00a000;">+</span> /* The unseen_objects list contains objects that have been registered but not yet categorized in any way. The seen_objects list has had its pc_begin and count fields initialized at minimum, and is sorted by decreasing value of pc_begin. */ static struct object *unseen_objects; static struct object *seen_objects; <span style="color: #a00000;">-#ifdef ATOMIC_FDE_FAST_PATH</span> <span style="color: #a00000;">-static int any_objects_registered;</span> <span style="color: #a00000;">-#endif</span> #ifdef __GTHREAD_MUTEX_INIT static __gthread_mutex_t object_mutex = __GTHREAD_MUTEX_INIT; <span style="color: purple; font-weight: bold;">@@ -78,6 +97,7 @@ init_object_mutex_once (void)</span> static __gthread_mutex_t object_mutex; #endif #endif <span style="color: #00a000;">+#endif</span> </pre></div> <br /><p></p><p></p><p>When the platform supports atomics (ATOMIC_FDE_FAST_PATH), we replace the whole mechanism with one b-tree, whose root is registered_frames. Neither the objects lists not the mutex exist on such platforms. To avoid leaking the memory for the b-tree we register release_registered_frames as destructor with a very late priority on shutdown. Note that it is still safe to throw exceptions even when the b-tree is gone as long as unwinding does not have to go through JITed code that was registered here. The destroyed b-tree behaves like an empty b-tree. get_pc_range and init_object are forward declared because we need them already when inserting into the b-tree.</p><p>&nbsp;When registering frames we no longer grab a mutex, we immediately store the frames in the b-tree: <br /></p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 110%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: purple; font-weight: bold;">@@ -99,23 +119,23 @@ __register_frame_info_bases (const void *begin, struct object *ob,</span> ob-&gt;fde_end = NULL; #endif <span style="color: #00a000;">+#ifdef ATOMIC_FDE_FAST_PATH</span> <span style="color: #00a000;">+ // Initialize eagerly to avoid locking later</span> <span style="color: #00a000;">+ init_object (ob);</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+ // And register the frame</span> <span style="color: #00a000;">+ uintptr_t range[2];</span> <span style="color: #00a000;">+ get_pc_range (ob, range);</span> <span style="color: #00a000;">+ btree_insert (&amp;registered_frames, range[0], range[1] - range[0], ob);</span> <span style="color: #00a000;">+#else</span> init_object_mutex_once (); __gthread_mutex_lock (&amp;object_mutex); ob-&gt;next = unseen_objects; unseen_objects = ob; <span style="color: #a00000;">-#ifdef ATOMIC_FDE_FAST_PATH</span> <span style="color: #a00000;">- /* Set flag that at least one library has registered FDEs.</span> <span style="color: #a00000;">- Use relaxed MO here, it is up to the app to ensure that the library</span> <span style="color: #a00000;">- loading/initialization happens-before using that library in other</span> <span style="color: #a00000;">- threads (in particular unwinding with that library's functions</span> <span style="color: #a00000;">- appearing in the backtraces). Calling that library's functions</span> <span style="color: #a00000;">- without waiting for the library to initialize would be racy. */</span> <span style="color: #a00000;">- if (!any_objects_registered)</span> <span style="color: #a00000;">- __atomic_store_n (&amp;any_objects_registered, 1, __ATOMIC_RELAXED);</span> <span style="color: #a00000;">-#endif</span> __gthread_mutex_unlock (&amp;object_mutex); <span style="color: #00a000;">+#endif</span> } void <span style="color: purple; font-weight: bold;">@@ -153,23 +173,23 @@ __register_frame_info_table_bases (void *begin, struct object *ob,</span> ob-&gt;s.b.from_array = 1; ob-&gt;s.b.encoding = DW_EH_PE_omit; <span style="color: #00a000;">+#ifdef ATOMIC_FDE_FAST_PATH</span> <span style="color: #00a000;">+ // Initialize eagerly to avoid locking later</span> <span style="color: #00a000;">+ init_object (ob);</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+ // And register the frame</span> <span style="color: #00a000;">+ uintptr_t range[2];</span> <span style="color: #00a000;">+ get_pc_range (ob, range);</span> <span style="color: #00a000;">+ btree_insert (&amp;registered_frames, range[0], range[1] - range[0], ob);</span> <span style="color: #00a000;">+#else</span> init_object_mutex_once (); __gthread_mutex_lock (&amp;object_mutex); ob-&gt;next = unseen_objects; unseen_objects = ob; <span style="color: #a00000;">-#ifdef ATOMIC_FDE_FAST_PATH</span> <span style="color: #a00000;">- /* Set flag that at least one library has registered FDEs.</span> <span style="color: #a00000;">- Use relaxed MO here, it is up to the app to ensure that the library</span> <span style="color: #a00000;">- loading/initialization happens-before using that library in other</span> <span style="color: #a00000;">- threads (in particular unwinding with that library's functions</span> <span style="color: #a00000;">- appearing in the backtraces). Calling that library's functions</span> <span style="color: #a00000;">- without waiting for the library to initialize would be racy. */</span> <span style="color: #a00000;">- if (!any_objects_registered)</span> <span style="color: #a00000;">- __atomic_store_n (&amp;any_objects_registered, 1, __ATOMIC_RELAXED);</span> <span style="color: #a00000;">-#endif</span> __gthread_mutex_unlock (&amp;object_mutex); <span style="color: #00a000;">+#endif</span> } </pre></div> <br /><p></p><p>Note that there is a subtle change of logic here: The old mechanism stores all incoming frames as they are in the unseen_objects chain without even looking at them. Then, during unwinding, it inspects the frames to find out the range of program counter values (pc_range), and sorts them in the seen_objects list. This fundamentally requires a lock, as the objects are modified during unwinding. To avoid that, the new code immediately computes the PC range and lets the b-tree do the sorting. Which keeps the frame immutable during unwinding.</p><p>When searching a frame during unwinding we delegate everything to the b-tree, which provides lock-free lookups. No mutex is acquired on platforms that support atomics:<br /></p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 110%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: purple; font-weight: bold;">@@ -1033,17 +1161,12 @@ _Unwind_Find_FDE (void *pc, struct dwarf_eh_bases *bases)</span> const fde *f = NULL; #ifdef ATOMIC_FDE_FAST_PATH <span style="color: #a00000;">- /* For targets where unwind info is usually not registered through these</span> <span style="color: #a00000;">- APIs anymore, avoid taking a global lock.</span> <span style="color: #a00000;">- Use relaxed MO here, it is up to the app to ensure that the library</span> <span style="color: #a00000;">- loading/initialization happens-before using that library in other</span> <span style="color: #a00000;">- threads (in particular unwinding with that library's functions</span> <span style="color: #a00000;">- appearing in the backtraces). Calling that library's functions</span> <span style="color: #a00000;">- without waiting for the library to initialize would be racy. */</span> <span style="color: #a00000;">- if (__builtin_expect (!__atomic_load_n (&amp;any_objects_registered,</span> <span style="color: #a00000;">- __ATOMIC_RELAXED), 1))</span> <span style="color: #00a000;">+ ob = btree_lookup (&amp;registered_frames, (uintptr_t) pc);</span> <span style="color: #00a000;">+ if (!ob)</span> return NULL; <span style="color: #a00000;">-#endif</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+ f = search_object (ob, pc);</span> <span style="color: #00a000;">+#else</span> init_object_mutex_once (); __gthread_mutex_lock (&amp;object_mutex); <span style="color: purple; font-weight: bold;">@@ -1081,6 +1204,7 @@ _Unwind_Find_FDE (void *pc, struct dwarf_eh_bases *bases)</span> fini: __gthread_mutex_unlock (&amp;object_mutex); <span style="color: #00a000;">+#endif</span> if (f) { </pre></div> <br /><p></p><p>De-registering a frame works just the same as registering, we use get_pc_range to get the range and then remove it from the b-tree: <br /></p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 110%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: purple; font-weight: bold;">@@ -200,16 +220,33 @@ __register_frame_table (void *begin)</span> void * __deregister_frame_info_bases (const void *begin) { <span style="color: #a00000;">- struct object **p;</span> struct object *ob = 0; /* If .eh_frame is empty, we haven't registered. */ if ((const uword *) begin == 0 || *(const uword *) begin == 0) return ob; <span style="color: #00a000;">+#ifdef ATOMIC_FDE_FAST_PATH</span> <span style="color: #00a000;">+ // Find the corresponding PC range</span> <span style="color: #00a000;">+ struct object lookupob;</span> <span style="color: #00a000;">+ lookupob.tbase = 0;</span> <span style="color: #00a000;">+ lookupob.dbase = 0;</span> <span style="color: #00a000;">+ lookupob.u.single = begin;</span> <span style="color: #00a000;">+ lookupob.s.i = 0;</span> <span style="color: #00a000;">+ lookupob.s.b.encoding = DW_EH_PE_omit;</span> <span style="color: #00a000;">+#ifdef DWARF2_OBJECT_END_PTR_EXTENSION</span> <span style="color: #00a000;">+ lookupob.fde_end = NULL;</span> <span style="color: #00a000;">+#endif</span> <span style="color: #00a000;">+ uintptr_t range[2];</span> <span style="color: #00a000;">+ get_pc_range (&amp;lookupob, range);</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+ // And remove</span> <span style="color: #00a000;">+ ob = btree_remove (&amp;registered_frames, range[0]);</span> <span style="color: #00a000;">+#else</span> init_object_mutex_once (); __gthread_mutex_lock (&amp;object_mutex); <span style="color: #00a000;">+ struct object **p;</span> for (p = &amp;unseen_objects; *p ; p = &amp;(*p)-&gt;next) if ((*p)-&gt;u.single == begin) { <span style="color: purple; font-weight: bold;">@@ -241,6 +278,8 @@ __deregister_frame_info_bases (const void *begin)</span> out: __gthread_mutex_unlock (&amp;object_mutex); <span style="color: #00a000;">+#endif</span> <span style="color: #00a000;">+</span> gcc_assert (ob); return (void *) ob; } </pre></div> <p></p><p>We are nearly done with our changes to existing gcc code, we just need a few more helper functions for get_pc_range:</p><p> <!--HTML generated using hilite.me--></p><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; width: 110%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: purple; font-weight: bold;">@@ -264,7 +303,7 @@ __deregister_frame (void *begin)</span> instead of an _Unwind_Context. */ static _Unwind_Ptr <span style="color: #a00000;">-base_from_object (unsigned char encoding, struct object *ob)</span> <span style="color: #00a000;">+base_from_object (unsigned char encoding, const struct object *ob)</span> { if (encoding == DW_EH_PE_omit) return 0; <span style="color: purple; font-weight: bold;">@@ -821,6 +860,91 @@ init_object (struct object* ob)</span> ob-&gt;s.b.sorted = 1; } <span style="color: #00a000;">+#ifdef ATOMIC_FDE_FAST_PATH</span> <span style="color: #00a000;">+/* Get the PC range from FDEs. The code is very similar to</span> <span style="color: #00a000;">+ classify_object_over_fdes and should be kept in sync with</span> <span style="color: #00a000;">+ that. The main difference is that classify_object_over_fdes</span> <span style="color: #00a000;">+ modifies the object, which we cannot do here */</span> <span style="color: #00a000;">+static void</span> <span style="color: #00a000;">+get_pc_range_from_fdes (const struct object *ob, const fde *this_fde,</span> <span style="color: #00a000;">+ uintptr_t *range)</span> <span style="color: #00a000;">+{</span> <span style="color: #00a000;">+ const struct dwarf_cie *last_cie = 0;</span> <span style="color: #00a000;">+ int encoding = DW_EH_PE_absptr;</span> <span style="color: #00a000;">+ _Unwind_Ptr base = 0;</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+ for (; !last_fde (ob, this_fde); this_fde = next_fde (this_fde))</span> <span style="color: #00a000;">+ {</span> <span style="color: #00a000;">+ const struct dwarf_cie *this_cie;</span> <span style="color: #00a000;">+ _Unwind_Ptr mask, pc_begin, pc_range;</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+ /* Skip CIEs. */</span> <span style="color: #00a000;">+ if (this_fde-&gt;CIE_delta == 0)</span> <span style="color: #00a000;">+ continue;</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+ this_cie = get_cie (this_fde);</span> <span style="color: #00a000;">+ if (this_cie != last_cie)</span> <span style="color: #00a000;">+ {</span> <span style="color: #00a000;">+ last_cie = this_cie;</span> <span style="color: #00a000;">+ encoding = get_cie_encoding (this_cie);</span> <span style="color: #00a000;">+ base = base_from_object (encoding, ob);</span> <span style="color: #00a000;">+ }</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+ const unsigned char *p;</span> <span style="color: #00a000;">+ p = read_encoded_value_with_base (encoding, base, this_fde-&gt;pc_begin,</span> <span style="color: #00a000;">+ &amp;pc_begin);</span> <span style="color: #00a000;">+ read_encoded_value_with_base (encoding &amp; 0x0F, 0, p, &amp;pc_range);</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+ /* Take care to ignore link-once functions that were removed.</span> <span style="color: #00a000;">+ In these cases, the function address will be NULL, but if</span> <span style="color: #00a000;">+ the encoding is smaller than a pointer a true NULL may not</span> <span style="color: #00a000;">+ be representable. Assume 0 in the representable bits is NULL. */</span> <span style="color: #00a000;">+ mask = size_of_encoded_value (encoding);</span> <span style="color: #00a000;">+ if (mask &lt; sizeof (void *))</span> <span style="color: #00a000;">+ mask = (((_Unwind_Ptr) 1) &lt;&lt; (mask &lt;&lt; 3)) - 1;</span> <span style="color: #00a000;">+ else</span> <span style="color: #00a000;">+ mask = -1;</span> <span style="color: #00a000;">+ if ((pc_begin &amp; mask) == 0)</span> <span style="color: #00a000;">+ continue;</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+ _Unwind_Ptr pc_end = pc_begin + pc_range;</span> <span style="color: #00a000;">+ if ((!range[0]) &amp;&amp; (!range[1]))</span> <span style="color: #00a000;">+ {</span> <span style="color: #00a000;">+ range[0] = pc_begin;</span> <span style="color: #00a000;">+ range[1] = pc_end;</span> <span style="color: #00a000;">+ }</span> <span style="color: #00a000;">+ else</span> <span style="color: #00a000;">+ {</span> <span style="color: #00a000;">+ if (pc_begin &lt; range[0])</span> <span style="color: #00a000;">+ range[0] = pc_begin;</span> <span style="color: #00a000;">+ if (pc_end &gt; range[1])</span> <span style="color: #00a000;">+ range[1] = pc_end;</span> <span style="color: #00a000;">+ }</span> <span style="color: #00a000;">+ }</span> <span style="color: #00a000;">+}</span> <span style="color: #00a000;">+</span> <span style="color: #00a000;">+/* Get the PC range for lookup */</span> <span style="color: #00a000;">+static void</span> <span style="color: #00a000;">+get_pc_range (const struct object *ob, uintptr_t *range)</span> <span style="color: #00a000;">+{</span> <span style="color: #00a000;">+ range[0] = range[1] = 0;</span> <span style="color: #00a000;">+ if (ob-&gt;s.b.sorted)</span> <span style="color: #00a000;">+ {</span> <span style="color: #00a000;">+ get_pc_range_from_fdes (ob, ob-&gt;u.sort-&gt;orig_data, range);</span> <span style="color: #00a000;">+ }</span> <span style="color: #00a000;">+ else if (ob-&gt;s.b.from_array)</span> <span style="color: #00a000;">+ {</span> <span style="color: #00a000;">+ fde **p = ob-&gt;u.array;</span> <span style="color: #00a000;">+ for (; *p; ++p)</span> <span style="color: #00a000;">+ get_pc_range_from_fdes (ob, *p, range);</span> <span style="color: #00a000;">+ }</span> <span style="color: #00a000;">+ else</span> <span style="color: #00a000;">+ {</span> <span style="color: #00a000;">+ get_pc_range_from_fdes (ob, ob-&gt;u.single, range);</span> <span style="color: #00a000;">+ }</span> <span style="color: #00a000;">+}</span> <span style="color: #00a000;">+#endif</span> <span style="color: #00a000;">+</span> /* A linear search through a set of FDEs for the given PC. This is used when there was insufficient memory to allocate and sort an array. */ </pre></div> <p></p><p>The code finds out the range of possible pc values for the FDE. Conceptually it does nearly the same as the already existing function classify_object_over_fdes. Unfortunately we cannot reuse that code, as classify_object_over_fdes modifies the object in order to speed up subsequent searches, and we require our data to be immutable.</p><p>In the next article we look at <a href="https://databasearchitects.blogspot.com/2022/06/optimisticlockcoupling.html">optimistic lock coupling</a>, which is the mechanism that we use to allow for lock-free readers.</p>Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com0tag:blogger.com,1999:blog-8891884956165580080.post-6412425750200813002022-06-26T10:46:00.001+02:002022-06-26T11:16:10.760+02:00Making unwinding through JIT-ed code scalable<p>&nbsp;Exceptions are a very handy mechanism to propagate errors in C++ programs, but unfortunately <a href="https://isocpp.org/files/papers/P2544R0.html">they do not scale very well</a>. In all common C++ implementations the unwinding mechanism takes global lock during unwinding, which has disastrous consequences when the number of threads is high. On a machine with 256 hardware context we see worse-than-single-threaded behavior even for relatively modest failure rates.</p><p>Fortunately the Florian Weimer fixed one contention point in gcc 12 on systems with glibc 2.35 or newer, which gives us scalable exceptions as long as no JIT-ed code has been registered. Unfortunately our system does register JIT-ed code... Which means exception unwinding in our code base is still single-threaded in practice.</p><p>But we can fix that by teaching gcc to store the unwinding information in a read-optimized b-tree, which allows for fully parallel unwinding without any atomic writes. There is a&nbsp;<a href="https://gcc.gnu.org/pipermail/gcc-patches/2022-June/597256.html">gcc patch</a> that does just that, but unfortunately it is quite involved and difficult to review. This article series thus explains all parts of the patch and shows how a read-optimized b-tree can be implemented lock-free.&nbsp;</p><p>In order to keep the article length somewhat reasonable, the discusses is broken into parts:</p><ol style="text-align: left;"><li><a href="https://databasearchitects.blogspot.com/2022/06/making-unwinding-through-jit-ed-code.html">The problem (this article)</a></li><li><a href="https://databasearchitects.blogspot.com/2022/06/replacinggcchooks.html">Replacing the gcc hooks</a>&nbsp;</li><li><a href="https://databasearchitects.blogspot.com/2022/06/optimisticlockcoupling.html">Optimistic Lock Coupling</a></li><li><a href="https://databasearchitects.blogspot.com/2022/06/btree.html">The b-tree</a>&nbsp;</li><li><a href="https://databasearchitects.blogspot.com/2022/06/btreeoperations.html">b-tree operations</a>&nbsp;&nbsp; <br /></li></ol><p>When unwinding exceptions, the compiler has to find the corresponding unwinding information for every call frame on the stack between the throw and the catch. gcc uses two different mechanisms for that: For ahead-of-time compiled code it asks glibc to find the unwinding information using either dl_iterate_phdr (on older systems) or _dl_find_object (on systems with glibc 2.35 or newer). Note that this mapping is not static, as shared libraries could be added or removed at any time, potentially during a concurrent unwind. For that reason dl_iterate_phdr was protected by a global mutex, which clearly does not scale. _dl_find_object avoids that mutex by using a lock-free data structure. But that only affects ahead-of-time compiled code, for JITed code libgcc uses <a href="https://github.com/gcc-mirror/gcc/blob/fc259b522c0f8b7bbca8e7adcd3da63330094a34/libgcc/unwind-dw2-fde.c">a different mechanism</a>, which we now discuss:</p><p>For JITed code the emitter has to explicitly register the unwinding information using <span class="pl-en">__register_frame_info_bases (and the very similar </span><span class="pl-en">__register_frame_info_table_bases):</span></p><!--HTML generated using hilite.me--><div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow-wrap: normal; overflow: scroll; padding: 0.2em 0.6em; width: 110%; word-wrap: normal;"><pre style="line-height: 125%; margin: 0px; overflow-wrap: normal; word-wrap: normal;"><span style="color: #333399; font-weight: bold;">void</span> <span style="color: #0066bb; font-weight: bold;">__register_frame_info_bases</span> (<span style="color: #008800; font-weight: bold;">const</span> <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #333333;">*</span>begin, <span style="color: #008800; font-weight: bold;">struct</span> object <span style="color: #333333;">*</span>ob, <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #333333;">*</span>tbase, <span style="color: #333399; font-weight: bold;">void</span> <span style="color: #333333;">*</span>dbase) { <span style="color: #888888;">/* If .eh_frame is empty, don't register at all. */</span> <span style="color: #008800; font-weight: bold;">if</span> ((<span style="color: #008800; font-weight: bold;">const</span> uword <span style="color: #333333;">*</span>) begin <span style="color: #333333;">==</span> <span style="color: #0000dd; font-weight: bold;">0</span> <span style="color: #333333;">||</span> <span style="color: #333333;">*</span>(<span style="color: #008800; font-weight: bold;">const</span> uword <span style="color: #333333;">*</span>) begin <span style="color: #333333;">==</span> <span style="color: #0000dd; font-weight: bold;">0</span>) <span style="color: #008800; font-weight: bold;">return</span>; ob<span style="color: #333333;">-&gt;</span>pc_begin <span style="color: #333333;">=</span> (<span style="color: #333399; font-weight: bold;">void</span> <span style="color: #333333;">*</span>)<span style="color: #333333;">-</span><span style="color: #0000dd; font-weight: bold;">1</span>; ob<span style="color: #333333;">-&gt;</span>tbase <span style="color: #333333;">=</span> tbase; ob<span style="color: #333333;">-&gt;</span>dbase <span style="color: #333333;">=</span> dbase; ob<span style="color: #333333;">-&gt;</span>u.single <span style="color: #333333;">=</span> begin; ob<span style="color: #333333;">-&gt;</span>s.i <span style="color: #333333;">=</span> <span style="color: #0000dd; font-weight: bold;">0</span>; ob<span style="color: #333333;">-&gt;</span>s.b.encoding <span style="color: #333333;">=</span> DW_EH_PE_omit; <span style="color: #557799;">#ifdef DWARF2_OBJECT_END_PTR_EXTENSION</span> ob<span style="color: #333333;">-&gt;</span>fde_end <span style="color: #333333;">=</span> <span style="color: #007020;">NULL</span>; <span style="color: #557799;">#endif</span> init_object_mutex_once (); __gthread_mutex_lock (<span style="color: #333333;">&amp;</span>object_mutex); ob<span style="color: #333333;">-&gt;</span>next <span style="color: #333333;">=</span> unseen_objects; unseen_objects <span style="color: #333333;">=</span> ob; <span style="color: #557799;">#ifdef ATOMIC_FDE_FAST_PATH</span> <span style="color: #888888;">/* Set flag that at least one library has registered FDEs.</span> <span style="color: #888888;"> Use relaxed MO here, it is up to the app to ensure that the library</span> <span style="color: #888888;"> loading/initialization happens-before using that library in other</span> <span style="color: #888888;"> threads (in particular unwinding with that library's functions</span> <span style="color: #888888;"> appearing in the backtraces). Calling that library's functions</span> <span style="color: #888888;"> without waiting for the library to initialize would be racy. */</span> <span style="color: #008800; font-weight: bold;">if</span> (<span style="color: #333333;">!</span>any_objects_registered) __atomic_store_n (<span style="color: #333333;">&amp;</span>any_objects_registered, <span style="color: #0000dd; font-weight: bold;">1</span>, __ATOMIC_RELAXED); <span style="color: #557799;">#endif</span> __gthread_mutex_unlock (<span style="color: #333333;">&amp;</span>object_mutex); } </pre></div> <p><span class="pl-en">&nbsp;Which conceptually is a simple thing: It grabs a mutex, puts the object information into a list, and released the mutex. Note the code within the </span><span style="color: #557799;">ATOMIC_FDE_FAST_PATH </span><span class="pl-en">block: As we will see below, that unwinding mechanism is very slow. Thus, libgcc tries to avoid taking that path by remembering if any JITed code was registered at all. If not, it stops unwinding immediately. But that mechanism does not help if we do have JITed code.</span></p><p><span class="pl-en">During unwinding, the unwinder calls&nbsp; </span><span style="color: #0066bb; font-weight: bold;">_Unwind_Find_FDE, </span>which traverses the list in order to find the corresponding unwind table for the given address:<br /></p> <div style="background: rgb(255, 255, 255) none repeat scroll 0% 0%; border-color: gray; border-image: none 100% / 1 / 0 stretch; border-style: solid; border-width: 0.1em 0.1em 0.1em 0.8em; border: medium solid gray; overflow: auto; padding: 0.2em 0.6em; scroll-x: auto; width: 110%;"><pre style="line-height: 125%; margin: 0px;"><span style="color: #008800; font-weight: bold;">const</span> fde <span style="color: #333333;">*</span> <span style="color: #0066bb; font-weight: bold;">_Unwind_Find_FDE</span> (<span style="color: #333399; font-weight: bold;">void</span> <span style="color: #333333;">*</span>pc, <span style="color: #008800; font-weight: bold;">struct</span> dwarf_eh_bases <span style="color: #333333;">*</span>bases) { <span style="color: #008800; font-weight: bold;">struct</span> object <span style="color: #333333;">*</span>ob; <span style="color: #008800; font-weight: bold;">const</span> fde <span style="color: #333333;">*</span>f <span style="color: #333333;">=</span> <span style="color: #007020;">NULL</span>; <span style="color: #557799;">#ifdef ATOMIC_FDE_FAST_PATH</span> <span style="color: #888888;">/* For targets where unwind info is usually not registered through these</span> <span style="color: #888888;"> APIs anymore, avoid taking a global lock.</span> <span style="color: #888888;"> Use relaxed MO here, it is up to the app to ensure that the library</span> <span style="color: #888888;"> loading/initialization happens-before using that library in other</span> <span style="color: #888888;"> threads (in particular unwinding with that library's functions</span> <span style="color: #888888;"> appearing in the backtraces). Calling that library's functions</span> <span style="color: #888888;"> without waiting for the library to initialize would be racy. */</span> <span style="color: #008800; font-weight: bold;">if</span> (__builtin_expect (<span style="color: #333333;">!</span>__atomic_load_n (<span style="color: #333333;">&amp;</span>any_objects_registered, __ATOMIC_RELAXED), <span style="color: #0000dd; font-weight: bold;">1</span>)) <span style="color: #008800; font-weight: bold;">return</span> <span style="color: #007020;">NULL</span>; <span style="color: #557799;">#endif</span> init_object_mutex_once (); __gthread_mutex_lock (<span style="color: #333333;">&amp;</span>object_mutex); <span style="color: #888888;">/* Linear search through the classified objects, to find the one</span> <span style="color: #888888;"> containing the pc. Note that pc_begin is sorted descending, and</span> <span style="color: #888888;"> we expect objects to be non-overlapping. */</span> <span style="color: #008800; font-weight: bold;">for</span> (ob <span style="color: #333333;">=</span> seen_objects; ob; ob <span style="color: #333333;">=</span> ob<span style="color: #333333;">-&gt;</span>next) <span style="color: #008800; font-weight: bold;">if</span> (pc <span style="color: #333333;">&gt;=</span> ob<span style="color: #333333;">-&gt;</span>pc_begin) { f <span style="color: #333333;">=</span> search_object (ob, pc); <span style="color: #008800; font-weight: bold;">if</span> (f) <span style="color: #008800; font-weight: bold;">goto</span> fini; <span style="color: #008800; font-weight: bold;">break</span>; } <span style="color: #888888;">/* Classify and search the objects we've not yet processed. */</span> <span style="color: #008800; font-weight: bold;">while</span> ((ob <span style="color: #333333;">=</span> unseen_objects)) { <span style="color: #008800; font-weight: bold;">struct</span> object <span style="color: #333333;">**</span>p; unseen_objects <span style="color: #333333;">=</span> ob<span style="color: #333333;">-&gt;</span>next; f <span style="color: #333333;">=</span> search_object (ob, pc); <span style="color: #888888;">/* Insert the object into the classified list. */</span> <span style="color: #008800; font-weight: bold;">for</span> (p <span style="color: #333333;">=</span> <span style="color: #333333;">&amp;</span>seen_objects; <span style="color: #333333;">*</span>p ; p <span style="color: #333333;">=</span> <span style="color: #333333;">&amp;</span>(<span style="color: #333333;">*</span>p)<span style="color: #333333;">-&gt;</span>next) <span style="color: #008800; font-weight: bold;">if</span> ((<span style="color: #333333;">*</span>p)<span style="color: #333333;">-&gt;</span>pc_begin <span style="color: #333333;">&lt;</span> ob<span style="color: #333333;">-&gt;</span>pc_begin) <span style="color: #008800; font-weight: bold;">break</span>; ob<span style="color: #333333;">-&gt;</span>next <span style="color: #333333;">=</span> <span style="color: #333333;">*</span>p; <span style="color: #333333;">*</span>p <span style="color: #333333;">=</span> ob; <span style="color: #008800; font-weight: bold;">if</span> (f) <span style="color: #008800; font-weight: bold;">goto</span> fini; } <span style="color: #997700; font-weight: bold;">fini:</span> __gthread_mutex_unlock (<span style="color: #333333;">&amp;</span>object_mutex); <span style="color: #008800; font-weight: bold;">if</span> (f) { <span style="color: #333399; font-weight: bold;">int</span> encoding; _Unwind_Ptr func; bases<span style="color: #333333;">-&gt;</span>tbase <span style="color: #333333;">=</span> ob<span style="color: #333333;">-&gt;</span>tbase; bases<span style="color: #333333;">-&gt;</span>dbase <span style="color: #333333;">=</span> ob<span style="color: #333333;">-&gt;</span>dbase; encoding <span style="color: #333333;">=</span> ob<span style="color: #333333;">-&gt;</span>s.b.encoding; <span style="color: #008800; font-weight: bold;">if</span> (ob<span style="color: #333333;">-&gt;</span>s.b.mixed_encoding) encoding <span style="color: #333333;">=</span> get_fde_encoding (f); read_encoded_value_with_base (encoding, base_from_object (encoding, ob), f<span style="color: #333333;">-&gt;</span>pc_begin, <span style="color: #333333;">&amp;</span>func); bases<span style="color: #333333;">-&gt;</span>func <span style="color: #333333;">=</span> (<span style="color: #333399; font-weight: bold;">void</span> <span style="color: #333333;">*</span>) func; } <span style="color: #008800; font-weight: bold;">return</span> f; } </pre></div> <p>In the fast path it tries to stop early if no unwinding information had been registered. But if any unwinding frame has been registered by JITed code, it grabs the global object_mutex and does everything effectively single-threaded. This does not scale. Thus, we discuss how to switch gcc to a lock-free lookup structure in the <a href="https://databasearchitects.blogspot.com/2022/06/replacinggcchooks.html">next article</a>.</p><p><br /></p>Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com0tag:blogger.com,1999:blog-8891884956165580080.post-39280898467884404382022-04-03T09:59:00.009+02:002022-06-01T07:19:41.690+02:00Cloud Network Traffic Within the Same Region Can Be Very Expensive<p>Everyone knows that the major cloud vendors try to make it easy to get data in, and hard to get it out. What is less known is that high egress cost also applies to outbound traffic within the same region.<br /><br />Let's look at AWS specifically. In AWS, EC2 outbound traffic is only free within the same availability zone (AZ). Moving data from one AZ to another in the same region is actually quite expensive:<i></i></p><blockquote><i>"</i><i>IPv4: Data transferred “in“ to and “out“ from Amazon EC2 [...] across Availability Zones in the same AWS Region is charged at $0.01/GB in each direction."<i> <a href="https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer_within_the_same_AWS_Region">source</a></i><br /></i></blockquote>This means that transferring 1TB costs $0.01/GB * 1000GB * 2 = $20. For comparison: most inter-region transfers cost $0.02 per GB, but only for outgoing traffic. Thus, remarkably, transferring 1TB from Ohio to Tokyo will cost the same as transferring it within Ohio from <i>us-east-2a</i> to <i>us-east-2b</i>. Two <i>c5n.18xlarge</i> instances communicating with each other at full 100 Gbit speed can theoretically incur network costs of $1,800 per hour (or $1,296,000 per month).<br /><br />Interestingly, S3 can be used to bypass the high traffic cost when moving data between different AZs in the same region because<br /><i><blockquote>"Data transferred directly between Amazon S3 [...] in the same AWS Region is free."<i> <a href="https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer_within_the_same_AWS_Region">source</a></i><br /></blockquote></i>Let's see if we can exploit this. Consider again our example where we want to transfer 1TB from <i>us-east-2a</i> to <i>us-east-2b</i>. Instead of two EC2 instances talking directly, we could use an S3 Standard bucket in <i>us-east-2</i>. We first PUT the data into it from <i>us-east-2a</i>, then GET the data from that bucket using <i>us-east-2b</i> instances, and finally delete all objects. If we split our data into 1,000 1GB chunks, we need to pay for only 1,000 PUT and 1,000 GET S3 requests, which would be less than $0.01. Storage cost is also low: assuming a transfer rate of 1GB/s, the data would have to be stored in S3 for less than an hour, which costs about $0.03 (and could be reduced via pipelining). Thus, in total we can transfer 1TB through S3 for less than $0.05 instead of $20.<br /><br />To summarize, inter-AZ traffic cost within the same region is unreasonably expensive. It seems unlikely that these prices have any resemblance to internal AWS cost -- as is reflected by the fact that AWS services such as S3 allow bypassing this cost.<br /><br /><p></p>Viktor Leishttp://www.blogger.com/profile/09732217689829100056noreply@blogger.com0tag:blogger.com,1999:blog-8891884956165580080.post-83540453226657587442022-01-16T15:07:00.003+01:002022-01-17T10:06:17.281+01:00Are you sure you want to use MMAP in your database management system?<p>Many database management systems carefully manage disk I/O operations and explicitly cache pages in main memory. Operating systems implement a page cache to speed up recurring disk accesses as well, and even allow transparent access to disk files through the mmap system call. Why do most database systems then even implement I/O handling and a caching component if the OS provides these features through mmap? Andrew Pavlo, Andrew Crotty, and myself tried to answer this question in a <a href="https://db.cs.cmu.edu/mmap-cidr2022/">CIDR 2022 paper</a>. This is quite a contentious question as the <a href="https://news.ycombinator.com/item?id=29936104">Hacker News discussion of the paper</a> shows.<br /><br />The paper argues that using <a href="https://man7.org/linux/man-pages/man2/mmap.2.html">mmap</a> in database systems is almost always a bad idea. To implement transactions and crash recovery with mmap, the DBMS has to write any change out-of-place because there is no way to prevent write back of a particular page. This makes it impossible to implement classical ARIES-style transactions. Furthermore, data access through mmap can take a handful of nanoseconds (if the data is in the CPU cache) or milliseconds (if it's on disk). If a page is not cached, it will be read through a synchronous page fault and there is no interface for asynchronous I/O. I/O errors, on the other hand, are communicated through signals rather than a local error code. These problems are caused by mmap's interface, which is too high-level and does not give the database system enough control.<br /><br />In addition to discussing these interface problems, the paper also shows that Linux' page cache and mmap implementation cannot achieve the bandwidth of modern storage devices. One <a href="https://www.samsung.com/semiconductor/ssd/enterprise-ssd/MZWLJ3T8HBLS/">PCIe 4.0 NVMe SSD</a> can read over 6 GB/s and <a href="https://news.samsung.com/global/samsung-develops-high-performance-pcie-5-0-ssd-for-enterprise-servers">upcoming PCIe 5.0 SSDs</a> will almost double this number. To achieve this performance, one needs to schedule hundreds or even thousands (if one has multiple SSDs) of concurrent I/O requests. Doing this in a synchronous fashion by starting hundreds of threads will not work well. Other kernel-level performance issues are single-threaded page eviction and TLB shootdowns. Overall, this is an example of OS evolution lagging behind hardware evolution.<br /><br />The OS has one big advantage over the DBMS though: it has control over the page table. Once a page is mapped, accessing it becomes transparent and as fast as ordinary memory. Any manually-implemented buffer manager, in contrast, will have some form of indirection, which causes some overhead. Pointer swizzling as implemented in <a href="https://leanstore.io/">LeanStore</a> and <a href="https://umbra-db.com/">Umbra</a> is a fast alternative but is also more difficult to implement than a traditional buffer manager and only supports tree-like data structures. Therefore, an interesting question is whether it would be possible to have an mmap-like interface, but with more control and better performance. Generally I believe this kind of research between different areas should be more common.<br /><br />&nbsp;</p>Viktor Leishttp://www.blogger.com/profile/09732217689829100056noreply@blogger.com0tag:blogger.com,1999:blog-8891884956165580080.post-47241630820162770592021-07-04T16:23:00.006+02:002023-02-07T13:39:55.748+01:00AWS EC2 Hardware Trends: 2015-2021<p>&nbsp;</p><p>Over the past decade, AWS EC2 has introduced many new instance types with different hardware configurations and prices. This hardware zoo can make it hard to keep track of what is available. In this post we will look at how the EC2 hardware landscape changed since 2015. This will hopefully help picking the best option for a given task.<br /><br />In the cloud, one can trade money for hardware resources. It therefore makes sense to take an economical perspective and normalize the hardware resource by the instance price. For example, instead of looking at absolute network bandwidth, we will use network bandwidth per dollar. Such metrics also allow us to ignore virtualized slices, reducing the number of instances relevant for the analysis from hundreds to dozens. For example, c5n.9xlarge is a virtualized slice of c5n.18xlarge with half the network bandwidth and half the cost.</p><h2 style="text-align: left;">Data Set</h2><p>We use historical data from <a href="https://instances.vantage.sh/">https://instances.vantage.sh/</a> and only consider current-generation Intel machines without GPUs. All prices are for us-east-1 Linux instances. Using these constraints, in July 2021 we can pick from the following instances:<br /> <style type="text/css">.tg-sort-header::-moz-selection{background:0 0}.tg-sort-header::selection{background:0 0}.tg-sort-header{cursor:pointer}.tg-sort-header:after{content:'';float:right;margin-top:7px;border-width:0 5px 5px;border-style:solid;border-color:#404040 transparent;visibility:hidden}.tg-sort-header:hover:after{visibility:visible}.tg-sort-asc:after,.tg-sort-asc:hover:after,.tg-sort-desc:after{visibility:visible;opacity:.4}.tg-sort-desc:after{border-bottom:none;border-width:5px 5px 0}@media screen and (max-width: 767px) {.tg {width: auto !important;}.tg col {width: auto !important;}.tg-wrap {overflow-x: auto;-webkit-overflow-scrolling: touch;}}</style></p><div class="tg-wrap"><table class="tg" id="tg-T8L1r" style="border-collapse: collapse; border-spacing: 0px;"><thead><tr><th style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; font-weight: bold; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">name</th><th style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; font-weight: bold; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">vCPU</th><th style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; font-weight: bold; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">memory [GB]</th><th style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; font-weight: bold; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">network [Gbit/s]</th><th style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; font-weight: bold; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">storage</th><th style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; font-weight: bold; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">price [$/h]</th></tr></thead><tbody><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">m4.16x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">64</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">256</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;"><br /></td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">3.20</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">h1.16x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">64</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">256</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">8x2TB disk</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">3.74</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">c5n.18x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">72</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">192</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">100</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;"><br /></td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">3.89</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">d3.8x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">32</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">256</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">24x2TB disk</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">4.00</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">c5.24x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">96</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">192</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;"><br /></td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">4.08</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">r4.16x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">64</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">488</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;"><br /></td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">4.26</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">m5.24x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">96</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">384</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;"><br /></td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">4.61</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">c5d.24x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">96</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">192</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">4x0.9TB NVMe</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">4.61</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">i3.16x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">64</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">488</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">8x1.9TB NVMe</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">5.00</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">m5d.24x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">96</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">384</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">4x0.9TB NVMe</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">5.42</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">d2.8x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">36</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">244</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">10</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">24x2TB disk</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">5.52</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">m5n.24x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">96</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">384</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">100</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;"><br /></td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">5.71</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">r5.24x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">96</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">768</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;"><br /></td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">6.05</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">d3en.12x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">48</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">192</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">75</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">24x14TB disk</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">6.31</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">m5dn.24x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">96</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">384</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">100</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">4x0.9TB NVMe</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">6.52</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">r5d.24x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">96</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">768</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">4x0.9TB NVMe</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">6.91</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">r5n.24x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">96</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">768</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">100</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;"><br /></td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">7.15</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">r5b.24x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">96</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">768</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;"><br /></td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">7.15</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">r5dn.24x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">96</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">768</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">100</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">4x0.9TB NVMe</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">8.02</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">i3en.24x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">96</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">768</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">100</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">8x7.5TB NVMe</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">10.85</td></tr><tr><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: center; vertical-align: top; word-break: normal;">x1e.32x</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">128</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">3904</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">25</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">2x1.9TB SATA</td><td style="border-color: black; border-style: solid; border-width: 1px; font-family: Arial, sans-serif; font-size: 14px; overflow: hidden; padding: 10px 5px; text-align: right; vertical-align: top; word-break: normal;">26.69</td></tr></tbody></table></div><p><script charset="utf-8">var TGSort=window.TGSort||function(n){"use strict";function r(n){return n?n.length:0}function t(n,t,e,o=0){for(e=r(n);o<e;++o)t(n[o],o)}function e(n){return n.split("").reverse().join("")}function o(n){var e=n[0];return t(n,function(n){for(;!n.startsWith(e);)e=e.substring(0,r(e)-1)}),r(e)}function u(n,r,e=[]){return t(n,function(n){r(n)&&e.push(n)}),e}var a=parseFloat;function i(n,r){return function(t){var e="";return t.replace(n,function(n,t,o){return e=t.replace(r,"")+"."+(o||"").substring(1)}),a(e)}}var s=i(/^(?:\s*)([+-]?(?:\d+)(?:,\d{3})*)(\.\d*)?$/g,/,/g),c=i(/^(?:\s*)([+-]?(?:\d+)(?:\.\d{3})*)(,\d*)?$/g,/\./g);function f(n){var t=a(n);return!isNaN(t)&&r(""+t)+1>=r(n)?t:NaN}function d(n){var e=[],o=n;return t([f,s,c],function(u){var a=[],i=[];t(n,function(n,r){r=u(n),a.push(r),r||i.push(n)}),r(i)<r(o)&&(o=i,e=a)}),r(u(o,function(n){return n==o[0]}))==r(o)?e:[]}function v(n){if("TABLE"==n.nodeName){for(var a=function(r){var e,o,u=[],a=[];return function n(r,e){e(r),t(r.childNodes,function(r){n(r,e)})}(n,function(n){"TR"==(o=n.nodeName)?(e=[],u.push(e),a.push(n)):"TD"!=o&&"TH"!=o||e.push(n)}),[u,a]}(),i=a[0],s=a[1],c=r(i),f=c>1&&r(i[0])<r(i[1])?1:0,v=f+1,p=i[f],h=r(p),l=[],g=[],N=[],m=v;m<c;++m){for(var T=0;T<h;++T){r(g)<h&&g.push([]);var C=i[m][T],L=C.textContent||C.innerText||"";g[T].push(L.trim())}N.push(m-v)}t(p,function(n,t){l[t]=0;var a=n.classList;a.add("tg-sort-header"),n.addEventListener("click",function(){var n=l[t];!function(){for(var n=0;n<h;++n){var r=p[n].classList;r.remove("tg-sort-asc"),r.remove("tg-sort-desc"),l[n]=0}}(),(n=1==n?-1:+!n)&&a.add(n>0?"tg-sort-asc":"tg-sort-desc"),l[t]=n;var i,f=g[t],m=function(r,t){return n*f[r].localeCompare(f[t])||n*(r-t)},T=function(n){var t=d(n);if(!r(t)){var u=o(n),a=o(n.map(e));t=d(n.map(function(n){return n.substring(u,r(n)-a)}))}return t}(f);(r(T)||r(T=r(u(i=f.map(Date.parse),isNaN))?[]:i))&&(m=function(r,t){var e=T[r],o=T[t],u=isNaN(e),a=isNaN(o);return u&&a?0:u?-n:a?n:e>o?n:e<o?-n:n*(r-t)});var C,L=N.slice();L.sort(m);for(var E=v;E<c;++E)(C=s[E].parentNode).removeChild(s[E]);for(E=v;E<c;++E)C.appendChild(s[v+L[E-v]])})})}}n.addEventListener("DOMContentLoaded",function(){for(var t=n.getElementsByClassName("tg"),e=0;e<r(t);++e)try{v(t[e])}catch(n){}})}(document)</script> </p><p></p><h2 style="text-align: left;">Compute</h2><p>Using our six-year data set, let's first look at the cost of compute:<br /></p> <img src="https://db.in.tum.de/~leis/cloud/cpuevol.svg" /> <br /> <p>It is quite remarkable that from 2015 to 2021, the cost of compute barely changed. During that six-year time frame, the number of server CPU cores has been growing significantly, which may imply that Intel compute power is currently overpriced in EC2. In the last couple of years EC2 has introduced cheaper AMD and ARM instances, but it's still surprising that AWS chose to keep Intel CPU prices fixed.<br /></p><h2 style="text-align: left;">DRAM Capacity</h2><p>For DRAM, the picture is also quite stagnant:<br /><br /> <img src="https://db.in.tum.de/~leis/cloud/memevol.svg" /> <br /><br />The introduction of the x1e instances improved the situation a bit, but there's been a stagnation since 2018. However, this is less surprising than the CPU situation because DRAM commodity prices in general did not move much.<br /></p><h2 style="text-align: left;">Instance Storage</h2><p>Let's next look at instance storage. EC2 offers instances with disks (about 0.2GB/s bandwidth), SATA SSDs (about 0.5GB/s bandwidth), and NVMe SSDs (about 2GB/s bandwidth). The introduction of instances with up to 8 NVMe SSDs in 2017 clearly disrupted IO bandwidth speed (the y-axis unit may look weird for bandwidth but is correct once we normalize by hourly cost):<br /><br /> <img src="https://db.in.tum.de/~leis/cloud/ioevol.svg" /> <br /><br />In terms of capacity per dollar, disk is still king and the d3en instance (<a href="https://aws.amazon.com/about-aws/whats-new/2020/12/introducing-amazon-ec2-d3-and-d3en-the-next-generation-of-dense-hdd-storage-instances/">introduced in December 2020</a>) totally changed the game:<br /><br /> <img src="https://db.in.tum.de/~leis/cloud/ioevol2.svg" /> <br /></p><h2 style="text-align: left;">Network Bandwidth</h2><p>For network bandwidth, we see another major disruption, this time the introduction of 100GBit network instances:<br /><br /> <img src="https://db.in.tum.de/~leis/cloud/networkevol.svg" /> <br /><br />The c5n instance, in particular, is clearly a game changer. It is only marginally more expensive than c5, but its network speed is 4 times faster.<br /></p><h2 style="text-align: left;">Conclusions</h2><p>These results show that the hardware landscape is very fluid and regularly we see major changes like the introduction of NVMe SSDs or 100 GBit networking. Truisms like "in distributed systems network bandwidth is the bottleneck" can become false! (Network latency is of course a different beast.) High-performance systems must therefore take hardware trends into account and adapt to the ever-evolving hardware landscape.<br /><br /><br /> </p>Viktor Leishttp://www.blogger.com/profile/09732217689829100056noreply@blogger.com4tag:blogger.com,1999:blog-8891884956165580080.post-66123729635533362632021-06-18T13:52:00.004+02:002021-06-23T11:26:06.848+02:00What Every Programmer Should Know About SSDs<p>Solid-State Drives (SSDs) based on flash have largely replaced magnetic disks as the standard storage medium. From the perspective of a programmer, SSDs and disks look very similar: both are persistent, enable page-based (e.g., 4KB) access through file systems and system calls, and have large capacities.<br /><br />However, there are also important differences, which become important if one wants to achieve optimal SSD performance. As we will see, SSDs are more complicated and their performance behavior can appear quite mysterious if one simply thinks of them as fast disks. The goal of this post is to provide an understanding of why SSDs behave the way they do, which can help creating software that is capable of exploiting them. (Note that I discuss NAND flash, not Intel Optane memory, which has different characteristics.)<br /><br /></p> <h3 style="text-align: left;">Drives not Disks</h3><p><br />SSDs are often referred to as disks, but this is misleading as they store data on semiconductors instead of a mechanical disk. To read or write from a random block, a disk has to mechanically move its head to the right location, which takes on the order of 10ms. A random read from an SSD, in contrast, takes about 100us &ndash; 100 times faster. This low read latency is the reason why booting from an SSD is so much faster than booting from a disk.<br /><br /></p><h3 style="text-align: left;">Parallelism<br /></h3><p><br />Another important difference between disks and SSDs is that disks have one disk head and perform well only for sequential accesses. SSDs, in contrast, consist of dozens or even hundreds of flash chips ("parallel units"), which can be accessed concurrently.<br /><br />SSDs transparently stripe larger files across the flash chips at page granularity, and a hardware prefetcher ensures that sequential scans exploit all available flash chips. However, at the flash level there is not much difference between sequential and random reads. Indeed, for most SSDs it is possible to achieve almost the full bandwidth with random page reads as well. To do this, one has to schedule hundreds of random IO requests concurrently in order to keep all flash chips busy. This can be done by starting lots of threads or using asynchronous IO interfaces such as libaio or io_uring.<br /><br /></p> <h3 style="text-align: left;">Writing</h3><p><br />Things get even more interesting with writes. For example, if one looks at write latency, one may measure results as low as 10us &ndash; 10 times faster than a read. However, latency only appears so low because SSDs are caching writes on volatile RAM. The actual write latency of NAND flash is about 1ms &ndash; 10 times slower than a read. On consumer SSDs, this can be measured by issuing a sync/flush command after the write to ensure that the data is persistent on flash. On most data center/server SSDs, write latency cannot be measured directly: the sync/flush will complete immediately because a battery guarantees persistence of the write cache even in the case of power loss. <br /><br />To achieve high write bandwidth despite the relatively high write latency, writes use the same trick as reads: they access multiple flash chips concurrently. Because the write cache can asynchronously write pages, it is not even necessary to schedule that many writes simultaneously to get good write performance. However, the write latency cannot always be hidden completely: for example, because a write occupies a flash chip 10 times longer than a read, writes cause significant tail latencies for reads to the same flash chip.<br /><br /></p> <h3 style="text-align: left;">Out-Of-Place Writes</h3><p><br />Our understanding is missing one important fact: NAND flash pages cannot be overwritten. Page writes can only be performed sequentially within blocks that have been erased beforehand. These erase blocks have a size of multiple MB and therefore consist of hundreds of pages. On a new SSD, all blocks are erased, and one can directly start appending new data.<br /><br />Updating pages, however, is not so easy. It would be too expensive to erase the entire block just to overwrite a single page in-place. Therefore, SSDs perform page updates by writing the new version of the page to a new location. This means that the logical and physical page addresses are decoupled. A mapping table, which is stored on the SSD, translates logical (software) addresses to physical (flash) locations. This component is also called Flash Translation Layer (FTL).<br /><br />For example, let's assume we have a (toy) SSD with 3 erase blocks, each with 4 pages. A sequence of writes to pages P1, P2, P0, P3, P5, P1 may result in the following physical SSD state: <table cellspacing="0" border="1"> <colgroup span="5" width="70"></colgroup> <tr> <td height="27" align="left">Block 0</td> <td align="left">P1 (old)</td> <td align="left">P2</td> <td align="left">P0</td> <td align="left">P3</td> </tr> <tr> <td height="27" align="left">Block 1</td> <td align="left">P5</td> <td align="left">P1</td> <td align="left">&#8594;</td> <td align="left"><br></td> </tr> <tr> <td height="27" align="left">Block 2</td> <td align="left"><br></td> <td align="left"><br></td> <td align="left"><br></td> <td align="left"><br></td> </tr> </table> <br /></p><p></p><p><br /></p> <h3 style="text-align: left;">Garbage Collection</h3><p><br />Using the mapping table and out-of-place write, everything is good until the SSD runs out of free blocks. The old version of overwritten pages must eventually be reclaimed. If we continue our example from above by writing to pages P3, P4, P7, P1, P6, P2, we get the following situation: <table cellspacing="0" border="1"> <colgroup span="5" width="70"></colgroup> <tr> <td height="27" align="left">Block 0</td> <td align="left">P1 (old)</td> <td align="left">P2 (old)</td> <td align="left">P0</td> <td align="left">P3 (old)</td> </tr> <tr> <td height="27" align="left">Block 1</td> <td align="left">P5</td> <td align="left">P1 (old)</td> <td align="left">P3</td> <td align="left">P4</td> </tr> <tr> <td height="27" align="left">Block 2</td> <td align="left">P7</td> <td align="left">P1</td> <td align="left">P6</td> <td align="left">P2</td> </tr> </table> <br /><br />At this point we have no more free erase blocks (even though logically there should still be space). Before one can write another page, the SSD first has to erase a block. In the example, it might be best for the garbage collector to erase block 0, because only one of its pages is still in use. After erasing block 0, we make space for 3 writes and our SSD looks like this: <table cellspacing="0" border="1"> <colgroup span="5" width="70"></colgroup> <tr> <td height="27" align="left">Block 0</td> <td align="left">P0</td> <td align="left">&#8594;<br></td> <td align="left"><br></td> <td align="left"><br></td> </tr> <tr> <td height="27" align="left">Block 1</td> <td align="left">P5</td> <td align="left">P1 (old)</td> <td align="left">P3</td> <td align="left">P4</td> </tr> <tr> <td height="27" align="left">Block 2</td> <td align="left">P7</td> <td align="left">P1</td> <td align="left">P6</td> <td align="left">P2</td> </tr> </table> <br /><br /></p> <h3 style="text-align: left;">Write Amplification and Overprovisioning</h3><p><br />To garbage collect block 0, we had to physically move page P0, even though logically nothing happened with that page. In other words, with flash SSDs the number of physical (flash) writes is generally higher than the number of logical (software) writes. The ratio between the two is called write amplification. In our example, to make space for 3 new pages in block 0, we had to move 1 page. Thus we have 4 physical writes for 3 logical writes, i.e., a write amplification of 1.33.<br /><br />High write amplification decreases performance and reduces flash lifetime. How large write amplification is depends on the access pattern and how full the SSD is. Large sequential writes have low write amplification, while random writes are the worst case.<br /><br />Let's assume our SSD is filled to 50% and we perform random writes. In steady state, wherever we erase a block, about half the pages of that block are still in use and have to be copied on average. Thus, write amplification for a fill factor of 50% is 2. In general, worst-case write amplification for a fill factor f is 1/(1-f): <table cellspacing="0" border="1"> <colgroup span="12" width="50"></colgroup> <tr> <td height="27" align="left">f</td> <td align="right" >0.1</td> <td align="right" >0.2</td> <td align="right" >0.3</td> <td align="right" >0.4</td> <td align="right" >0.5</td> <td align="right" >0.6</td> <td align="right" >0.7</td> <td align="right" >0.8</td> <td align="right" >0.9</td> <td align="right" >0.95</td> <td align="right" >0.99</td> </tr> <tr> <td height="27" align="left">WA</td> <td align="right">1.11</td> <td align="right">1.25</td> <td align="right">1.43</td> <td align="right">1.67</td> <td align="right">2.00</td> <td align="right">2.50</td> <td align="right">3.33</td> <td align="right">5</td> <td align="right">10</td> <td align="right">20</td> <td align="right">100</td> </tr> </table> <br /><br />Because write amplification becomes unreasonably high for fill factors close to 1, most SSDs have hidden spare capacity. This overprovisioning is typically 10-20% of the total capacity. Of course, it is also easy to add more overprovisioning by creating an empty partition and never write to it.<br /><br /></p> <h3 style="text-align: left;">Summary and Further Reading</h3><p><br />SSDs have become quite cheap and they have very high performance. For example, a Samsung PM1733 server SSD costs about 200 EUR per TB and promises close to 7 GB/s read and 4 GB/s write bandwidth. Actually achieving such high performance requires knowing how SSDs work and this post described the most important behind-the-scenes mechanisms of flash SSDs.<br /><br />I tried to keep this post short, which meant that I had to simplify things. To learn more, a good starting point is <a href="https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/">this tutorial</a>, which also references some insightful papers. Finally, because SSDs have become so fast, the operating system I/O stack is often the performance bottleneck. Experimental results for Linux can be found in our <a href="http://cidrdb.org/cidr2020/papers/p16-haas-cidr20.pdf">CIDR 2020 paper</a>.<br /><br /><br /></p>Viktor Leishttp://www.blogger.com/profile/09732217689829100056noreply@blogger.com7tag:blogger.com,1999:blog-8891884956165580080.post-65711794579513932462020-11-22T17:50:00.000+01:002020-11-22T17:50:41.678+01:00Taming Deep Recursion<p>&nbsp;When operating on hierarchical data structures, it is often convenient to formulate that using pairwise recursive functions. For example, our semantic analysis walks that parse tree recursively and transforms it into an expression tree. This corresponding code looks roughly like this:</p><p></p> <pre style="background: rgb(240, 240, 240) none repeat scroll 0% 0%; border: 1px dashed rgb(204, 204, 204); color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; overflow-wrap: normal; word-wrap: normal;"> unique_ptr&lt;Expression&gt; analyzeExpression(AST* astNode) { switch (astNode-&gt;getType()) { case AST::BinaryExpression: return analyzeBinaryExpression(astNode-&gt;as&lt;BinaryExpAST&gt;()); case AST::CaseExpression: return analyzeCaseExpression(astNode-&gt;as&lt;CaseExpAST&gt;()); ... } } unique_ptr&lt;Expression&gt; analyzeBinaryExpression(BinaryExpAST* astNode) { auto left = analyzeExpression(astNode-&gt;left); auto right = analyzeExpression(astNode-&gt;right); auto type = inferBinaryType(astNode-&gt;getOp(), left, right); return make_unique&lt;BinaryExpression&gt;(astNode-&gt;getOp(), move(left), move(right), type); } </code></pre> <p>It recursively walks the tree, collects input expressions, infers types, and constructs new expressions. This works beautifully until you encounter a (generated) query with 300,000 expressions, which we did. At that point our program crashed due to stack overflow. Oops. </p><p>Our first mitigation was using <a href="https://gcc.gnu.org/onlinedocs/gcc/Return-Address.html">__builtin_frame_address(0)</a> at the beginning of <i>analyzeExpression</i> to detect excessive stack usage, and to throw an exception if that happens. This prevented the crash, but is not very satisfying. First, it means we refuse a perfectly valid SQL query "just" because it uses 300,000 terms in one expression. And second, we cannot be sure that this is enough. There are several places in the code that recursively walk the algebra tree, and it is hard to predict their stack usage. Even worse, the depth of the tree can change due to optimizations. For example, when a query has 100,000 entries in the from clause, the initial tree is extremely wide but flat. Later, after we have stopped checking for stack overflows, the optimizer might transform that into a tree with 100,000 levels, again leading to stack overflow. Basically, all recursive operations on the algebra tree are dangerous.</p><p>Now common wisdom is to avoid recursion if we cannot bound the maximum depth, and use iteration with an explicit stack instead. We spend quite some time thinking about that approach, but the main problem with that is that it makes the code extremely ugly. The code snippet above is greatly simplified, but even there an explicit stack would be unwieldy and ugly if we have to cover both <i>binaryExpression</i> and <i>caseExpression</i> using one stack. And the code gets cut into tiny pieces due to the control flow inversion required for manual stacks. And all that to defend against something that nearly never happens. We were unhappy with that solution, we wanted something that is minimal invasive and created overhead only in the unlikely case that a user gives us an extremely deep input.</p><p>One mechanism that promises to solve this problem is <a href="https://gcc.gnu.org/wiki/SplitStacks">-fsplit-stack</a>. There, the compiler checks for stack overflows in the function prolog and creates a new stack segment if needed. Great, exactly what we wanted! We can handle deep trees, no code change, and we only create a new stack if we indeed encounter deep recursion. Except that it is not really usable in practice. First, -fsplit-stack is quite slow. We measured 20% overhead in our optimizer when enabling split stacks, and that in cases where we did not create any new stacks at all. When -fsplit-stack does create new stacks it is even worse. This is most likely a deficit of the implementation, one could implement -fsplit-stack much more efficiently, but the current implementation is not encouraging. Even worse, clang produces an&nbsp;<a href="https://bugs.llvm.org/show_bug.cgi?id=19197">internal compiler error</a> when compiling some functions with -fsplit-stack. Nobody seems to use this mechanism in production, and after disappointing results we stopped considering -fsplit-stack.<br /></p><p>But the idea of split stacks is clearly good. When encountering deep recursion we will have to switch stacks at some point. After contemplating this problem for some time we realized that&nbsp;<a href="https://www.boost.org/doc/libs/1_74_0/libs/context/doc/html/index.html">boost.context</a> offers the perfect mechanism for switching stacks: It can start a fiber with a given stack, and switching between fibers costs just 19 cycles. By caching additional stacks and their fibers in thread-local data structures we can provide our own implementation of split stacks that is fast and supports arbitrary code. Without compiler support the split stack mechanism is visible, of course, but that is fine in our code. We have only a few entry points like <i>analyzeExpression</i> that will be called over and over again during recursion, and checking there is enough. Code wise the mechanism is not too ugly, it needs two lines of code per recursion head and looks like</p><pre style="background: rgb(240, 240, 240) none repeat scroll 0% 0%; border: 1px dashed rgb(204, 204, 204); color: black; font-family: arial; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; overflow-wrap: normal; word-wrap: normal;"> unique_ptr&lt;Expression&gt; analyzeExpression(AST* astNode) { if (StackGuard::needsNewStack()) return StackGuard::newStack([=]() { return analyzeExpression(astNode); }); ... // unchanged } </code></pre><p>Note that the lambda argument for <i>newStack</i> will be called from within the new stack, avoiding the stack overflow. When the lambda returns the mechanism will use boost.context to switch back to the previous stack. The performance impact of that mechanism is negligible, as 1) we do not check for overflows all the time but only in the few places that we know are central to the recursive invocations, like <i>analyzeExpression</i> here, 2) stack overflows are extremely rare in practice, and we only pay with one <i>if</i> per invocation is no overflow happens, and 3) even if they do happen, the mechanism is reasonably cheap. We cache the child stacks, and switching to a cached stack costs something like 30 cycles. And we never recurse in a hot loop.<br /></p><p>It took us a while to get there, but now we can handle really large queries without crashing. Just for fun we tried running a query with 100,000 relations in the from clause. Fortunately our optimizer <a href="https://dl.acm.org/doi/10.1145/3183713.3183733">could already handle that</a>, and now the rest of the system can handle it, too. And that with nice, intuitive, recursive code, at the small price of two lines of code per recursion head.<br /></p>Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com0tag:blogger.com,1999:blog-8891884956165580080.post-3081388142334773452020-10-30T17:17:00.002+01:002020-10-30T17:19:05.782+01:00C++ Concurrency Model on x86 for Dummies<p>Since C++11, multi-threaded C++ code has been governed by a rigorous memory model. The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the <i>std::atomic</i> type for concurrently-accessed memory locations. Second, each atomic operation requires a memory order argument with six options determining the concurrency semantics in terms of which re-orderings are allowed. (Some operations even allow specifying two memory orders!)<br /><br />While there are a number of attempts to describe the model, I always found the full semantics very hard to understand and consequently concurrent code hard to write and reason about. And since we are talking about low-level concurrent code here, making a mistake (like picking the wrong memory order) can lead to disastrous consequences.<br /><br />Luckily, at least on x86, a small subset of the full C++11 memory model is sufficient. In this post, I'll present such a subset that is sufficient to write high-performance concurrent code on x86. This simplification has the advantage that the resulting code is much more likely to be correct, without leaving any performance on the table. (On non-x86 platforms like ARM, code written based on this simplified model will still be correct, but might potentially be slightly slower than necessary.)&nbsp;</p><p>There are only six things one needs to know to write high-performance concurrent code on x86.<br /></p><p style="text-align: left;"></p><h2 style="text-align: left;">1. Data races are undefined<br /></h2><p style="text-align: left;">If a data race occurs in C++, the behavior of the program is undefined. Let's unpack that statement. A data race can be defined as two or more threads accessing the same memory location with at least one of the accesses being a write. By default (i.e., without using <i>std::atomic</i>), the compiler may assume that no other thread is concurrently modifying memory. This allows the compiler to optimize the code, for example by reordering or optimizing away memory accesses. Consider the following example:<br /><br /><span style="font-size: x-small;"><span style="font-family: courier;"><b>void</b> wait(bool* flag) {<br />&nbsp;&nbsp;&nbsp; <b>while</b> (*flag);<br />}</span></span><br /><br />Because data races are undefined, the compiler may assume that the value behind the pointer <i>flag</i> is not concurrently modified by another thread. Using this assumption, <a href="https://godbolt.org/z/8Mxfs5">gcc translates</a> the code to a return statement while <a href="https://godbolt.org/z/34384c">clang translates</a> it to an infinite loop if <i>flag</i> is initially true. Both translations are likely not what the code intends. To avoid undefined code, it is necessary use <i>std::atomic</i> for variables where race conditions may happen:<br /><br /><span style="font-size: x-small;"><span style="font-family: courier;"><b>void</b> wait(std::atomic&lt;bool&gt;* flag) {<br />&nbsp;&nbsp;&nbsp; <b>while</b> (*flag); // same as while(flag-&gt;load(std::memory_order_seq_cst));<br />}</span></span><br /><br /><i>*flag</i> is equivalent to <i>flag.load(std::memory_order_seq_cst)</i>, i.e., the default memory order is sequential consistency. Sequential consistency is the strongest memory order guaranteeing that atomic operations are executed in program order. The compiler is not allowed to reorder memory operations or optimize them away.<br /><br /></p><h2 style="text-align: left;">2. Sequentially-consistent loads are fast</h2><p style="text-align: left;"><br />Making the flag atomic may seem expensive, but luckily atomic loads are cheap on x86. Indeed, our wait function is <a href="https://godbolt.org/z/ssnGqa">translated</a> to a simple loop with a simple MOV instruction, without any barrier/fence instruction. This is great as it means that on x86 an atomic, sequentially-consistent load can be just as fast as a normal load. It also means that on x86 there is no performance benefit of using any weaker memory order for atomic loads. For loads all memory orders are simply translated to MOV.<br /><br /></p><h2 style="text-align: left;">3. Sequentially-consistent stores are slow</h2><p style="text-align: left;"><br />While sequentially-consistent atomic loads are as cheap as normal loads, this is not the case for stores, as can be observed from the following example:<br /><br /><span style="font-family: courier;"><span style="font-size: x-small;"><b>void</b> unwait(std::atomic&lt;bool&gt;* flag) {<br />&nbsp;&nbsp;&nbsp; *flag = false; // same as flag-&gt;store(false, std::memory_order_seq_cst);<br />}</span></span><br /><br />As with atomic loads, atomic stores are sequentially consistent if no explicit memory order is specified. In clang and gcc 10, the store <a href="https://godbolt.org/z/fn7bjY">translates</a> to an XCHG instruction rather than a MOV instruction (older gcc versions <a href="https://godbolt.org/z/oW7PME">translate</a> it to a MOV plus MFENCE.). XCHG and MFENCE are fairly expensive instructions but are required for sequentially consistent stores on x86. (The CPU's store buffer must be flushed to L1 cache to make the write visible to other threads through cache coherency.)<br /><br /></p><h2 style="text-align: left;">4. Stores that can be delayed can benefit from the <i>release</i> memory order</h2><p style="text-align: left;"><br />Because sequentially-consistent stores are fairly expensive, there are situations where a weaker memory order can improve performance. A common case is when the effect of a store can be delayed. The classical example is unlocking a mutex. The unlocking thread does not have to synchronously wait for the unlocking to become visible, but can continue executing other instructions. Another way of saying this is that it is correct to move instructions into the critical section, but not out of it. In C++, this weaker form of store consistency is available through the release memory order.<br /><br />In our unwait example, the store can indeed be delayed, which is why we can use the release memory order for the store:<br /><br /><span style="font-size: x-small;"><span style="font-family: courier;"><b>void</b> unwait(std::atomic&lt;bool&gt;* flag) {<br />&nbsp;&nbsp;&nbsp; flag-&gt;store(false, std::memory_order_release);<br />}</span></span><br /><br />This code is <a href="https://godbolt.org/z/xsdnhj">translated</a> to a simple MOV instruction, which means it can be as efficient as a non-atomic store.<br /><br /></p><h2 style="text-align: left;">5. Some atomic operations are always sequentially consistent</h2><p style="text-align: left;"><br />Besides loads and stores, <i>std::atomic</i> also offers the high-level atomic operations <i>compare_exchange</i>, <i>exchange</i>, <i>add</i>, <i>sub</i>, <i>and</i>, <i>or</i>, <i>xor</i>. On x86, these are always directly translated to sequentially-consistent CPU instructions. This means that there is no performance benefit from specifying weaker memory orders on any of these operations. (An atomic increment with release semantics would be useful in many situations, but alas is not available.)<br /><br /></p><h2 style="text-align: left;">6. Scalable code should avoid cache invalidations</h2><p style="text-align: left;"><br />I mentioned above that sequentially-consistent loads and release stores may be as cheap as non-atomic loads/store. Unfortunately, this is not always the case. Because CPUs have per-core caches, the performance of concurrent programs depends on the dynamic access pattern. Every store has to invalidate any cached copies of that cache line on other cores. This can cause parallel implementations to be slower than single-threaded code. Therefore, to write scalable code, it is important to minimize the number of writes to shared memory locations. A positive way of saying this is that as long as the program does not frequently write to memory locations that are frequently being read or written, the program will scale very well. (<a href="http://sites.computer.org/debull/A19mar/p73.pdf">Optimistic lock coupling</a>, for example, is a general-purpose concurrency scheme for synchronizing data structures that exploits this observation.)<br /><br /></p><h2 style="text-align: left;">Summary</h2><p style="text-align: left;"><br />The full C++ memory model is notoriously hard to understand. x86, on the other hand, has a fairly strong memory model (<a href="https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf">x86-TSO</a>) that is quite intuitive: basically everything is in-order, except for writes, which are delayed by the write buffer. Exploiting x86's memory model, I presented a simplified subset of the C++ memory model that is sufficient to write scalable, high-performance concurrent programs on x86.<br /><br /></p>Viktor Leishttp://www.blogger.com/profile/09732217689829100056noreply@blogger.com0tag:blogger.com,1999:blog-8891884956165580080.post-76844031575004020082020-04-28T16:42:00.000+02:002020-04-28T16:42:09.804+02:00Linear Time Liveness AnalysisStandard compiler are usually used with hand-written programs. These programs tend to have reasonably small functions, and can be processed in (nearly) linear time. Generated programs however can be quite large, and compilers sometimes struggle to compile them at all. This can be seen with the following (silly) demonstration script:<br /> <br /> <pre style="background: #f0f0f0; border: 1px dashed #cccccc; color: black; font-family: &quot;arial&quot;; font-size: 12px; height: auto; line-height: 20px; overflow: auto; padding: 0px; text-align: left; width: 99%;"><code style="color: black; word-wrap: normal;"> import subprocess from timeit import default_timer as timer def doTest(size): with open("foo.cpp", "w") as out: print("int foo(int x) {", file=out) for s in range(size): p="x" if s==0 else f'l{s-1}' print (f'int l{s}; if (__builtin_sadd_overflow({p},1,&amp;l{s})) goto error;', file=out) print(f'return l{size-1};error: throw;}}', file=out); start = timer() subprocess.run(["gcc", "-c", "foo.cpp"]) stop = timer() print(size, ": ", (stop-start)) for size in [10,100,1000,10000,100000]: doTest(size) </code></pre> <br /> It generates one function with n statements of the form "<span style="background-color: #f0f0f0; font-family: &quot;arial&quot;; font-size: 12px;">int lX; if (__builtin_sadd_overflow(lY,1,&amp;lX)) goto error;</span>"&nbsp;which are basically just n additions with overflow checks, and then measures the compile time. The generated code is conceptually a very simple, but it contains a lot of basic blocks due to the large number of ifs. When compiling with gcc we get the following compile times:<br /> <br /> <table> <tbody> <tr><td>n</td><td>10</td><td>100</td><td>1,000</td><td>10,000</td><td>100,000</td></tr> <tr><td>compilation [s]</td><td>0.02</td><td>0.04</td><td>0.19</td><td>34.99</td><td>&gt; 1h</td></tr> </tbody></table> <br /> The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. This is disastrous when processing generated code, where we cannot easily limit the size of individual functions. In <a href="https://umbra-db.com/">our own system</a>&nbsp;we use neither gcc nor clang for query compilation, but we have same problem, namely compiling large generated code. And super-linear runtime quickly becomes an issue when the input is large.<br /> <br /> One particular important problem in this context is <a href="https://en.wikipedia.org/wiki/Live_variable_analysis">liveness analysis</a>, i.e, figuring out which value is alive at which part of the program. The&nbsp;<a href="https://www.isi.edu/~pedro/Teaching/CSCI565-Spring16/Lectures/DataFlowAnalysis.part3.pdf">textbook solution</a> for that problem involves propagating liveness information for each variable across the blocks, but that is clearly super linear and does not scale to large program sizes. We therefore developed <a href="https://doi.org/10.1109/ICDE.2018.00027">a different approach</a>&nbsp;that we recently refined even more and the I want to present here:<br /> <br /> Instead of propagating liveness sets or bitmasks, we study the control flow graph of the program. For a simple queries with a for loop and an if within the loop it might look like this:<br /> <br /> <div class="separator" style="clear: both; text-align: center;"> <a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoGFukGNPxVlLXqhs3jIhqCFYEDm85WvUBcSAWfK3PsO_DhCuYdZxKTBRxo3_zRBhxERNvc0FwA0ofGyb4RYy5scB_GBJ3dG4klUwPrKd0-DEUdGDQoHkwrKR3_SjREAMAZiL_T5b_4qM/s1600/graphviz.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" data-original-height="443" data-original-width="256" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjoGFukGNPxVlLXqhs3jIhqCFYEDm85WvUBcSAWfK3PsO_DhCuYdZxKTBRxo3_zRBhxERNvc0FwA0ofGyb4RYy5scB_GBJ3dG4klUwPrKd0-DEUdGDQoHkwrKR3_SjREAMAZiL_T5b_4qM/s320/graphviz.png" width="184" /></a></div> <br /> <br /> Now if we define a variable x in block 0, and use it in block 2, the variable has to be alive on every path between definition and usage. Obviously that includes the blocks 0-2, but that is not enough. We see that there is a loop involving all loops between 1 and 4, and we can take that before coming to 2. Thus, the lifetime has to be extended to include the full loop, and is there range 0-4. If, however, we define a variable in 1 and use it in 2, the range is indeed 1-2, as we do not have to wrap around.<br /> <br /> Algorithmically, we identify all loops in the program, which we can do in <a href="https://dl.acm.org/doi/10.1145/316686.316687">almost linear time</a>, and remember how the loops are nested within each other.&nbsp;Then, we examine each occurrence of a variable (in SSA form). In principle the lifetime is the span from the first to the last occurrences. If, however, two occurrences are in different loops, we walk "up" from the lower loop level until we occur in the same loop (or top level), extending the lifetime to cover the full loop while doing so. In this example x in block 0 is top level, while x in block 2 is in loop level 1. Thus, we leave the loop, expand 2 into 1-4, and find the lifetime to be 0-4.<br /> <br /> This assumes, of course, that the block numbers are meaningful. We can guarantee that by first labeling all blocks in&nbsp;<a href="https://en.wikipedia.org/wiki/Depth-first_search#Vertex_orderings">reverse postorder</a>. This guarantees that all dominating blocks will have a lower number than their successors. We can further improve the labeling by re-arranging the blocks such that all blocks within a certain loop are next to each other, keeping the original reverse post order within the loop. This leads to nice, tight liveness intervals. Using just an interval instead of a bitmap is of course less precise, but the block reordering makes sure that the intervals primarily contain the blocks that are indeed involved in the execution.<br /> <br /> Asymptotically such an interval based approach is far superior to a classical liveness propagation. All algorithms involved are linear or almost linear, and we only to have to store two numbers per variable. When handling large, generated code such an approach is mandatory. And even for classical, smaller programs it is quite attractive. I looked around a bit at lectures about compiler construction, and I am somewhat surprised that nobody seems to teach similar techniques to handle large programs. When you cannot control the input size, super linear runtime is not an option.Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com3tag:blogger.com,1999:blog-8891884956165580080.post-3953824914729907392020-01-30T14:19:00.002+01:002020-01-30T14:19:56.766+01:00All hash table sizes you will ever need<script src="https://cdn.jsdelivr.net/gh/google/code-prettify@master/loader/run_prettify.js"></script> <br /> <div style="text-align: justify;"> When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset of bits is ok.</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> Prime numbers are more forgiving concerning the hash function, and we have more choices concerning the size, which leads to less overhead. But using a prime number requires a modulo computation, which is expensive. And we have to find a suitable prime number at runtime, which is not that simple either.</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> Fortunately we can solve both problems simultaneously, which is what this blog post is about. We can tackle the problem of finding prime numbers by pre-computing suitable numbers with a given maximum distance. For example when when only considering prime numbers that are at least 5% away from each other we can cover the whole space from 0 to 2^64 with just 841 prime numbers. We can solve the performance problem by pre-computing the <a href="https://web.archive.org/web/20190703172151/http://www.hackersdelight.org/magic.htm">magic numbers</a> from <a href="https://en.wikipedia.org/wiki/Hacker%27s_Delight">Hacker's Delight</a> for each prime number in our list, which allows us to use multiplications instead of expensive modulo computations. And we can skip prime numbers with unpleasant magic numbers (i.e., the ones that require an additional add fixup), preferring the next cheap prime number instead.</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> The resulting code can be found <a href="http://db.in.tum.de/~neumann/primes.hpp">here</a>. It contains every prime number you will ever need for hash tables, covering the whole 64bit address space. Usage is very simple, we just ask for a prime number and then perform modulo operations as needed:</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> <br /></div> <pre class="prettyprint">class HashTable { primes::Prime prime; vector<entry> table; public: HashTable(uint64_t size) { </entry></pre> <pre class="prettyprint"><entry> prime = prime::Prime::pick(size);</entry></pre> <pre class="prettyprint"><entry> table.resize(prime.get());</entry></pre> <pre class="prettyprint"><entry> } ... Entry* getEntry(uint64_t hash) { return table[prime.mod(hash)]; } ... };</entry></pre> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> The performance is quite good. On an AMD 1950X, computing the modulo for 10M values (and computing the sum of the results) takes about 4.7ns per value when using a plain (x%p), but only 0.63ns per value when using p.mod(x).<br /> <br /> Getting this into <a href="https://github.com/gcc-mirror/gcc/blob/master/libstdc%2B%2B-v3/include/std/unordered_map">unordered_map</a> would be useful, it would probably improve the performance quite significantly when we have few cache misses. </div> Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com5tag:blogger.com,1999:blog-8891884956165580080.post-46553705072844277342019-07-24T15:40:00.002+02:002019-10-30T09:40:18.888+01:00Cuckoo Filters with arbitrarily sized tables<div style="text-align: justify;"> <a href="https://www.eecs.harvard.edu/~michaelm/postscripts/cuckoo-conext2014.pdf">Cuckoo Filters</a> are an interesting alternative to <a href="https://en.wikipedia.org/wiki/Bloom_filter">Bloom filters</a>. Instead of maintaining a filter bitmap, they maintain a small (cuckoo-)hash table of key signatures, which has several good properties. For example is stores just the signature of a key instead of the key itself, but is nevertheless able to move an element to a different position in the case of conflicts.</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> This conflict resolution mechanism is quite interesting: Just like regular cuckoo hash tables each element has two potential positions where is be placed, a primary position i1 and a secondary position i2. These can be computed as follows:</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> i1 = hash(x)</div> <div style="text-align: justify;"> i2 = i1 xor hash(signature(x))</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> Remember that the cuckoo filter stores only the (small) signature(x), not x itself. Thus, when we encounter a value, we cannot know if it is at its position i1 or position i2. However, we can nevertheless alternate between positions because the following holds</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> i1 = i2 xor hash(signature(x))</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash(signature(x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. This allows is to switch between positions, which is used in the cuckoo filter conflict resolution logic.</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> However all this hold only because the original cuckoo filters use power-of-two hash tables. If our hash table size is not a power of 2, the xor can place the alternative position beyond the size of the hash table, which breaks the filter. Thus cuckoo filter tables always had to be powers of two, even if that wasted a lot of memory.</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> In&nbsp;<a href="http://www.vldb.org/pvldb/vol12/p502-lang.pdf">more recent work</a> Lang et al. proposed using cuckoo filters with size C, where C did not have to be a power of two, offering much better space utilization. They achieved this by using a different self-inverse function:</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> i1 = hash(x) mod C</div> <div style="text-align: justify;"> i2 = -(i1 + hash(signature(x)) mod C</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> Note that the modulo computation can be made reasonable efficient by using <a href="https://www.hackersdelight.org/magic.htm">magic numbers</a>, which can be precomputed when allocating the filter.</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> A slightly different way to formulate this is to introduce a switch function f, which switches between positions:</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> f(i,sig,C) = -(i + hash(sig)) mod C</div> <div style="text-align: justify;"> i1 = hash(x) mod C</div> <div style="text-align: justify;"> i2 = f(i1, signature(x), C)</div> <div style="text-align: justify;"> i1 = f(i2, signature(x), C)</div> <div style="text-align: justify;"> </div> <div style="text-align: justify;"> </div> <div style="text-align: justify;"> All this works because f is <i>self-inverse</i>, i.e.,</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> i = f(f(i, signature(x), C), signature(x), C)</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> for all C&gt;0, i between 0 and C-1, and signature(x)&gt;0.</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> The only problem is: Is this true? In a purely mathematical sense it is, as you can convince yourself by expanding the formula, but the cuckoo filters are not executed on abstract machines but on real CPUs. And there something unpleasant happens: We can get numerical overflows of our integer registers, which implicitly introduces a modulo 2^32 into our computation. Which breaks the self-inverseness of f in some cases, except when C is power of two itself.</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> <a href="https://db.in.tum.de/~kipf/">Andreas Kipf</a> noticed this problem when using the cuckoo filters with real world data. Which teaches us not to trust in formulas without additional extensive empirical validation... Fortunately we can repair the function f by using proper modular arithmetic. In pseudo-code this looks like this</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> f(i,sig,C)</div> <div style="text-align: justify;"> &nbsp;&nbsp;&nbsp; x=(C-1)-(hash(sig) mod C)</div> <div style="text-align: justify;"> &nbsp;&nbsp;&nbsp; if (x&gt;=i)</div> <div style="text-align: justify;"> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; return (x-i);</div> <div style="text-align: justify;"> &nbsp;&nbsp;&nbsp; // The natural formula would be C-(i-x), but we prefer this one...</div> <div style="text-align: justify;"> &nbsp;&nbsp;&nbsp; return C+(x-i);</div> <div style="text-align: justify;"> <br /></div> <div style="text-align: justify;"> This computes the correct wrap-around module C, at the cost of one additional if. We can avoid the if by using predication, as shown below</div> <br /> <div style="text-align: justify;"> f(i,sig,C)</div> <div style="text-align: justify;"> &nbsp;&nbsp;&nbsp; x=(C-1)-(hash(sig) mod C)</div> &nbsp;&nbsp;&nbsp; m = (x&lt;i)*(~0) <br /> <div style="text-align: justify;"> &nbsp;&nbsp;&nbsp; return (m&amp;C)+(x-i); </div> <br /> <br /> <div style="text-align: justify;"> which can be attractive for SSE implementations where the comparison produces a bit mask anyway.</div> <br /> <div style="text-align: justify;"> We have validated that this new f function is now self-inverse for all possible values of i, sig, and C. And we did this by not just looking at the formula, but by trying out all values programmatically. Which is a good way to get confidence in your approach; there is only a finite number of combinations, and we can test them all.</div> <div style="text-align: justify;"> With this small fix, we can now enjoy Cuckoo Filters with arbitrarily sized tables.<br /> <br /> <b>Edit:</b>&nbsp; The original post did not mirror the hash space correctly (using C-... instead of (C-1)-...), thanks to Andreas Kipf for pointing this out.</div> Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com2tag:blogger.com,1999:blog-8891884956165580080.post-8930996313719056732019-05-16T16:44:00.001+02:002019-05-16T16:44:45.397+02:00Why use learning when you can fit?We recently had a talk by <a href="http://people.csail.mit.edu/kraska/">Tim Kraska</a> in our group, and he spoke among other things about <a href="https://arxiv.org/abs/1712.01208">learned indexes</a>. As I had mentioned before, I am more in favor of using suitably implemented&nbsp;<a href="http://databasearchitects.blogspot.com/2017/12/the-case-for-b-tree-index-structures.html">b-trees</a>, for reasons like update friendliness and distribution independence. But nevertheless, the talk made me curious: The model they are learning is in the end very primitive. It is a two-level linear model, i.e., they are using a linear function to select another linear function. But if that is enough, why do we need machine learning? A simple function fit should work just as well.<br /> <br /> Thus, I tried the following:<br /> 1) we sort all data and keep it in an array, just like with learned indexes<br /> 2) we build the <a href="https://en.wikipedia.org/wiki/Cumulative_distribution_function">CDF</a><br /> 3) we fit a linear spline to the CDF minimizing the <a href="https://en.wikipedia.org/wiki/Chebyshev_distance">Chebyshev norm</a><br /> 4) we fit a polynomial function to the spline nodes<br /> 5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the values from the array. The previous step is always the seed to a local search in the next step.<br /> <br /> As we bound the Chebyshev norm in each step, the lookup is in O(1), without any need for machine learning or other magic boxes.<br /> <br /> Now admittedly there was some <a href="http://Weasel word">weasel wording</a> in the previous paragraph: The lookup is in O(1), but the "constant" here is the Chebyshev norm of the fit, which means this only works well if we can find the good fit. But just the same is true for the learned indexes, of course.<br /> <br /> Now do we find a good fit? In theory we know how to construct the optimal fit in <a href="https://link.springer.com/article/10.1007/BF02570717">O(n log n)</a>, but that paper is beyond me. I am not aware of any implementation, and the paper is much too vague to allow for one. But constructing a good fit is much easier, and can also be done in <a href="https://link.springer.com/chapter/10.1007%2F978-3-540-70504-8_12">O(n log n)</a>. Using that algorithm, we can construct a linear spline that maximum error efficiently, and we know what the maximum is over the whole domain. Thus, we can probe the spline to get an estimate for the real value position, and we then can perform an efficient local search on a small, known, window of the data.<br /> <br /> The only problem is evaluating the spline itself. Evaluating a linear spline is pretty cheap, but we have to find the appropriate knot points to evaluate. Traditionally, we find these with binary search again. Note that the spline is much smaller than the original data, but still we want to avoid the binary search. Thus, we construct a polynomial function to predict the spline knot, again minimizing the Chebyshev norm, which allows us to consider only a small subset of spline nodes, leading to the before mentioned time bound.<br /> <br /> How well does this work in practice? On the map data set from the learned indexes paper and a log normal data set we get the following. (The learned indexes numbers are from the paper, the b-tree numbers are from <a href="http://databasearchitects.blogspot.com/2017/12/the-case-for-b-tree-index-structures.html">here</a>, and the spline numbers from this experiments. I still do not really know what the averages mean for the learned indexes, but probably the average errors averaged over all models).<br /> <br /> <br /> <table> <tbody> <tr><td>Map data</td><td>size (MB)</td><td>avg error</td></tr> <tr><td>Learned Index (10,000)</td><td>0.15</td><td>8 ± 45</td></tr> <tr><td>Learned Index (100,000)</td><td>1.53</td><td>2 ± 36</td></tr> <tr><td>B-tree (10,000)</td><td>0.15</td><td>225</td></tr> <tr><td>B-tree (100,000)</td><td>1.53</td><td>22</td></tr> <tr><td>Spline (10,000)</td><td>0.15</td><td>193</td></tr> <tr><td>Spline (100,000)</td><td>1.53</td><td>22</td></tr> </tbody> </table> <br /> <table> <tbody> <tr><td>Log normal data</td><td>size (MB)</td><td>avg error</td></tr> <tr><td>Learned Index (10,000)</td><td>0.15</td><td>17,060 ± 61,072</td></tr> <tr><td>Learned Index (100,000)</td><td>1.53</td><td>17,005 ± 60,959</td></tr> <tr><td>B-tree (10,000)</td><td>0.15</td><td>1330</td></tr> <tr><td>B-tree (100,000)</td><td>1.53</td><td>3</td></tr> <tr><td>Spline (10,000)</td><td>0.15</td><td>153</td></tr> <tr><td>Spline (100,000)</td><td>1.53</td><td>1</td></tr> </tbody></table> <br /> <br /> Somewhat surprising the accuracy the accuracy of the spline is nearly identical to the interpolating b-tree for the real-world map data, which suggests that the separators span the domain reasonably well there. For the log normal data the spline is significantly better, and leads to nearly perfect predictions. Note that the original data sets contains many millions of data points in both cases, thus the prediction accuracy is really high.<br /> <br /> For practical applications I still recommend the B-tree, of course, even though the polynomial+spline solution is in "O(1)" while the B-tree is in O(log n). I can update a B-tree just fine, including concurrency, recovery, etc., while I do not know how to do that with the polynomial+spline solution.<br /> But if one wants to go the read-only/read-mostly route, the function functions could be attractive alternative the machine learning. The advantage of using fits is that the algorithms are reasonably fast, we understand how they work, and they give strong guarantees for the results.Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com3tag:blogger.com,1999:blog-8891884956165580080.post-56205239612796451322019-02-01T09:47:00.002+01:002019-02-01T11:18:35.343+01:00Honest asymptotic complexity for search trees and tries<div style="text-align: justify;"> Fun fact that was pointed out to me by <a href="http://db.in.tum.de/~leis/index.shtml">Viktor</a>: All traditional books on algorithms and data structures that we could find gave the lookup costs of <a href="https://en.wikipedia.org/wiki/Red%E2%80%93black_tree">balanced search trees</a> as <i>O(log n)</i> (i.e., the depth of the search tree), and the lookup costs of <a href="https://en.wikipedia.org/wiki/Trie">tries</a> as <i>O(k)</i> (i.e., the length of the key). At a first glance this is a logarithmic time lookup against a linear time lookup, which makes people nervous when thinking about long keys.<br /> <br /> But that argument is very unfair: Either we consider a key comparison a <i>O(1)</i> operation, then a tree lookup is indeed in <i>O(log n)</i>, but then a trie lookup is in <i>O(1)</i>! As the length of the key has to be bounded by a constant to get <i>O(1)</i> comparisons, the depth of the trie is bounded, too. Or the length of the key matters, then a trie lookup is indeed in <i>O(k)</i>, but then a tree lookup is in <i>O(k log n)</i>. We have to compare with the key on every level, and if we are unlucky we have to look at the whole key, which gives the factor <i>k</i>.<br /> <br /> Which of course makes tries much more attractive asymptotic wise. Note that we ignore wall clock times here, which are much more difficult to predict, but in many if not most cases tries are indeed <a href="http://db.in.tum.de/~leis/papers/ART.pdf">much faster than search trees</a>.<br /> <br /> I believe the reason why text books get away with this unfair comparison is that they all present balanced search trees with integer keys:<br /> <br /> <svg xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:svg="http://www.w3.org/2000/svg" xmlns="http://www.w3.org/2000/svg" id="svg8" version="1.1" viewBox="0 0 163.52975 123.46429" height="123.46429mm" width="163.52975mm"> <defs id="defs2" /> <metadata id="metadata5"> <rdf:RDF> <cc:Work rdf:about=""> <dc:format>image/svg+xml</dc:format> <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage" /> <dc:title /> </cc:Work> </rdf:RDF> </metadata> <g transform="translate(-1.0119057,-32.672617)" id="layer1"> <ellipse ry="12.851191" rx="18.142857" cy="46.023808" cx="102.80952" id="path834" style="opacity:1;fill:none;fill-opacity:1;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> <ellipse ry="12.851191" rx="18.142857" cy="92.89286" cx="52.916668" id="path834-2" style="opacity:1;fill:none;fill-opacity:1;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> <ellipse ry="12.851191" rx="18.142857" cy="142.78572" cx="19.654762" id="path834-9" style="opacity:1;fill:none;fill-opacity:1;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> <ellipse ry="12.851191" rx="18.142857" cy="142.02975" cx="89.958336" id="path834-96" style="opacity:1;fill:none;fill-opacity:1;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> <ellipse ry="12.851191" rx="18.142857" cy="91.380951" cx="145.8988" id="path834-0" style="opacity:1;fill:none;fill-opacity:1;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> <path id="path869" d="M 90.238722,55.464545 64.236102,82.333922" style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" /> <path id="path871" d="m 116.04873,55.079323 23.9802,24.0765" style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" /> <path id="path873" d="m 42.37464,103.52124 -19.2612,26.29154" style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" /> <path id="path875" d="m 64.139796,103.23232 23.306053,26.09893" style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" /> <text id="text881" y="47.821632" x="101.23959" style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93888855px;line-height:137.00000048%;font-family:'DejaVu Sans';-inkscape-font-specification:'DejaVu Sans, Normal';text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" xml:space="preserve"><tspan style="stroke-width:0.26458332px" y="47.821632" x="101.23959" id="tspan879">8</tspan></text> <text id="text885" y="93.178772" x="142.64801" style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93888855px;line-height:137.00000048%;font-family:'DejaVu Sans';-inkscape-font-specification:'DejaVu Sans, Normal';text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" xml:space="preserve"><tspan style="stroke-width:0.26458332px" y="93.178772" x="142.64801" id="tspan883">10</tspan></text> <text id="text889" y="94.693092" x="51.363621" style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93888855px;line-height:137.00000048%;font-family:'DejaVu Sans';-inkscape-font-specification:'DejaVu Sans, Normal';text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" xml:space="preserve"><tspan style="stroke-width:0.26458332px" y="94.693092" x="51.363621" id="tspan887">4</tspan></text> <text id="text893" y="144.58595" x="18.040218" style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93888855px;line-height:137.00000048%;font-family:'DejaVu Sans';-inkscape-font-specification:'DejaVu Sans, Normal';text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" xml:space="preserve"><tspan style="stroke-width:0.26458332px" y="144.58595" x="18.040218" id="tspan891">1</tspan></text> <text id="text897" y="143.82758" x="88.370316" style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93888855px;line-height:137.00000048%;font-family:'DejaVu Sans';-inkscape-font-specification:'DejaVu Sans, Normal';text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" xml:space="preserve"><tspan style="stroke-width:0.26458332px" y="143.82758" x="88.370316" id="tspan895">6</tspan></text> </g> </svg> <br /> While tries are traditionally introduced with string examples. If they had used string keys for balanced search trees instead it would have been clear that the key length matters:<br /> <br /> <svg xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:cc="http://creativecommons.org/ns#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:svg="http://www.w3.org/2000/svg" xmlns="http://www.w3.org/2000/svg" id="svg8" version="1.1" viewBox="0 0 163.52975 123.46429" height="123.46429mm" width="163.52975mm"> <defs id="defs2" /> <metadata id="metadata5"> <rdf:RDF> <cc:Work rdf:about=""> <dc:format>image/svg+xml</dc:format> <dc:type rdf:resource="http://purl.org/dc/dcmitype/StillImage" /> <dc:title /> </cc:Work> </rdf:RDF> </metadata> <g transform="translate(-1.0119057,-32.672617)" id="layer1"> <ellipse ry="12.851191" rx="18.142857" cy="46.023808" cx="102.80952" id="path834" style="opacity:1;fill:none;fill-opacity:1;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> <ellipse ry="12.851191" rx="18.142857" cy="92.89286" cx="52.916668" id="path834-2" style="opacity:1;fill:none;fill-opacity:1;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> <ellipse ry="12.851191" rx="18.142857" cy="142.78572" cx="19.654762" id="path834-9" style="opacity:1;fill:none;fill-opacity:1;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> <ellipse ry="12.851191" rx="18.142857" cy="142.02975" cx="89.958336" id="path834-96" style="opacity:1;fill:none;fill-opacity:1;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> <ellipse ry="12.851191" rx="18.142857" cy="91.380951" cx="145.8988" id="path834-0" style="opacity:1;fill:none;fill-opacity:1;stroke:#000000;stroke-width:1;stroke-linecap:butt;stroke-linejoin:round;stroke-miterlimit:4;stroke-dasharray:none;stroke-dashoffset:0;stroke-opacity:1" /> <path id="path869" d="M 90.238722,55.464545 64.236102,82.333922" style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" /> <path id="path871" d="m 116.04873,55.079323 23.9802,24.0765" style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" /> <path id="path873" d="m 42.37464,103.52124 -19.2612,26.29154" style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" /> <path id="path875" d="m 64.139796,103.23232 23.306053,26.09893" style="fill:none;fill-rule:evenodd;stroke:#000000;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" /> <text id="text881" y="47.821632" x="94.423294" style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93888855px;line-height:137.00000048%;font-family:'DejaVu Sans';-inkscape-font-specification:'DejaVu Sans, Normal';text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" xml:space="preserve"><tspan style="stroke-width:0.26458332px" y="47.821632" x="94.423294" id="tspan879">ABCD8</tspan></text> <text id="text885" y="93.178772" x="135.93541" style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93888855px;line-height:137.00000048%;font-family:'DejaVu Sans';-inkscape-font-specification:'DejaVu Sans, Normal';text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" xml:space="preserve"><tspan style="stroke-width:0.26458332px" y="93.178772" x="135.93541" id="tspan883">ABCD10</tspan></text> <text id="text889" y="94.693092" x="44.500301" style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93888855px;line-height:137.00000048%;font-family:'DejaVu Sans';-inkscape-font-specification:'DejaVu Sans, Normal';text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" xml:space="preserve"><tspan style="stroke-width:0.26458332px" y="94.693092" x="44.500301" id="tspan887">ABCD4</tspan></text> <text id="text893" y="144.58595" x="11.327622" style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93888855px;line-height:137.00000048%;font-family:'DejaVu Sans';-inkscape-font-specification:'DejaVu Sans, Normal';text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" xml:space="preserve"><tspan style="stroke-width:0.26458332px" y="144.58595" x="11.327622" id="tspan891">ABCD1</tspan></text> <text id="text897" y="143.82758" x="81.558846" style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:4.93888855px;line-height:137.00000048%;font-family:'DejaVu Sans';-inkscape-font-specification:'DejaVu Sans, Normal';text-align:start;letter-spacing:0px;word-spacing:0px;writing-mode:lr-tb;text-anchor:start;fill:#000000;fill-opacity:1;stroke:none;stroke-width:0.26458332px;stroke-linecap:butt;stroke-linejoin:miter;stroke-opacity:1" xml:space="preserve"><tspan style="stroke-width:0.26458332px" y="143.82758" x="81.558846" id="tspan895">ABCD6</tspan></text> </g> </svg> <br /> The trie examines every key byte at most once, while the search tree can examine every key byte <i>log n</i> times. Thus, the asymptotic complexity of tries is actually better than that of balanced search tries. </div> Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com0tag:blogger.com,1999:blog-8891884956165580080.post-78766450294002209972018-06-08T09:06:00.000+02:002018-06-08T09:06:19.300+02:00Propagation of Mistakes in PapersWhile reading papers on cardinality estimation I noticed something odd: The seminal paper by Flajolet and Martin on <a href="http://algo.inria.fr/flajolet/Publications/FlMa85.pdf">probabilistic counting</a> gives a bias correction constant as 0.77351, while a <a href="https://pdfs.semanticscholar.org/6558/d618556f812328f969b60b5f97dae6f940c8.pdf">more recent (and very useful) paper</a> by Scheuermann and Mauve gives the constant as 0.775351. Was this a mistake? Or did they correct a mistake in the original paper?<br /> <br /> I started searching, and there is a <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=8&amp;cad=rja&amp;uact=8&amp;ved=2ahUKEwjgvIrUssPbAhWFTcAKHTfJCiMQFjAHegQIARBW&amp;url=http%3A%2F%2Fwww.cs.bu.edu%2Fgroups%2Fdblab%2Fpub_pdfs%2FTODS-sketch.pdf&amp;usg=AOvVaw22RNkei-ZhF7aGs79j42KI">large</a> <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.571.2661&amp;rep=rep1&amp;type=pdf">number</a> <a href="http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.571.2661&amp;rep=rep1&amp;type=pdf">of</a> <a href="http://129.94.242.51/~lxue/paper/adc10.pdf">papers</a> that uses the value 0.775351, but there is also a <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=19&amp;cad=rja&amp;uact=8&amp;ved=2ahUKEwiZmYuPu8PbAhWiNJoKHXaUCdY4ChAWMAh6BAgBEEw&amp;url=https%3A%2F%2Fwww.cs.upc.edu%2F~conrado%2Fresearch%2Ftalks%2Faofa2012.pdf&amp;usg=AOvVaw30yc0h3Ot7acCXpiD5WJMM">number</a> <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=24&amp;cad=rja&amp;uact=8&amp;ved=2ahUKEwi5pda3u8PbAhUDyaYKHcAuDdA4FBAWMAN6BAgBEDw&amp;url=http%3A%2F%2Fwww.vldb.org%2Fpvldb%2Fvol11%2Fp499-harmouch.pdf&amp;usg=AOvVaw08J-URRirbavlZnK7XleeV">of</a> <a href="https://www.google.com/url?sa=t&amp;rct=j&amp;q=&amp;esrc=s&amp;source=web&amp;cd=36&amp;cad=rja&amp;uact=8&amp;ved=2ahUKEwjwyebSu8PbAhXsDZoKHfHVDTQ4HhAWMAV6BAgBEEo&amp;url=https%3A%2F%2Fdomino.mpi-inf.mpg.de%2Fintranet%2Fag5%2Fag5publ.nsf%2F951b6516df8639d3c1256464004a33d0%2Fd37106f9c337bcc3c12571550046667a%2F%24file%2Fntarmostw06.pdf&amp;usg=AOvVaw1LFk16OkCGibsGHAp2o3mM">papers</a> that uses the value 0.77351. Judging by the number of Google hits for "Flajolet 0.77351" vs. "Flajolet 0.775351" the 0.77351 group seems to be somewhat larger, but both camps have a significant number of publications. Interestingly, not a single paper mentions both constants, and thus no paper explains what the correct constant should be.<br /> <br /> In the end I repeated the constant computation as explained by Flajolet, and the correct value is 0.77351. We can even derive one digit more when using double arithmetic (i.e., 0.773516), but that makes no difference in practice. Thus, the original paper was correct.<br /> <br /> But why do so many paper use the incorrect value 0.775351 then? My guess is that at some point somebody made a typo while writing a paper, introducing the superfluous digit 5, and that all other authors copied the constant from that paper without re-checking its value. I am not 100% sure what the origin of the mistake is. The incorrect value seems to appear first in the year 2007, showing up in multiple publications from that year. Judging by publication date the source seems to be <a href="http://www.cse.unsw.edu.au/~yingz/papers/icde_2007_k-skyline.pdf">this paper</a> (also it did not cite any other papers with the incorrect value, as far as I know). And everybody else just copied the constant from somewhere else, propagating it from paper to paper.<br /> <br /> If you find this web page because you are searching for the correct Flajolet/Martin bias correction constant, I can assure you that the original paper was correct, and that the value is 0.77351. But you do not have to trust me on this, you can just <a href="http://db.in.tum.de/~neumann/fmconstant.cpp">repeat the computation</a> yourself.Thomas Neumannhttp://www.blogger.com/profile/15209393663505917383noreply@blogger.com1