Database Architects

Advent of Code 2024 in pure SQL

2024-12-27T17:57:00.000+01:00

On a whim I decided to do this years advent of code in pure SQL. That was an interesting experience that I can recommend to everybody because it forces you to think differently about the problems. And I can report that it was possible to solve every problem in pure SQL.

In many cases SQL was actually surprisingly pleasant to use. The full solution for day 11 (including the puzzle input) is shown below:

with recursive aoc10_input(i) as (select '
89010123
78121874
87430965
96549874
45678903
32019012
01329801
10456732
'),
lines(y,line) as (
   select 0, substr(i,1,position(E'\n' in i)-1), substr(i,position(E'\n' in i)+1)
   from aoc10_input
   union all
   select y+1,substr(r,1,position(E'\n' in r)-1), substr(r,position(E'\n' in r)+1)
   from lines l(y,l,r) where position(E'\n' in r)>0
),
field(x,y,v) as (
   select x,y,ascii(substr(line,x::integer,1))-48
   from (select * from lines l where line<>'') s, lateral generate_series(1,length(line)) g(x)
),
paths(x,y,v,sx,sy) as (
   select x,y,9,x,y from field where v = 9
   union all
   select f.x,f.y,f.v,p.sx,p.sy
   from field f, paths p
   where f.v=p.v-1 and ((f.x=p.x and abs(f.y-p.y)=1) or (f.y=p.y and abs(f.x-p.x)=1)) and p.v>0),
results as (select * from paths where v=0),
part1 as (select distinct * from results)
select (select count(*) from part1)  as part1, (select count(*) from results) as part2

Parsing the input is a bit painful in SQL, but it is not too bad. Lines 1-10 are simply the puzzle input, lines 11-17 split the input into individual lines, and lines 18-21 construct a 2D array from the input. The algorithm itself is pretty short, lines 22-27 perform a recursive traversal of the field, and lines 28-39 extract the puzzle answer from the traversal results. For this kind of small scale traversals SQL works just fine.

Other days were more painful. Day 16 for example does conceptually a very similar traversal of a field, and it computes the minimal traversal distance for each visited. Expressing that in SQL in easy, but evaluation is wasteful. When replacing the reference input with a real puzzle input the field is quite large, and the recursive query generates and preserves a lot of state, even though we only care about the last iteration of the recursive query. As a consequence you need a machine with over 200GB memory to execute that query, even though most of the computed tuples are irrelevant. We could fix that excessive memory consumption by using iteration semantic during recursion, but that is not widely supported by DBMSes. Umbra could do it, but Postgres and DuckDB cannot, thus I have not used it in my solutions.

And sometimes the programming model of recursive SQL clashes with what we want to do. On day 23 we had to find the maximum clique in sparse graph. This can be computed reasonably well with the Bron-Kerbosch algorithm, but expressing that in recursive SQL is quite convoluted because the algorithm wants to maintain multiple sets, but recursive SQL only passes a single set along. It can be done, but the result does not look pretty.

This experiment has shown two things to me 1) it is possible to code quite complex algorithms in SQL, and often the SQL code is surprisingly pleasant, and 2) recursive SQL would be much more efficient and more pleasant to use if we had mechanisms to update state. There is ongoing work on supporting more complex control flow in recursion via a trampoline mechanisms, which is very useful, too, but we should definitively look into more complex state manipulation mechanisms. With just a bit extra functionality SQL would be quite a solid choice for running complex algorithms directly inside a database.

What are important data systems problems, ignored by research?

2024-12-13T08:37:00.002+01:00

In November, I had the pleasure of attending the Dutch-Belgian DataBase Day, where I moderated a panel on practical challenges often overlooked in database research. Our distinguished panelists included Allison Lee (founding engineer at Snowflake), Andy Pavlo (professor at CMU), and Hannes Mühleisen (co-creator of DuckDB and researcher at CWI), with attendees contributing to the discussion and sharing their perspectives. In this post, I'll attempt to summarize the discussion in the hope that it inspires young (and young-at-heart) researchers to tackle these challenges. Additionally, I'll link to some paper that can serve as motivation and starting points for research in these areas.

One significant yet understudied problem raised by multiple panellists is the handling of variable-length strings. Any analysis of real-world analytical queries reveals that strings are ubiquitous. For instance, Amazon Redshift recently reported that around 50% of all columns are strings. Since strings are typically larger than numeric data, this implies that strings are a substantial majority of real-world data. Dealing with strings presents two major challenges. First, query processing is often slow due to the variable size of strings and the (time and space) overhead of dynamic allocation. Second, surprisingly little research has been dedicated to efficient database-specific string compression. Given the importance of strings on real-world query performance and storage consumption, it is surprising how little research there is on the topic (there are some exceptions).

Allison highlighted a related issue: standard benchmarks, like TPC-H, are overly simplistic, which may partly explain why string processing is understudied. TPC-H queries involve little complex string processing and don't use strings as join or aggregation keys. Moreover, TPC-H strings have static upper bounds, allowing them to be treated as fixed-size objects. This sidesteps the real challenges of variable-size strings and their complex operations. More broadly, standard benchmarks fall short of reflecting real-world workloads, as they lack advanced relational operators (e.g., window functions, CTEs) and complex expressions. To drive meaningful progress, we likely need new, more realistic benchmarks. While the participants agreed on most points, one particularly interesting topic of discussion was distributed query processing. Allison pointed out that many query processing papers overlook distributed processing, making them hard to adopt in industrial systems. Hannes, however, argued that most user workloads can be handled on a single machine, which should be the primary focus of publicly funded research. My personal view is that both single-node and distributed processing are important, and there is ample room to address both challenges.

While database researchers often focus on database engine architectures, Andy argued that surrounding topics, such as network connection handling (e.g., database proxies), receive little attention despite their practical importance. Surprisingly, there is also limited research on scheduling database workloads and optimizing the network stack, even though communication bottlenecks frequently constrain efficient OLTP systems. Multi-statement stored procedures, though a potential solution, are not widely adopted and fail to address this issue in practice. I believe there are significant research opportunities in exploring how to better structure the interface between applications and database systems.

One striking fact about major database conferences, such as SIGMOD and VLDB, is how few papers address practical database system problems. From personal experience, I believe this presents a significant opportunity for researchers seeking both academic and real-world impact. Solutions to the problems discussed above (and many others) are likely to gain industry attention and be adopted by real database systems. Moreover, with the availability of open-source systems like DuckDB, DataFusion, LeanStore, and PostgreSQL, conducting systems research has become easier than ever.

C++ exception performance three years later

2024-12-10T15:44:00.001+01:00

About three years ago we noticed serious performance problems in C++ exception unwinding. Due to contention on the unwinding path these became more and more severe the more cores a system had, and unwinding could slow down by orders of magnitude. Due to the constraints of backwards compatibility this contention was not easy to eliminate, and P2544 discussed ways to fix this problem via language changes in C++.

But fortunately people found less invasive solutions. First, Florian Weimer changed the glibc to provide a lock-free mechanism to find the (static) unwind tables for a given shared object. Which eliminates the most serious contention for "simple" C++ programs. For example in a micro-benchmark that calls a function with some computations (100 calls to sqrt per function invocation), and which throws with a certain probability, we previously had very poor scalability with increasing core count. With his patch we now see with gcc 14.2 on a dual-socket EPYC 7713 the following performance development (runtime in ms):

	1	2	4	8	16	32	64	128
0% failure	29	29	29	29	29	29	29	42
0.1% failure	29	29	29	29	29	29	29	32
1% failure	29	30	30	30	30	30	32	34
10% failure	36	36	37	37	37	37	47	65

Which is more or less perfect. 128 threads are a bit slower, but that is to be expected as one EPYC only has 64 cores. With higher failure rates unwinding itself becomes slower but that is still acceptable here. Thus most C++ programs are just fine.

For our use case that is not enough, though. We dynamically generate machine code at runtime, and we want to be able to pass exceptions through generated code. The _dl_find_object mechanism of glibc is not used for JITed code, instead libgcc maintains its own lookup structure. Historically this was a simple list with a global lock, which of course had terrible performance. But through a series of patches we managed to change libgcc into using a lock-free b-tree for maintaining the dynamic unwinding frames. Using a similar experiment to the one above, but now with JIT-generated code (using LLVM 19), we get the following:

	1	2	4	8	16	32	64	128
0% failure	32	38	48	64	48	36	59	62
0.1% failure	32	32	32	32	32	48	62	68
1% failure	41	40	40	40	53	69	80	83
10% failure	123	113	103	116	128	127	131	214

The numbers have more noise than for statically generated code, but overall observation is the same: Unwinding now scales with the number of cores, and we can safely use C++ exceptions even on machines with large core counts.

So is everything perfect now? No. First, only gcc has a fully scalable frame lookup mechanism. clang has its own implementation, and as far as I know it still does not scale properly due to a global lock in DwarfFDECache. Note that at least on many Linux distributions clang uses libgcc by default, thus the problem is not immediately obvious there, but a pure llvm/clang build will have scalability problems. And second unwinding through JIT-ed code is a quite a bit slower, which is unfortunate. But admittedly the problem is less severe than shown here, the benchmark with JITed code simply has a stack frame more to unwind due to the way static code and JITed code interact. And it makes sense to prioritize static unwinding over dynamic unwinding frames, as most people never JIT-generate code.

Overall we are now quite happy with the unwinding mechanism. The bottlenecks are gone, and performance is fine even with high core counts. It is still not appropriate for high failure rates, something like P709 would be better for that, but we can live with the status quo.

B-trees Require Fewer Comparisons Than Balanced Binary Search Trees

2024-06-06T15:59:00.000+02:00

Due to better access locality, B-trees are faster than binary search trees in practice -- but are they also better in theory? To answer this question, let's look at the number of comparisons required for a search operation. Assuming we store n elements in a binary search tree, the lower bound for the number of comparisons is log₂ n in the worst case. However, this is only achievable for a perfectly balanced tree. Maintaining such a tree's perfect balance during insert/delete operations requires O(n) time in the worst case.

Balanced binary search trees, therefore, leave some slack in terms of how balanced they are and have slightly worse bounds. For example, it is well known that an AVL tree guarantees at most 1.44 log₂ n comparisons, and a Red-Black tree guarantees 2 log₂ n comparisons. In other words, AVL trees require at most 1.44 times the minimum number of comparisons, and Red-Black trees require up to twice the minimum.

How many comparisons does a B-tree need? In B-trees with degree k, each node (except the root) has between k and 2k children. For k=2, a B-tree is essentially the same data structure as a Red-Black tree and therefore provides the same guarantee of 2 log₂ n comparisons. So how about larger, more realistic values of k?

To analyze the general case, we start with a B-tree that has the highest possible height for n elements. The height is maximal when each node has only k children (for simplicity, this analysis ignores the special case of underfull root nodes). This implies that the worst-case height of a B-tree is log_k n. During a lookup, one has to perform a binary search that takes log₂ k comparisons in each of the log_k n nodes. So in total, we have log₂ k * log_k n = log₂ n comparisons.

This actually matches the best case, and to construct the worst case, we have to modify the tree somewhat. On one (and only one) arbitrary path from the root to a single leaf node, we increase the number of children from k to 2k. In this situation, the tree height is still less than or equal to log_k n, but we now have one worst-case path where we need log₂ 2k (instead of log₂ k) comparisons. On this worst-case path, we have log₂ 2k * log_k n = (log₂ 2k) / (log₂ k) * log₂ n comparisons.

Using this formula, we get the following bounds:
k=2: 2 log₂ n
k=4: 1.5 log₂ n
k=8: 1.33 log₂ n
k=16: 1.25 log₂ n
...
k=512: 1.11 log₂ n

We see that as k grows, B-trees get closer to the lower bound. For k>=8, B-trees are guaranteed to perform fewer comparisons than AVL trees in the worst case. As k increases, B-trees become more balanced. One intuition for this result is that for larger k values, B-trees become increasingly similar to sorted arrays which achieve the log₂ n lower bound. Practical B-trees often use fairly large values of k (e.g., 100) and therefore offer tight bounds -- in addition to being more cache-friendly than binary search trees.

(Caveat: For simplicity, the analysis assumes that log₂ n and log₂ 2k are integers, and that the root has either k or 2k entries. Nevertheless, the observation that larger k values lead to tighter bounds should hold in general.)

SSDs Have Become Ridiculously Fast, Except in the Cloud

2024-02-19T09:00:00.001+01:00

In recent years, flash-based SSDs have largely replaced disks for most storage use cases. Internally, each SSD consists of many independent flash chips, each of which can be accessed in parallel. Assuming the SSD controller keeps up, the throughput of an SSD therefore primarily depends on the interface speed to the host. In the past six years, we have seen a rapid transition from SATA to PCIe 3.0 to PCIe 4.0 to PCIe 5.0. As a result, there was an explosion in SSD throughput:

At the same time, we saw not just better performance, but also more capacity per dollar:

The two plots illustrate the power of a commodity market. The combination of open standards (NVMe and PCIe), huge demand, and competing vendors led to great benefits for customers. Today, top PCIe 5.0 data center SSDs such as the Kioxia CM7-R or Samsung PM1743 achieve up to 13 GB/s read throughput and 2.7M+ random read IOPS. Modern servers have around 100 PCIe lanes, making it possible to have a dozen of SSDs (each usually using 4 lanes) in a single server at full bandwidth. For example, in our lab we have a single-socket Zen 4 server with 8 Kioxia CM7-R SSDs, which achieves 100GB/s (!) I/O bandwidth:

AWS EC2 was an early NVMe pioneer, launching the i3 instance with 8 physically-attached NVMe SSDs in early 2017. At that time, NVMe SSDs were still expensive, and having 8 in a single server was quite remarkable. The per-SSD read (2 GB/s) and write (1 GB/s) performance was considered state of the art as well. Another step forward occurred in 2019 with the launch of i3en instances, which doubled storage capacity per dollar.

Since then, several NVMe instance types, including i4i and im4gn, have been launched. Surprisingly, however, the performance has not increased; seven years after the i3 launch, we are still stuck with 2 GB/s per SSD. Indeed, the venerable i3 and i3en instances basically remain the best EC2 has to offer in terms of IO-bandwidth/$ and SSD-capacity/$, respectively. Personally, I find this very surprising given the SSD bandwidth explosion and cost reductions we have seen on the commodity market. At this point, the performance gap between state-of-the-art SSDs and those offered by major cloud vendors, especially in read throughput, write throughput, and IOPS, is nearing an order of magnitude. (Azure's top NVMe instances are only slightly faster than AWS's.)

What makes this stagnation in the cloud even more surprising is that we have seen great advances in other areas. For example, during the same 2017 to 2023 time frame, EC2 network bandwidth exploded, increasing from 10 Gbit/s (c4) to 200 Gbit/s (c7gn). Now, I can only speculate why the cloud vendors have not caught up on the storage side:

One theory is that EC2 intentionally caps the write speed at 1 GB/s to avoid frequent device failure, given the total number of writes per SSD is limited. However, this does not explain why the read bandwidth is stuck at 2 GB/s.
A second possibility is that there is no demand for faster storage because very few storage systems can actually exploit tens of GB/s of I/O bandwidth. See our recent VLDB paper. On the other hand, as long as fast storage devices are not widely available, there is also little incentive to optimize existing systems.
A third theory is that if EC2 were to launch fast and cheap NVMe instance storage, it would disrupt the cost structure of its other storage service (in particular EBS). This is, of course, the classic innovator's dilemma, but one would hope that one of the smaller cloud vendors would make this step to gain a competitive edge.

Overall, I'm not fully convinced by any of these three arguments. Actually, I hope that we'll soon see cloud instances with 10 GB/s SSDs, making this post obsolete.

The Great CPU Stagnation

2023-04-09T14:08:00.005+02:00

For at least five decades, Moore's law consistently delivered increasing numbers of transistors. Equally significant, Dennard scaling led to each transistor using less energy, enabling higher clock frequencies. This was great, as higher clock frequencies enhanced existing software performance automatically, without necessitating any code rewrite. However, around 2005, Dennard scaling began to falter, and clock frequencies have largely plateaued since then.

Despite this, Moore's law continued to advance, with the additional available transistors being channeled into creating more cores per chip. The following graph displays the number of cores for the largest available x86 CPU at the time:
Notice the logarithmic scale: this represents the exponential trend we had become accustomed to, with core counts doubling roughly every three years. Regrettably, when considering cost per core, this impressive trend appears to have stalled, ushering in an era of CPU stagnation.

To demonstrate this stagnation, I gathered data from wikichip.org on AMD's Epyc single-socket CPU lineup, introduced in 2017 and now in its fourth generation (Naples, Rome, Milan, Genoa):

Model	Gen	Launch	Cores	GHz	IPC	Price
7351P	Naples	06/2017	16	2.4	1.00	$750
7401P	Naples	06/2017	24	2.0	1.00	$1,075
7551P	Naples	06/2017	32	2.0	1.00	$2,100
7302P	Rome	08/2019	16	3.0	1.15	$825
7402P	Rome	08/2019	24	2.8	1.15	$1,250
7502P	Rome	08/2019	32	2.5	1.15	$2,300
7702P	Rome	08/2019	64	2.0	1.15	$4,425
7313P	Milan	03/2021	16	3.0	1.37	$913
7443P	Milan	03/2021	24	2.9	1.37	$1,337
7543P	Milan	03/2021	32	2.8	1.37	$2,730
7713P	Milan	03/2021	64	2.0	1.37	$5,010
9354P	Genoa	11/2022	32	3.3	1.57	$2,730
9454P	Genoa	11/2022	48	2.8	1.57	$4,598
9554P	Genoa	11/2022	64	3.1	1.57	$7,104
9654P	Genoa	11/2022	96	2.4	1.57	$10,625

Over these past six years, AMD has emerged as the x86 performance per dollar leader. Examining these numbers should provide insight into the state of server CPUs. Let's first observe CPU cores per dollar:

This deviates significantly from the expected exponential improvement graphs. In fact, CPU cores are becoming slightly more expensive over time! Admittedly, newer cores outperform their predecessors. When accounting for both clock frequency and higher IPC, we obtain the following image:

This isn't much better. The performance improvement over a 6-year period is underwhelming when normalized for cost. Similar results can also be observed for Intel CPUs in EC2.

Lastly, let's examine transistor counts, only taking into account the logic transistors. Despite improved production nodes from 14nm (Naples) over 7nm (Rome/Milan) to 5nm (Genoa), cost-adjusted figures reveal stagnation:

In conclusion, the results are disheartening. Rapid and exponential improvements in CPU speed seem to be relics of the past. We now find ourselves in a markedly different landscape compared to the historical norm in computing. The implications could be far-reaching. For example, most software is extremely inefficient when compared to what hardware can theoretically achieve, and maybe this needs to change. Furthermore, historically specialized chips enjoyed only limited success due to the rapid advancement of commodity CPUs. Perhaps, custom chips will have a much bigger role in the future.

P.S. Due to popular demand, here's how the last graph looks like after adjusting for inflation:

Five Decades of Database Research

2023-02-07T15:31:00.001+01:00

Since 1975, over 24 thousand articles have have been published in major database venues (SIGMOD, VLDB/PVLDB, ICDE, EDBT, CIDR, TODS, VLDB Journal, TKDE). The number of papers per year is rising:

Over time, the topics change. Looking at the percentage of keywords appearing in paper titles (in that particular year), we can see interesting trends:

For systems, research is development and development is research

2023-01-23T13:13:00.001+01:00

The Conference on Innovative Data Systems Research (CIDR) 2023 is over, and as usual both the official program and the informal discussions have been great. CIDR encourages innovative, risky, and controversial ideas as well as honest exchanges. One intensely-discussed talk was the keynote by Hannes Mühleisen, who together with Mark Raasveldt is the brain behind DuckDB.

In the keynote, Hannes lamented the incentives of systems researchers in academia (e.g., papers over running code). He also criticized the often obscure topics database systems researchers work on while neglecting many practical and pressing problems (e.g., top-k algorithms rather than practically-important issues like strings). Michael Stonebraker has similar thoughts on the database systems community. I share many of these criticisms, but I'm more optimistic regarding what systems research in academia can do, and would therefore like to share my perspective.

Software is different: copying it is free, which has two implications: (1) Most systems are somewhat unique -- otherwise one could have used an existing one. (2) The cost of software is dominated by development effort. I argue that, together, these two observations mean that systems research and system development are two sides of the same coin.

Because developing complex systems is difficult, reinventing the wheel is not a good idea -- it's much better to stand on the proverbial shoulders of giants. Thus, developers should look at the existing literature to find out what others have done, and should experimentally compare existing approaches. Often there are no good solutions for some problems, requiring new inventions, which need to be written up to communicate them to others. Writing will not just allow communication, it will also improve conceptual clarity and understanding, leading to better software. Of course, all these activities (literature review, experiments, invention, writing) are indistinguishable from systems research.

On the other hand, doing systems research without integrating the new techniques into real systems can also lead to problems. Without being grounded by real systems, researchers risk wasting their time on intellectually-difficult, but practically-pointless problems. (And indeed much of what is published at the major database conferences falls into this trap.) Building real systems leads to a treasure trove of open problems. Publishing solutions to these often directly results in technological progress, better systems, and adoption by other systems.

To summarize: systems research is (or should be) indistinguishable from systems development. In principle, this methodology could work in both industry and academia. Both places have problematic incentives, but different ones. Industry often has a very short time horizon, which can lead to very incremental developments. Academic paper-counting incentives can lead to lots of papers without any impact on real systems.

Building systems in academia may not be the best strategy to publish the maximum number of papers or citations, but can lead to real-world impact, technological progress, and (in the long run even) academic accolades. The key is therefore to work with people who have shown how to overcome these systemic pathologies, and build systems over a long time horizon. There are many examples such academic projects (e.g., PostgreSQL, C-Store/Vertica, H-Store/VoltDB, ShoreMT, Proteus, Quickstep, Peloton, KÙZU, AsterixDB, MonetDB, Vectorwise, DuckDB, Hyper, LeanStore, and Umbra).

Making unwinding through JIT-ed code scalable - b-tree operations

2022-06-26T11:00:00.001+02:00

This article is part of the series about scalable unwinding that starts here.

Now that we have all infrastructure in place, we look at the high-level algorithms. For inserts, we walk down the tree until we hit the leaf-node that should contain the new value. If that node is full, we split the leaf node, and insert a new separator into the parent node to distinguish the two nodes. To avoid propagating that split further up (as the inner node might be full, too, requiring an inner split), we eagerly split full inner nodes when walking down. This guarantees that the parent of a node is never full, which allows us to look at nodes purely from top-to-bottom, which greatly simplifies locking.

The splits themselves are relatively simple, we just copy the right half of each node into a new node, reduce the size of the original node, and insert a separator into the parent. However two problems require some care 1) we might have to split the root, which does not have a parent itself, and 2) the node split could mean that the value we try to insert could be either in the left or the right node. The split functions always update the node iterator to the correct node, and release the lock on the node that is not needed after the split.

// Insert a new separator after splitting
static void
btree_node_update_separator_after_split (struct btree_node *n,
					 uintptr_t old_separator,
					 uintptr_t new_separator,
					 struct btree_node *new_right)
{
  unsigned slot = btree_node_find_inner_slot (n, old_separator);
  for (unsigned index = n->entry_count; index > slot; --index)
    n->content.children[index] = n->content.children[index - 1];
  n->content.children[slot].separator = new_separator;
  n->content.children[slot + 1].child = new_right;
  n->entry_count++;
}

// Check if we are splitting the root
static void
btree_handle_root_split (struct btree *t, struct btree_node **node,
			 struct btree_node **parent)
{
  // We want to keep the root pointer stable to allow for contention
  // free reads. Thus, we split the root by first moving the content
  // of the root node to a new node, and then split that new node
  if (!*parent)
    {
      // Allocate a new node, this guarantees us that we will have a parent
      // afterwards
      struct btree_node *new_node
	= btree_allocate_node (t, btree_node_is_inner (*node));
      struct btree_node *old_node = *node;
      new_node->entry_count = old_node->entry_count;
      new_node->content = old_node->content;
      old_node->content.children[0].separator = max_separator;
      old_node->content.children[0].child = new_node;
      old_node->entry_count = 1;
      old_node->type = btree_node_inner;

      *parent = old_node;
      *node = new_node;
    }
}

// Split an inner node
static void
btree_split_inner (struct btree *t, struct btree_node **inner,
		   struct btree_node **parent, uintptr_t target)
{
  // Check for the root
  btree_handle_root_split (t, inner, parent);

  // Create two inner node
  uintptr_t right_fence = btree_node_get_fence_key (*inner);
  struct btree_node *left_inner = *inner;
  struct btree_node *right_inner = btree_allocate_node (t, true);
  unsigned split = left_inner->entry_count / 2;
  right_inner->entry_count = left_inner->entry_count - split;
  for (unsigned index = 0; index < right_inner->entry_count; ++index)
    right_inner->content.children[index]
      = left_inner->content.children[split + index];
  left_inner->entry_count = split;
  uintptr_t left_fence = btree_node_get_fence_key (left_inner);
  btree_node_update_separator_after_split (*parent, right_fence, left_fence,
					   right_inner);
  if (target <= left_fence)
    {
      *inner = left_inner;
      btree_node_unlock_exclusive (right_inner);
    }
  else
    {
      *inner = right_inner;
      btree_node_unlock_exclusive (left_inner);
    }
}

// Split a leaf node
static void
btree_split_leaf (struct btree *t, struct btree_node **leaf,
		  struct btree_node **parent, uintptr_t fence, uintptr_t target)
{
  // Check for the root
  btree_handle_root_split (t, leaf, parent);

  // Create two leaf node
  uintptr_t right_fence = fence;
  struct btree_node *left_leaf = *leaf;
  struct btree_node *right_leaf = btree_allocate_node (t, false);
  unsigned split = left_leaf->entry_count / 2;
  right_leaf->entry_count = left_leaf->entry_count - split;
  for (unsigned index = 0; index != right_leaf->entry_count; ++index)
    right_leaf->content.entries[index]
      = left_leaf->content.entries[split + index];
  left_leaf->entry_count = split;
  uintptr_t left_fence = right_leaf->content.entries[0].base - 1;
  btree_node_update_separator_after_split (*parent, right_fence, left_fence,
					   right_leaf);
  if (target <= left_fence)
    {
      *leaf = left_leaf;
      btree_node_unlock_exclusive (right_leaf);
    }
  else
    {
      *leaf = right_leaf;
      btree_node_unlock_exclusive (left_leaf);
    }
}

// Insert an entry
static bool
btree_insert (struct btree *t, uintptr_t base, uintptr_t size,
	      struct object *ob)
{
  // Sanity check
  if (!size)
    return false;

  // Access the root
  struct btree_node *iter, *parent = NULL;
  {
    version_lock_lock_exclusive (&(t->root_lock));
    iter = t->root;
    if (iter)
      {
	btree_node_lock_exclusive (iter);
      }
    else
      {
	t->root = iter = btree_allocate_node (t, false);
      }
    version_lock_unlock_exclusive (&(t->root_lock));
  }

  // Walk down the btree with classic lock coupling and eager splits.
  // Strictly speaking this is not performance optimal, we could use
  // optimistic lock coupling until we hit a node that has to be modified.
  // But that is more difficult to implement and frame registration is
  // rare anyway, we use simple locking for now

  uintptr_t fence = max_separator;
  while (btree_node_is_inner (iter))
    {
      // Use eager splits to avoid lock coupling up
      if (iter->entry_count == max_fanout_inner)
	btree_split_inner (t, &iter, &parent, base);

      unsigned slot = btree_node_find_inner_slot (iter, base);
      if (parent)
	btree_node_unlock_exclusive (parent);
      parent = iter;
      fence = iter->content.children[slot].separator;
      iter = iter->content.children[slot].child;
      btree_node_lock_exclusive (iter);
    }

  // Make sure we have space
  if (iter->entry_count == max_fanout_leaf)
    btree_split_leaf (t, &iter, &parent, fence, base);
  if (parent)
    btree_node_unlock_exclusive (parent);

  // Insert in node
  unsigned slot = btree_node_find_leaf_slot (iter, base);
  if ((slot < iter->entry_count) && (iter->content.entries[slot].base == base))
    {
      // duplicate entry, this should never happen
      btree_node_unlock_exclusive (iter);
      return false;
    }
  for (unsigned index = iter->entry_count; index > slot; --index)
    iter->content.entries[index] = iter->content.entries[index - 1];
  struct leaf_entry *e = &(iter->content.entries[slot]);
  e->base = base;
  e->size = size;
  e->ob = ob;
  iter->entry_count++;
  btree_node_unlock_exclusive (iter);
  return true;
}

Deletion is more complex, as there are more cases. We have to maintain the invariant that each node is at least half full. Just like insertion we have the problem that operations can trickle up, e.g., deleting in element in a node might make it less than half-full, merging that node with a half-full neighbor deletes an entry from the parent, which can make that node less than half-full, etc. We solve that problem by merging while going down: When traversing the tree during element-removal, we check if the current node is less than half full. If yes, we merge/balance it with a neighbor node. If the parent becomes less than half-full that will be fixed at the next traversal. Strictly speaking this means nodes can, at least temporarily, be less than half full, but that is fine for asymptotic complexity, as we are never more than one element below the threshold.

The merge logic examines that least-full neighbor of the current code. If both nodes together would fit in one node, they are merged and the separator for the left node is removed from the parent. Otherwise, elements are shifted from the less-full node to the other node, which makes both nodes at least half full. The separator of the left node is updated after the shift:

// Merge (or balance) child nodes
static struct btree_node *
btree_merge_node (struct btree *t, unsigned child_slot,
		  struct btree_node *parent, uintptr_t target)
{
  // Choose the emptiest neighbor and lock both. The target child is already
  // locked
  unsigned left_slot;
  struct btree_node *left_node, *right_node;
  if ((child_slot == 0)
      || (((child_slot + 1) < parent->entry_count)
	  && (parent->content.children[child_slot + 1].child->entry_count
	      < parent->content.children[child_slot - 1].child->entry_count)))
    {
      left_slot = child_slot;
      left_node = parent->content.children[left_slot].child;
      right_node = parent->content.children[left_slot + 1].child;
      btree_node_lock_exclusive (right_node);
    }
  else
    {
      left_slot = child_slot - 1;
      left_node = parent->content.children[left_slot].child;
      right_node = parent->content.children[left_slot + 1].child;
      btree_node_lock_exclusive (left_node);
    }

  // Can we merge both nodes into one node?
  unsigned total_count = left_node->entry_count + right_node->entry_count;
  unsigned max_count
    = btree_node_is_inner (left_node) ? max_fanout_inner : max_fanout_leaf;
  if (total_count <= max_count)
    {
      // Merge into the parent?
      if (parent->entry_count == 2)
	{
	  // Merge children into parent. This can only happen at the root
	  if (btree_node_is_inner (left_node))
	    {
	      for (unsigned index = 0; index != left_node->entry_count; ++index)
		parent->content.children[index]
		  = left_node->content.children[index];
	      for (unsigned index = 0; index != right_node->entry_count;
		   ++index)
		parent->content.children[index + left_node->entry_count]
		  = right_node->content.children[index];
	    }
	  else
	    {
	      parent->type = btree_node_leaf;
	      for (unsigned index = 0; index != left_node->entry_count; ++index)
		parent->content.entries[index]
		  = left_node->content.entries[index];
	      for (unsigned index = 0; index != right_node->entry_count;
		   ++index)
		parent->content.entries[index + left_node->entry_count]
		  = right_node->content.entries[index];
	    }
	  parent->entry_count = total_count;
	  btree_release_node (t, left_node);
	  btree_release_node (t, right_node);
	  return parent;
	}
      else
	{
	  // Regular merge
	  if (btree_node_is_inner (left_node))
	    {
	      for (unsigned index = 0; index != right_node->entry_count;
		   ++index)
		left_node->content.children[left_node->entry_count++]
		  = right_node->content.children[index];
	    }
	  else
	    {
	      for (unsigned index = 0; index != right_node->entry_count;
		   ++index)
		left_node->content.entries[left_node->entry_count++]
		  = right_node->content.entries[index];
	    }
	  parent->content.children[left_slot].separator
	    = parent->content.children[left_slot + 1].separator;
	  for (unsigned index = left_slot + 1; index + 1 < parent->entry_count;
	       ++index)
	    parent->content.children[index]
	      = parent->content.children[index + 1];
	  parent->entry_count--;
	  btree_release_node (t, right_node);
	  btree_node_unlock_exclusive (parent);
	  return left_node;
	}
    }

  // No merge possible, rebalance instead
  if (left_node->entry_count > right_node->entry_count)
    {
      // Shift from left to right
      unsigned to_shift
	= (left_node->entry_count - right_node->entry_count) / 2;
      if (btree_node_is_inner (left_node))
	{
	  for (unsigned index = 0; index != right_node->entry_count; ++index)
	    {
	      unsigned pos = right_node->entry_count - 1 - index;
	      right_node->content.children[pos + to_shift]
		= right_node->content.children[pos];
	    }
	  for (unsigned index = 0; index != to_shift; ++index)
	    right_node->content.children[index]
	      = left_node->content
		  .children[left_node->entry_count - to_shift + index];
	}
      else
	{
	  for (unsigned index = 0; index != right_node->entry_count; ++index)
	    {
	      unsigned pos = right_node->entry_count - 1 - index;
	      right_node->content.entries[pos + to_shift]
		= right_node->content.entries[pos];
	    }
	  for (unsigned index = 0; index != to_shift; ++index)
	    right_node->content.entries[index]
	      = left_node->content
		  .entries[left_node->entry_count - to_shift + index];
	}
      left_node->entry_count -= to_shift;
      right_node->entry_count += to_shift;
    }
  else
    {
      // Shift from right to left
      unsigned to_shift
	= (right_node->entry_count - left_node->entry_count) / 2;
      if (btree_node_is_inner (left_node))
	{
	  for (unsigned index = 0; index != to_shift; ++index)
	    left_node->content.children[left_node->entry_count + index]
	      = right_node->content.children[index];
	  for (unsigned index = 0; index != right_node->entry_count - to_shift;
	       ++index)
	    right_node->content.children[index]
	      = right_node->content.children[index + to_shift];
	}
      else
	{
	  for (unsigned index = 0; index != to_shift; ++index)
	    left_node->content.entries[left_node->entry_count + index]
	      = right_node->content.entries[index];
	  for (unsigned index = 0; index != right_node->entry_count - to_shift;
	       ++index)
	    right_node->content.entries[index]
	      = right_node->content.entries[index + to_shift];
	}
      left_node->entry_count += to_shift;
      right_node->entry_count -= to_shift;
    }
  uintptr_t left_fence;
  if (btree_node_is_leaf (left_node))
    {
      left_fence = right_node->content.entries[0].base - 1;
    }
  else
    {
      left_fence = btree_node_get_fence_key (left_node);
    }
  parent->content.children[left_slot].separator = left_fence;
  btree_node_unlock_exclusive (parent);
  if (target <= left_fence)
    {
      btree_node_unlock_exclusive (right_node);
      return left_node;
    }
  else
    {
      btree_node_unlock_exclusive (left_node);
      return right_node;
    }
}

// Remove an entry
static struct object *
btree_remove (struct btree *t, uintptr_t base)
{
  // Access the root
  version_lock_lock_exclusive (&(t->root_lock));
  struct btree_node *iter = t->root;
  if (iter)
    btree_node_lock_exclusive (iter);
  version_lock_unlock_exclusive (&(t->root_lock));
  if (!iter)
    return NULL;

  // Same strategy as with insert, walk down with lock coupling and
  // merge eagerly
  while (btree_node_is_inner (iter))
    {
      unsigned slot = btree_node_find_inner_slot (iter, base);
      struct btree_node *next = iter->content.children[slot].child;
      btree_node_lock_exclusive (next);
      if (btree_node_needs_merge (next))
	{
	  // Use eager merges to avoid lock coupling up
	  iter = btree_merge_node (t, slot, iter, base);
	}
      else
	{
	  btree_node_unlock_exclusive (iter);
	  iter = next;
	}
    }

  // Remove existing entry
  unsigned slot = btree_node_find_leaf_slot (iter, base);
  if ((slot >= iter->entry_count) || (iter->content.entries[slot].base != base))
    {
      // not found, this should never happen
      btree_node_unlock_exclusive (iter);
      return NULL;
    }
  struct object *ob = iter->content.entries[slot].ob;
  for (unsigned index = slot; index + 1 < iter->entry_count; ++index)
    iter->content.entries[index] = iter->content.entries[index + 1];
  iter->entry_count--;
  btree_node_unlock_exclusive (iter);
  return ob;
}

Lookups are conceptually simple, we just walk down the b-tree. However we do the traversal using optimistic lock coupling, which means the data could change behind our back at any time. As a consequence, all reads have to be (relaxed) atomic reads, and we have to validate the current lock before acting upon a value that we have read. In case of failures (e.g., concurrent writes during reading), we simply restart the traversal.

// Find the corresponding entry for the given address
static struct object *
btree_lookup (const struct btree *t, uintptr_t target_addr)
{
  // Within this function many loads are relaxed atomic loads.
  // Use a macro to keep the code reasonable
#define RLOAD(x) __atomic_load_n (&(x), __ATOMIC_RELAXED)

  // For targets where unwind info is usually not registered through these
  // APIs anymore, avoid any sequential consistent atomics.
  // Use relaxed MO here, it is up to the app to ensure that the library
  // loading/initialization happens-before using that library in other
  // threads (in particular unwinding with that library's functions
  // appearing in the backtraces).  Calling that library's functions
  // without waiting for the library to initialize would be racy.
  if (__builtin_expect (!RLOAD (t->root), 1))
    return NULL;

  // The unwinding tables are mostly static, they only change when
  // frames are added or removed. This makes it extremely unlikely that they
  // change during a given unwinding sequence. Thus, we optimize for the
  // contention free case and use optimistic lock coupling. This does not
  // require any writes to shared state, instead we validate every read. It is
  // important that we do not trust any value that we have read until we call
  // validate again. Data can change at arbitrary points in time, thus we always
  // copy something into a local variable and validate again before acting on
  // the read. In the unlikely event that we encounter a concurrent change we
  // simply restart and try again.

restart:
  struct btree_node *iter;
  uintptr_t lock;
  {
    // Accessing the root node requires defending against concurrent pointer
    // changes Thus we couple rootLock -> lock on root node -> validate rootLock
    if (!version_lock_lock_optimistic (&(t->root_lock), &lock))
      goto restart;
    iter = RLOAD (t->root);
    if (!version_lock_validate (&(t->root_lock), lock))
      goto restart;
    if (!iter)
      return NULL;
    uintptr_t child_lock;
    if ((!btree_node_lock_optimistic (iter, &child_lock))
	|| (!version_lock_validate (&(t->root_lock), lock)))
      goto restart;
    lock = child_lock;
  }

  // Now we can walk down towards the right leaf node
  while (true)
    {
      enum node_type type = RLOAD (iter->type);
      unsigned entry_count = RLOAD (iter->entry_count);
      if (!btree_node_validate (iter, lock))
	goto restart;
      if (!entry_count)
	return NULL;

      if (type == btree_node_inner)
	{
	  // We cannot call find_inner_slot here because we need (relaxed)
	  // atomic reads here
	  unsigned slot = 0;
	  while (
	    ((slot + 1) < entry_count)
	    && (RLOAD (iter->content.children[slot].separator) < target_addr))
	    ++slot;
	  struct btree_node *child = RLOAD (iter->content.children[slot].child);
	  if (!btree_node_validate (iter, lock))
	    goto restart;

	  // The node content can change at any point in time, thus we must
	  // interleave parent and child checks
	  uintptr_t child_lock;
	  if (!btree_node_lock_optimistic (child, &child_lock))
	    goto restart;
	  if (!btree_node_validate (iter, lock))
	    goto restart; // make sure we still point to the correct node after
			  // acquiring the optimistic lock

	  // Go down
	  iter = child;
	  lock = child_lock;
	}
      else
	{
	  // We cannot call find_leaf_slot here because we need (relaxed)
	  // atomic reads here
	  unsigned slot = 0;
	  while (((slot + 1) < entry_count)
		 && (RLOAD (iter->content.entries[slot].base)
		       + RLOAD (iter->content.entries[slot].size)
		     <= target_addr))
	    ++slot;
	  struct leaf_entry entry;
	  entry.base = RLOAD (iter->content.entries[slot].base);
	  entry.size = RLOAD (iter->content.entries[slot].size);
	  entry.ob = RLOAD (iter->content.entries[slot].ob);
	  if (!btree_node_validate (iter, lock))
	    goto restart;

	  // Check if we have a hit
	  if ((entry.base <= target_addr)
	      && (target_addr < entry.base + entry.size))
	    {
	      return entry.ob;
	    }
	  return NULL;
	}
    }
#undef RLOAD
}

This is the end of the article series discussing the gcc patch for lock-free unwinding. With that patch, we get scalable unwinding even on a machine with 256 hardware contexts. I hope the series helps with understanding the patch, and eventually allows it to be integrated into gcc.

Making unwinding through JIT-ed code scalable - The b-tree

2022-06-26T10:56:00.000+02:00

This article is part of the series about scalable unwinding that starts here.

We use a b-tree because it offers fast lookup, good data locality, and a scalable implementation is reasonable easy when using optimistic lock coupling. Nevertheless a b-tree is a non-trivial data structure. To avoid having one huge article that includes all details of the b-tree, we just discuss the data structure themselves and some helper functions here, the insert/remove/lookup operations will be discussed in the next article.

A b-tree partitions its elements by value. An inner node contains a sorted list of separator/child pairs, with the guarantee that the elements in the sub-tree rooted at the child pointer will be <= the separator. The leaf nodes contains sorted lists of (base, size, object) entries, where the object is responsible for unwinding entries between base and base+size. An b-tree maintains the invariants that 1) all nodes except the root are at least half full, and 2) a leaf nodes have the same distance to the root. This guarantees us logarithmic lookup costs. Note that we use fence-keys, i.e., the inner nodes have a separator for the right-most entries, too, which is not the case in all b-tree implementations:

// The largest possible separator value
static const uintptr_t max_separator = ~((uintptr_t) (0));

// Inner entry. The child tree contains all entries <= separator
struct inner_entry
{
  uintptr_t separator;
  struct btree_node *child;
};

// Leaf entry. Stores an object entry
struct leaf_entry
{
  uintptr_t base, size;
  struct object *ob;
};

// node types
enum node_type
{
  btree_node_inner,
  btree_node_leaf,
  btree_node_free
};

// Node sizes. Chosen such that the result size is roughly 256 bytes
#define max_fanout_inner 15
#define max_fanout_leaf 10

// A btree node
struct btree_node
{
  // The version lock used for optimistic lock coupling
  struct version_lock version_lock;
  // The number of entries
  unsigned entry_count;
  // The type
  enum node_type type;
  // The payload
  union
  {
    // The inner nodes have fence keys, i.e., the right-most entry includes a
    // separator
    struct inner_entry children[max_fanout_inner];
    struct leaf_entry entries[max_fanout_leaf];
  } content;
};

To simplify the subsequent code we define a number of helper functions that are largely straight-forward, and that allow to distinguish leaf and inner node and provide searching within a node. The lock operations directly map to operations on the version lock:

// Is an inner node?
static inline bool
btree_node_is_inner (const struct btree_node *n)
{
  return n->type == btree_node_inner;
}

// Is a leaf node?
static inline bool
btree_node_is_leaf (const struct btree_node *n)
{
  return n->type == btree_node_leaf;
}

// Should the node be merged?
static inline bool
btree_node_needs_merge (const struct btree_node *n)
{
  return n->entry_count < (btree_node_is_inner (n) ? (max_fanout_inner / 2)
						   : (max_fanout_leaf / 2));
}

// Get the fence key for inner nodes
static inline uintptr_t
btree_node_get_fence_key (const struct btree_node *n)
{
  // For inner nodes we just return our right-most entry
  return n->content.children[n->entry_count - 1].separator;
}

// Find the position for a slot in an inner node
static unsigned
btree_node_find_inner_slot (const struct btree_node *n, uintptr_t value)
{
  for (unsigned index = 0, ec = n->entry_count; index != ec; ++index)
    if (n->content.children[index].separator >= value)
      return index;
  return n->entry_count;
}

// Find the position for a slot in a leaf node
static unsigned
btree_node_find_leaf_slot (const struct btree_node *n, uintptr_t value)
{
  for (unsigned index = 0, ec = n->entry_count; index != ec; ++index)
    if (n->content.entries[index].base + n->content.entries[index].size > value)
      return index;
  return n->entry_count;
}

// Try to lock the node exclusive
static inline bool
btree_node_try_lock_exclusive (struct btree_node *n)
{
  return version_lock_try_lock_exclusive (&(n->version_lock));
}

// Lock the node exclusive, blocking as needed
static inline void
btree_node_lock_exclusive (struct btree_node *n)
{
  version_lock_lock_exclusive (&(n->version_lock));
}

// Release a locked node and increase the version lock
static inline void
btree_node_unlock_exclusive (struct btree_node *n)
{
  version_lock_unlock_exclusive (&(n->version_lock));
}

// Acquire an optimistic "lock". Note that this does not lock at all, it
// only allows for validation later
static inline bool
btree_node_lock_optimistic (const struct btree_node *n, uintptr_t *lock)
{
  return version_lock_lock_optimistic (&(n->version_lock), lock);
}

// Validate a previously acquire lock
static inline bool
btree_node_validate (const struct btree_node *n, uintptr_t lock)
{
  return version_lock_validate (&(n->version_lock), lock);
}

With that we come to the b-tree itself, which consists of a pointer to the root node, a version lock to protect the root, and a free list:

// A btree. Suitable for static initialization, all members are zero at the
// beginning
struct btree
{
  // The root of the btree
  struct btree_node *root;
  // The free list of released node
  struct btree_node *free_list;
  // The version lock used to protect the root
  struct version_lock root_lock;
};

// Initialize a btree. Not actually used, just for exposition
static inline void
btree_init (struct btree *t)
{
  t->root = NULL;
  t->free_list = NULL;
  t->root_lock.version_lock = 0;
};

We need that free list because readers operate without visible synchronization. If we would simply free() a node, we would risk that a concurrent reader is still looking at that node, even though no relevant data exist on that node. (But the reader does not know this until it did the read). To prevent that, we put freed nodes in the free list, which ensures that the memory location remains valid, and prefer using nodes from the free list when allocating new nodes:

// Allocate a node. This node will be returned in locked exclusive state
static struct btree_node *
btree_allocate_node (struct btree *t, bool inner)
{
  while (true)
    {
      // Try the free list first
      struct btree_node *next_free
	= __atomic_load_n (&(t->free_list), __ATOMIC_SEQ_CST);
      if (next_free)
	{
	  if (!btree_node_try_lock_exclusive (next_free))
	    continue;
	  // The node might no longer be free, check that again after acquiring
	  // the exclusive lock
	  if (next_free->type == btree_node_free)
	    {
	      struct btree_node *ex = next_free;
	      if (__atomic_compare_exchange_n (
		    &(t->free_list), &ex, next_free->content.children[0].child,
		    false, __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST))
		{
		  next_free->entry_count = 0;
		  next_free->type = inner ? btree_node_inner : btree_node_leaf;
		  return next_free;
		}
	    }
	  btree_node_unlock_exclusive (next_free);
	  continue;
	}

      // No free node available, allocate a new one
      struct btree_node *new_node
	= (struct btree_node *) (malloc (sizeof (struct btree_node)));
      version_lock_initialize_locked_exclusive (
	&(new_node->version_lock)); // initialize the node in locked state
      new_node->entry_count = 0;
      new_node->type = inner ? btree_node_inner : btree_node_leaf;
      return new_node;
    }
}

// Release a node. This node must be currently locked exclusively and will
// be placed in the free list
static void
btree_release_node (struct btree *t, struct btree_node *node)
{
  // We cannot release the memory immediately because there might still be
  // concurrent readers on that node. Put it in the free list instead
  node->type = btree_node_free;
  struct btree_node *next_free
    = __atomic_load_n (&(t->free_list), __ATOMIC_SEQ_CST);
  do
    {
      node->content.children[0].child = next_free;
  } while (!__atomic_compare_exchange_n (&(t->free_list), &next_free, node,
					 false, __ATOMIC_SEQ_CST,
					 __ATOMIC_SEQ_CST));
  btree_node_unlock_exclusive (node);
}

The last remaining infrastructure code is destroying the b-tree. Here, we simply walk the tree recursively and release all nodes. The recursion is safe because the depth is bound logarithmic:

// Recursively release a tree. The btree is by design very shallow, thus
// we can risk recursion here
static void
btree_release_tree_recursively (struct btree *t, struct btree_node *node)
{
  btree_node_lock_exclusive (node);
  if (btree_node_is_inner (node))
    {
      for (unsigned index = 0; index < node->entry_count; ++index)
	btree_release_tree_recursively (t, node->content.children[index].child);
    }
  btree_release_node (t, node);
}

// Destroy a tree and release all nodes
static void
btree_destroy (struct btree *t)
{
  // Disable the mechanism before cleaning up
  struct btree_node *old_root
    = __atomic_exchange_n (&(t->root), NULL, __ATOMIC_SEQ_CST);
  if (old_root)
    btree_release_tree_recursively (t, old_root);

  // Release all free nodes
  while (t->free_list)
    {
      struct btree_node *next = t->free_list->content.children[0].child;
      free (t->free_list);
      t->free_list = next;
    }
}

This finished the infrastructure part of the b-tree, the high-level insert/remove/lookup functions are covered in the next article .

Making unwinding through JIT-ed code scalable - Optimistic Lock Coupling

2022-06-26T10:52:00.002+02:00

This article is part of the series about scalable unwinding that starts here.

When thinking about exception handling it is reasonable to assume that we will have far more unwinding requests than changes to the unwinding tables. In our setup, the tables only change when JITed code is added to or removed from the program. That is always expensive to begin with due the mprotect calls, TLB shootdowns, etc. Thus we can safely assume that we will have at most a few hundred updates per second even in extreme cases, probably far less. Lookups however can easily reach thousands or even millions per second, as we do one lookup per frame.

This motivates us to use a read-optimized data structure, a b-tree with optimistic lock coupling: Writers use traditional lock coupling (lock parent node exclusive, lock child node exclusive, release parent node, lock child of child, etc.), which works fine as long as there is not too much contention. Readers however have to do something else, as we expect thousands of them. One might be tempted to use a rw-lock for readers, but that does not help. Locking an rw-lock in shared mode causes an atomic write, which makes the threads fight over the cache line of the lock even if there is no (logical) contention.

Instead, we use version locks, where readers do no write at all:

// Common logic for version locks
struct version_lock
{
  // The lock itself. The lowest bit indicates an exclusive lock,
  // the second bit indicates waiting threads. All other bits are
  // used as counter to recognize changes.
  // Overflows are okay here, we must only prevent overflow to the
  // same value within one lock_optimistic/validate
  // range. Even on 32 bit platforms that would require 1 billion
  // frame registrations within the time span of a few assembler
  // instructions.
  uintptr_t version_lock;
};

#ifdef __GTHREAD_HAS_COND
// We should never get contention within the tree as it rarely changes.
// But if we ever do get contention we use these for waiting
static __gthread_mutex_t version_lock_mutex = __GTHREAD_MUTEX_INIT;
static __gthread_cond_t version_lock_cond = __GTHREAD_COND_INIT;
#endif

The version lock consists of a single number, where the lower two bits indicate the lock status. As we will see below, for exclusive locks we will use them just like a regular mutex, with the addition that the higher bits are incremented on every unlock. If we do get contention we use version_lock_mutex and version_lock_cond for sleeping, but that should be very rare. For readers we do not modify the lock at all but just remember its state. After the read is finished we check the state again. If it changed, we did a racy read, and try again. Note that such a locking mechanism is sometimes call a sequence lock in the literature. The great advantage is that readers can run fully parallel, and the performance is excellent as long as writes are uncommon.

Initialiizing the lock and trying to acquire the lock in exclusive more are straight forward:

// Initialize in locked state
static inline void
version_lock_initialize_locked_exclusive (struct version_lock *vl)
{
  vl->version_lock = 1;
}

// Try to lock the node exclusive
static inline bool
version_lock_try_lock_exclusive (struct version_lock *vl)
{
  uintptr_t state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
  if (state & 1)
    return false;
  return __atomic_compare_exchange_n (&(vl->version_lock), &state, state | 1,
				      false, __ATOMIC_SEQ_CST,
				      __ATOMIC_SEQ_CST);
}

We simply set the lock to 1 to initialize a new lock in locked state. The try_lock tries to change the lowest bit, and fails if that is not possible.

For blocking lock_exclusive calls we first try the same as try_lock. If that fails, we acquire the mutex, try to lock again, and sleep if we did not get the lock:

// Lock the node exclusive, blocking as needed
static void
version_lock_lock_exclusive (struct version_lock *vl)
{
#ifndef __GTHREAD_HAS_COND
restart:
#endif

  // We should virtually never get contention here, as frame
  // changes are rare
  uintptr_t state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
  if (!(state & 1))
    {
      if (__atomic_compare_exchange_n (&(vl->version_lock), &state, state | 1,
				       false, __ATOMIC_SEQ_CST,
				       __ATOMIC_SEQ_CST))
	return;
    }

    // We did get contention, wait properly
#ifdef __GTHREAD_HAS_COND
  __gthread_mutex_lock (&version_lock_mutex);
  state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
  while (true)
    {
      // Check if the lock is still held
      if (!(state & 1))
	{
	  if (__atomic_compare_exchange_n (&(vl->version_lock), &state,
					   state | 1, false, __ATOMIC_SEQ_CST,
					   __ATOMIC_SEQ_CST))
	    {
	      __gthread_mutex_unlock (&version_lock_mutex);
	      return;
	    }
	  else
	    {
	      continue;
	    }
	}

      // Register waiting thread
      if (!(state & 2))
	{
	  if (!__atomic_compare_exchange_n (&(vl->version_lock), &state,
					    state | 2, false, __ATOMIC_SEQ_CST,
					    __ATOMIC_SEQ_CST))
	    continue;
	}

      // And sleep
      __gthread_cond_wait (&version_lock_cond, &version_lock_mutex);
      state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
    }
#else
  // Spin if we do not have condition variables available
  // We expect no contention here, spinning should be okay
  goto restart;
#endif
}

When sleeping we set the second-lowest bit, too, to indicate waiting threads. The unlock function checks that bit, and wakes up the threads if needed:

// Release a locked node and increase the version lock
static void
version_lock_unlock_exclusive (struct version_lock *vl)
{
  // increase version, reset exclusive lock bits
  uintptr_t state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
  uintptr_t ns = (state + 4) & (~((uintptr_t) 3));
  state = __atomic_exchange_n (&(vl->version_lock), ns, __ATOMIC_SEQ_CST);

#ifdef __GTHREAD_HAS_COND
  if (state & 2)
    {
      // Wake up waiting threads. This should be extremely rare.
      __gthread_mutex_lock (&version_lock_mutex);
      __gthread_cond_broadcast (&version_lock_cond);
      __gthread_mutex_unlock (&version_lock_mutex);
    }
#endif
}

Readers do not modify the lock at all. When they "lock" in shared mode they store the current state, and check if the state is still the same in the validate function:

// Acquire an optimistic "lock". Note that this does not lock at all, it
// only allows for validation later
static inline bool
version_lock_lock_optimistic (const struct version_lock *vl, uintptr_t *lock)
{
  uintptr_t state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
  *lock = state;

  // Acquiring the lock fails when there is currently an exclusive lock
  return !(state & 1);
}

// Validate a previously acquire lock
static inline bool
version_lock_validate (const struct version_lock *vl, uintptr_t lock)
{
  // Prevent the reordering of non-atomic loads behind the atomic load.
  // Hans Boehm, Can Seqlocks Get Along with Programming Language Memory
  // Models?, Section 4.
  __atomic_thread_fence (__ATOMIC_ACQUIRE);

  // Check that the node is still in the same state
  uintptr_t state = __atomic_load_n (&(vl->version_lock), __ATOMIC_SEQ_CST);
  return (state == lock);
}

We fail early if the lock is currently locked exclusive.

Note that optimistic lock coupling conceptually does the same as classical lock coupling. The main difference is that we have to valid a lock before we can act upon a value that we have read. This means: 1) lock parent, 2) fetch child pointer, 3) validate parent, restart if validation fails, 4) lock child, etc.

In the next article we look at the b-tree data structure that we use to store the frames.

Making unwinding through JIT-ed code scalable - Replacing the gcc hooks

2022-06-26T10:48:00.001+02:00

This article is part of the series about scalable unwinding that starts here.

As discussed in the previous article, the gcc mechanism does not scale because it uses a global lock to protect its list of unwinding frames. To solve that problem, we replace that list with a read-optimized b-tree that allows for concurrent reads and writes. In this article we just discuss the patches to gcc necessary to enable that mechanism, the b-tree itself is discussed in subsequent articles.

We start by replacing the old fast path mechanism with a b-tree root:

index 8ee55be5675..d546b9e4c43 100644
--- a/libgcc/unwind-dw2-fde.c
+++ b/libgcc/unwind-dw2-fde.c
@@ -42,15 +42,34 @@ see the files COPYING3 and COPYING.RUNTIME respectively.  If not, see
 #endif
 #endif
 
+#ifdef ATOMIC_FDE_FAST_PATH
+#include "unwind-dw2-btree.h"
+
+static struct btree registered_frames;
+
+static void
+release_registered_frames (void) __attribute__ ((destructor (110)));
+static void
+release_registered_frames (void)
+{
+  /* Release the b-tree and all frames. Frame releases that happen later are
+   * silently ignored */
+  btree_destroy (&registered_frames);
+}
+
+static void
+get_pc_range (const struct object *ob, uintptr_t *range);
+static void
+init_object (struct object *ob);
+
+#else
+
 /* The unseen_objects list contains objects that have been registered
    but not yet categorized in any way.  The seen_objects list has had
    its pc_begin and count fields initialized at minimum, and is sorted
    by decreasing value of pc_begin.  */
 static struct object *unseen_objects;
 static struct object *seen_objects;
-#ifdef ATOMIC_FDE_FAST_PATH
-static int any_objects_registered;
-#endif
 
 #ifdef __GTHREAD_MUTEX_INIT
 static __gthread_mutex_t object_mutex = __GTHREAD_MUTEX_INIT;
@@ -78,6 +97,7 @@ init_object_mutex_once (void)
 static __gthread_mutex_t object_mutex;
 #endif
 #endif
+#endif

When the platform supports atomics (ATOMIC_FDE_FAST_PATH), we replace the whole mechanism with one b-tree, whose root is registered_frames. Neither the objects lists not the mutex exist on such platforms. To avoid leaking the memory for the b-tree we register release_registered_frames as destructor with a very late priority on shutdown. Note that it is still safe to throw exceptions even when the b-tree is gone as long as unwinding does not have to go through JITed code that was registered here. The destroyed b-tree behaves like an empty b-tree. get_pc_range and init_object are forward declared because we need them already when inserting into the b-tree.

When registering frames we no longer grab a mutex, we immediately store the frames in the b-tree:

@@ -99,23 +119,23 @@ __register_frame_info_bases (const void *begin, struct object *ob,
   ob->fde_end = NULL;
 #endif
 
+#ifdef ATOMIC_FDE_FAST_PATH
+  // Initialize  eagerly to avoid locking later
+  init_object (ob);
+
+  // And register the frame
+  uintptr_t range[2];
+  get_pc_range (ob, range);
+  btree_insert (&registered_frames, range[0], range[1] - range[0], ob);
+#else
   init_object_mutex_once ();
   __gthread_mutex_lock (&object_mutex);
 
   ob->next = unseen_objects;
   unseen_objects = ob;
-#ifdef ATOMIC_FDE_FAST_PATH
-  /* Set flag that at least one library has registered FDEs.
-     Use relaxed MO here, it is up to the app to ensure that the library
-     loading/initialization happens-before using that library in other
-     threads (in particular unwinding with that library's functions
-     appearing in the backtraces).  Calling that library's functions
-     without waiting for the library to initialize would be racy.  */
-  if (!any_objects_registered)
-    __atomic_store_n (&any_objects_registered, 1, __ATOMIC_RELAXED);
-#endif
 
   __gthread_mutex_unlock (&object_mutex);
+#endif
 }
 
 void
@@ -153,23 +173,23 @@ __register_frame_info_table_bases (void *begin, struct object *ob,
   ob->s.b.from_array = 1;
   ob->s.b.encoding = DW_EH_PE_omit;
 
+#ifdef ATOMIC_FDE_FAST_PATH
+  // Initialize  eagerly to avoid locking later
+  init_object (ob);
+
+  // And register the frame
+  uintptr_t range[2];
+  get_pc_range (ob, range);
+  btree_insert (&registered_frames, range[0], range[1] - range[0], ob);
+#else
   init_object_mutex_once ();
   __gthread_mutex_lock (&object_mutex);
 
   ob->next = unseen_objects;
   unseen_objects = ob;
-#ifdef ATOMIC_FDE_FAST_PATH
-  /* Set flag that at least one library has registered FDEs.
-     Use relaxed MO here, it is up to the app to ensure that the library
-     loading/initialization happens-before using that library in other
-     threads (in particular unwinding with that library's functions
-     appearing in the backtraces).  Calling that library's functions
-     without waiting for the library to initialize would be racy.  */
-  if (!any_objects_registered)
-    __atomic_store_n (&any_objects_registered, 1, __ATOMIC_RELAXED);
-#endif
 
   __gthread_mutex_unlock (&object_mutex);
+#endif
 }

Note that there is a subtle change of logic here: The old mechanism stores all incoming frames as they are in the unseen_objects chain without even looking at them. Then, during unwinding, it inspects the frames to find out the range of program counter values (pc_range), and sorts them in the seen_objects list. This fundamentally requires a lock, as the objects are modified during unwinding. To avoid that, the new code immediately computes the PC range and lets the b-tree do the sorting. Which keeps the frame immutable during unwinding.

When searching a frame during unwinding we delegate everything to the b-tree, which provides lock-free lookups. No mutex is acquired on platforms that support atomics:

@@ -1033,17 +1161,12 @@ _Unwind_Find_FDE (void *pc, struct dwarf_eh_bases *bases)
   const fde *f = NULL;
 
 #ifdef ATOMIC_FDE_FAST_PATH
-  /* For targets where unwind info is usually not registered through these
-     APIs anymore, avoid taking a global lock.
-     Use relaxed MO here, it is up to the app to ensure that the library
-     loading/initialization happens-before using that library in other
-     threads (in particular unwinding with that library's functions
-     appearing in the backtraces).  Calling that library's functions
-     without waiting for the library to initialize would be racy.  */
-  if (__builtin_expect (!__atomic_load_n (&any_objects_registered,
-					  __ATOMIC_RELAXED), 1))
+  ob = btree_lookup (&registered_frames, (uintptr_t) pc);
+  if (!ob)
     return NULL;
-#endif
+
+  f = search_object (ob, pc);
+#else
 
   init_object_mutex_once ();
   __gthread_mutex_lock (&object_mutex);
@@ -1081,6 +1204,7 @@ _Unwind_Find_FDE (void *pc, struct dwarf_eh_bases *bases)
 
  fini:
   __gthread_mutex_unlock (&object_mutex);
+#endif
 
   if (f)
     {

De-registering a frame works just the same as registering, we use get_pc_range to get the range and then remove it from the b-tree:

@@ -200,16 +220,33 @@ __register_frame_table (void *begin)
 void *
 __deregister_frame_info_bases (const void *begin)
 {
-  struct object **p;
   struct object *ob = 0;
 
   /* If .eh_frame is empty, we haven't registered.  */
   if ((const uword *) begin == 0 || *(const uword *) begin == 0)
     return ob;
 
+#ifdef ATOMIC_FDE_FAST_PATH
+  // Find the corresponding PC range
+  struct object lookupob;
+  lookupob.tbase = 0;
+  lookupob.dbase = 0;
+  lookupob.u.single = begin;
+  lookupob.s.i = 0;
+  lookupob.s.b.encoding = DW_EH_PE_omit;
+#ifdef DWARF2_OBJECT_END_PTR_EXTENSION
+  lookupob.fde_end = NULL;
+#endif
+  uintptr_t range[2];
+  get_pc_range (&lookupob, range);
+
+  // And remove
+  ob = btree_remove (&registered_frames, range[0]);
+#else
   init_object_mutex_once ();
   __gthread_mutex_lock (&object_mutex);
 
+  struct object **p;
   for (p = &unseen_objects; *p ; p = &(*p)->next)
     if ((*p)->u.single == begin)
       {
@@ -241,6 +278,8 @@ __deregister_frame_info_bases (const void *begin)
 
  out:
   __gthread_mutex_unlock (&object_mutex);
+#endif
+
   gcc_assert (ob);
   return (void *) ob;
 }

We are nearly done with our changes to existing gcc code, we just need a few more helper functions for get_pc_range:

@@ -264,7 +303,7 @@ __deregister_frame (void *begin)
    instead of an _Unwind_Context.  */
 
 static _Unwind_Ptr
-base_from_object (unsigned char encoding, struct object *ob)
+base_from_object (unsigned char encoding, const struct object *ob)
 {
   if (encoding == DW_EH_PE_omit)
     return 0;
@@ -821,6 +860,91 @@ init_object (struct object* ob)
   ob->s.b.sorted = 1;
 }
 
+#ifdef ATOMIC_FDE_FAST_PATH
+/* Get the PC range from FDEs. The code is very similar to
+   classify_object_over_fdes and should be kept in sync with
+   that. The main difference is that classify_object_over_fdes
+   modifies the object, which we cannot do here */
+static void
+get_pc_range_from_fdes (const struct object *ob, const fde *this_fde,
+			uintptr_t *range)
+{
+  const struct dwarf_cie *last_cie = 0;
+  int encoding = DW_EH_PE_absptr;
+  _Unwind_Ptr base = 0;
+
+  for (; !last_fde (ob, this_fde); this_fde = next_fde (this_fde))
+    {
+      const struct dwarf_cie *this_cie;
+      _Unwind_Ptr mask, pc_begin, pc_range;
+
+      /* Skip CIEs.  */
+      if (this_fde->CIE_delta == 0)
+	continue;
+
+      this_cie = get_cie (this_fde);
+      if (this_cie != last_cie)
+	{
+	  last_cie = this_cie;
+	  encoding = get_cie_encoding (this_cie);
+	  base = base_from_object (encoding, ob);
+	}
+
+      const unsigned char *p;
+      p = read_encoded_value_with_base (encoding, base, this_fde->pc_begin,
+					&pc_begin);
+      read_encoded_value_with_base (encoding & 0x0F, 0, p, &pc_range);
+
+      /* Take care to ignore link-once functions that were removed.
+	 In these cases, the function address will be NULL, but if
+	 the encoding is smaller than a pointer a true NULL may not
+	 be representable.  Assume 0 in the representable bits is NULL.  */
+      mask = size_of_encoded_value (encoding);
+      if (mask < sizeof (void *))
+	mask = (((_Unwind_Ptr) 1) << (mask << 3)) - 1;
+      else
+	mask = -1;
+      if ((pc_begin & mask) == 0)
+	continue;
+
+      _Unwind_Ptr pc_end = pc_begin + pc_range;
+      if ((!range[0]) && (!range[1]))
+	{
+	  range[0] = pc_begin;
+	  range[1] = pc_end;
+	}
+      else
+	{
+	  if (pc_begin < range[0])
+	    range[0] = pc_begin;
+	  if (pc_end > range[1])
+	    range[1] = pc_end;
+	}
+    }
+}
+
+/* Get the PC range for lookup */
+static void
+get_pc_range (const struct object *ob, uintptr_t *range)
+{
+  range[0] = range[1] = 0;
+  if (ob->s.b.sorted)
+    {
+      get_pc_range_from_fdes (ob, ob->u.sort->orig_data, range);
+    }
+  else if (ob->s.b.from_array)
+    {
+      fde **p = ob->u.array;
+      for (; *p; ++p)
+	get_pc_range_from_fdes (ob, *p, range);
+    }
+  else
+    {
+      get_pc_range_from_fdes (ob, ob->u.single, range);
+    }
+}
+#endif
+
 /* A linear search through a set of FDEs for the given PC.  This is
    used when there was insufficient memory to allocate and sort an
    array.  */

The code finds out the range of possible pc values for the FDE. Conceptually it does nearly the same as the already existing function classify_object_over_fdes. Unfortunately we cannot reuse that code, as classify_object_over_fdes modifies the object in order to speed up subsequent searches, and we require our data to be immutable.

In the next article we look at optimistic lock coupling, which is the mechanism that we use to allow for lock-free readers.

Making unwinding through JIT-ed code scalable

2022-06-26T10:46:00.001+02:00

Exceptions are a very handy mechanism to propagate errors in C++ programs, but unfortunately they do not scale very well. In all common C++ implementations the unwinding mechanism takes global lock during unwinding, which has disastrous consequences when the number of threads is high. On a machine with 256 hardware context we see worse-than-single-threaded behavior even for relatively modest failure rates.

Fortunately the Florian Weimer fixed one contention point in gcc 12 on systems with glibc 2.35 or newer, which gives us scalable exceptions as long as no JIT-ed code has been registered. Unfortunately our system does register JIT-ed code... Which means exception unwinding in our code base is still single-threaded in practice.

But we can fix that by teaching gcc to store the unwinding information in a read-optimized b-tree, which allows for fully parallel unwinding without any atomic writes. There is a gcc patch that does just that, but unfortunately it is quite involved and difficult to review. This article series thus explains all parts of the patch and shows how a read-optimized b-tree can be implemented lock-free.

In order to keep the article length somewhat reasonable, the discusses is broken into parts:

When unwinding exceptions, the compiler has to find the corresponding unwinding information for every call frame on the stack between the throw and the catch. gcc uses two different mechanisms for that: For ahead-of-time compiled code it asks glibc to find the unwinding information using either dl_iterate_phdr (on older systems) or _dl_find_object (on systems with glibc 2.35 or newer). Note that this mapping is not static, as shared libraries could be added or removed at any time, potentially during a concurrent unwind. For that reason dl_iterate_phdr was protected by a global mutex, which clearly does not scale. _dl_find_object avoids that mutex by using a lock-free data structure. But that only affects ahead-of-time compiled code, for JITed code libgcc uses a different mechanism, which we now discuss:

For JITed code the emitter has to explicitly register the unwinding information using __register_frame_info_bases (and the very similar __register_frame_info_table_bases):

void
__register_frame_info_bases (const void *begin, struct object *ob,
			     void *tbase, void *dbase)
{
  /* If .eh_frame is empty, don't register at all.  */
  if ((const uword *) begin == 0 || *(const uword *) begin == 0)
    return;

  ob->pc_begin = (void *)-1;
  ob->tbase = tbase;
  ob->dbase = dbase;
  ob->u.single = begin;
  ob->s.i = 0;
  ob->s.b.encoding = DW_EH_PE_omit;
#ifdef DWARF2_OBJECT_END_PTR_EXTENSION
  ob->fde_end = NULL;
#endif

  init_object_mutex_once ();
  __gthread_mutex_lock (&object_mutex);

  ob->next = unseen_objects;
  unseen_objects = ob;
#ifdef ATOMIC_FDE_FAST_PATH
  /* Set flag that at least one library has registered FDEs.
     Use relaxed MO here, it is up to the app to ensure that the library
     loading/initialization happens-before using that library in other
     threads (in particular unwinding with that library's functions
     appearing in the backtraces).  Calling that library's functions
     without waiting for the library to initialize would be racy.  */
  if (!any_objects_registered)
    __atomic_store_n (&any_objects_registered, 1, __ATOMIC_RELAXED);
#endif

  __gthread_mutex_unlock (&object_mutex);
}

Which conceptually is a simple thing: It grabs a mutex, puts the object information into a list, and released the mutex. Note the code within the ATOMIC_FDE_FAST_PATH block: As we will see below, that unwinding mechanism is very slow. Thus, libgcc tries to avoid taking that path by remembering if any JITed code was registered at all. If not, it stops unwinding immediately. But that mechanism does not help if we do have JITed code.

During unwinding, the unwinder calls _Unwind_Find_FDE, which traverses the list in order to find the corresponding unwind table for the given address:

const fde *
_Unwind_Find_FDE (void *pc, struct dwarf_eh_bases *bases)
{
  struct object *ob;
  const fde *f = NULL;

#ifdef ATOMIC_FDE_FAST_PATH
  /* For targets where unwind info is usually not registered through these
     APIs anymore, avoid taking a global lock.
     Use relaxed MO here, it is up to the app to ensure that the library
     loading/initialization happens-before using that library in other
     threads (in particular unwinding with that library's functions
     appearing in the backtraces).  Calling that library's functions
     without waiting for the library to initialize would be racy.  */
  if (__builtin_expect (!__atomic_load_n (&any_objects_registered,
					  __ATOMIC_RELAXED), 1))
    return NULL;
#endif

  init_object_mutex_once ();
  __gthread_mutex_lock (&object_mutex);

  /* Linear search through the classified objects, to find the one
     containing the pc.  Note that pc_begin is sorted descending, and
     we expect objects to be non-overlapping.  */
  for (ob = seen_objects; ob; ob = ob->next)
    if (pc >= ob->pc_begin)
      {
	f = search_object (ob, pc);
	if (f)
	  goto fini;
	break;
      }

  /* Classify and search the objects we've not yet processed.  */
  while ((ob = unseen_objects))
    {
      struct object **p;

      unseen_objects = ob->next;
      f = search_object (ob, pc);

      /* Insert the object into the classified list.  */
      for (p = &seen_objects; *p ; p = &(*p)->next)
	if ((*p)->pc_begin < ob->pc_begin)
	  break;
      ob->next = *p;
      *p = ob;

      if (f)
	goto fini;
    }

 fini:
  __gthread_mutex_unlock (&object_mutex);

  if (f)
    {
      int encoding;
      _Unwind_Ptr func;

      bases->tbase = ob->tbase;
      bases->dbase = ob->dbase;

      encoding = ob->s.b.encoding;
      if (ob->s.b.mixed_encoding)
	encoding = get_fde_encoding (f);
      read_encoded_value_with_base (encoding, base_from_object (encoding, ob),
				    f->pc_begin, &func);
      bases->func = (void *) func;
    }

  return f;
}

In the fast path it tries to stop early if no unwinding information had been registered. But if any unwinding frame has been registered by JITed code, it grabs the global object_mutex and does everything effectively single-threaded. This does not scale. Thus, we discuss how to switch gcc to a lock-free lookup structure in the next article.

Cloud Network Traffic Within the Same Region Can Be Very Expensive

2022-04-03T09:59:00.009+02:00

Everyone knows that the major cloud vendors try to make it easy to get data in, and hard to get it out. What is less known is that high egress cost also applies to outbound traffic within the same region.

Let's look at AWS specifically. In AWS, EC2 outbound traffic is only free within the same availability zone (AZ). Moving data from one AZ to another in the same region is actually quite expensive:

"IPv4: Data transferred “in“ to and “out“ from Amazon EC2 [...] across Availability Zones in the same AWS Region is charged at $0.01/GB in each direction." source

This means that transferring 1TB costs $0.01/GB * 1000GB * 2 = $20. For comparison: most inter-region transfers cost $0.02 per GB, but only for outgoing traffic. Thus, remarkably, transferring 1TB from Ohio to Tokyo will cost the same as transferring it within Ohio from us-east-2a to us-east-2b. Two c5n.18xlarge instances communicating with each other at full 100 Gbit speed can theoretically incur network costs of $1,800 per hour (or $1,296,000 per month).

Interestingly, S3 can be used to bypass the high traffic cost when moving data between different AZs in the same region because
"Data transferred directly between Amazon S3 [...] in the same AWS Region is free." source
Let's see if we can exploit this. Consider again our example where we want to transfer 1TB from us-east-2a to us-east-2b. Instead of two EC2 instances talking directly, we could use an S3 Standard bucket in us-east-2. We first PUT the data into it from us-east-2a, then GET the data from that bucket using us-east-2b instances, and finally delete all objects. If we split our data into 1,000 1GB chunks, we need to pay for only 1,000 PUT and 1,000 GET S3 requests, which would be less than $0.01. Storage cost is also low: assuming a transfer rate of 1GB/s, the data would have to be stored in S3 for less than an hour, which costs about $0.03 (and could be reduced via pipelining). Thus, in total we can transfer 1TB through S3 for less than $0.05 instead of $20.

To summarize, inter-AZ traffic cost within the same region is unreasonably expensive. It seems unlikely that these prices have any resemblance to internal AWS cost -- as is reflected by the fact that AWS services such as S3 allow bypassing this cost.

Are you sure you want to use MMAP in your database management system?

2022-01-16T15:07:00.003+01:00

Many database management systems carefully manage disk I/O operations and explicitly cache pages in main memory. Operating systems implement a page cache to speed up recurring disk accesses as well, and even allow transparent access to disk files through the mmap system call. Why do most database systems then even implement I/O handling and a caching component if the OS provides these features through mmap? Andrew Pavlo, Andrew Crotty, and myself tried to answer this question in a CIDR 2022 paper. This is quite a contentious question as the Hacker News discussion of the paper shows.

The paper argues that using mmap in database systems is almost always a bad idea. To implement transactions and crash recovery with mmap, the DBMS has to write any change out-of-place because there is no way to prevent write back of a particular page. This makes it impossible to implement classical ARIES-style transactions. Furthermore, data access through mmap can take a handful of nanoseconds (if the data is in the CPU cache) or milliseconds (if it's on disk). If a page is not cached, it will be read through a synchronous page fault and there is no interface for asynchronous I/O. I/O errors, on the other hand, are communicated through signals rather than a local error code. These problems are caused by mmap's interface, which is too high-level and does not give the database system enough control.

In addition to discussing these interface problems, the paper also shows that Linux' page cache and mmap implementation cannot achieve the bandwidth of modern storage devices. One PCIe 4.0 NVMe SSD can read over 6 GB/s and upcoming PCIe 5.0 SSDs will almost double this number. To achieve this performance, one needs to schedule hundreds or even thousands (if one has multiple SSDs) of concurrent I/O requests. Doing this in a synchronous fashion by starting hundreds of threads will not work well. Other kernel-level performance issues are single-threaded page eviction and TLB shootdowns. Overall, this is an example of OS evolution lagging behind hardware evolution.

The OS has one big advantage over the DBMS though: it has control over the page table. Once a page is mapped, accessing it becomes transparent and as fast as ordinary memory. Any manually-implemented buffer manager, in contrast, will have some form of indirection, which causes some overhead. Pointer swizzling as implemented in LeanStore and Umbra is a fast alternative but is also more difficult to implement than a traditional buffer manager and only supports tree-like data structures. Therefore, an interesting question is whether it would be possible to have an mmap-like interface, but with more control and better performance. Generally I believe this kind of research between different areas should be more common.

AWS EC2 Hardware Trends: 2015-2021

2021-07-04T16:23:00.006+02:00

Over the past decade, AWS EC2 has introduced many new instance types with different hardware configurations and prices. This hardware zoo can make it hard to keep track of what is available. In this post we will look at how the EC2 hardware landscape changed since 2015. This will hopefully help picking the best option for a given task.

In the cloud, one can trade money for hardware resources. It therefore makes sense to take an economical perspective and normalize the hardware resource by the instance price. For example, instead of looking at absolute network bandwidth, we will use network bandwidth per dollar. Such metrics also allow us to ignore virtualized slices, reducing the number of instances relevant for the analysis from hundreds to dozens. For example, c5n.9xlarge is a virtualized slice of c5n.18xlarge with half the network bandwidth and half the cost.

Data Set

We use historical data from https://instances.vantage.sh/ and only consider current-generation Intel machines without GPUs. All prices are for us-east-1 Linux instances. Using these constraints, in July 2021 we can pick from the following instances:

name	vCPU	memory [GB]	network [Gbit/s]	storage	price [$/h]
m4.16x	64	256	25		3.20
h1.16x	64	256	25	8x2TB disk	3.74
c5n.18x	72	192	100		3.89
d3.8x	32	256	25	24x2TB disk	4.00
c5.24x	96	192	25		4.08
r4.16x	64	488	25		4.26
m5.24x	96	384	25		4.61
c5d.24x	96	192	25	4x0.9TB NVMe	4.61
i3.16x	64	488	25	8x1.9TB NVMe	5.00
m5d.24x	96	384	25	4x0.9TB NVMe	5.42
d2.8x	36	244	10	24x2TB disk	5.52
m5n.24x	96	384	100		5.71
r5.24x	96	768	25		6.05
d3en.12x	48	192	75	24x14TB disk	6.31
m5dn.24x	96	384	100	4x0.9TB NVMe	6.52
r5d.24x	96	768	25	4x0.9TB NVMe	6.91
r5n.24x	96	768	100		7.15
r5b.24x	96	768	25		7.15
r5dn.24x	96	768	100	4x0.9TB NVMe	8.02
i3en.24x	96	768	100	8x7.5TB NVMe	10.85
x1e.32x	128	3904	25	2x1.9TB SATA	26.69

Compute

Using our six-year data set, let's first look at the cost of compute:

It is quite remarkable that from 2015 to 2021, the cost of compute barely changed. During that six-year time frame, the number of server CPU cores has been growing significantly, which may imply that Intel compute power is currently overpriced in EC2. In the last couple of years EC2 has introduced cheaper AMD and ARM instances, but it's still surprising that AWS chose to keep Intel CPU prices fixed.

DRAM Capacity

For DRAM, the picture is also quite stagnant:

The introduction of the x1e instances improved the situation a bit, but there's been a stagnation since 2018. However, this is less surprising than the CPU situation because DRAM commodity prices in general did not move much.

Instance Storage

Let's next look at instance storage. EC2 offers instances with disks (about 0.2GB/s bandwidth), SATA SSDs (about 0.5GB/s bandwidth), and NVMe SSDs (about 2GB/s bandwidth). The introduction of instances with up to 8 NVMe SSDs in 2017 clearly disrupted IO bandwidth speed (the y-axis unit may look weird for bandwidth but is correct once we normalize by hourly cost):

In terms of capacity per dollar, disk is still king and the d3en instance (introduced in December 2020) totally changed the game:

Network Bandwidth

For network bandwidth, we see another major disruption, this time the introduction of 100GBit network instances:

The c5n instance, in particular, is clearly a game changer. It is only marginally more expensive than c5, but its network speed is 4 times faster.

Conclusions

These results show that the hardware landscape is very fluid and regularly we see major changes like the introduction of NVMe SSDs or 100 GBit networking. Truisms like "in distributed systems network bandwidth is the bottleneck" can become false! (Network latency is of course a different beast.) High-performance systems must therefore take hardware trends into account and adapt to the ever-evolving hardware landscape.

What Every Programmer Should Know About SSDs

2021-06-18T13:52:00.004+02:00

Solid-State Drives (SSDs) based on flash have largely replaced magnetic disks as the standard storage medium. From the perspective of a programmer, SSDs and disks look very similar: both are persistent, enable page-based (e.g., 4KB) access through file systems and system calls, and have large capacities.

However, there are also important differences, which become important if one wants to achieve optimal SSD performance. As we will see, SSDs are more complicated and their performance behavior can appear quite mysterious if one simply thinks of them as fast disks. The goal of this post is to provide an understanding of why SSDs behave the way they do, which can help creating software that is capable of exploiting them. (Note that I discuss NAND flash, not Intel Optane memory, which has different characteristics.)

Drives not Disks

SSDs are often referred to as disks, but this is misleading as they store data on semiconductors instead of a mechanical disk. To read or write from a random block, a disk has to mechanically move its head to the right location, which takes on the order of 10ms. A random read from an SSD, in contrast, takes about 100us – 100 times faster. This low read latency is the reason why booting from an SSD is so much faster than booting from a disk.

Parallelism

Another important difference between disks and SSDs is that disks have one disk head and perform well only for sequential accesses. SSDs, in contrast, consist of dozens or even hundreds of flash chips ("parallel units"), which can be accessed concurrently.

SSDs transparently stripe larger files across the flash chips at page granularity, and a hardware prefetcher ensures that sequential scans exploit all available flash chips. However, at the flash level there is not much difference between sequential and random reads. Indeed, for most SSDs it is possible to achieve almost the full bandwidth with random page reads as well. To do this, one has to schedule hundreds of random IO requests concurrently in order to keep all flash chips busy. This can be done by starting lots of threads or using asynchronous IO interfaces such as libaio or io_uring.

Writing

Things get even more interesting with writes. For example, if one looks at write latency, one may measure results as low as 10us – 10 times faster than a read. However, latency only appears so low because SSDs are caching writes on volatile RAM. The actual write latency of NAND flash is about 1ms – 10 times slower than a read. On consumer SSDs, this can be measured by issuing a sync/flush command after the write to ensure that the data is persistent on flash. On most data center/server SSDs, write latency cannot be measured directly: the sync/flush will complete immediately because a battery guarantees persistence of the write cache even in the case of power loss.

To achieve high write bandwidth despite the relatively high write latency, writes use the same trick as reads: they access multiple flash chips concurrently. Because the write cache can asynchronously write pages, it is not even necessary to schedule that many writes simultaneously to get good write performance. However, the write latency cannot always be hidden completely: for example, because a write occupies a flash chip 10 times longer than a read, writes cause significant tail latencies for reads to the same flash chip.

Out-Of-Place Writes

Our understanding is missing one important fact: NAND flash pages cannot be overwritten. Page writes can only be performed sequentially within blocks that have been erased beforehand. These erase blocks have a size of multiple MB and therefore consist of hundreds of pages. On a new SSD, all blocks are erased, and one can directly start appending new data.

Updating pages, however, is not so easy. It would be too expensive to erase the entire block just to overwrite a single page in-place. Therefore, SSDs perform page updates by writing the new version of the page to a new location. This means that the logical and physical page addresses are decoupled. A mapping table, which is stored on the SSD, translates logical (software) addresses to physical (flash) locations. This component is also called Flash Translation Layer (FTL).

For example, let's assume we have a (toy) SSD with 3 erase blocks, each with 4 pages. A sequence of writes to pages P1, P2, P0, P3, P5, P1 may result in the following physical SSD state:

Block 0	P1 (old)	P2	P0	P3
Block 1	P5	P1	→
Block 2

Garbage Collection

Using the mapping table and out-of-place write, everything is good until the SSD runs out of free blocks. The old version of overwritten pages must eventually be reclaimed. If we continue our example from above by writing to pages P3, P4, P7, P1, P6, P2, we get the following situation:

Block 0	P1 (old)	P2 (old)	P0	P3 (old)
Block 1	P5	P1 (old)	P3	P4
Block 2	P7	P1	P6	P2

At this point we have no more free erase blocks (even though logically there should still be space). Before one can write another page, the SSD first has to erase a block. In the example, it might be best for the garbage collector to erase block 0, because only one of its pages is still in use. After erasing block 0, we make space for 3 writes and our SSD looks like this:

Block 0	P0	→
Block 1	P5	P1 (old)	P3	P4
Block 2	P7	P1	P6	P2

Write Amplification and Overprovisioning

To garbage collect block 0, we had to physically move page P0, even though logically nothing happened with that page. In other words, with flash SSDs the number of physical (flash) writes is generally higher than the number of logical (software) writes. The ratio between the two is called write amplification. In our example, to make space for 3 new pages in block 0, we had to move 1 page. Thus we have 4 physical writes for 3 logical writes, i.e., a write amplification of 1.33.

High write amplification decreases performance and reduces flash lifetime. How large write amplification is depends on the access pattern and how full the SSD is. Large sequential writes have low write amplification, while random writes are the worst case.

Let's assume our SSD is filled to 50% and we perform random writes. In steady state, wherever we erase a block, about half the pages of that block are still in use and have to be copied on average. Thus, write amplification for a fill factor of 50% is 2. In general, worst-case write amplification for a fill factor f is 1/(1-f):

f	0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9	0.95	0.99
WA	1.11	1.25	1.43	1.67	2.00	2.50	3.33	5	10	20	100

Because write amplification becomes unreasonably high for fill factors close to 1, most SSDs have hidden spare capacity. This overprovisioning is typically 10-20% of the total capacity. Of course, it is also easy to add more overprovisioning by creating an empty partition and never write to it.

Summary and Further Reading

SSDs have become quite cheap and they have very high performance. For example, a Samsung PM1733 server SSD costs about 200 EUR per TB and promises close to 7 GB/s read and 4 GB/s write bandwidth. Actually achieving such high performance requires knowing how SSDs work and this post described the most important behind-the-scenes mechanisms of flash SSDs.

I tried to keep this post short, which meant that I had to simplify things. To learn more, a good starting point is this tutorial, which also references some insightful papers. Finally, because SSDs have become so fast, the operating system I/O stack is often the performance bottleneck. Experimental results for Linux can be found in our CIDR 2020 paper.

Taming Deep Recursion

2020-11-22T17:50:00.000+01:00

When operating on hierarchical data structures, it is often convenient to formulate that using pairwise recursive functions. For example, our semantic analysis walks that parse tree recursively and transforms it into an expression tree. This corresponding code looks roughly like this:

 unique_ptr<Expression> analyzeExpression(AST* astNode) {  
   switch (astNode->getType()) {  
    case AST::BinaryExpression: return analyzeBinaryExpression(astNode->as<BinaryExpAST>());  
    case AST::CaseExpression: return analyzeCaseExpression(astNode->as<CaseExpAST>());  
    ...  
   }  
 }  
 unique_ptr<Expression> analyzeBinaryExpression(BinaryExpAST* astNode) {  
   auto left = analyzeExpression(astNode->left);  
   auto right = analyzeExpression(astNode->right);  
   auto type = inferBinaryType(astNode->getOp(), left, right);  
   return make_unique<BinaryExpression>(astNode->getOp(), move(left), move(right), type);  
 }

It recursively walks the tree, collects input expressions, infers types, and constructs new expressions. This works beautifully until you encounter a (generated) query with 300,000 expressions, which we did. At that point our program crashed due to stack overflow. Oops.

Our first mitigation was using __builtin_frame_address(0) at the beginning of analyzeExpression to detect excessive stack usage, and to throw an exception if that happens. This prevented the crash, but is not very satisfying. First, it means we refuse a perfectly valid SQL query "just" because it uses 300,000 terms in one expression. And second, we cannot be sure that this is enough. There are several places in the code that recursively walk the algebra tree, and it is hard to predict their stack usage. Even worse, the depth of the tree can change due to optimizations. For example, when a query has 100,000 entries in the from clause, the initial tree is extremely wide but flat. Later, after we have stopped checking for stack overflows, the optimizer might transform that into a tree with 100,000 levels, again leading to stack overflow. Basically, all recursive operations on the algebra tree are dangerous.

Now common wisdom is to avoid recursion if we cannot bound the maximum depth, and use iteration with an explicit stack instead. We spend quite some time thinking about that approach, but the main problem with that is that it makes the code extremely ugly. The code snippet above is greatly simplified, but even there an explicit stack would be unwieldy and ugly if we have to cover both binaryExpression and caseExpression using one stack. And the code gets cut into tiny pieces due to the control flow inversion required for manual stacks. And all that to defend against something that nearly never happens. We were unhappy with that solution, we wanted something that is minimal invasive and created overhead only in the unlikely case that a user gives us an extremely deep input.

One mechanism that promises to solve this problem is -fsplit-stack. There, the compiler checks for stack overflows in the function prolog and creates a new stack segment if needed. Great, exactly what we wanted! We can handle deep trees, no code change, and we only create a new stack if we indeed encounter deep recursion. Except that it is not really usable in practice. First, -fsplit-stack is quite slow. We measured 20% overhead in our optimizer when enabling split stacks, and that in cases where we did not create any new stacks at all. When -fsplit-stack does create new stacks it is even worse. This is most likely a deficit of the implementation, one could implement -fsplit-stack much more efficiently, but the current implementation is not encouraging. Even worse, clang produces an internal compiler error when compiling some functions with -fsplit-stack. Nobody seems to use this mechanism in production, and after disappointing results we stopped considering -fsplit-stack.

But the idea of split stacks is clearly good. When encountering deep recursion we will have to switch stacks at some point. After contemplating this problem for some time we realized that boost.context offers the perfect mechanism for switching stacks: It can start a fiber with a given stack, and switching between fibers costs just 19 cycles. By caching additional stacks and their fibers in thread-local data structures we can provide our own implementation of split stacks that is fast and supports arbitrary code. Without compiler support the split stack mechanism is visible, of course, but that is fine in our code. We have only a few entry points like analyzeExpression that will be called over and over again during recursion, and checking there is enough. Code wise the mechanism is not too ugly, it needs two lines of code per recursion head and looks like

 unique_ptr<Expression> analyzeExpression(AST* astNode) {  
   if (StackGuard::needsNewStack())  
    return StackGuard::newStack([=]() { return analyzeExpression(astNode); });  
   ... // unchanged  
 }

Note that the lambda argument for newStack will be called from within the new stack, avoiding the stack overflow. When the lambda returns the mechanism will use boost.context to switch back to the previous stack. The performance impact of that mechanism is negligible, as 1) we do not check for overflows all the time but only in the few places that we know are central to the recursive invocations, like analyzeExpression here, 2) stack overflows are extremely rare in practice, and we only pay with one if per invocation is no overflow happens, and 3) even if they do happen, the mechanism is reasonably cheap. We cache the child stacks, and switching to a cached stack costs something like 30 cycles. And we never recurse in a hot loop.

It took us a while to get there, but now we can handle really large queries without crashing. Just for fun we tried running a query with 100,000 relations in the from clause. Fortunately our optimizer could already handle that, and now the rest of the system can handle it, too. And that with nice, intuitive, recursive code, at the small price of two lines of code per recursion head.

C++ Concurrency Model on x86 for Dummies

2020-10-30T17:17:00.002+01:00

Since C++11, multi-threaded C++ code has been governed by a rigorous memory model. The model allows implementing concurrent code such as low-level synchronization primitives or lock-free data structures in a portable fashion. To use the memory model, programmers need to do two things: First, they have to use the std::atomic type for concurrently-accessed memory locations. Second, each atomic operation requires a memory order argument with six options determining the concurrency semantics in terms of which re-orderings are allowed. (Some operations even allow specifying two memory orders!)

While there are a number of attempts to describe the model, I always found the full semantics very hard to understand and consequently concurrent code hard to write and reason about. And since we are talking about low-level concurrent code here, making a mistake (like picking the wrong memory order) can lead to disastrous consequences.

Luckily, at least on x86, a small subset of the full C++11 memory model is sufficient. In this post, I'll present such a subset that is sufficient to write high-performance concurrent code on x86. This simplification has the advantage that the resulting code is much more likely to be correct, without leaving any performance on the table. (On non-x86 platforms like ARM, code written based on this simplified model will still be correct, but might potentially be slightly slower than necessary.)

There are only six things one needs to know to write high-performance concurrent code on x86.

1. Data races are undefined

If a data race occurs in C++, the behavior of the program is undefined. Let's unpack that statement. A data race can be defined as two or more threads accessing the same memory location with at least one of the accesses being a write. By default (i.e., without using std::atomic), the compiler may assume that no other thread is concurrently modifying memory. This allows the compiler to optimize the code, for example by reordering or optimizing away memory accesses. Consider the following example:

void wait(bool* flag) {
while (*flag);
}

Because data races are undefined, the compiler may assume that the value behind the pointer flag is not concurrently modified by another thread. Using this assumption, gcc translates the code to a return statement while clang translates it to an infinite loop if flag is initially true. Both translations are likely not what the code intends. To avoid undefined code, it is necessary use std::atomic for variables where race conditions may happen:

void wait(std::atomic<bool>* flag) {
while (*flag); // same as while(flag->load(std::memory_order_seq_cst));
}

*flag is equivalent to flag.load(std::memory_order_seq_cst), i.e., the default memory order is sequential consistency. Sequential consistency is the strongest memory order guaranteeing that atomic operations are executed in program order. The compiler is not allowed to reorder memory operations or optimize them away.

2. Sequentially-consistent loads are fast

Making the flag atomic may seem expensive, but luckily atomic loads are cheap on x86. Indeed, our wait function is translated to a simple loop with a simple MOV instruction, without any barrier/fence instruction. This is great as it means that on x86 an atomic, sequentially-consistent load can be just as fast as a normal load. It also means that on x86 there is no performance benefit of using any weaker memory order for atomic loads. For loads all memory orders are simply translated to MOV.

3. Sequentially-consistent stores are slow

While sequentially-consistent atomic loads are as cheap as normal loads, this is not the case for stores, as can be observed from the following example:

void unwait(std::atomic<bool>* flag) {
*flag = false; // same as flag->store(false, std::memory_order_seq_cst);
}

As with atomic loads, atomic stores are sequentially consistent if no explicit memory order is specified. In clang and gcc 10, the store translates to an XCHG instruction rather than a MOV instruction (older gcc versions translate it to a MOV plus MFENCE.). XCHG and MFENCE are fairly expensive instructions but are required for sequentially consistent stores on x86. (The CPU's store buffer must be flushed to L1 cache to make the write visible to other threads through cache coherency.)

4. Stores that can be delayed can benefit from the release memory order

Because sequentially-consistent stores are fairly expensive, there are situations where a weaker memory order can improve performance. A common case is when the effect of a store can be delayed. The classical example is unlocking a mutex. The unlocking thread does not have to synchronously wait for the unlocking to become visible, but can continue executing other instructions. Another way of saying this is that it is correct to move instructions into the critical section, but not out of it. In C++, this weaker form of store consistency is available through the release memory order.

In our unwait example, the store can indeed be delayed, which is why we can use the release memory order for the store:

void unwait(std::atomic<bool>* flag) {
flag->store(false, std::memory_order_release);
}

This code is translated to a simple MOV instruction, which means it can be as efficient as a non-atomic store.

5. Some atomic operations are always sequentially consistent

Besides loads and stores, std::atomic also offers the high-level atomic operations compare_exchange, exchange, add, sub, and, or, xor. On x86, these are always directly translated to sequentially-consistent CPU instructions. This means that there is no performance benefit from specifying weaker memory orders on any of these operations. (An atomic increment with release semantics would be useful in many situations, but alas is not available.)

6. Scalable code should avoid cache invalidations

I mentioned above that sequentially-consistent loads and release stores may be as cheap as non-atomic loads/store. Unfortunately, this is not always the case. Because CPUs have per-core caches, the performance of concurrent programs depends on the dynamic access pattern. Every store has to invalidate any cached copies of that cache line on other cores. This can cause parallel implementations to be slower than single-threaded code. Therefore, to write scalable code, it is important to minimize the number of writes to shared memory locations. A positive way of saying this is that as long as the program does not frequently write to memory locations that are frequently being read or written, the program will scale very well. (Optimistic lock coupling, for example, is a general-purpose concurrency scheme for synchronizing data structures that exploits this observation.)

Summary

The full C++ memory model is notoriously hard to understand. x86, on the other hand, has a fairly strong memory model (x86-TSO) that is quite intuitive: basically everything is in-order, except for writes, which are delayed by the write buffer. Exploiting x86's memory model, I presented a simplified subset of the C++ memory model that is sufficient to write scalable, high-performance concurrent programs on x86.

Linear Time Liveness Analysis

2020-04-28T16:42:00.000+02:00

Standard compiler are usually used with hand-written programs. These programs tend to have reasonably small functions, and can be processed in (nearly) linear time. Generated programs however can be quite large, and compilers sometimes struggle to compile them at all. This can be seen with the following (silly) demonstration script:

 import subprocess  
 from timeit import default_timer as timer  
 def doTest(size):  
   with open("foo.cpp", "w") as out:  
     print("int foo(int x) {", file=out)  
     for s in range(size):  
       p="x" if s==0 else f'l{s-1}'  
       print (f'int l{s}; if (__builtin_sadd_overflow({p},1,&l{s})) goto error;', file=out)  
     print(f'return l{size-1};error: throw;}}', file=out);  
   start = timer()  
   subprocess.run(["gcc", "-c", "foo.cpp"])   
   stop = timer()  
   print(size, ": ", (stop-start))  
 for size in [10,100,1000,10000,100000]:  
   doTest(size)

It generates one function with n statements of the form "int lX; if (__builtin_sadd_overflow(lY,1,&lX)) goto error;" which are basically just n additions with overflow checks, and then measures the compile time. The generated code is conceptually a very simple, but it contains a lot of basic blocks due to the large number of ifs. When compiling with gcc we get the following compile times:

n	10	100	1,000	10,000	100,000
compilation [s]	0.02	0.04	0.19	34.99	> 1h

The compile time is dramatically super linear, gcc is basically unable to compile the function if it contains 10,000 ifs or more. In this simple example clang fares better when using -O0, but with -O1 it shows super-linear compile times, too. This is disastrous when processing generated code, where we cannot easily limit the size of individual functions. In our own system we use neither gcc nor clang for query compilation, but we have same problem, namely compiling large generated code. And super-linear runtime quickly becomes an issue when the input is large.

One particular important problem in this context is liveness analysis, i.e, figuring out which value is alive at which part of the program. The textbook solution for that problem involves propagating liveness information for each variable across the blocks, but that is clearly super linear and does not scale to large program sizes. We therefore developed a different approach that we recently refined even more and the I want to present here:

Instead of propagating liveness sets or bitmasks, we study the control flow graph of the program. For a simple queries with a for loop and an if within the loop it might look like this:

Now if we define a variable x in block 0, and use it in block 2, the variable has to be alive on every path between definition and usage. Obviously that includes the blocks 0-2, but that is not enough. We see that there is a loop involving all loops between 1 and 4, and we can take that before coming to 2. Thus, the lifetime has to be extended to include the full loop, and is there range 0-4. If, however, we define a variable in 1 and use it in 2, the range is indeed 1-2, as we do not have to wrap around.

Algorithmically, we identify all loops in the program, which we can do in almost linear time, and remember how the loops are nested within each other. Then, we examine each occurrence of a variable (in SSA form). In principle the lifetime is the span from the first to the last occurrences. If, however, two occurrences are in different loops, we walk "up" from the lower loop level until we occur in the same loop (or top level), extending the lifetime to cover the full loop while doing so. In this example x in block 0 is top level, while x in block 2 is in loop level 1. Thus, we leave the loop, expand 2 into 1-4, and find the lifetime to be 0-4.

This assumes, of course, that the block numbers are meaningful. We can guarantee that by first labeling all blocks in reverse postorder. This guarantees that all dominating blocks will have a lower number than their successors. We can further improve the labeling by re-arranging the blocks such that all blocks within a certain loop are next to each other, keeping the original reverse post order within the loop. This leads to nice, tight liveness intervals. Using just an interval instead of a bitmap is of course less precise, but the block reordering makes sure that the intervals primarily contain the blocks that are indeed involved in the execution.

Asymptotically such an interval based approach is far superior to a classical liveness propagation. All algorithms involved are linear or almost linear, and we only to have to store two numbers per variable. When handling large, generated code such an approach is mandatory. And even for classical, smaller programs it is quite attractive. I looked around a bit at lectures about compiler construction, and I am somewhat surprised that nobody seems to teach similar techniques to handle large programs. When you cannot control the input size, super linear runtime is not an option.

All hash table sizes you will ever need

2020-01-30T14:19:00.002+01:00

When picking a hash table size we usually have two choices: Either, we pick a prime number or a power of 2. Powers of 2 are easy to use, as a modulo by a power of 2 is just a bit-wise and, but 1) they waste quite a bit of space, as we have to round up to the next power of 2, and 2) they require "good" hash functions, where looking at just a subset of bits is ok.

Prime numbers are more forgiving concerning the hash function, and we have more choices concerning the size, which leads to less overhead. But using a prime number requires a modulo computation, which is expensive. And we have to find a suitable prime number at runtime, which is not that simple either.

Fortunately we can solve both problems simultaneously, which is what this blog post is about. We can tackle the problem of finding prime numbers by pre-computing suitable numbers with a given maximum distance. For example when when only considering prime numbers that are at least 5% away from each other we can cover the whole space from 0 to 2^64 with just 841 prime numbers. We can solve the performance problem by pre-computing the magic numbers from Hacker's Delight for each prime number in our list, which allows us to use multiplications instead of expensive modulo computations. And we can skip prime numbers with unpleasant magic numbers (i.e., the ones that require an additional add fixup), preferring the next cheap prime number instead.

The resulting code can be found here. It contains every prime number you will ever need for hash tables, covering the whole 64bit address space. Usage is very simple, we just ask for a prime number and then perform modulo operations as needed:

class HashTable {
   primes::Prime prime;
   vector table;
public:
   HashTable(uint64_t size) {

      prime = prime::Prime::pick(size);

      table.resize(prime.get());

   }
   ...
   Entry* getEntry(uint64_t hash) { return table[prime.mod(hash)]; }
   ...
};

The performance is quite good. On an AMD 1950X, computing the modulo for 10M values (and computing the sum of the results) takes about 4.7ns per value when using a plain (x%p), but only 0.63ns per value when using p.mod(x).

Getting this into unordered_map would be useful, it would probably improve the performance quite significantly when we have few cache misses.

Cuckoo Filters with arbitrarily sized tables

2019-07-24T15:40:00.002+02:00

Cuckoo Filters are an interesting alternative to Bloom filters. Instead of maintaining a filter bitmap, they maintain a small (cuckoo-)hash table of key signatures, which has several good properties. For example is stores just the signature of a key instead of the key itself, but is nevertheless able to move an element to a different position in the case of conflicts.

This conflict resolution mechanism is quite interesting: Just like regular cuckoo hash tables each element has two potential positions where is be placed, a primary position i1 and a secondary position i2. These can be computed as follows:

i1 = hash(x)

i2 = i1 xor hash(signature(x))

Remember that the cuckoo filter stores only the (small) signature(x), not x itself. Thus, when we encounter a value, we cannot know if it is at its position i1 or position i2. However, we can nevertheless alternate between positions because the following holds

i1 = i2 xor hash(signature(x))

and we have the signature stored in the table. Thus, we can just use the self-inverse xor hash(signature(x)) to switch between i1 and i2, regardless of whether are currently at i1 or i2. Which is a neat little trick. This allows is to switch between positions, which is used in the cuckoo filter conflict resolution logic.

However all this hold only because the original cuckoo filters use power-of-two hash tables. If our hash table size is not a power of 2, the xor can place the alternative position beyond the size of the hash table, which breaks the filter. Thus cuckoo filter tables always had to be powers of two, even if that wasted a lot of memory.

In more recent work Lang et al. proposed using cuckoo filters with size C, where C did not have to be a power of two, offering much better space utilization. They achieved this by using a different self-inverse function:

i1 = hash(x) mod C

i2 = -(i1 + hash(signature(x)) mod C

Note that the modulo computation can be made reasonable efficient by using magic numbers, which can be precomputed when allocating the filter.

A slightly different way to formulate this is to introduce a switch function f, which switches between positions:

f(i,sig,C) = -(i + hash(sig)) mod C

i1 = hash(x) mod C

i2 = f(i1, signature(x), C)

i1 = f(i2, signature(x), C)

All this works because f is self-inverse, i.e.,

i = f(f(i, signature(x), C), signature(x), C)

for all C>0, i between 0 and C-1, and signature(x)>0.

The only problem is: Is this true? In a purely mathematical sense it is, as you can convince yourself by expanding the formula, but the cuckoo filters are not executed on abstract machines but on real CPUs. And there something unpleasant happens: We can get numerical overflows of our integer registers, which implicitly introduces a modulo 2^32 into our computation. Which breaks the self-inverseness of f in some cases, except when C is power of two itself.

Andreas Kipf noticed this problem when using the cuckoo filters with real world data. Which teaches us not to trust in formulas without additional extensive empirical validation... Fortunately we can repair the function f by using proper modular arithmetic. In pseudo-code this looks like this

f(i,sig,C)

x=(C-1)-(hash(sig) mod C)

if (x>=i)

return (x-i);

// The natural formula would be C-(i-x), but we prefer this one...

return C+(x-i);

This computes the correct wrap-around module C, at the cost of one additional if. We can avoid the if by using predication, as shown below

f(i,sig,C)

x=(C-1)-(hash(sig) mod C)

m = (x<i)*(~0)

return (m&C)+(x-i);

which can be attractive for SSE implementations where the comparison produces a bit mask anyway.

We have validated that this new f function is now self-inverse for all possible values of i, sig, and C. And we did this by not just looking at the formula, but by trying out all values programmatically. Which is a good way to get confidence in your approach; there is only a finite number of combinations, and we can test them all.

With this small fix, we can now enjoy Cuckoo Filters with arbitrarily sized tables.

Edit: The original post did not mirror the hash space correctly (using C-... instead of (C-1)-...), thanks to Andreas Kipf for pointing this out.

Why use learning when you can fit?

2019-05-16T16:44:00.001+02:00

We recently had a talk by Tim Kraska in our group, and he spoke among other things about learned indexes. As I had mentioned before, I am more in favor of using suitably implemented b-trees, for reasons like update friendliness and distribution independence. But nevertheless, the talk made me curious: The model they are learning is in the end very primitive. It is a two-level linear model, i.e., they are using a linear function to select another linear function. But if that is enough, why do we need machine learning? A simple function fit should work just as well.

Thus, I tried the following:
1) we sort all data and keep it in an array, just like with learned indexes
2) we build the CDF
3) we fit a linear spline to the CDF minimizing the Chebyshev norm
4) we fit a polynomial function to the spline nodes
5) now we can lookup a value by evaluating first the polynomial function, then the spline, and then retrieving the values from the array. The previous step is always the seed to a local search in the next step.

As we bound the Chebyshev norm in each step, the lookup is in O(1), without any need for machine learning or other magic boxes.

Now admittedly there was some weasel wording in the previous paragraph: The lookup is in O(1), but the "constant" here is the Chebyshev norm of the fit, which means this only works well if we can find the good fit. But just the same is true for the learned indexes, of course.

Now do we find a good fit? In theory we know how to construct the optimal fit in O(n log n), but that paper is beyond me. I am not aware of any implementation, and the paper is much too vague to allow for one. But constructing a good fit is much easier, and can also be done in O(n log n). Using that algorithm, we can construct a linear spline that maximum error efficiently, and we know what the maximum is over the whole domain. Thus, we can probe the spline to get an estimate for the real value position, and we then can perform an efficient local search on a small, known, window of the data.

The only problem is evaluating the spline itself. Evaluating a linear spline is pretty cheap, but we have to find the appropriate knot points to evaluate. Traditionally, we find these with binary search again. Note that the spline is much smaller than the original data, but still we want to avoid the binary search. Thus, we construct a polynomial function to predict the spline knot, again minimizing the Chebyshev norm, which allows us to consider only a small subset of spline nodes, leading to the before mentioned time bound.

How well does this work in practice? On the map data set from the learned indexes paper and a log normal data set we get the following. (The learned indexes numbers are from the paper, the b-tree numbers are from here, and the spline numbers from this experiments. I still do not really know what the averages mean for the learned indexes, but probably the average errors averaged over all models).

Map data	size (MB)	avg error
Learned Index (10,000)	0.15	8 ± 45
Learned Index (100,000)	1.53	2 ± 36
B-tree (10,000)	0.15	225
B-tree (100,000)	1.53	22
Spline (10,000)	0.15	193
Spline (100,000)	1.53	22

Log normal data	size (MB)	avg error
Learned Index (10,000)	0.15	17,060 ± 61,072
Learned Index (100,000)	1.53	17,005 ± 60,959
B-tree (10,000)	0.15	1330
B-tree (100,000)	1.53	3
Spline (10,000)	0.15	153
Spline (100,000)	1.53	1

Somewhat surprising the accuracy the accuracy of the spline is nearly identical to the interpolating b-tree for the real-world map data, which suggests that the separators span the domain reasonably well there. For the log normal data the spline is significantly better, and leads to nearly perfect predictions. Note that the original data sets contains many millions of data points in both cases, thus the prediction accuracy is really high.

For practical applications I still recommend the B-tree, of course, even though the polynomial+spline solution is in "O(1)" while the B-tree is in O(log n). I can update a B-tree just fine, including concurrency, recovery, etc., while I do not know how to do that with the polynomial+spline solution.
But if one wants to go the read-only/read-mostly route, the function functions could be attractive alternative the machine learning. The advantage of using fits is that the algorithms are reasonably fast, we understand how they work, and they give strong guarantees for the results.

Honest asymptotic complexity for search trees and tries

2019-02-01T09:47:00.002+01:00

Fun fact that was pointed out to me by Viktor: All traditional books on algorithms and data structures that we could find gave the lookup costs of balanced search trees as O(log n) (i.e., the depth of the search tree), and the lookup costs of tries as O(k) (i.e., the length of the key). At a first glance this is a logarithmic time lookup against a linear time lookup, which makes people nervous when thinking about long keys.

But that argument is very unfair: Either we consider a key comparison a O(1) operation, then a tree lookup is indeed in O(log n), but then a trie lookup is in O(1)! As the length of the key has to be bounded by a constant to get O(1) comparisons, the depth of the trie is bounded, too. Or the length of the key matters, then a trie lookup is indeed in O(k), but then a tree lookup is in O(k log n). We have to compare with the key on every level, and if we are unlucky we have to look at the whole key, which gives the factor k.

Which of course makes tries much more attractive asymptotic wise. Note that we ignore wall clock times here, which are much more difficult to predict, but in many if not most cases tries are indeed much faster than search trees.

I believe the reason why text books get away with this unfair comparison is that they all present balanced search trees with integer keys:

While tries are traditionally introduced with string examples. If they had used string keys for balanced search trees instead it would have been clear that the key length matters:

The trie examines every key byte at most once, while the search tree can examine every key byte log n times. Thus, the asymptotic complexity of tries is actually better than that of balanced search tries.

Propagation of Mistakes in Papers

2018-06-08T09:06:00.000+02:00

While reading papers on cardinality estimation I noticed something odd: The seminal paper by Flajolet and Martin on probabilistic counting gives a bias correction constant as 0.77351, while a more recent (and very useful) paper by Scheuermann and Mauve gives the constant as 0.775351. Was this a mistake? Or did they correct a mistake in the original paper?

I started searching, and there is a large number of papers that uses the value 0.775351, but there is also a number of papers that uses the value 0.77351. Judging by the number of Google hits for "Flajolet 0.77351" vs. "Flajolet 0.775351" the 0.77351 group seems to be somewhat larger, but both camps have a significant number of publications. Interestingly, not a single paper mentions both constants, and thus no paper explains what the correct constant should be.

In the end I repeated the constant computation as explained by Flajolet, and the correct value is 0.77351. We can even derive one digit more when using double arithmetic (i.e., 0.773516), but that makes no difference in practice. Thus, the original paper was correct.

But why do so many paper use the incorrect value 0.775351 then? My guess is that at some point somebody made a typo while writing a paper, introducing the superfluous digit 5, and that all other authors copied the constant from that paper without re-checking its value. I am not 100% sure what the origin of the mistake is. The incorrect value seems to appear first in the year 2007, showing up in multiple publications from that year. Judging by publication date the source seems to be this paper (also it did not cite any other papers with the incorrect value, as far as I know). And everybody else just copied the constant from somewhere else, propagating it from paper to paper.

If you find this web page because you are searching for the correct Flajolet/Martin bias correction constant, I can assure you that the original paper was correct, and that the value is 0.77351. But you do not have to trust me on this, you can just repeat the computation yourself.