<aside> 🔥 Original blogpost on our website

</aside>

<aside> 🔥 Synnada helps you build interactive, collaborative, intelligent, real-time products for your mission-critical systems, within minutes and with ease.

</aside>

As the Lambda architecture that separates batch and stream processing systems gives way to the Kappa architecture which argues for a unified approach, a significant amount of attention is again focusing on stream joins in the data space. At Synnada, we strongly support the ideas and motivations behind the Kappa architecture, and we have been working hard on readying Apache Datafusion for this transformation.

Today, we are excited to announce the addition of Symmetric Hash Join to Apache Datafusion. This new feature will greatly enhance the capabilities of our query processing engine and give users even more powerful tools for querying, analyzing, and managing large datasets in real time.

Symmetric Hash Join will integrate into DataFusion's existing API seamlessly, fully compatible with Datafusion's existing static data sources and streams. Whether you have static data with cardinality constraints or streaming data, Symmetric Hash Join will help you process large datasets faster and more efficiently, enabling you to generate low-latency, data-driven results.

Our implementation of this feature includes an advanced optimization technique called build-side pruning, which allows the algorithm to detect early on when certain rows no longer matter to the join result and remove them from memory. This results in a significant performance boost and more efficient use of computational resources.

In this blog post, we'll take a closer look at the Symmetric Hash Join algorithm. We will explain how it works, its advantages, and the benefits of using it. We'll also show you how to use it in your Datafusion pipeline and provide tips and tricks for getting the most out of it.

We hope that this new addition will help you to achieve your big data goals more easily and with greater speed, and we welcome your feedback and questions.

Symmetric Hash Join

A brief explanation

To explain the Symmetric Hash Join (SHJ) algorithm, let’s first quickly remind ourselves how the ordinary Hash Join algorithm works. At a high-level, the latter works in a two-step fashion: It first consumes one of its inputs (the build side) to build a hash table, and then consults this hash table while consuming its other input (the probe side) to determine matches. The SHJ algorithm eliminates this build/probe side asymmetry by maintaining a hash table for each input, and consults these tables continuously as it consumes both its inputs.

When is it appropriate?

The SHJ algorithm is typically useful when quick response times are needed, or when inputs are unbounded. It can be used in various types of data processing tasks, such as:

Real-time streaming: The SHJ algorithm can be used in real-time streaming scenarios to join data from multiple streams with low latency. For example, it can be used to join data from IoT devices with data from other sources, such as weather data.
Large datasets with low cardinality: The SHJ algorithm can be used to join large datasets with low cardinality by exploiting sliding window join properties. When the low cardinality columns are used as keys in sliding window joins, a lot of unnecessary computational work is done when we first check the equality, making the join operation inefficient. This can lead to long processing times and high resource usage, making it difficult to meet low-latency processing requirements.

How does it work?

Context: Query processing and streams

In a query processing engine, data flows in between different operators or stages within the engine, and each operator or stage performs a specific operation on the data. The query planning step is responsible for determining the optimal execution plan for the query, which simply means selecting the applicable operators, configuring them and connecting them in some optimal sense.

However, when processing streams, this is not as straightforward as it sounds, since not all operators are stream-friendly. Stream-unfriendly operators are often called pipeline-breaking operators, and we need to get the job done by utilizing only stream-friendly operators that allow our data streams to flow smoothly through the entire pipeline.

This means that none of our operators should block processing by waiting for a large amount of input data without producing any output, or by writing intermediate results to disk (if possible). While some operations (such as sorting) are inherently not compatible with this concept (and thus are pipeline-breaking), many use cases allow solutions involving only stream-friendly operators.