The ever-growing volume of data and the power to leverage it in real-time are well-recognized by today’s most successful businesses. However, keeping data pipelines cost-efficient, accessible, and flexible at scale remains a challenge.
This article delves into how leading enterprises bypass the hurdle of intricate preprocessing pipelines. It examines Airbnb’s innovative strategy for ditching denormalization to achieve efficient, scalable, on-the-fly JOINs using open-source database technologies.
Denormalization: The Status Quo of Real-Time Preprocessing Pipelines
Query latency is a critical component of real-time analytics, but keeping latency under control can be difficult due to the resource-intensive nature of JOIN operations. Not all real-time OLAP databases are optimized to perform them on the fly. To overcome this, many data practitioners turn to denormalization: pre-joining multiple tables into one during data preprocessing. This “locks” the data from its natural snowflake or star schema into a giant flat table and allows engineers to bypass the need for on-the-fly JOINs.
But this approach isn’t without costs, and taken together, these costs add up. A helpful example of these costs is best illustrated in the evolution of Airbnb’s Minerva metric management platform. The initial generation of the platform relied on Apache Druid as its backend storage and query engine, which required denormalization and presented the following challenges:
Because of Druid’s limited JOIN performance, everything needed to be denormalized. This created a problem because adding a metric (a schema change) required reconfiguring all of the associated denormalization pipelines and backfilling the data, which could take hours or days. For a metric platform that was home to 30,000+ metrics and supported all of Airbnb’s data-driven decision-making, it was simply unacceptable. Apache Druid isn’t unique in this limitation; any solution that demands denormalization suffers from this.
While Airbnb had to contend with limited flexibility, they also faced issues with complexity. Stream processing tools, while powerful for real-time processing like denormalization, introduce significant hoops to jump through. These can range from intricate setup and maintenance and data checkpointing, to state management and fault tolerance.
Marrying the intricacies of these tools with the expensive stateful tasks of denormalization only amplifies the challenges, especially when you need to ensure consistent data across streams without latency or discrepancies. On top of this, stream processing tools are infra-level software in your technology stack, which is the user’s responsibility to maintain and adds another layer of complexity.
The inflexibility demonstrated by platforms like Airbnb’s Minerva 1.0, combined with the inherent complexities of stream processing tools, all converge toward a singular outcome: elevated cost. Adjusting denormalization pipelines and maintaining intricate streaming infrastructure not only devours your time but also significantly inflates operational costs in the form of labour and technology resources.
Embracing JOINs: Airbnb’s Fraud Detection System
Even if you are willing to accept the costs associated with denormalization, you may still find yourself having to tackle JOIN operations. This was certainly the case for Airbnb, who not only struggled with their initial version of Minerva mentioned above but also faced similar challenges with JOINs when it came to fraud detection.
With users spanning across the globe and millions of available listings, it’s imperative for Airbnb to detect and prevent fraudulent activities instantly. Through real-time data analysis, Airbnb can quickly spot suspicious patterns, flag questionable bookings, and detect fake reviews to avert potential losses and maintain user trust.
The nature of fraud detection inevitably generates complex queries with multiple JOINs that require intricate analyses. Unfortunately, due to the wide range and uncertainty of violations, there is no unique profile for each query. Therefore, even if Airbnb data engineers were willing to conduct denormalization, it would be nearly impossible to do so.
Previously, Airbnb relied on an in-house Druid cluster to support real-time analytics. As mentioned above, because Airbnb couldn’t substitute JOINs with denormalization in this scenario, they began to struggle with their fraud detection operations. Airbnb’s initial solution to this was to turn to its Presto cluster to run fraud detection in batch. However, this meant allowing hours or even days for fraudulent actions to persist before they were detected. This presented a huge risk in terms of compromised user trust and financial losses.
To address these challenges, Airbnb began to search for open-source solutions capable of handling JOINs in real-time. They eventually settled on StarRocks, an OLAP database focused on handling complex JOINs in real-time.
This new architecture enables:
- Data to be ingested (with data upserts) in real-time
- Users to run ad hoc complex JOIN queries on the fly and get results returned in seconds
Adopting an approach that embraces JOINs instead of solutions that lean on denormalization has shortened the time required for trust analytics from days to seconds. Now, Airbnb is able to respond rapidly to potential threats, preventing costly losses and providing a safer and more trustworthy experience for both hosts and guests.
What This All Adds Up To
Denormalization may seem like a smart solution to manage your data pipelines, but it comes with costs that you need to be aware of to make the right choice for your analytics needs. If you’re working with real-time scenarios, it’s a good time to look toward the latest generation of analytics solutions that can tackle high-volume, complex JOINs. The savings are worth the switch.
CelerData enables enterprises to quickly and easily grow their business with a real-time analytical engine that provides three times the performance and cost of any other solution on the market. Powered by StarRocks, CelerData is the only platform uniquely designed for the next generation real-time enterprise, unleashing the power of business intelligence to help accelerate enterprise digital transformation. Used worldwide by market-leading brands including Airbnb, Lenovo, and Trip.com, CelerData generates critical new insights for data-driven companies.