Adam Selipsky, CEO, AWS, announced a number of new products and capabilities in his keynote at AWS re:Invent 22 in Las Vegas. One of those announcements is the move to zero-ETL (Extract, Transfer, Load). It’s a bold claim and one that, if adopted, will have significant benefits for a lot of organisations.
Data is still siloed by application. For users and applications to get the most out of the data means combining and creating new data stores. It’s not a new task. It’s been happening for decades as we extract data for reporting, analysis, marketing and for other uses. The problem is that the process involved, ETL, can be long, arduous, and take a lot of compute cycles and other resources. It also needs to be done again and again and again.
Selipsky quoted from an email he received from a customer discussing ETL. He said they used the phrase, “thankless, unsustainable black hole.” It’s a phrase that will resonate with anyone who has worked with multiple datasets for any length of time.
What is AWS doing about this?
According to Selipsky, it is a problem AWS has been trying to solve for several years. He said, “We’ve been working for a few years now, building integrations between our services to make it easier to do analytics and machine learning without having to deal with the ETL muck.
“For instance, we have federated query capabilities in both Redshift and Athena, so that you can run queries across a wide range of databases and data stores, and even third-party applications and other clouds without moving any data. To make it easy for customers to enrich all their data.
“AWS data exchange seamlessly integrates with Redshift and enables you to access and combine third-party datasets with your own data in Redshift, no ETL required. We’ve integrated SageMaker with Redshift and Aurora to enable anyone with SQL skills to operate machine-learning models and make predictions also without having to move data around. These integrations eliminate the need to move data around for some important use cases.”
The next step is zero-ETL
Integration and making data transfers without ETL is nothing new. Several companies have done that with their own products over the years. The problem is that, in most cases, it tends to be more an alignment of their products, not a wider solution to the ETL problem.
Selipsky wants to go further. He asked, “What if we could eliminate ETL entirely?”
His solution is what AWS is calling a zero-ETL future. The goal is to remove the need for manually building an ETL pipeline. Instead, it will be replaced by an automated tool where you define the data and tables you want bringing together and the integration is done for you.
The first product is a preview of a fully managed new zero ETL integration between Aurora and Redshift. Selipsky says it will “eliminate all of the work of building and managing custom data pipelines between Aurora and Redshift.” Importantly, it will also continuously propagate the data as it detects changes. It won’t be in real-time but Selipsky says it will be in near real-time.
But is this really zero-ETL?
It’s a good question. What Selipsky has described is effectively describes a simple automated process to get the data from Aurora into Redshift. But what if you want to bring data from multiple sources where you need to normalise the data? How will the click and point process build the pipeline to do that normalisation?
At the AWS booth downstairs on the show floor, it was a question that was unanswered. Looking deeper it seems there will still be some manual processes required when the data being brought in cannot just be replicated in Redshift in order to be useful.
This is disappointing but not unexpected. The positive side here is that anything that reduces the amount of time spent building ETL is to be welcomed. If that automated pipeline can then do continuous data import based on changes in the source data, something that Selipsky says it will, then that is also a bonus.
The question now is how will AWS deliver properly on its zero-ETL promise? How will it enable data teams to eliminate complex ETL processes? At present, we just don’t know. AWS needs to clarify this.
Enterprise Times: What does this mean?
There is a huge promise here in removing the pain that ETL causes. It is one that will get a lot of attention as people look to see exactly what it delivers. Those with simple data replication and import issues will be happy with the automation and continuous replication that is now available. However, those with more complex issues will want to know how they can really benefit from this feature.
According to Selipsky, “We’re going to keep on innovating here, finding new ways to make it easier for you to access and analyse data across all of your data stores.”
We look forward to those innovations and how much more than can offer in reducing the burden of ETL.