Amazon is making it easier to analyze transactional data held in its Aurora PostgreSQL and DynamoDB databases by avoiding the need for ETL routines when moving data from them into its Redshift data warehouse.
Aurora is Amazon's relational database service built for its public cloud with full MySQL and PostgreSQL support. Aurora MySQL is a drop-in replacement for MySQL, the open-source RDBMS, and Aurora PostgreSQL is a drop-in replacement for PostgreSQL. MySQL and PostgreSQL are both open-source relational database management systems (RDBMS) for transactional data, with Amazon saying: "They store data in tables that are interrelated to each other via common column values."
DynamoDB is Amazon's fully managed proprietary NoSQL database intended for use with non-relational databases. Such databases store data within one data structure, like a JSON document, instead of the tabular structure of a relational database.
Redshift is Amazon's cloud data warehouse for data analytics. It needs feeding with transactional data if its analytic processes are going to analyze that data. Normally ETL (Extract, Transform and Load) routines are used to select datasets from a source, transform them and move them into a target data warehouse. The zero-ETL concept is to do away with such routines by having the necessary integration functions built in to the source databases.
There are existing, generally available, zero-ETL integrations for Aurora MySQL and Amazon RDS for MySQL with Redshift, which enable customers to combine data from multiple relational and non-relational databases in Redshift for analysis. Amazon RDS (Relational Database Service) is a managed SQL database service which supports Aurora, MySQL, PostgreSQL, MariaDB, Microsoft SQL Server, and Oracle database engines.
Amazon claims Zero-ETL integrations mean that IT staff do not have to build and manage ETL pipelines and operate them. AWS blogger Senior Solutions Architect Esra Kayabali says zero-ETL: "automates the replication of source data to Amazon Redshift, simultaneously updating source data for you to use in Amazon Redshift for analytics and machine learning (ML)."
You still need an ETL function but these AWS products now set up, operate, and manage the whole thing. Kayabali blogs: "To create a zero-ETL integration, you specify a source and Amazon Redshift as the target. The integration replicates data from the source to the target data warehouse, making it available in Amazon Redshift seamlessly, and monitors the pipeline's health."
Her blog tells readers how to "create zero-ETL integrations to replicate data from different source databases (Aurora PostgreSQL and DynamoDB) to the same Amazon Redshift cluster. You will also learn how to select multiple tables or databases from Aurora PostgreSQL source databases to replicate data to the same Amazon Redshift cluster. You will observe how zero-ETL integrations provide flexibility without the operational burden of building and managing multiple ETL pipelines."
Amazon is here integrating its own databases with its own data warehouse. When the target data warehouse is owned and operated by one supplier and the source databases by others, partnerships with third parties are needed to get rid of the ETL pipeline development and operation burden.
Bryteflow tells us: "The zero-ETL process assumes native integration between sources and data warehouses (native integration means there is an API to directly connect the two) or data virtualization mechanisms (they provide a unified view of data from multiple sources without needing to move or copy the data). Since the process is much more simplified and the data does not need transformation, real-time querying is easily possible, reducing latency and operational costs."
Snowflake says it has zero-ETL data sharing capabilities across clouds and regions. It is partnering with Astro so that customers can orchestrate ETL operations in Snowflake with Astro using the Snowflake provider for Airflow.
CData says itsSync offering "provides a straightforward way to continuously pipeline your Snowflake data to any Database, Data Lake, or Data Warehouse, making it easily available to Analytics, Reporting, AI, and Machine Learning."