German startup UltiHash has made its newly developed object storage generally available as a data store for AI-focused lakehouses.
This new object storage enters a market widely considered mature by analysts, populated by suppliers such as AWS (S3), Azure, DataCore (Caringo Swarm), Dell (ECS), Cloudian, GCP, Hitachi Vantara, IBM (COS), NetApp (StorageGRID), and Scality. UltiHash has invented a byte-level deduplication algorithm, with fine-grained sub-object deduplication across an entire storage pool. The company says it reduces data volume by up to 60 percent and is deployed in a Kubernetes-native and S3-compatible object storage cluster that can run on-premises or in AWS. The clusters have a head node and data nodes, and built-in erasure coding. They can scale horizontally with variable sized data nodes capable of supporting petabyte-scale volumes. UltiHash's storage also reads data fast, achieving a "250 percent performance improvement in read speed compared to AWS S3," the startup claims.
Tom Lüdersdorf, co-founder and CEO of UltiHash, stated: "The AI revolution is generating data at an unprecedented rate, and traditional storage solutions are struggling to keep pace. The future of storage will make it possible to avoid ballooning data costs without compromising on speed."
Data storage, UltiHash says, "serves as the critical link between AI models and their data -- like a gas tank connecting a car's engine to its fuel. Currently, these 'gas tanks' are inefficient, leading to high costs and unnecessary environmental impact, just like inefficient fuel systems do in vehicles."
UltiHash supports processing engines (Flink, Pyspark), ETL tools (Airflow), open table formats (Delta Lake, Hudi, Iceberg) and querying engines (Presto, Hive).
The UltiHash object storage layer fits below data warehouse and lakehouses built using open table formats like Delta Lake (Databricks), Hudi (Uber) and Iceberg (Netflix). It is, UltiHash says, the storage backbone for lakehouses. The company thinks that cloud-based lakehouse "users are suffering from a lack of price predictability due to expensive storage solutions and sky high I/O and data egress fees." Enter UltiHash's alternative.
It supports a hybrid approach, saying enterprises often keep a lakehouse in the cloud to enable use cases across the organisation while keeping older data or training sets on-premises to bring cost-efficiency to the MLOps lifecycle. UltiHash has adopted an Infrastructure as Code (IaC) approach, enabling users to set up UltiHash anywhere within their infrastructure.
UltiHash claims its platform, which can store various types and formats of data - text, PDFs, images, audio - provides "a unified storage layer that combines the scalability of data lakes with the querying power of data warehouses." It "offers fast, efficient data management, enabling generative AI and other data-intensive applications, such as advanced analytics, to scale sustainably."
We asked Tom Lüdersdorf questions about UltiHash's technology and approach.
Blocks & Files: How does UltiHash's dedupe efficiency compare to VAST Data's similarity-based dedupe?
Tom Lüdersdorf: Our deduplication is specifically engineered to optimize speed with minimal overhead only on write. Unfortunately, we cannot provide a benchmark with VAST's technology. Their similarity-based dedupe imposes significant read performance overhead as stated in their documentation, which is why we consider it to target different use cases [than] we do.
In our benchmarking, UltiHash's deduplication demonstrated a 15 percent overhead on writes compared to the fastest object storage solution we could find in the market (without compression enabled in this solution), with no performance degradation on reads.
Unlike other providers who often caution about throughput impacts with deduplication and compression, UltiHash's approach ensures optimal performance without these penalties. Building our microservice architecture from the ground up optimized for read operations, allows us to limit the computing tradeoff to write operations, which occur less frequently than reads.
Blocks & Files: Does UltiHash improve data access speed? If so, how?
Tom Lüdersdorf: Yes, UltiHash enhances data access speed by eliminating the need for data processing during rehydration, which is commonly required by storage solutions that use e.g. compression algorithms decoding on read. This approach enables UltiHash to deliver data access speeds comparable to high-performance storage, without the penalties associated with other space-saving techniques.
Blocks & Files: Where is the UltiHash data stored when claiming 250 percent faster read speeds than S3?
Tom Lüdersdorf: In our benchmark comparing to S3, UltiHash is deployed on AWS in a Virtual Private Cloud through Kubernetes (AWS EKS) leveraging EBS volumes with SSD/NVMe storage for best throughput. For customers in need of high-performance data lakes traditionally built with EBS, UltiHash provides a scalable object storage solution that reaches petabyte levels while reducing EBS costs thanks to its built-in deduplication. This approach ensures high performance and cost-efficiency without the typical scaling challenges of EBS.
Blocks & Files: How does UltiHash provide the querying power of data warehouses to data lakes? What makes UltiHash special in terms of integration with open source table formats like Delta Lake, Apache Iceberg, or Apache Hudi?
Tom Lüdersdorf: UltiHash's querying power is further enhanced by efficient metadata handling. We avoid deduplication of object metadata within our storage architecture to maintain rapid access. Data and metadata tables from open table formats are stored within UltiHash and get deduplicated. As our deduplication does not impose a performance overhead on read, queries on metadata are not slowed down in UltiHash. UltiHash helps customers eliminate the need for expensive proprietary data catalogs often associated with legacy storage providers, enabling them to leverage the latest innovations in data management without vendor lock-in, and making the disaggregation of storage and compute easier.
For AI-driven workloads that demand quick access, UltiHash's resource-efficient approach ensures fast data retrieval and optimal performance especially on read. By offering a Kubernetes-native object storage solution that ensures resource efficiency without compromising read throughput, we provide an ideal storage choice for modern data lakehouse architectures across private, colocation, and public clouds.
***
UltiHash says it "plans to engage with clients in AI, telecom, manufacturing, automotive, research institutions, and more to further refine its offering and drive change in the data storage space." Find out more about its Lakehouse ideas here.