BigQuery Omni: Distributed Query Engine Comes To Google Cloud

BigQuery’s distributed query engine extends support for multi-cloud data lakes

Google Cloud released BigQuery Omni, a service that provides a federated query engine that can query the contents of data lakes on AWS and Azure.

Unlike query engine offerings from AWS or Azure, the BigQuery Omni service fully embraces multi-cloud with a serverless SQL query engine than can execute queries across different cloud data lakes. If you are asking yourself “What is a data lake?” we cover the architecture, value, and best practices on our site.

The first release focuses on Amazon Web Services while a subsequent release supports Azure.

BigQuery Omni, Anthos & Dremel

Rather than run the full-stack of BigQuery resources in Google Cloud, Google “localizes” its Dremel engine compute resources to the AWS or Azure cloud it is running on. Doing solves a few technical hurdles, but also some significant cost implications.

For example, let’s say that Google said that data resident in AWS S3 needed to be moved temporarily to Google Cloud Storage. Transfer of data would be necessary so that BigQuery can execute queries within GCP. As a result, your data is leaving AWS, which charges for all outbound network traffic. The first 10 TB per month costs .15 per gigabyte.

Let’s assume you have a 1TB of CSV files on AWS S3. A Google BigQuery Omni query would need to copy 1TB of data from AWS to Google. The transfer costs for this query would be close to $150. Ouch!

By running in a containerized environment in a Google-managed AWS or Azure environment, BigQuery Omni solves data transfer cost implications.

BigQuery Omni vs

The move to distributed query engines is gaining steam. Amazon Athena, Amazon Redshift, PrestoDB, and others support this model. Google BigQuery supported this pattern as well before Omni, except only it could only query data resident within GCP.

However, what is novel about this approach Google more than the distribution of the query but the compute resources that execute those queries. Assuming the aforementioned Omni service works as advertised, this will extend the reach for current BigQuery customers that operate in AWS and Azure.

One caveat is even though Google supports a federated compute model, it does not change the essential need to have your data lake contents optimized for Omni. If you are running Omni against unoptimized data lakes, the performance and cost implications are significant.

BigQuery Omni Opportunites & Challenges

Based on Google Anthos and Dremel, BigQuery Omni supports Avro, CSV, JSON, ORC, and Parquet. While Google says there is no need to format or transform your data on AWS or Azure; this is marketing, not technical advice.

It is not uncommon for Google, AWS, and Azure to promote the conceptual ease of use model while downplaying the realities associated with it.

If you are attempting to use Omni to query data objects in an unoptimized AWS or Azure environment, performance and cost will become a significant concern.

Vendor’s posts, guides, documentation, or tweets describe setting up a data lake for query enginers as creating an AWS S3 bucket, some paths, and then dropping files in. Voilà, you have a data lake ready for the query service to get to work! Not really.

For most Omni users, they will learn the same hard lessons Athena, Spectrum, and Presto users learned: Distributed query engines are only as good as the data lakes they query.

BigQuery Omni Best Practices

Optimizing and automating the configuration, processing, and loading of data to your private Azure or Amazon data lake is critical for BigQuery Omni to operate efficiently.

Here key considerations for BigQuery Omni optimization when using your AWS data lake:

Automatic partitioning of data — With data partitioning, you maximize the amount of data scanned by each Omni query, thus improving performance and reducing the cost of data stored in Azure or AWS S3 as you run queries
Automatic conversion to Apache Parquet — Convert data into an efficient and optimized open-source columnar format, Apache Parquet. This lowers costs when you execute queries as the Parquet files columnar format is highly optimized for interactive query services like BigQuery Omni
Automatic data compression — With data in Apache Parquet, compression is performed column by column using Google Snappy, which means it not only supports query optimizations it also reduces the size of the data stored in your Azure Data Lake Storage or Amazon S3 bucket, which reduces costs
Automated data catalogs, database, table, and view creation — As upstream data changes, the use of a data catalog can ensure that changes in your data lake automatically version tables and views within BigQuery Omni. Data is analyzed and the system “trained” to infer schemas to automate the creation of database, views, and tables in the BigQuery Omni

Getting Started With Bigquery Omni & Data Lakes

Does a data lake have to be complex to set up for BigQuery Omni? No! With a data lake formation process, you can get up and running more quickly.

It has never been easier to take advantage of an “analytics-ready” data lake with a serverless query service like Google BigQuery Omni.

The Opernbridge data lake service automates the configuration, processing, and loading of data to Google BigQuery, unlocking how users can return query results quickly and cost-effectively.

With our zero administration, data lake service, you push data from supported data sources, and our service automatically loads it into BigQuery Omni.

Want to get started with Omni and data lakes? Sign up for a 14-day no cost trial!

References

BigQuery Omni: Distributed Query Engine Comes To Google Cloud was originally published in Openbridge on Medium, where people are continuing the conversation by highlighting and responding to this story.

source https://blog.openbridge.com/bigquery-omni-distributed-query-engine-comes-to-google-cloud-f23b34c87362?source=rss----4c5221789b3---4

Data Lakes Daily