Posts

Showing posts from August, 2020

Data Lake Definition: Velocity, Agility, and Openness By Design

Image
Defining data lakes in terms of velocity, agility, and openness, delivers successful business outcomes Data lake definitions can take many shapes, largely because different vendors promote definitions that align with product offerings. Given that there can be many different definitions, there can be confusion when people attempt to ask “ what is a data lake ?” Building a background definition it helps build a common vocabulary around what can be overly technical, abstract, and vendor-driven conversations. Define “Data Lake” Rather than rely on an AWS, Google, or Azure data lake definition, here are a few essentials to set some baselines. Pentaho co-founder and CTO, James Dixon framed it this way; This situation is similar to the way that old school business intelligence and analytic applications were built. End users listed out the questions they want to ask of the data, the attributes necessary to answer those questions were skimmed from the data stream, and bulk loaded into a d

Amazon Redshift Federated Queries: Rise Of Query Engines

Image
Here come the SQL query engines! AWS added query services to Redshift with Spectrum which enabled users to query an S3 data lake. However, with the latest federated query updates, AWS is bringing Amazon Redshift in line with competitive query service offerings from not only Google and Microsoft, but other AWS services too. What are federated queries? Facebook PrestoDB popularized the concept of distributed SQL query engines when it open-sourced the project back in 2013. Over the past couple of years, AWS, Google, Microsoft, and many others in the industry have accelerated the adoption of a distributed query engine model within their products. For example, AWS developed Amazon Athena on top of the Presto code base. Here is how PrestoDB describes what is allows users to do: Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analyti