What is Hive?

Hive is a data warehouse infrastructure tool to process structured data in Hadoop. In other words, it is a query engine for big data analytics. It is used to querying and managing large datasets residing in distributed storage.

Hadoop is an open-source framework to store and process Big Data in a distributed environment. It contains two modules:

  • MapReduce: It is a parallel programming model for processing large amounts of structured, semi-structured, and unstructured data on large clusters of commodity hardware.
  • HDFS:Hadoop Distributed File System is a part of Hadoop framework, used to store the datasets. It provides a fault-tolerant file system to run on commodity hardware.

Building on top of Hadoop, Hive provides a mechanism to access data stored in HDFS and to query that data using a SQL-like language called HiveQL (HQL) which does MapReduce operations.

Hive’s architecture is optimized for batch processing of large ETL jobs and batch SQL queries on very large data sets.

Hive VS Snowflake

Hive is a query engine for data warehouses while Snowflake is a data warehouse with its own natively designed query engine. Therefore they are competitors to each other.

Snowflake is a newer solution and is gaining momentum in recent years. It is a cloud (AWS or Azure) data warehouse, mainly used for analytics. Compared to Hive and other data warehouses such as Amazon RedShift, it has the following advantages:

  • it is faster. It has its own SQL query engine which is natively designed for the cloud and has optimized the performance at different layers.
  • it is cheaper. It separates computing from storage and allows elastic scaling of computing and storage space.
  • it supports both structured data and semi-structured data (i.e., JSON, Parquet etc).
  • it is easy to maintain (a big bonus for data engineers!)

Hive and Qubole

Difference between data lakes and data warehouse:

A data lake is a centralized place that allows you to store all your raw structured and unstructured data at any scale, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions. There are two components to a data lake: storage and compute. Common storage tools includes Amazon S3 (cloud), Azure Blob Storage (cloud), Hadoop (on premise) etc. For computing, you can run various types of analytics or ML on query engines like Presto or BigQuery or Spark or Hive.

A data warehouse is a repository for structured, filtered data that has already been processed for a specific purpose. Most business users access data stored within data warehouses through connected Business Intelligence (BI) tools like Tableau and Looker.

Most organizations that have a data lake will also have a data warehouse. Data warehouse vendors are expanding into the data lake space, and vice versa. For example, BigQuery (data warehouse) is now letting organizations query data on Amazon S3 (storage for data lake). Similarly, data lake platforms like Databricks and Qubole are moving towards data warehousing use cases. For example, DataBricks Lakehouse Platform delivers data management and performance typically found in data warehouses with the low-cost, flexible object stores offered by data lakes.

Qubole is a cloud-based simple, open, and secure Data Lake Platform for machine learning, streaming, and ad-hoc analytics. Qubole supports query engines such as Hive, Spark, Pig, Presto, Hadoop and cloud storage tools such as Amazon S3 and Azure Blob Storage.

References:

  1. https://aws.amazon.com/big-data/datalakes-and-analytics/what-is-a-data-lake/
  2. https://www.sprinkledata.com/docs/Snowflake-vs-Redshift-vs-BigQuery-vs-Hive-vs-Athena/index.html
  3. https://www.talend.com/resources/what-is-data-lake/
  4. https://www.qubole.com/blog/data-lakes-vs-data-warehouses/