Although Apache™ Pig™ can be quite a powerful and simple language to use, the downside is that it’s something new to learn and master. A runtime Apache Hadoop™ support structure was developed allowing fluent SQL users (which is commonplace for relational data-base developers) to leverage the Hadoop platform. Apache , allows SQL developers to write Query Language (HQL) statements that are similar to standard SQL ones. HQL is limited in commands it understands, but still useful.
As with any database management system (DBMS), you can run your queries from a command line interface (known as the Hive shell), from a Java Database Connectivity (JDBC) or Open Database Connectivity (ODBC) application leveraging the Hive JDBC/ODBC drivers. You can run a Hive Thrift Client, within applications written in C++, Java, PHP, Python, or Ruby (much like you can use these client-side languages with embedded SQL to access a database such as Db2 or Informix).
Hive looks very much like traditional database code with SQL access. However, because is based on Apache™ Hadoop™ and hive operations, there are several key differences.
The first is that Hadoop is intended for long sequential scans, and because is based on Hadoop, you can expect queries to have a very high latency (many minutes). This means that Hive would not be appropriate for applications that need very fast response times, as you would expect with a database such as Db2. Finally, is read-based and therefore not appropriate for transaction processing that typically involves a high percentage of write operations.
If you’re interested in SQL on Hadoop, in addition to Hive, IBM offers Db2 Big SQL which makes accessing datasets faster and more secure. Checkout our videos, below, for a quick overview of and Db2 Big SQL.
Posted by: Margaret Rouse
Contributor(s): Delaney Rebernik
This definition is part of our Essential Guide: An enterprise guide to big data in cloud computing
Five Steps to Maximizing the Value of Hadoop
–SAS Institute Inc.
4 Steps to Big Data Transformation in the Cloud
5 Best Practices for Tableau & Hadoop
6 Pillars of Big Data: How to Harness Data with Tableau
Apache Hive is an open-source data warehouse system for querying and analyzing large datasets stored in Hadoop files. Hadoop is a framework for handling large datasets in a distributed computing environment.
DOWNLOAD THIS FREE GUIDE
Download: An enterprise guide to big data in cloud computing
Download the PDF version of this essential guide “An enterprise guide to big data in cloud computing”
Corporate E-mail Address:
corporate email address
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
Hive has three main functions: data summarization, query and analysis. It supports queries expressed in a language called HiveQL, which automatically translates SQL-like queries into MapReduce jobs executed on Hadoop. In addition, HiveQL supports custom MapReduce scripts to be plugged into queries. Hive also enables data serialization/deserialization and increases flexibility in schema design by including a system catalog called Hive-Metastore.
According to the Apache Hive wiki, “Hive is not designed for OLTP workloads and does not offer real-time queries or row-level updates. It is best used for batch jobs over large sets of append-only data (like web logs).”
Hive supports text files (also called flat files), SequenceFiles (flat files consisting of binary key/value pairs) and RCFiles (Record Columnar Files which store columns of a table in a columnar database way.)