Hive Data warehousing package built on top of hadoop. Used for data analysis on structured data. Targeted towards users comfortable with SQL. It is similar to SQL and called HiveQL. Abstracts complexity of hadoop. No Java is required. Developed by Facebook.
Features of Hive How is it Different from SQL The major difference is that a Hive query executes on a Hadoop infrastructure rather than a traditional database. This allows Hive to handle huge data sets - data sets so large that high-end, expensive, traditional databases would fail. The internal execution of a Hive query is via a series of automatically generated Map Reduce jobs
Hive Modes To start the hive shell, type hive and Enter. Hive in Local mode No HDFS is required, All files run on local file system. hive> SET mapred.job.tracker=local Hive in MapReduce(hadoop) mode hive> SET mapred.job.tracker=master:9001;
Hive Architecture
Components Thrift Client It is possible to interact with hive by using any programming language that usages Thrift server. For e.g. Python Ruby JDBC Driver Hive provides a pure java JDBC driver for java application to connect to hive , defined in the class org.hadoop.hive.jdbc.HiveDriver ODBC Driver An ODBC driver allows application that supports ODBC protocol
Hive Program Structure The Hive Shell The shell is the primary way that we will interact with Hive, by issuing commands in HiveQL . HiveQL is heavily influenced by MySQL, so if you are familiar with MySQL, you should feel at home using Hive. The command must be terminated with a semicolon to tell Hive to execute it. HiveQL is generally case insensitive. The Tab key will autocomplete Hive keywords and functions. Hive can run in non-interactive mode. Use -f option to run the commands in the specified file, hive -f script.hql For short scripts, you can use the -e option to specify the commands inline, in which case the final semicolon is not required. hive -e 'SELECT * FROM dummy'
Hive Tables A Hive table is logically made up of the data being stored in HDFS and the associated metadata describing the layout of the data in the MySQL table. Managed Table When you create a table in Hive and load data into a managed table, it is moved into Hive’s warehouse directory. CREATE TABLE managed_table (dummy STRING); LOAD DATA INPATH '/user/tom/data.txt' INTO table managed_table; External Table Alternatively, you may create an external table , which tells Hive to refer to the data that is at an existing location outside the warehouse directory. The location of the external data is specified at table creation time: CREATE EXTERNAL TABLE external_table (dummy STRING) LOCATION '/user/tom/external_table'; LOAD DATA INPATH '/user/tom/data.txt' INTO TABLE external_table; When you drop an external table, Hive will leave the data untouched and only delete the metadata. Hive does not do any transformation while loading data into tables. Load operations are currently pure copy/move operations that move data files into locations corresponding to Hive tables.