Hive It stores schema in a database and processed data into HDFS. It provides SQL type language for querying called HiveQL or HQL. It is familiar, fast, scalable, and extensible

rajsigh020 21 views 79 slides Aug 20, 2024
Slide 1
Slide 1 of 79
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79

About This Presentation

It stores schema in a database and processed data into HDFS.
It provides SQL type language for querying called HiveQL or HQL.
It is familiar, fast, scalable, and extensible. Hive is a data warehouse infrastructure tool to process structured data in Hadoop(used for structure and semi structured data ...


Slide Content

HIVE

Hive is a data warehouse infrastructure tool to process structured data in Hadoop(used for structure and semi structured data analysis and processing). It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. Initially Hive was developed by Facebook, later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon Hive is not : A relational database

It stores schema in a database and processed data into HDFS. It provides SQL type language for querying called HiveQL or HQL. It is familiar, fast, scalable, and extensible. Features of Hive

Hadoop Hive Architecture and its Components

The diagram describes the Architecture of Hive and Hive components. It also describes the flow in which a query is submitted into Hive and finally processed using the  MapReduce  framework:

Above diagram shows the major components of Apache Hive- Hive Clients –  Apache Hive supports all application written in languages like C++, Java, Python etc. using JDBC, Thrift and ODBC drivers. Thus, one can easily write Hive client application written in a language of their choice. Hive Services –  Hive provides various services like web Interface, CLI etc. to perform queries. Processing framework and Resource Management –  Hive internally uses Hadoop MapReduce framework to execute the queries. Distributed Storage –  As seen above that Hive is built on the top of Hadoop, so it uses the underlying HDFS for the distributed storage.

The Hive supports different types of client applications for performing queries. These clients are categorized into 3 types: Thrift Clients –  As Apache Hive server is based on Thrift, so it can serve the request from all those languages that support Thrift(Thrift is basically protocols which define how you connections are made between clients and servers. Apache  Hive  uses Thrift to allow remote users to make a connection with HiveServer2(The  thrift server ) to connect to it and submit queries. JDBC Clients –  Apache Hive allows Java applications to connect to it using JDBC driver. It is defined in the class  apache.hadoop.hive.jdbc.HiveDriver. ODBC Clients –  ODBC Driver allows applications that support ODBC protocol to connect to Hive. For example JDBC driver, ODBC uses Thrift to communicate with the Hive server.  Hive Clients

a) CLI(Command Line Interface) –  This is the default shell that Hive provides, in which you can execute your Hive queries and command directly. Hive Services

Web Interface –  Hive also provides web based GUI for executing Hive queries and c) Hive Server –  It is built on Apache Thrift and thus is also called as Thrift server. It allows different clients to submit requests to Hive and retrieve the final result. d) Hive Deriver –  Driver is responsible for receiving the queries submitted Thrift, JDBC, ODBC, CLI, Web UL interface by a Hive client. Complier – After that hive driver passes the query to the compiler. Where parsing, type checking, and semantic analysis takes place with the help of schema present in the metastore. Optimizer –  It generates the optimized logical plan in the form of a DAG (Directed Acyclic Graph) of MapReduce and HDFS tasks. Executor –  Once compilation and optimization complete, execution engine executes these tasks in the order of their dependencies using Hadoop

. Metastore –   Metastore  is the central repository of Apache Hive metadata in the Hive Architecture. It stores metadata for Hive tables (like their schema and location) and partitions in a relational database. It provides client access to this information by using metastore service API. Hive metastore consists of two fundamental units: A service that provides metastore access to other Apache Hive services. Disk storage for the Hive metadata which is separate from  HDFS  storage.

User Interface (UI) calls the execute interface to the Driver. The driver creates a session handle for the query. Then it sends the query to the compiler to generate an execution plan. The compiler needs the metadata. So it sends a request for  getMetaData . Thus receives the  sendMetaData  request from Metastore. Now compiler uses this metadata to type check the expressions in the query. The compiler generates the plan which is  DAG  of stages with each stage being either a  map/reduce job , a metadata operation or an operation on  HDFS . The plan contains map operator trees and a reduce operator tree for map/reduce stages. Now execution engine submits these stages to appropriate components. After in each task the deserializer associated with the table or intermediate outputs is used to read the rows from HDFS files. Then pass them through the associated operator tree. Once it generates the output, write it to a temporary HDFS file through the serializer. Now temporary file provides the subsequent map/reduce stages of the plan. Then move the final temporary file to the table’s location for DML operations. Now for queries, execution engine directly read the contents of the temporary file from HDFS as part of the fetch call from the Driver. How to process data with Apache Hive?

Ability to create and manage tables and partitions (create, drop and alter). Ability to support various Relational, Arithmetic and Logical Operators. Ability to do various joins between two tables. Ability to evaluate functions like aggregations on multiple “group by” columns in a table. Ability to store the results of a query into another table. Ability to download the contents of a table to a local directory. Ability to create an external table that points to a specified location within HDFS Ability to store the results of a query in an HDFS directory. Ability to plug in custom scripts using the language of choice for custom map/reduce jobs.

A  file format  is a way in which information is stored or encoded in a computer file. In Hive it refers to how records are stored inside the file. As we are dealing with structured data, each record has to be its own structure. How records are encoded in a file defines a file format. These file formats mainly vary between data encoding, compression rate, usage of space and disk I/O. Hive does not verify whether the data that you are loading matches the schema for the table or not. However, it verifies if the file format matches the table definition or not.

Generally, HQL syntax is similar to the  SQL  syntax that most data analysts are familiar with. Hive supports four file formats those are TEXTFILE, SEQUENCEFILE, ORC and RCFILE (Record Columnar File).

We know that Hadoop’s performance is drawn out when we work with a small number of files with big size rather than a large number of files with small size. If the size of a file is smaller than the typical block size in Hadoop, we consider it as a small file. Due to this, a number of metadata increases which will become an overhead to the NameNode. To solve this problem sequence files are introduced in Hadoop. Sequence files act as a container to store the small files. Sequence files are flat files consisting of binary key-value pairs. When Hive converts queries to MapReduce jobs, it decides on the appropriate key-value pairs to be used for a given record. Sequence files are in the binary format which can be split and the main use of these files is to club two or more smaller files and make them as a one sequence file. The SequenceFile provides a Writer, Reader and Sorter classes for writing, reading and sorting respectively. SEQUENCEFILE File Format

There are 3 different SequenceFile formats: Uncompressed key/value records. Record compressed key/value records - only 'values' are compressed here. Block compressed key/value records - both keys and values are collected in 'blocks' separately and compressed. The size of the 'block' is configurable. The recommended way is to use the SequenceFile.createWriter methods to construct the 'preferred' writer implementation.

The  Optimized Row Columnar  (ORC) file formatOptimized Row Columnar which means it can store data in an optimized way than the other file formats. ORC reduces the size of the original data up to 75%(eg: 100GB file will become 25GB). As a result the speed of data processing also increases. ORC shows better performance than Text, Sequence and RC file formats. An ORC file contains rows data in groups called as Stripes along with a file footer. ORC format improves the performance when Hive is processing the data. Compared with RCFile format, for example, ORC file format has many advantages such as: a single file as the output of each task, which reduces the NameNode's load Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) light-weight indexes stored within the file

An ORC file contains groups of row data called  stripes , along with auxiliary information in a  file footer . At the end of the file a  postscript  holds compression parameters and the size of the compressed footer. The default stripe size is 250 MB. Large stripe sizes enable large, efficient reads from HDFS. The file footer contains a list of stripes in the file, the number of rows per stripe, and each column's data type. It also contains column-level aggregates count, min, max, and sum. This diagram illustrates the ORC file structure:

Stripe Structure As shown in the diagram, each stripe in an ORC file holds index data, row data, and a stripe footer. The  stripe footer  contains a directory of stream locations.  Row data  is used in table scans. Index data  includes min and max values for each column and the row positions within each column.

RCFile stores table data in a flat file consisting of binary key/value pairs. It first partitions rows horizontally into row splits, and then it vertically partitions each row split in a columnar way. RCFile stores the metadata of a row split as the key part of a record, and all the data of a row split as the value part. RC File

a) If your data is delimited by some parameters then you can use TEXTFILE format. b) If your data is in small files whose size is less than the block size then you can use SEQUENCEFILE format. c) If you want to store your data in an optimized way which lessens your storage and increases your performance then you can use ORCFILE format.

Numeric Types TINYINT (1-byte signed integer, from -128 to 127) SMALLINT (2-byte signed integer, from -32,768 to 32,767) INT (4-byte signed integer, from -2,147,483,648 to 2,147,483,647) BIGINT (8-byte signed integer, from -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807) FLOAT (4-byte single precision floating point number) DOUBLE (8-byte double precision floating point number) DECIMAL (Hive 0.13.0 introduced user definable precision and scale) HIVE Data Types

String Types STRING VARCHAR CHAR Misc Types BOOLEAN BINARY Complex Type arrays: ARRAY<data_type> maps: MAP<primitive_type, data_type> structs: STRUCT<col_name : data_type [COMMENT col_comment], ...>

hive> CREATE DATABASE [IF NOT EXISTS] userdb; hive> SHOW DATABASES; default userdb Hive> use userdb; Create Database Statement

hive> DROP DATABASE IF EXISTS userdb; The following query drops the database using CASCADE . It means dropping respective tables before dropping the database. hive> DROP DATABASE IF EXISTS userdb CASCADE; Drop Database Statement

Internal Table: hive> CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, destination String) COMMENT ‘Employee details’ ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’ STORED AS TEXTFILE; External table: CREATE EXTERNAL TABLE IF NOT EXISTS Cars( Name STRING, Miles_per_Gallon INT, Cylinders INT, Displacement INT, Horsepower INT, Weight_in_lbs INT, Acceleration DECIMAL, Year DATE, Origin CHAR(1)) COMMENT 'Data about cars from a public database' ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE location '/user/<username>/visdata';

hive> LOAD DATA LOCAL INPATH '/home/user/sample.txt' OVERWRITE INTO TABLE employee;

hive> ALTER TABLE employee RENAME TO emp; The following queries rename the column name and column data type:- hive> ALTER TABLE employee CHANGE name ename String; hive> ALTER TABLE employee CHANGE salary salary Double; Alter Table Statement

The following query adds a column named dept to the employee table. hive> ALTER TABLE employee ADD COLUMNS ( dept STRING COMMENT 'Department name');

hive> DROP TABLE IF EXISTS employee; hive> SHOW TABLES; emp ok Time taken: 2.1 seconds hive> Drop Table Statement

hive> SELECT * FROM employee WHERE salary>30000;

hive> SELECT Id, Name, Dept FROM employee ORDER BY DEPT; Order by

BIGINT ------- datatype count(*), count(expr), count(*) - Returns the total number of retrieved rows. DOUBLE ------- datatype sum(col), sum(DISTINCT col) It returns the sum of the elements in the group or the sum of the distinct values of the column in the group. DOUBLE ------- datatype avg(col), avg(DISTINCT col) It returns the average of the elements in the group or the average of the distinct values of the column in the group. DOUBLE ------- datatype min(col) It returns the minimum value of the column in the group. DOUBLE ------- datatype max(col) It returns the maximum value of the column in the group.

SELECT [ALL | DISTINCT] select_expr, select_expr, ... FROM table_reference [WHERE where_condition] [GROUP BY col_list] [HAVING having_condition] [ORDER BY col_list]] [LIMIT number] GROUP BY

hive> SELECT Dept,count(*) FROM employee GROUP BY DEPT;

arrayfile 1,abc,40000,a$b$c,hyd 2,def,3000,d$f,bang hive> create table tab7(id int,name string,sal bigint,sub array<string>,city string) > row format delimited > fields terminated by ‘,’ > collection items terminated by ‘$’; arrays: ARRAY<data_type>

hive>select sub[2] from tab7 where id=1; hive>select sub[0] from tab7;

mapfile 1,abc,40000,a$b$c, pf#500$epf#200 ,hyd 2,def,3000,d$f,pf#500,bang hive>create table tab10(id int ,name string, sal bigint, sub array<string>,emp map<string,int>,city string) row format delimited fields terminated by ‘,’ collection items terminated by ‘$’ map keys terminated by ‘#’; Map file

hive> load data local inpath ‘/home/training/mapfile’ overwrite into table tab10; hive>select emp["pf"] from tab10; hive>select emp["pf"],emp["epf"] from tab10;

mapfile 1,abc,40000, pf#500$epf#200,hyd$ap$500001 2,def,3000, pf#500,bang$kar$600038 hive> create table tab11(id int,name string,sal bigint, dud map<string,int>,addr struct<city:string,state:string,pin:bigint>) > row format delimited > fields terminated by ‘,’ > collection items terminated by ‘$’ > map keys terminated by ‘#’; hive> load data local inpath ‘/home/training/structfile’ into table tab11; hive>select addr.city from tab11; Struct

To increase performance Hive has the capability to partition data – The values of partitioned column divide a table into segments – Entire partitions can be ignored at query time Partitions

Partitions have to be properly created by users – When inserting data must specify a partition. At query time, whenever appropriate, Hive will automatically filter out partitions Update nt possible in hive . partitions

Creating Partitioned Table

Load Data Into Partitioned Table

Partitions are physically stored under separate directories hive> show partitions posts; OK country=AUSTRALIA country=US Time taken: 0.095 seconds hive> exit; Partitioned Table

We can add partitions to a table by altering the table. Let us assume we have a table called  employee  with fields such as Id, Name, Salary, Designation, Dept, and yoj. hive> ALTER TABLE employee > ADD PARTITION (country=‘Paris’); Adding a Partition

hive> ALTER TABLE employee PARTITION (year=’1203’) > RENAME TO PARTITION (yaer=’1213’); Renaming a Partition

hive> ALTER TABLE employee DROP [IF EXISTS] > PARTITION (year=’1203’); Dropping a Partition

Table SALES is: Sid sname loaction 101 arsh Noida 102 anamika chandigarh 103 raman mohali Step 1: Create table target(id int,name string) Fields terminated by ‘\t’; Single Table insertion

Step2: Insert OVERWRITE table target Select Sid,sname from sales Step3: Select * from target; Output: 101 arsh 102 anamika 103 raman

First table ---create table target1 having fields-Id, name, salary, location, deptId Second table ---create table target2 having fields –location, deptId, dname Third table- create table target3 having fields- dname, phone, email, country MultiTable insertion

Id Name salary location DeptId DepName phone email country MultiTable Insertion Id Name salary location DeptId DepName phone email country location DeptId DepName

Hive>From emp >Insert overwrite table target1 select Id, name, salary, location, deptId >Insert overwrite table target2 select location, deptId, dname >Insert overwrite table target3 select dname, phone, email, country ; Query:

Static Partition in Hive Insert input data files individually into a partition table is Static Partition Usually when loading files (big files) into Hive Tables static partitions are preferred Static Partition saves your time in loading data compared to dynamic partition You 'statically' add a partition in table and move the file into the partition of the table. We can alter the partition in static partition If you want to use the Static partition in the hive you should set property  set hive.mapred.mode = strict  this property set by default in hive-site.xml Static partition is in Strict Mode.

Dynamic Partition in Hive Single insert to partition table is known as dynamic partition Usually dynamic partition load the data from non partitioned table Dynamic Partition takes more time in loading data compared to static partition When you have large data stored in a table then Dynamic partition is suitable If you want to partition number of column but you don’t know how many columns then also dynamic partition is suitable We can’t perform alter on the Dynamic partition.

Pros: It distributes execution load horizontally. In partition faster execution of queries with the low volume of data takes place. Cons: There is the possibility of too many small partition creations- too many directories. Partition is effective for low volume data. But there some queries like group by on high volume of data take a long time to execute. For example, grouping population of China will take a long time as compared to a grouping of the population in Vatican City. There is no need for searching entire table column for a single record. Pros and Cons of Hive Partitioning

Mechanism to query and examine random samples of data • Break data into a set of buckets based on a hash function of a "bucket column" – Capability to execute queries on a sub-set of random data Bucketing

Basically, this concept is based on hashing function on the bucketed column. Along with mod (by the total number of buckets).  i. Where the hash_function depends on the type of the bucketing column. ii. However, the Records with the same bucketed column will always be stored in the same bucket. iii. Moreover,  to divide the table into buckets we use CLUSTERED BY clause. iv. Generally, in the table directory, each bucket is just a file, and Bucket numbering is 1-based. v. Along with Partitioning on Hive tables bucketing can be done and even without partitioning. vi. Moreover, Bucketed tables will create almost equally distributed data file parts.

hive > CREATE TABLE post_count (user STRING, count INT) > CLUSTERED BY (user) INTO 5 BUCKETS; Declare table with 5 buckets for user column

set hive.enforce.bucketing = true; hive > insert overwrite table post_count > select user, count(post) from posts group by user; # of reducer will get set 5

hive> select * from post_count TABLESAMPLE(BUCKET 1 OUT OF 2); OK user5 1 user1 2 Time taken: 11.758 seconds hive> Random Sample of Bucketed Table

Pros and Cons of Hive Bucketing  Pros: It provides faster query response like portioning. In bucketing due to equal volumes of data in each partition, joins at Map side will be quicker. Cons: We can define a number of buckets during table creation. But loading of an equal volume of data has to be done manually by programmers.

Joins in Hive are trivial Supports outer joins – left, right and full joins • Can join multiple tables Joins

Default Join is Inner Join – Rows are joined where the keys match – Rows that do not have matches are not included in the result

Only Equality joins are allowed In Joins More than two tables can be joined in the same query LEFT, RIGHT, FULL OUTER joins exist in order to provide more control over ON Clause for which there is no match points to observe in Joins:

Simple Inner Join

Simple Inner Join

Outer Join

Outer Join Examples

Hive Architecture

The above diagram shows how a typical query flows through the system Step 1 :-  The UI calls the execute interface to the Driver Step 2 :-  The Driver creates a session handle for the query and sends the query to the compiler to generate an execution plan Step 3&4 :-  The compiler needs the metadata so send a request for getMetaData and receives the sendMetaData request from MetaStore. The above diagram shows how a typical query flows through the system Steps

Step 5 :-  This metadata is used to typecheck the expressions in the query tree. The plan generated by the compiler  have alots of stages with each stage being either a map/reduce job, a metadata operation or an operation on HDFS. For map/reduce stages, the plan contains map operator trees (operator trees that are executed on the mappers) and a reduce operator tree (for operations that need reducers). Step 6 :-   The execution engine submits these stages to appropriate components (steps 6, 6.1, 6.2 and 6.3Once the output generate  it is written to a temporary HDFS file though the serializer. The temporary files are used to provide the subsequent map/reduce stages of the plan.For DML operations the final temporary file is moved to the table’s location Step 7&8&9 :-   For queries, the contents of the temporary file are read by the execution engine directly from HDFS as part of the fetch call from the Driver

UI :-  UI means User Interface, The user interface for users to submit queries and other operations to the system. Driver :-  The Driver is used for receives the quires from UI . Compiler :-  The component that parses the query, does semantic analysis on the different query blocks and query expressions and eventually generates an execution plan with the help of the table and partition metadata looked up from the metastore. MetaStore :-   The component that stores all the structure information of the various tables and partitions in the warehouse including column and column type information, the serializers and deserializers necessary to read and write data and the corresponding HDFS files where the data is stored. Execution Engine :-  The component which executes the execution plan created by the compiler. The execution engine manages the dependencies between these different stages of the plan and executes these stages on the appropriate system components. This is the main theme of  hadoop hive architecture Major Components of Hive