hive- Inroduction , INTEGRATION and work flow, Partition and Bucketing
ssuser9d6aac
57 views
58 slides
Jul 14, 2024
Slide 1 of 58
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
About This Presentation
Hive Partition and bucketing
Size: 2.79 MB
Language: en
Added: Jul 14, 2024
Slides: 58 pages
Slide Content
Hive Integration and work flow Hourly log Hadoop(HDFS) Log Compression Hive Table 1 Hive Table 2 Flow of Log Analysis File Hourly Log data can be stored directly into HDFS and then data cleaning is performed on the log file. Finally Hive tables can be created to query the log file.
Hive Data Units Data Bases: The namespace for tables Tables: Set of records that have similar schema Partitions: Logical separations of data based on classification of given information as per specific attributes. Once hive has portioned the data based on a specified key, it starts to assemble the records into specific folders as and when the records are inserted. 4. Buckets(or Clusters): Similar to partitions but uses hash function to segregate data and determines the cluster or bucket into which the record should be placed.
Introduction You might have seen an encyclopedia in your school or college library. It is a set of books that will give you information about almost anything. Do you know what is the best thing about the encyclopedia ? Yes, you guessed it correctly. The words are arranged alphabetically. For example, you have a word in mind “Pyramids”. You will directly go and pick up the book with the title “P ”. You don’t have to search that in other books. Can you imagine how tough would the task be to search for a single book if they were stored without any order?
Here storing the words alphabetically represents indexing, but using a different location for the words that start from the same character is known as bucketing. Similar kinds of storage techniques like partitioning and bucketing are there in Apache Hive so that we can get faster results for the search queries. In this article, we will see what is partitioning and bucketing, and when to use which one?
Example “XYZ Corp” has their customer base spread across 190+ countries. There are 5 million records/entities available. If it is required to fetch the entities pertaining to a particular country, in absence of partitioning, there is no choice but to go through all of the 5 million entities. This despite the fact our query will eventually result in few thousand entities of the particular country. How ever, creating partitions based on country will greatly help to alleviate the performance issue by checking the data belonging to the partition for the country in question.
Partitioning tables changes how Hive structures the data storage. Hive will create subdirectories reflecting structure like …/customers/country=ABC Partitioning helps in enhancing performance and is recommended. But having too many partitions may prove detrimental for few queries.
Bucketing Bucketing is another technique of managing large databases. If we partition the dataset based on customer_ID , we would end up with far too many partitions. Instead, if we bucket the customer table and use customer_id as the bucketing column, the value of this column will be hashed by user-defined number into buckets. Records with the same customer_id will always be placed in the same bucket. Assuming we have more customer_ids than the number of buckets, each bucket will house many customer_ids . While creating the table you can specify the number of buckets that you would like your data to be distributed in using the syntax “CLUSTERED BY( customer_id ) INTO XX BUCKETS”; here XX is the number of buckets.
PARTITIONING/BUCKETING
Data Units as arranged in a Hive D Data Data Data Base Partitions Tables Buckets Columns Columns
Semblance of Hive Structure with Data base The figure describes the semblance of Hive structure with data base. A data bases contains several tables. Each Table constitutes of rows and columns. In Hive, Tables are stored as a folder and partition tables are stored as a sub-directory. Bucketed tables are stored as a file
What is Partitioning? Apache Hive allows us to organize the table into multiple partitions where we can group the same kind of data together. It is used for distributing the load horizontally. Let’s understand it with an example: Suppose we have to create a table in the hive which contains the product details for a fashion e-commerce company. It has the following columns:
Now, the first filter that most of the customer uses is Gender then they select categories like Shirt, its size, and color . Let’s see how to create the partitions for this example. CREATE TABLE products ( product_id string, brand string, size string, discount float, price float ) PARTITIONED BY (gender string, category string, color string); Now, the hive will store the data in the directory structure like: /user/hive/warehouse/ mytable /gender=male/category=shoes/ color =black
Partitioning the data gives us performance benefits and also helps us in organizing the data. Now, let’s see when to use the partitioning in the hive.
When to use Partitioning? When the column with a high search query has low cardinality. For example, if you create a partition by the country name then a maximum of 195 partitions will be made and these number of directories are manageable by the hive. On the other hand, do not create partitions on the columns with very high cardinality. For example- product IDs, timestamp, and price because will create millions of directories which will be impossible for the hive to manage. It is effective when the data volume in each partition is not very high. For example, if you have the airline data and you want to calculate the total number of flights in a day. In that case, the result will take more time to calculate over the partition “Dubai” as it has one of the busiest airports in the world whereas for the country like “Albania” will return results quicker.
What is Bucketing? In the above example, we know that we cannot create a partition over the column price because its data type is float and there is an infinite number of unique prices are possible. Hive will have to generate a separate directory for each of the unique prices and it would be very difficult for the hive to manage these. Instead of this, we can manually define the number of buckets we want for such columns.
In bucketing, the partitions can be subdivided into buckets based on the hash function of a column. It gives extra structure to the data which can be used for more efficient queries. CREATE TABLE products ( product_id string, brand string, size string, discount float, price float ) PARTITIONED BY (gender string, category string, color string) CLUSTERED BY (price) INTO 50 BUCKETS; Now, only 50 buckets will be created no matter how many unique values are there in the price column. For example, in the first bucket, all the products with a price [ 0 – 500 ] will go, and in the next bucket products with a price [ 500 – 200 ] and so on.
When to Use Partitioning/Bucketing? Bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when cardinality of the partitioning field is not too high. Partitioning can be done on multiple fields with an order where as bucketing can be done on only one field. Figure shows how these data units are arranged in a Hive Cluster.
When to use Bucketing? We cannot do partitioning on a column with very high cardinality. Too many partitions will result in multiple Hadoop files which will increase the load on the same node as it has to carry the metadata of each of the partitions. If some map-side joins are involved in your queries, then bucketed tables are a good option. Map side join is a process where two tables are joins using the map function only without any reduced function.
Hive Architecture
The various parts of Hive are as follows Hive Command-Line Interface(Hive CLI): The most commonly used interface to interact with Hive. Hive Web Interface: it is a simple Graphic User Interface to Interact with hive and to execute query. Hive Server: This is an optional server. This can be used to submit Hive jobs from a remote client. JDBC/ODBC : Jobs can be submitted from a JDBC Client. Once can write a Java code to connect to Hive and submit jobs on it. Driver: Hive queries are sent to the driver for compilation, optimization and execution. MetaStore : Hive table definitions and mappings to the data are stored in a meta store.A metastore consists of the following: Metastore service: Offers interface to the Hive database : Stores data definitions, mappings to the data and others.
The meta data which is stored in the metastore includes IDs of database, IDs of tables, IDs of Indexes etc., the time of creation of a table, the Input Format used for a table, the output Format used for a table etc. The meta store is updated when ever a table is created or deleted from Hive. There are three kinds of meta store Embedded Metastore Local MetaStore Remote Metastore
Embedded Meta Store Mainly used for unit tests Only one process is allowed to connect to the meta store at a time Default meta store of Hive. It is Apache Derby Database Both database and the meta store service run embedded in the main Hive server process.
Local Meta Store Meta data can be stored in an RDBMS component like MYSQL. Local meta store allows multiple connections at allows multiple connections at a time. Hive meta store services runs in the main Hive Server process, but the meta store data base runs in a separate process, and can be on a separate host
Remote Meta store Hive driver and the meta store interface run on different JVMs. This way the data base can be fire-walled from the Hive user and also database credentials are completely isolated from the users of Hive.
Data base Create DATABASE IF NOT EXISTS STUDENTS COMMENT ‘STUDENT details’ WITH DBPROPERTIES(‘creator’=‘Shubha’); IF NOT EXISTS: IT IS OPTIONS CLAUSE. It creates a database if it does not exist. If the database already exists then it will notify the user that a database with the same name already exists and will not show any error message. COMMENT: This is to provide short description about the database. WITH DBPROPERTIES: It is an optional clause. It is used to specify any properties of database in the form of ( key,value ) separated pairs.
SHOW DATABASES; DESCRIBE DATABASE STUDENTS; Describe DATABASE EXTENDED STUDENTS;(It shows DB properties also) Alter DATABASE STUDENTS SET DBPROPERTIES(‘EDITIED-BY’=‘VSR’); To make the database working database: USE STUDENTS; To drop data base DROP DATABASE STUDENTS SUPPOSE THE DATABASE “STUDENTS” HAS 10 TABLES WITHINIT. How do we delete the complete database along with tables contained there in. Use the command DROP DATABSE STUDENTS CASCADE; By default the mode is Restrict which implies that the database will not be dropped if it contains tables.
Tables Hive provides two kinds of tables Internal or managed table External table Managed table Hive stores the managed tables under the warehouse folder under Hive . The complete life cycle of table and data is managed by Hive. When the internal table is dropped, it drops the data as well as the meta data. When you create a table in Hive, by default it is internal or managed table., If one needs to create an external table one will have use the keyword “EXTERNAL”.
Example CREATE TABLE IF NOT EXISTS STUDENT( rollno INT,name STRING,gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’; DESCRIBE STUDENT; DESCRIBE FORMATTED STUDENT; It will display complete metadata of a table. You will see one row called table type which will display either managed table or external table
EXTERNAL or SELF MANAGED TABLE 1. when the table is dropped, it retains the data in the underlying location. 2. External keyword is used to create an external table. 3. Location needs to be specified to store the dataset in that particular location. Example CREATE EXTERNALTABLE IF NOT EXISTS EXT_STUDENT( rollno INT,name STRING,gpa FLOAT) row format delimited fields terminated by ‘\t’ location ‘/STUDENT_INFO ;
Loading Data into Table from File LOAD DATA LOCAL INPATH ‘/root/ hivedemos /students.csv’ OVERWRITE INTO TABLE EXT_STUDENT;
Collection data types To work with collection data types Input: 1001,John,Smith:jones,Mark1!45:Mark2!46:Mark2!43 1002,Jack,Smith:jones,Mark1!46:Mark2!47:Mark2!42 CREATE TABLE STUDENT_INFO( rollno INT,name STRING,sub ARRAY<STRING>,marks MAP<STRING,INT>) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’ COLLECTION ITEMS TERMINATED BY ‘:’ MAP KEYS TERMINATED BY ‘!’; LOAD DATA LOCAL INPATH ‘/root/ hivedemos /studentinfo.csv’ INTO TABLE STUDENT_INFO;
Querying Table SELECT * from EXT_STUDENT; Quering Collection Data Types SELECT * from STUDENT_INFO; SELECT NAME,SUB FROM STUDENT_INFO; SELECT NAME,MARKS[‘mark1’] from STUDENT_INFO; SELECT NAME,SUB[0] FROM STUDENT_INFO;
Partitions In hive ,the query reads the entire dataset even though a where clause filter is specified on a particular column. This becomes a bottleneck in most of the MApRedce jobs as it involve huge degree of I/O. So it is necessary to reduce I/O required by the MapReduce job to improve the performance of the query A very common method to reduce I/O is data partitioning. Partitions splits the larger dataset into more meaningful chunks.
In the example we will refrain from adding the portioned column along with other columns of the data set and trust Hive to automatically manage this. Partitions are of two types Static partition: it is upon the user to mention the partition ( segregation unit) where the data from the file is to be loaded. Dynamic Partition : the user is required to simply state the column, basis which the partitioning will take place. Hive will then create partition basis the unique values in the column on which partition is to be carried out.
Static Partition Insert input data files individually into a partition table is Static Partition. Usually when loading files (big files) into Hive tables static partitions are preferred. Static Partition saves your time in loading data compared to dynamic partition. You “statically” add a partition in the table and move the file into the partition of the table. We can alter the partition in the static partition. You can get the partition column value from the filename, day of date etc. without reading the whole big file. If you want to use the Static partition in the hive you should set property set hive.mapred.mode = strict. This property set by default in hive-site.xml Static partition is in Strict Mode. You should use where clause to use limit in the static partition. You can perform Static partition on Hive Manage table or external table.
Hive Dynamic Partitioning Single insert to partition table is known as a dynamic partition. Usually, dynamic partition loads the data from the non-partitioned table. Dynamic Partition takes more time in loading data compared to static partition. When you have large data stored in a table then the Dynamic partition is suitable. If you want to partition a number of columns but you don’t know how many columns then also dynamic partition is suitable. Dynamic partition there is no required where clause to use limit. we can’t perform alter on the Dynamic partition. You can perform dynamic partition on hive external table and managed table. If you want to use the Dynamic partition in the hive then the mode is in non-strict mode .
By default, dynamic partitioning is enabled in Hive. Also by default it is strictly implying that one is required to do one leel of static portioning before Hive can perform Dynamic partitioning inside this static segregation unit.
Static partitions comprise columns whose values are known at compile time. CREATE TABLE IF NOT EXISTS STATIC_PART_STUDENT( rollno INT,name STRING PARTIONED BY ( gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\T’); INSERT OVERWRITE TABLE STATIC_PART_STUDENT PARTITION( gpa =4.0) SELECT rollno , name from EXT_STUDENT where gpa =4.0;
To add one or more static partition based on gpa column using the alter statement. ALTER TABLE STATIC_PART_STUDENT ADD PARTITION( gpa =3.5); INSERT OVERWRITE TABLE STATIC_PART_STUDENT PARTITION( gpa =4.0) SELECT rollno , name from EXT_STUDENT where gpa =4.0;
Dynamic partitions They have columns whose values are known only at execution time CREATE TABLE IF NOT EXISTS DYNAMIC_PART_STUDENT( rollno INT,name STRING PARTIONED BY ( gpa FLOAT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’); To load data into dynamic partition table from table SET hive.exec.dynamic.partition =true SET hive.exec.dynamic.partion.mode = nonstrict INSERT OVERWRITE TABLE DYNAMIC_PART_STUDENT PARTITION( gpa ) SELECT rollno , name from EXT_STUDENT;
B ucketing T he concept of bucketing is based on the hashing technique. Here, modules of current column value and the number of required buckets is calculated (let say, F(x) % 3). Now, based on the resulted value, the data is stored into the corresponding bucket.
Example of Bucketing in Hive First, select the database in which we want to create a table. hive > use showbucket ; hive > create table emp_demo (Id int, Name string , Salary float) row format delimited fields terminated by ',' ; Now, load the data into the table. hive > load data local inpath '/home/ codegyani /hive/ emp_details ' into table emp_demo ;
Enable the bucketing by using the following command: - hive > set hive.enforce.bucketing = true; Create a bucketing table by using the following command: - hive > create table emp_bucket (Id int, Name string , Salary float) clustered by (Id) into 3 buckets row format delimited fields terminated by ',' ; Now, insert the data of dummy table into the bucketed table. hive > insert overwrite table emp_bucket select * from emp_demo ;
Here, we can see that the data is divided into three buckets.
What is Hive view? Basically , Apache Hive View is similar to Hive tables, that are generated on the basis of requirements.
hive> CREATE VIEW emp_35000 AS SELECT * FROM employee WHERE salary>35000 SELECT * FROM EMP_35000; The following query drops a view named as emp_35000: hive> DROP VIEW emp_35000;
Sub Query Write a subquery to count occurance of similar word in the file CREATE TABLE word_count AS SELECT word,count (1) AS count FROM (SELECT EXPLODE (SPLIT (LINE,’ ‘)) AS word FROM docs) w GROUPBY word ORDER BY word; SELECT * FROM word_count Explode function takes an array as input and outputs the elements of the array as separate rows.
Joins
Aggregation SELECT avg ( gpa ) FROM STUDENT; SELECT count(*) FROM STUDENT; Group By and Having SELECT rollno , name, gpa FROM STUDENT GROUP BY rollno,name,gpa HAVING gpa >4.0;
RCFILE IMPLEMENTATION RCFILE(Record COLUMNAR File) is a data placement structure that determines how to store relational tables on computer clusters. create table employee_rc (name string,salary int,deptno int,DOJ date) row format delimited fields terminated by ',' stored as RCFILE ; insert into table employee_rc select * from employee ; create table employee_rc (name string,salary int,deptno int,DOJ date) row format delimited fields terminated by ',' stored as RCFILE location '/data/in/ employee_rc ' ;
SERDE Basically, for Serializer / Deserializer in Hive or Hive SerDe (an acronym ). I t handles both serialization and deserialization in Hive. Also , interprets the results of serialization as individual fields for processing. In addition, to read in data from a table a SerDe allows Hive . Further writes it back out to HDFS in any custom format. However , it is possible that anyone can write their own SerDe for their own data formats. HDFS files –> InputFileFormat –> <key, value> –> Deserializer –> Row object Row object –> Serializer –> <key, value> –> OutputFileFormat –> HDFS files It is very important to note that the “key” part is ignored when reading, and is always a constant when writing. However, row object is stored into the “value”.
SERDE
USER DEFINED FUNCTION class SimpleUDFExample extends UDF { public Text evaluate(Text input) { return new Text("Hello " + input.toString ()); } }
public class SimpleUDFExampleTest { @Test public void testUDF () { SimpleUDFExample example = new SimpleUDFExample (); Assert.assertEquals ("Hello world", example.evaluate (new Text("world")). toString ()); } }