BDA: Introduction to HIVE, PIG and HBASE

tripathineeharika 106 views 28 slides Jun 24, 2024
Slide 1
Slide 1 of 28
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28

About This Presentation

Introduction to HIVE PIG AND HBASE


Slide Content

Presented By: Ms. Neeharika Tripathi Assistant Professor Department of Computer Science And Engineering AJAY KUMAR GARG ENGINEERING COLLEGE, GHAZIABAD Computer Science And Engineering Big Data and Analytics (KDS-601) Introduction to HIVE, HBASE, PIG

HBASE- Introduction HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable. HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS). It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System. One can store the data in HDFS either directly or through HBase . Data consumer reads/accesses the data in HDFS randomly using HBase . HBase sits on top of the Hadoop File System and provides read and write access.

Hbase Architecture HBase has three major components: the client library, a master server, and region servers. Region servers can be added or removed as per requirement.

MasterServer Assigns regions to the region servers and takes the help of Apache ZooKeeper for this task. Handles load balancing of the regions across region servers. It unloads the busy servers and shifts the regions to less occupied servers. Maintains the state of the cluster by negotiating the load balancing. Is responsible for schema changes and other metadata operations such as creation of tables and column families.

Regions Communicate with the client and handle data-related operations. Handle read and write requests for all the regions under it. Decide the size of the region by following the region size thresholds.

Zookeeper ZooKeeper is a distributed coordination service that also helps to manage a large set of hosts. Managing and coordinating a service especially in a distributed environment is a complicated process, so ZooKeeper solves this problem due to its simple architecture as well as API. ZooKeeper allows developers to focus on core application logic. For instance, to track the status of distributed data, Apache HBase uses ZooKeeper . They can also support a large Hadoop cluster easily. To retrieve information, each client machine communicates with one of the servers. It keeps an eye on the synchronization as well as coordination across the cluster There is some best Apache ZooKeeper feature: Simplicity: With the help of a shared hierarchical namespace, it coordinates. Reliability: The system keeps performing, even if more than one node fails. Speed: In the cases where ‘Reads’ are more common, it runs with the ratio of 10:1. Scalability : By deploying more machines, the performance can be enhanced.

HBase Data Model The Data Model in HBase is designed to accommodate semi-structured data that could vary in field size, data type and columns. Additionally, the layout of the data model makes it easier to partition the data and distribute it across the cluster. 

Components of Data Model Tables  – The HBase Tables are more like logical collection of rows stored in separate partitions called Regions. As shown above, every Region is then served by exactly one Region Server. The figure above shows a representation of a Table. Rows  – A row is one instance of data in a table and is identified by a  rowkey . Rowkeys are unique in a Table and are always treated as a byte[]. Column Families  – Data in a row are grouped together as Column Families. Each Column Family has one more Columns and these Columns in a family are stored together in a low level storage file known as HFile . Column Families form the basic unit of physical storage to which certain HBase features like compression are applied. Hence it’s important that proper care be taken when designing Column Families in table.

Columns  – A Column Family is made of one or more columns. A Column is identified by a Column Qualifier that consists of the Column Family name concatenated with the Column name using a colon – example: columnfamily:columnname . There can be multiple Columns within a Column Family and Rows within a table can have varied number of Columns. Cell  – A Cell stores data and is essentially a unique combination of  rowkey , Column Family and the Column (Column Qualifier). The data stored in a Cell is called its value and the data type is always treated as byte[]. Version  – The data stored in a cell is versioned and versions of data are identified by the timestamp. The number of versions of data retained in a column family is configurable and this value by default is 3.

Hive- Introduction Hive is a data warehouse infrastructure tool to process structured data in Hadoop . It resides on top of Hadoop to summarize Big Data, and makes querying and analyzing easy. Initially Hive was developed by Facebook , later the Apache Software Foundation took it up and developed it further as an open source under the name Apache Hive. It is used by different companies. For example, Amazon uses it in Amazon Elastic MapReduce . Hive is not A relational database A design for OnLine Transaction Processing (OLTP) A language for real-time queries and row-level updates Features of Hive It stores schema in a database and processed data into HDFS. It provides SQL type language for querying called HiveQL or HQL. It is familiar, fast, scalable, and extensible.

Architecture of Hive Metastore –  It stores metadata for each of the tables like their schema and location. Hive also includes the partition metadata. This helps the driver to track the progress of various data sets distributed over the cluster. It stores the data in a traditional RDBMS format. Hive metadata helps the driver to keep a track of the data and it is highly crucial. Backup server regularly replicates the data which it can retrieve in case of data loss.

User Interface- Hive is a data warehouse infrastructure software that can create interaction between user and HDFS. The user interfaces that Hive supports are Hive Web UI, Hive command line, and Hive HD Insight (In Windows server). Driver –  It acts like a controller which receives the HiveQL statements. The driver starts the execution of the statement by creating sessions. It monitors the life cycle and progress of the execution. Driver stores the necessary metadata generated during the execution of a HiveQL statement. It also acts as a collection point of data or query result obtained after the Reduce operation. Compiler –  It performs the compilation of the HiveQL query. This converts the query to an execution plan. The plan contains the tasks. It also contains steps needed to be performed by the  MapReduce to get the output as translated by the query. The compiler in Hive converts the query to an  Abstract Syntax Tree (AST) . First, check for compatibility and compile-time errors, then converts the AST to a  Directed Acyclic Graph (DAG).

HiveQL Process Engine- HiveQL is similar to SQL for querying on schema info on the Metastore . It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce program in Java, we can write a query for MapReduce job and process it. Execution Engine- The conjunction part of HiveQL process Engine and MapReduce is Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce . HDFS or Hbase : Hadoop distributed file system or HBASE are the data storage techniques to store data into file system.

Working of Hive Step No. Operation 1 Execute Query: The Hive interface such as Command Line or Web UI sends query to Driver (any database driver such as JDBC, ODBC, etc.) to execute.

2 Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query. 3 Get Metadata The compiler sends metadata request to Metastore (any database). 4 Send Metadata Metastore sends metadata as a response to the compiler. 5 Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete.

2 Get Plan The driver takes the help of query compiler that parses the query to check the syntax and query plan or the requirement of query. 3 Get Metadata The compiler sends metadata request to Metastore (any database). 4 Send Metadata Metastore sends metadata as a response to the compiler. 5 Send Plan The compiler checks the requirement and resends the plan to the driver. Up to here, the parsing and compiling of a query is complete.

components of Apache Hive The major components of Apache Hive are: 1. Hive Client 2. Hive Services 3. Processing and Resource Management 4. Distributed Storage

HIVE Client Hive supports applications written in any language like Python, Java, C++, Ruby, etc using JDBC, ODBC, and Thrift drivers, for performing queries on the Hive. Hence, one can easily write a hive client application in any language of its own choice. Hive clients are categorized into three types: 1.Thrift Clients: The Hive server is based on Apache Thrift so that it can serve the request from a thrift client. 2.JDBC Client: Hive allows for the Java applications to connect to it using the JDBC driver. JDBC driver uses Thrift to communicate with the Hive Server. 3. ODBC Client: Hive ODBC driver allows applications based on the ODBC protocol to connect to Hive. Similar to the JDBC driver, the ODBC driver uses Thrift to communicate with the Hive Server.

HIVE Services To perform all queries, Hive provides various services like the Hive server2, Beeline, etc. The various services offered by Hive are: 1.Beeline 2.Hive Server 2 3.Hive Driver 4.Hive Compiler 5.Optimizer 6.Execution Engine 7.Metastore 8.HCatalog 9.WebHCat

PROCESSING AND RESOURCE MANAGEMENT: Hive internally uses a MapReduce framework as a actual engine for executing the queries. MapReduce is a software framework for writing those applications that process a massive amount of data in parallel on the large clusters of commodity hardware. MapReduce job works by splitting data into chunks, which are processed by map-reduce tasks. DISTRIBUTED STORAGE: Hive is built on top of Hadoop , so it uses the underlying Hadoop Distributed File System for the distributed storage.

Hive Shell Hive shell is a primary way to interact with hive. It is a default service in the hive. It is also called CLI (command line interference). Hive shell is similar to MySQL Shell. Hive users can run HQL queries in the hive shell. In hive shell up and down arrow keys are used to scroll previous commands. HiveQL is case-insensitive (except for string comparisons). The tab key will autocomplete (provides suggestions while you type into the field) Hive keywords and functions.

Hive Shell can run in two modes: Non-Interactive mode: Non-interactive mode means run shell scripts in administer zone. Hive Shell can run in the non-interactive mode, with the -f option. Example: $hive -f script.q , Where script. q is a file. Interactive mode: The hive can work in interactive mode by directly typing the command “hive” in the terminal. Example: $hive Hive> show databases;

Apache Pig- Introduction Apache Pig is a high-level data flow platform for executing MapReduce programs of Hadoop . The language used for Pig is Pig Latin. The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored in HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache Spark. Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores the corresponding results into Hadoop Data File System. Every task which can be achieved using PIG can also be achieved using java used in MapReduce . It’s easy to learn, especially if you’re familiar with SQL. Pig’s multi-query approach reduces the number of times data is scanned. This means 1/20th the lines of code and 1/16th the development time when compared to writing raw MapReduce . Performance of Pig is in par with raw MapReduce Pig provides data operations like filters, joins, ordering, etc. and nested data types like tuples , bags, and maps, that are missing from MapReduce . Pig Latin is easy to write and read.

Apache Pig Run Modes Apache Pig executes in two modes: Local Mode and MapReduce Mode .

Local Mode It executes in a single JVM and is used for development experimenting and prototyping. Here, files are installed and run using localhost . The local mode works on a local file system. The input and output data stored in the local file system. The command for local mode grunt shell: $ pig-x local  MapReduce Mode The MapReduce mode is also known as Hadoop Mode. It is the default mode. In this Pig renders Pig Latin into MapReduce jobs and executes them on the cluster. It can be executed against semi-distributed or fully distributed Hadoop installation. Here, the input and output data are present on HDFS. The command for Map reduce mode: $ pig  $ pig -x  mapreduce   

Ways to execute Pig Program These are the following ways of executing a Pig program on local and MapReduce mode: - Interactive Mode  - In this mode, the Pig is executed in the Grunt shell. To invoke Grunt shell, run the pig command. Once the Grunt mode executes, we can provide Pig Latin statements and command interactively at the command line. Batch Mode  - In this mode, we can run a script file having a .pig extension. These files contain Pig Latin commands. Embedded Mode  - In this mode, we can define our own functions. These functions can be called as UDF (User Defined Functions). Here, we use programming languages like Java and Python.
Tags