Hbase

SatyaHadoop 319 views 24 slides Aug 01, 2018
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

In this session you will learn:
HBase Introduction
Row & Column storage
Characteristics of a huge DB
What is HBase?
HBase Data-Model
HBase vs RDBMS
HBase architecture
HBase in operation
Loading Data into HBase
HBase shell commands
HBase operations through Java
HBase operations through MR
To kno...


Slide Content

Big Data and Hadoop Training HBASE

Agenda HBase Introduction Row & Column storage Characteristics of a huge DB What is HBase ? HBase Data-Model HBase vs RDBMS HBase architecture HBase in operation Loading Data into HBase HBase shell commands HBase operations through Java HBase operations through MR

What is Hbase ? Open source project built on top of Apache Hadoop NoSQL database Distributed , scalable store Column-family datastore

How do you pick Sql or NoSql ? What does your data look like? Is your data model likely to change? Is your data growing exponentially? Will you be doing real-time analytics on operational data?

Inspiration for Hbase Google’s BigTable is the inspiration for Hbase It is designed to run on a cluster of computers. Characteristics of Big Table: Data is ‘Sparse’ Data is stored as a ‘Sorted Map’ ‘Distributed’ ‘Multi-dimensional’ ‘Consistent’

Hbase vs RDBMS HBase RDBMS Data that is accessed together is stored together Data is normalized Column-oriented Row-oriented(mostly) Flexible schema, can add columns on the fly Fixed schema Good with Sparse tables Not optimized for sparse tables No Joins Optimized for joins Horizontal Scalability Har d to shard and scale Good for structured, semi-structured data Good for structured data Row-based transactions Distributed transactions

Row & Column - Storage Column oriented store – For specific queries, not all values of a table are needed (analytical databases) Advantages of Column-oriented storage: Reduced I/O Values of columns in the logical rows are similar – better suited for compression

Hbase Data - Model Component Description Table Data organized into tables; comprised rows Row key Data stored in rows; Rows identified by Rowkeys ; Primary key; Rows are sorted by this value Column family Columns are grouped into families Column Qualifier Identifies the column Cell Combination of the rowkey , column family, colum , timestamp; contains the value Version Values within cell versioned by version number  timestamp

Hbase Data Model

Hbase Data - Model Regions – Horizontal partitions of a Hbase Table. A Region is denoted by the Table it belongs to, it’s first row(inclusive), last row(exclusive) Regions are the units that get distributed over an entire cluster. Initially, a table comprises a single region, but as the region grows it eventually crosses a configurable size threshold, at which point it splits at a row boundary into two new regions of approximately equal size

Hbase Architecture

Hbase Master – master node Regionservers – slave nodes Hbase Master bootstraps a virgin install, assigns regions to registered regionservers , recovers regionserver failures Regionservers carry zero or more regions take client read/write requests Manage region splits – informs master about the new daughter regions Hbase Architecture

ZooKeeper – Authority on the cluster state Hbase – location of catalog table & cluster master Assignment of regions is mediated via Zookeeper in case servers crash mid-assignment Hbase Client must know the location of the zookeeper ensemble. Thereafter, client navigates the zookeeper hierarchy to learn cluster attributes such as server lcoations . Hbase Architecture

h base:meta – list, state & locations of all regions on the cluster. Entries in hbase:meta are keyed by region name Region name – table name of the region, region’s start row, time of creation, and MD5 hash of all of these. Eg : TestTable,xyz,1279729913622.1b6e176fb8d8aa88fd4ab6bc80247ece. As row keys are sorted, finding the region that hosts a particular key is easy Whenever region(s) split, enabled, disabled, deleted etc., the catalog table is updated. Hbase in Operation

Fresh clients connect to Zookeeper cluster to get the location of hbase:meta  To figure out hosting user-space regions and its locations. Then, clients interact directly with regionservers . Clients cache their previous operations – works fine until there is a fault. If fault happens, clients contact hbase:meta again. If this has also moved, clients will contact Zookeeper. Writes arriving at a regionserver are first appended to a commit log and then added to an in-memory memstore . When a memstore fills, its content is flushed to the filesystem Hbase in Operation

When reading, the region’s memstore is consulted first. If sufficient versions are found reading memstore alone, the query completes there. Otherwise, flush files are consulted in order, from newest to oldest, either until versions sufficient to satisfy the query are found or until we run out of flush files. Hbase in Operation

Using HBase shell Using Client APIs Using Pig Using Sqoop Loading Data Into Hbase

Hbase Shell commands

Hbase Shell commands

Hbase Shell Commands

Connect to Hbase from Clients

Hbase Use cases Capturing incremental data – Time series data – High Volume, Velocity Writes eg : Sensor, system metrics, events, stock prices, server logs, rainfall data Information Exchange – High Volume, Velocity Write/Read eg : email, chat Content serving, web Application Backend – High Volume, Velocity Reads eg : ebay , groupon

Thank You