Apache phoenix

OsamaHussein27 234 views 44 slides Jul 01, 2021
Slide 1
Slide 1 of 44
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44

About This Presentation

Apache Phoenix Brief Description.


Slide Content

Apache Phoenix Put the SQL back in NoSQL 1 Osama Hussein, March 2021

Agenda History Overview Architecture 2 Capabilities Code Scenarios

1. History From open-source repo to top Apache project

Overview (Apache Phoenix) 4 Began as an internal project by the company (salesforce.com). MAY 2014 JAN 2014 A Top-Level Apache Project Orignially Open-S o urced on Github

2. Overview UDF, Transactions and Schema

Overview (Apache Phoenix) 6 Lorem ipsum congue tempus Support for late-bound, schema-on-read SQL and JDBC API support Access to data stored and produced in other components such as Apache Spark and Apache Hive Developed as part of Apache Hadoop. Runs on top of Hadoop Distributed File System (HDFS). HBase scales linearly and shards automatically.

Overview (Apache Phoenix) 7 Lorem ipsum congue tempus Support for late-bound, schema-on-read SQL and JDBC API support Access to data stored and produced in other components such as Apache Spark and Apache Hive Apache Phoenix is an add-on for Apache HBase that provides a programmatic ANSI SQL interface. implements best-practice optimizations to enable software engineers to develop next-generation data-driven applications based on HBase. Create and interact with tables in the form of typical DDL/DML statements using the standard JDBC API.

Overview (Apache Phoenix) 8 Written in Java and SQL Atomicity, Consistency, Isolation and Durability (ACID) Fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and Map Reduce.

Overview (Apache Phoenix) 9 included in Cloudera Data Platform 7.0 and above. Hortonworks distribution for HDP 2.1 and above. Available as part of Cloudera labs. Part of the Hadoop ecosystem.

Overview (SQL Support) 10 Compiles SQL to and orchestrate running of HBase scans. Produces JDBC result set. All standard SQL query constructs are supported.

Overview (SQL Support) 11 Direct use of the HBase API, along with coprocessors and custom filters. Performance: Milliseconds for small queries Seconds for tens of millions of rows.

Overview (Bulk Loading) 12 MapReduce-based : CSV and JSON Via Phoenix MapReduce library Single-Threaded: CSV Via PostgreSQL (PSQL) HBase on local machine

Overview (User Defintion Functions) 13 Temporary UDFs for sessions only. Permanent UDFs stored in system functions. UDF used in SQL and indexes. Tenant specific UDF usage and support. UDF jar update require cluster bounce.

Overview (Transactions) 14 Using Apache Tephra cross row/table/ACID support. Create tables with flag ‘ transactional = true ’. Enable transactions and snapshot directory and set timeout value ‘hbase-site.xml’. Transactions start with statement against table. Transactions end with commit or rollback.

Overview (Transactions) 15 Applications let HBase manage timestamps. Incase the application needs to control the timestamp ‘ CurrentSCN ’ property must be specified at the connection time. ‘ CurrentSCN ’ controls the timestamp for any DDL, DML, or query.

Overview (Schema) 16 The table metadata is stored in versioned HBase table (Up to 1000 versions). ‘ UPDATE_CACHE_FREQUENCY ’ allow the user to declare how often the server will be checked for meta data updates. Values: Always Never Millisecond value

Overview (Schema) 17 Phoenix table can be: Built from scratch. Mapped to an existing HBase table. Read-Write Table Read-Only View

Overview (Schema) 18 Read-Write Table: column families will be created automatically if they don’t already exist. An empty key value will be added to the first column family of each existing row to minimize the size of the projection for queries.

Overview (Schema) 19 Read-Only View: All column families must already exist. Addition of the Phoenix coprocessors used for query processing (Only change to HBase table).

3. Architecture Architecture, Phoenix Data Mode, Query Execution and Enviroment

Architecture 21

Architecture 22

Architecture (Phoenix Data Model) 23

Architecture (Server Metrics Example) 24

Architecture (Server Metrics Example) 25 Example:

26 Overlay Row Key Query Perform Merge Sort Skip Filtering Scan Interception Execute Scan Perform Final Merge Sort Intercept Scan in Coprocessor Filter using Skip Scan Execute Parallel Scans Overlay Row Key Ranges with Regions Identify Row Key Ranges from Query Architecture (Query Execution)

Architecture (Enviroment) 27 Data Warehouse Extract, Transform, Load (ETL) BI and Visualizing

4. Code C o mmands and Sample Codes

Code (C o mmands) 29 DML Commands: UPSERT VALUES UPSERT SELECT DELETE DDL Commands: CREATE TABLE CREATE VIEW Drop Table Drop View

30 Connection: Long Running Short Running Connection conn = DriverManager.getConnection (“ jdbc:phoenix:my_server:longRunning ”, longRunningProps ); Connection conn = DriverManager.getConnection (" jdbc:phoenix:my_server:shortRunning ", shortRunningProps );

31 @Test public void createTable () throws Exception { String tableName = generateUniqueName (); long numSaltBuckets = 6; String ddl = "CREATE TABLE " + tableName + " (K VARCHAR NOT NULL PRIMARY KEY, V VARCHAR)" + " SALT_BUCKETS = " + numSaltBuckets ; Connection conn = DriverManager.getConnection ( getUrl ()); conn.createStatement ().execute( ddl ); } Transactions: Create Table

32 @Test public void readTable () throws Exception { String tableName = generateUniqueName (); long numSaltBuckets = 6; long numRows = 1000; long numExpectedTasks = numSaltBuckets ; insertRowsInTable ( tableName , numRows ); String query = "SELECT * FROM " + tableName ; Statement stmt = conn.createStatement (); ResultSet rs = stmt.executeQuery (query); PhoenixResultSet resultSetBeingTested = rs.unwrap ( PhoenixResultSet.class ); changeInternalStateForTesting ( resultSetBeingTested ); while ( resultSetBeingTested.next ()) {} resultSetBeingTested.close (); Set<String> expectedTableNames = Sets.newHashSet ( tableName ); assertReadMetricValuesForSelectSql ( Lists.newArrayList ( numRows ), Lists.newArrayList ( numExpectedTasks ), resultSetBeingTested , expectedTableNames ); } Transactions: Read Table

33 @Override public void getRowCount ( ResultSet resultSet ) throws SQLException { Tuple row = resultSet.unwrap ( PhoenixResultSet.class ). getCurrentRow (); Cell kv = row.getValue (0); ImmutableBytesWritable tmpPtr = new ImmutableBytesWritable ( kv.getValueArray (), kv.getValueOffset (), kv.getValueLength ()); // A single Cell will be returned with the count(*) - we decode that here rowCount = PLong.INSTANCE.getCodec (). decodeLong ( tmpPtr , SortOrder.getDefault ()); } Transactions: Row Count

34 private void changeInternalStateForTesting ( PhoenixResultSet rs ) { // get and set the internal state for testing purposes. ReadMetricQueue testMetricsQueue = new TestReadMetricsQueue ( LogLevel.OFF,true ); StatementContext ctx = ( StatementContext ) Whitebox.getInternalState ( rs , "context"); Whitebox.setInternalState ( ctx , " readMetricsQueue ", testMetricsQueue ); Whitebox.setInternalState ( rs , " readMetricsQueue ", testMetricsQueue ); } Transactions: Internal State

5. Capabilities Features and Capabilities

Capabilities Overlays on top of HBase Data Model Keeps Versioned Schema Respository Query Processor 36

Capabilities Cost-based query optimizer. Enhance existing statistics collection. Generate histograms to drive query optimization decisions and join ordering. 37

Capabilities Secondary indexes: Boost the speed of queries without relying on specific row-key designs. Enable users to use star schemes. Leverage SQL tools and Online Analytics Processing (OLAP) tools. 38

Capabilities Row timestamp column. Set minimum and maximum time range for scans. Improves performance especially when querying the tail-end of the data. 39

5. Scenarios Use Cases

Scenarios (Server Metrics Example) 41

SELECT substr (host,1,3), trunc ( date,’DAY ’), avg( response_time ) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr (host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) GROUP BY substr (host, 1, 3), trunc ( date,’DAY ’) 42 Scenarios ( Chart Response Time Per Cluster )

SELECT host, date, gc_time FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr (host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) ORDER BY gc_time DESC LIMIT 5 43 Scenarios ( Find 5 Longest GC Times )

Thanks! Any questions? You can find me at: Github: @sxaxmz Linkedin: linkedin.com/in/husseinosama 44