Apache Phoenix Put the SQL back in NoSQL 1 Osama Hussein, March 2021
Agenda History Overview Architecture 2 Capabilities Code Scenarios
1. History From open-source repo to top Apache project
Overview (Apache Phoenix) 4 Began as an internal project by the company (salesforce.com). MAY 2014 JAN 2014 A Top-Level Apache Project Orignially Open-S o urced on Github
2. Overview UDF, Transactions and Schema
Overview (Apache Phoenix) 6 Lorem ipsum congue tempus Support for late-bound, schema-on-read SQL and JDBC API support Access to data stored and produced in other components such as Apache Spark and Apache Hive Developed as part of Apache Hadoop. Runs on top of Hadoop Distributed File System (HDFS). HBase scales linearly and shards automatically.
Overview (Apache Phoenix) 7 Lorem ipsum congue tempus Support for late-bound, schema-on-read SQL and JDBC API support Access to data stored and produced in other components such as Apache Spark and Apache Hive Apache Phoenix is an add-on for Apache HBase that provides a programmatic ANSI SQL interface. implements best-practice optimizations to enable software engineers to develop next-generation data-driven applications based on HBase. Create and interact with tables in the form of typical DDL/DML statements using the standard JDBC API.
Overview (Apache Phoenix) 8 Written in Java and SQL Atomicity, Consistency, Isolation and Durability (ACID) Fully integrated with other Hadoop products such as Spark, Hive, Pig, Flume, and Map Reduce.
Overview (Apache Phoenix) 9 included in Cloudera Data Platform 7.0 and above. Hortonworks distribution for HDP 2.1 and above. Available as part of Cloudera labs. Part of the Hadoop ecosystem.
Overview (SQL Support) 10 Compiles SQL to and orchestrate running of HBase scans. Produces JDBC result set. All standard SQL query constructs are supported.
Overview (SQL Support) 11 Direct use of the HBase API, along with coprocessors and custom filters. Performance: Milliseconds for small queries Seconds for tens of millions of rows.
Overview (Bulk Loading) 12 MapReduce-based : CSV and JSON Via Phoenix MapReduce library Single-Threaded: CSV Via PostgreSQL (PSQL) HBase on local machine
Overview (User Defintion Functions) 13 Temporary UDFs for sessions only. Permanent UDFs stored in system functions. UDF used in SQL and indexes. Tenant specific UDF usage and support. UDF jar update require cluster bounce.
Overview (Transactions) 14 Using Apache Tephra cross row/table/ACID support. Create tables with flag ‘ transactional = true ’. Enable transactions and snapshot directory and set timeout value ‘hbase-site.xml’. Transactions start with statement against table. Transactions end with commit or rollback.
Overview (Transactions) 15 Applications let HBase manage timestamps. Incase the application needs to control the timestamp ‘ CurrentSCN ’ property must be specified at the connection time. ‘ CurrentSCN ’ controls the timestamp for any DDL, DML, or query.
Overview (Schema) 16 The table metadata is stored in versioned HBase table (Up to 1000 versions). ‘ UPDATE_CACHE_FREQUENCY ’ allow the user to declare how often the server will be checked for meta data updates. Values: Always Never Millisecond value
Overview (Schema) 17 Phoenix table can be: Built from scratch. Mapped to an existing HBase table. Read-Write Table Read-Only View
Overview (Schema) 18 Read-Write Table: column families will be created automatically if they don’t already exist. An empty key value will be added to the first column family of each existing row to minimize the size of the projection for queries.
Overview (Schema) 19 Read-Only View: All column families must already exist. Addition of the Phoenix coprocessors used for query processing (Only change to HBase table).
3. Architecture Architecture, Phoenix Data Mode, Query Execution and Enviroment
Architecture 21
Architecture 22
Architecture (Phoenix Data Model) 23
Architecture (Server Metrics Example) 24
Architecture (Server Metrics Example) 25 Example:
26 Overlay Row Key Query Perform Merge Sort Skip Filtering Scan Interception Execute Scan Perform Final Merge Sort Intercept Scan in Coprocessor Filter using Skip Scan Execute Parallel Scans Overlay Row Key Ranges with Regions Identify Row Key Ranges from Query Architecture (Query Execution)
Architecture (Enviroment) 27 Data Warehouse Extract, Transform, Load (ETL) BI and Visualizing
4. Code C o mmands and Sample Codes
Code (C o mmands) 29 DML Commands: UPSERT VALUES UPSERT SELECT DELETE DDL Commands: CREATE TABLE CREATE VIEW Drop Table Drop View
33 @Override public void getRowCount ( ResultSet resultSet ) throws SQLException { Tuple row = resultSet.unwrap ( PhoenixResultSet.class ). getCurrentRow (); Cell kv = row.getValue (0); ImmutableBytesWritable tmpPtr = new ImmutableBytesWritable ( kv.getValueArray (), kv.getValueOffset (), kv.getValueLength ()); // A single Cell will be returned with the count(*) - we decode that here rowCount = PLong.INSTANCE.getCodec (). decodeLong ( tmpPtr , SortOrder.getDefault ()); } Transactions: Row Count
34 private void changeInternalStateForTesting ( PhoenixResultSet rs ) { // get and set the internal state for testing purposes. ReadMetricQueue testMetricsQueue = new TestReadMetricsQueue ( LogLevel.OFF,true ); StatementContext ctx = ( StatementContext ) Whitebox.getInternalState ( rs , "context"); Whitebox.setInternalState ( ctx , " readMetricsQueue ", testMetricsQueue ); Whitebox.setInternalState ( rs , " readMetricsQueue ", testMetricsQueue ); } Transactions: Internal State
5. Capabilities Features and Capabilities
Capabilities Overlays on top of HBase Data Model Keeps Versioned Schema Respository Query Processor 36
Capabilities Cost-based query optimizer. Enhance existing statistics collection. Generate histograms to drive query optimization decisions and join ordering. 37
Capabilities Secondary indexes: Boost the speed of queries without relying on specific row-key designs. Enable users to use star schemes. Leverage SQL tools and Online Analytics Processing (OLAP) tools. 38
Capabilities Row timestamp column. Set minimum and maximum time range for scans. Improves performance especially when querying the tail-end of the data. 39
5. Scenarios Use Cases
Scenarios (Server Metrics Example) 41
SELECT substr (host,1,3), trunc ( date,’DAY ’), avg( response_time ) FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr (host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) GROUP BY substr (host, 1, 3), trunc ( date,’DAY ’) 42 Scenarios ( Chart Response Time Per Cluster )
SELECT host, date, gc_time FROM server_metrics WHERE date > CURRENT_DATE() – 7 AND substr (host, 1, 3) IN (‘sf1’, ‘sf3, ‘sf7’) ORDER BY gc_time DESC LIMIT 5 43 Scenarios ( Find 5 Longest GC Times )
Thanks! Any questions? You can find me at: Github: @sxaxmz Linkedin: linkedin.com/in/husseinosama 44