Understanding apache-druid

SumanBanerjee17 355 views 19 slides Jul 14, 2017
Slide 1
Slide 1 of 19
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19

About This Presentation

Talks about what & why Druid ?
Architecture
Use cases in Real Time Analytics
Query Patterns
Industry adoption
etc...


Slide Content

Interactive Analytics at Scale By:-Suman Banerjee

Druid Concepts What is it ? Druid is a open source fast distributed column-oriented data store. Designed for low latency ingestion and very fast ad-hoc aggregation based analytics. Pros:- Fast response in aggregation operation (in almost sub-second) Supports real time streaming ingestion capability with many other popular solution in market e.g. Kafka , Samza , Spark etc. Traditional Batch type ingestion ( Hadoop based ). Cons / Limitation :- joins are not mature enough Limited options compared to other SQL like solutions.

Brief History on Druid History Druid was started in 2011 to power the analytics in Metamarkets. The project was open-sourced to an Apache License in February 2015 .

Industries has Druid in production Metamarkets Druid is the primary data store for Metamarkets’ full stack visual analytics service for the RTB (real time bidding) space. Ingesting over 30 billion events per day, Metamarkets is able to provide insight to its customers using complex ad-hoc queries at query time of around 1 second in almost 95% of the time. Airbnb Druid powers slice and dice analytics on both historical and real time-time metrics. It significantly reduces latency of analytic queries and help people to get insights more interactively. Alibaba At Alibaba Search Group, we use Druid for real-time analytics of users' interaction with its popular e-commerce site. Cisco Cisco uses Druid to power a real-time analytics platform for network flow data . eBay eBay uses Druid to aggregate multiple data streams for real-time user behavior analytics by ingesting up at a very high rate(over 100,000 events/sec), with the ability to query or aggregate data by any random combination of dimensions, and support over 100 concurrent queries without impacting ingest rate and query latencies.

Industries …

Druid In Production - MetaMarkets 3M+ events/sec through Druid’s real time ingestion. 100+ PB of data Application supporting 1000 of queries per sec concurrently. Supports 1000 of cores for horizontally scale up. … Reference :- https://metamarkets.com/2016/impact-on-query-speed-from-forced-processing-ordering-in-druid / https://metamarkets.com/2016/distributing-data-in-druid-at-petabyte-scale/

A real example of Druid in Action Reference :- https://whynosql.com/2015/11/06/lambda-architecture-with-druid-at-gumgum /

Ideal requirements to Druid ? You need :- Fast aggregation & arbitrary data exploration in low latency on huge data sets. Fast response on near real time event data. Ingested data is immediately available for querying ) No SPoF Handle peta-bytes of data with multiple dimension. Less than a second in time-oriented summarization of the incoming data stream NOTE >> before we go to understand the architecture part of it , I want to show u a typical use case just to understand what we have said so far.

Druid Concepts – An example The Data timestamp publisher advertiser gender country click price 2011-01-01T01:01:35Z bieberfever.com google.com Male USA 0 0.65 2011-01-01T01:03:63Z bieberfever.com google.com Male USA 0 0.62 2011-01-01T01:04:51Z bieberfever.com google.com Male USA 1 0.45 2011-01-01T01:00:00Z ultratrimfast.com google.com Female UK 0 0.87 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 0 0.99 2011-01-01T02:00:00Z ultratrimfast.com google.com Female UK 1 1.53 GROUP BY timestamp, publisher, advertiser, gender, country :: impressions = COUNT(1), clicks = SUM(click), revenue = SUM(price) timestamp publisher advertiser gender country impressions clicks revenue 2011-01-01T01:00:00Z ultratrimfast.com google.com Male USA 1800 25 15.70 2011-01-01T01:00:00Z bieberfever.com google.com Male USA 2912 42 29.18 2011-01-01T02:00:00Z ultratrimfast.com google.com Male UK 1953 17 17.31 2011-01-01T02:00:00Z bieberfever.com google.com Male UK 3194 170 34.01

Druid – Architecture C L I N T Indexing Streaming data Real Time Node Broker Node Historical Node H D F S Deep Storage H D F S Over lord node Indexing DATA QUERY Static Data

H D F S Druid – Architecture ( cluster mgmt. depedency ) C L I N T Indexing Streaming data Real Time Node Broker Node Historical Node Deep Storage H D F S Coordinator node DATA QUERY Zookeeper Meta Store

Druid – Components Broker Node Real time node Overlord Node Middle-Manager Node Historical Node Coordinator Node Aside from these nodes, there are 3 external dependencies to the system: A running ZooKeeper cluster for cluster service discovery and maintenance of current data topology A metadata storage instance for maintenance of metadata about the data segments that should be served by the system A "deep storage" system to hold the stored segments.

Druid - Data Storage Layer Segments and Data Storage Druid stores its index in  segment files , which are partitioned by time columnar : the data for each column is laid out in separate data structures.

Druid – Query Timeseries TopN GroupBy & Aggregations Time Boundary Search Select a) queryType b) granularity c) filter d) aggregation e) post-Aggregation

Demo

Task Submit Commands 1- clear HDFS storage location hdfs dfs – rm –r /user/root/segments Make sure the data source is exist in local FS :- / root/ labtest / druid_hadoop /druid-0.10.0/ quickstart /Test/ pageviewsLatforCountExmaple.json & upload to HDFS. hdfs dfs -put -f pageviewsLat.json / user/root/ quickstart /Test Create index Task on Druid curl -X 'POST' -H ' Content-Type:application / json ' -d @ quickstart /Test/ pageviewsLat -index- forCountExample.json localhost:8090/druid/indexer/v1/task Task information can be seen in <overlord_Host:8090>/console.html Verify the segments is created under /user/root/segments hdfs dfs –ls /user/root/segments

Query commands TopN This will result Top N pages with latency in descending order. curl -L - H'Content -Type: application/ json ' -XPOST --data-binary @ quckstart /Test/query/ pageviewsLatforCount -top-latency- pages.json http://localhost:8082/druid/v2/? pretty Timeseries This will result total latency , filtered by user=“ alice ” and "granularity": " day“ . [ “all” ] curl -L - H'Content -Type: application/ json ' -XPOST --data-binary @ ckstart /Test/query/ pageviewsLatforCount-timeseries-pages.json http://localhost:8082/druid/v2/? pretty groupBy A) This is will result aggregated latency grpBy user+url curl -L - H'Content -Type: application/ json ' -XPOST --data-binary @ quickstart /Test/query/ pageviewsLatforCount-aggregateLatencyGrpByURLUser.json http://localhost:8082/druid/v2/? pretty B) This will result aggregated page count (i.e. number of url accessed ) grpBy user curl -L - H'Content -Type: application/ json ' -XPOST --data-binary @ quickstart /Test/query/ pageviewsLatforCount-countURLAccessedGrpByUser.json http://localhost:8082/druid/v2/?pretty

Query commands Time Boundary Time boundary queries return the earliest and latest data points of a data set. curl -L - H'Content -Type: application/ json ' -XPOST --data-binary @ ckstart /Test/query/ pageviewsLatforCount-timeBoundary-pages.json http://localhost:8082/druid/v2/?pretty Search A search query returns dimension values that match the search specification , like e.g. here searching for dimension url has matches with text “ facebook ” curl -L - H'Content -Type: application/ json ' -XPOST --data-binary @ ckstart /Test/query/ pageviewsLatforCount -search-URL- pages.json http://localhost:8082/druid/v2/? pretty

Thank You