Apache samza past, present and future

Apache Samza Past, Present and Future Kartik Paramasivam Director of Engineering, Streams Infra@ LinkedIn

Agenda Stream P rocessing State of the U nion Apache Samza : Key D ifferentiators Apache Samza Futures

Stream Processing: Processing events as soon as they happen.. Stateless Processing Transformation etc. Lookup adjunct data (lookup databases/call services ) Producing results for every event Stateful Processing Triggering/Producing results periodically (time-windows) Maintain intermediate state E.g. Joining across multiple streams of events. Common Issues Scale !! Scale !! Scale !! Reliability !! Everything else (upgrades, debugging, diagnostics, security, ……)

Stream Processing: State of the Union Millwheel Storm Heron Spark Streaming S4 Dempsey Samza Flink Beam Dataflow Azure Stream Analytics AWS Kinesis Analytics GearPump Kafka Streams Orleans Not meant to be an accurate timeline.. Yes It is CROWDED !!

Apache Samza Top level Apache project since Dec 2014 5 big Releases (0.7, 0.8, 0.9, 0.10, 0.11) 62 Contributors 14 Committers Companies using : LinkedIn, Uber, MetaMarkets, Netflix, Intuit, TripAdvisor, MobileAware, Optimizely …. https://cwiki.apache.org/confluence/display/SAMZA/Powered+By Applications at LinkedIn : from ~20 to ~200 in 2 years.

Key Differentiators for Apache Samza Performance !! Stability Support for a variety of input sources Stream processing as a service AND as an embedded library

Performance : Accessing Adjunct Data Samza Processor Database Remote-read Samza Processor Capture changes Databus, Brooklin Rocks-DB Local read Database Local data access Remote data access Input stream Input stream

Performance : Maintaining Temporary State Samza Processor Remote Database Read-Write Samza Processor Backup changes Kafka Change Log(Log compacted) Rocks-DB Local read/write Local data access Remote data access Input stream Input stream In Memory Store

Performance : Let us talk numbers ! 100x Difference between using Local State vs Remote No-Sql store Local State details: 1.1 Million TPS on a single processing machine (SSD) Used a 3 node Kafka cluster for storing the durable changelog Remote State details: 8500 TPS when the Samza job was changed to accessing a remote No-Sql store No-Sql Store was also on a 3 node (ssd) cluster

Remote State : Asynchronous Event Processing Event Loop (Single thread) P rocessAsync Remote DB /Services Asynchronous I/O calls, using Java Nio, Netty... Responses sent to main thread via callback Event loop is woken up to process next message Task.max.concurrency >1 to enable pipelining Available with Samza 0.11

Remote State: Synchronous Processing on Multiple Threads Event Loop (Single thread) Schedule P rocess() Remote DB/ Services Built-In Thread pool Blocking I/O calls Event loop is woken up by the worker thread job.container.thread.pool.size = N Available with Samza 0.11

Incremental Checkpointing : MVP for stateful apps Some applications have ~ 2 TB state in production Stateful apps don’t really work without incremental checkpointing Samza Task State changelog checkpoint Host 1 Input stream(e.g. Kafka)

Key Differentiators for Apache Samza Performance !! Stability Support for a variety of input sources Stream processing as a service AND as an embedded library

Speed Thrills .. but can kill Local State Considerations: State should NOT be reseeded under normal operations (e.g. Upgrades, Application restarts) Minimal State should be reseeded - If a container dies/removed - If a container is added

How Samza keeps Local state ‘stable’ ? P P 1 P 2 P 3 Task-0 Task-1 Task-2 Task-3 P P 1 P 2 P 3 Host-E Host-B Host-C Coordinator Stream : Task-Container-Host Mapping Container-0 -> Host-E Container-1 -> Host-B Container-2 -> Host-C AM JC Yarn- RM Ask: Host-E Allocate: Host-E Samza Job Input Stream Change-log Enable Continuous Scheduling

Stream A Stream B Stream C Job 1 Job 2 Kafka or durable intermediate queues are leveraged to avoid backpressure issues in a pipeline. Allows each stage to be independent of the next stage Backpressure in a Pipeline

Key Differentiators for Apache Samza Performance !! Stability Support for a variety of input sources Stream processing as a service AND as an embedded library

Pluggable system consumers Samza Processor Mongo DB DynamoDB Streams Kafka Databus/ Brooklin Kinesis ZeroMQ … Azure EventHub, Azure Document DB, Google Pub-Sub etc. Oracle, Espresso Dynamo-DB

Batch processing in Samza!! (NEW) HDFS system consumer for Samza Same Samza processor can be used for processing events from Kafka and HDFS with no code changes Scenarios : Experimentation and Testing Re-processing of large datasets Some datasets are readily available on HDFS (company specific)

Samza - HDFS support HDFS input (Samza) Processor YARN HDFS output (Samza) Processor HDFS output HDFS input (Samza) Processor Kafka Kafka New Available since Samza 0.10 The batch job auto-terminates when the input is fully processed.

Reprocessing entire Dataset online Updates (Samza) Processor (Samza) Processor Bootstrap output Kafka Brooklin Brooklin Database (espresso) set offset=0 Nearline Processing

Reprocessing ‘large’ Datasets “ offline ” Updates (Samza) Processor (Samza) Processor Backup output Kafka Databus Database (espresso) Database Backup (HDFS) Nearline Processing Offline Processing

Samza batch pipelines YARN (Samza) Processor HDFS output HDFS input (Samza) Processor Kafka (Samza) Processor HDFS output HDFS input (Samza) Processor HDFS

Samza- HDFS Early Performance Results !! Benchmark : Count number of records grouped by <Field> DataSize (bytes): 250 GB Number of files : 487 Samza Map/Reduce Spark Number of Containers T ime -seconds

Key Differentiators for Apache Samza Performance !! Stability Support for a variety of input sources (batch and streaming) Stream processing as a service AND ( coming soon) as an embedded library

Stream Processing as a Service Based on YARN Yarn- RM high availabilit y W ork preserving RM Support for Heterogenous hardware with Node Labels (NEW) Easy upgrade of Samza framework : Use the Samza version deployed on the machine instead of packaging it with the application. Disk Quotas for local state (e.g. rocksDB state) Samza Management Service(SAMZA-REST)-> Next Slide

YARN Resource Managers Nodes in the YARN cluster RM SR R RM SR R RM SR R NM SR N Samza Management Service (Samza REST) (NEW) NM SR N NM SR N NM SR N NM SR N NM SR N NM SR N NM SR N /v1/jobs /v1/jobs /v1/jobs Samza Containers Exposes /jobs resource to start, stop, get status of jobs etc. Cleans up stores from dead jobs Samza REST YARN processes(RM/NM)

Agenda Stream processing State of the union Apache Samza : Key differentiators Apache Samza Futures

Coming Soon : Samza as a Library Stream Processor Code Job Coordinator Stream Processor Code Job Coordinator Stream Processor Code Job Coordinator ... Leader No YARN dependency Will use ZK for leader election Embed stream processing into your bigger application StreamProcessor processor = new StreamProcessor (config, “job-name”, “job-id”); processor.start(); processor.awaitStart(); … processor.stop();

Coming Soon: High Level API and Event Time (SAMZA-914/915) Count the number of PageViews by Region, every 30 minutes. @Override public void init(Collection<SystemMessageStream> sources) { sources.forEach(source -> { Function<PageView, String> keyExtractor = view -> view.getRegion(); source.map(msg -> new PageViewMessage(msg)) .window(Windows.<PageViewMessage, String>intoSessionCounter( keyExtractor , WindowType.Tumbling, 30*60 )) }); }

Coming Soon: First class support for Pipelines (Samza- 1041) public class My Pipeline implements PipelineFactory { public Pipeline create(Config config) { Processor myShuffler = getShuffle(config); Processor myJoiner = getJoin(config); Stream inStream = getStream(config, “inStream1”); // … omitted for brevity PipelineBuilder builder = new PipelineBuilder(); return builder.addInputStreams( myShuffler , inStream1 ) .addOutputStreams( myShuffler , intermediateO utStream ) .addInputStreams( myJoiner, intermediateOutStream, inStream2) .addOutputStreams( myJoiner , finalO utStream ) .build(); } } Shuffle Join input output

Future: Miscellaneous Exactly once processing Making it easier to auto-scale even with Local State (on-demand Standby containers) Turnkey Disaster Recovery for stateful applications Easy Restore of changelog and checkpoints from some other datacenter. Improved support for Batch jobs SQL over Streams A default Dashboard :)

Questions ?

Apache samza past, present and future

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Apache samza past, present and future

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......