Apache Kafka Persistent, reliable, distributed message queue Shiny new logo!
At LinkedIn 10+ billion writes per day 172k messages per second (average) 55+ billion messages per day to real-time consumers
Quick aside… Kafka: First among (pluggable) equals LinkedIn: Espresso and Databus Coming soon? HDFS, ActiveMQ , Amazon SQS
Kafka in four bullet points Producers send messages to brokers Messages are key, value pairs Brokers store messages in topics for consumers Consumers pull messages from brokers
A Kafka Topic “Very sleepy” 534 “Car nicked!” 755 “The ref’s blind!” 234 “Nicked a car!” 534 Topic: StatusUpdateEvent Key: User ID of user who updated the status Value: Timestamp, new status, geolocation , etc.
A Samza job Input topics StatusUpdateEvent NewConnectionEvent LikeUpdateEvent Some code MyStreamTask implements StreamTask { …………. } Output topics NewsUpdatePost UpdatesPerHourMetric
Execution engine: YARN
What we use YARN for Distributing our tasks across multiple machines Letting us know when one has died Distributing a replacement Isolating our tasks from each other
Co-partitioning of topics MyStreamTask: process () Samza TaskRunner : Partition 0 StatusUpdateEvent , Partition 0 NewConnectionEvent , Partition 0 NewsUpdatePost An instance of StreamTask is responsible for a specific partition
Awesome feature: State Generic data store interface Key-value out-of-box More soon? Bloom filter, lucene , etc. Restored by Samza upon task crash MyStreamTask: process () Samza TaskRunner : Partition 0 Store state
(Pseudo)code snippet: Newsfeed Consume StatusUpdateEvent Send those updates to all your conmections via the NewsUpdatePost topic Consume NewConnectionEvent Maintain state of connections to know who to send to