AWS & Confluent GameDay Ahmed Zamzam Senior Partner Solution Engineer
Agenda 10:30 - 10:35 Welcome & Introductions 10:35 - 10:50 Data Analytics on AWS 10:50 - 11:30 Unlock Value with Confluent on AWS 11:30 - 12:15 Lunch & Networking 12:15 - 14:30 GameDay Workshop 14:30 - 15:00 Wrap Up
AWS & Confluent Team Devadyuti Das Sr Partner SA – AWS Bans Sago Sr SE - Confluent Ahmed Zamzam Sr Partner SA - Confluent
Logistics Team Hash Code Wifi Access: SSID: Guest / Pswd: BrokenWires@@2019 Dietary Requirements for Lunch
Feedback - QR Code Survey Link https://amazonmr.au1.qualtrics.com/jfe/form/SV_2ca9vpakNF46Iui Alternatively use the QR Code
AWS for Real-time Analytics and Data Streaming Devadyuti Das Partner Sales Solutions Architect Amazon Web Services (AWS) #AWS #Confluent #GameDay
Out-of-box integration with popular services Unified operations and billing experience Certified and validated by AWS AWS Native Services Top-5 Global ISV for S3 Data Volume 3rd-Party ISV Services Unified Billing Experience Transact directly through AWS Marketplace with flexible building schedules to remove friction and burn down existing AWS commits Unified security & management Enable identical and secure network setups with other AWS Workloads via AWS PrivateLink Extend existing CloudWatch monitoring to Confluent using our fully managed connector Amazon RDS Ready AWS Lambda Ready Amazon Redshift Ready AWS PrivateLink Ready AWS Outposts Ready Validated Service Designations Confluent integrations with AWS
Scalable data lakes Purpose-built for performance and cost Serverless and easy to use Unified data access, security, and governance Built-in machine learning AWS analytics pillars
Amazon OpenSearch Service Amazon Aurora Amazon EMR Amazon SageMaker Amazon DynamoDB Amazon Redshift Amazon S3 INNOVATE MODERNIZE UNIFY Start anywhere Modern data strategy on AWS
The benefits of data lakes Catalog Store all your data in open formats Decouple storage from compute Cost-effectively scale storage to exabytes Process data in place Choice of analytical and ML engines
Federated Query Amazon Redshift Spectrum query Amazon S3 Data lake export Operational databases Query live data and maintain materialized views BI and analytics apps Connect apps to analyze and visualize your data Amazon S3 data lake Keep up to exabytes of data in Amazon S3 SQL Amazon Redshift ML Amazon Redshift Materialized views Data sharing Data marketplaces for third-party data ML and analytics services Analyze open standards- based data formats No ETL! No data movement Secure and consistent data sharing Billions of predictions right within your data warehouse No data duplication ML in SQL Use your favorite BI tool 1,300+ third-party datasets, 350+ data providers Break thru data silos, analyze data across AWS services
S Confluent Cloud on AWS – reference architecture S3
Unlock value of your data in real-time with Confluent and AWS Ahmed Zamzam Senior Partner Solutions Architect
Agenda ksqlDB 101 Introduction to ksqlDB Serverless Streaming ETL Demo Q/A Rearchitected Kafka, together with the features you need to rapidly deploy production use cases Data streaming with Confluent Real-time Analytics with Confluent and AWS Building event-streaming applications real time analytics pipelines using Confluent and AWS Using Lambda and ksqlDB together
Rich frontend customer experiences CONFLUENT MAKES REAL-TIME DATA STREAMS TOP PRIORITY “We need to shift our thinking from everything at rest to everything in motion .” 16 Rich customer experiences Real-time event s Real-time Event Streams A Sale A shipment A Trade A Customer Experience Data driven operations
Why Real-time data? Source: Perishable insights, Mike Gualtieri, Forrester Data loses value quickly over time Real-time Seconds Minutes Hours Days Months Value of data to decision-making Preventive/Predictive Actionable Reactive Historical Time critical decisions Traditional “batch” business intelligence Information half-life in decision-making
Typical real-time data pipeline Data continuously generated at a high velocity from different sources like IoT devices, Application logs, Online transactions, etc.. Source Data captured and stored in the order it was received for set duration of time, and can be replayed indefinitely. Event Streaming Process, analyse and action on the data as soon as it is generated and in the order it was received Stream Processing Sink data different destinations. Dara Lakes (most common) and/or different Databases Presentation
Everywhere Be everywhere our customers want to be Cloud-Native Re-imagined Kafka experience for the Cloud Complete Enable developers to reliably & securely build next-gen apps faster The Confluent Product Advantage
Leave Kafka reliability worries behind with 99.99% uptime SLA and 10x built-in durability Never worry about Kafka storage limits again with Infinite Storage that’s 10x more scalable and performant Scale and shrink to handle 0 to GBps+ workloads and peak customer demands 10x faster and easier 10x Kafka Confluent Cloud offers a truly fully managed, cloud-native data streaming platform for Apache Kafka, with 10x faster scaling, infinitely more storage, and built-in resilience Resiliency Storage Elasticity
Confluent Platform The Enterprise Distribution of Apache Kafka Confluent Cloud Apache Kafka reengineered for the cloud Self-managed software Fully managed service VM Deploy on any platform, on premises, or cloud Available on Confluent: Everywhere
Federated streaming, hybrid and multi-cloud. Data syndication and replication across and between clouds and on-premises, with self-service APIs, data governance, and visual tooling. Reliable & real-time data streams between all customer sites, so you can run always-on streaming analytics on the data of the entire enterprise, despite regional or cloud provider outages. Everywhere: Cluster Linking Global Central Nervous System
“We are in the business of selling and renting clothes. We are not in the business of managing an event streaming platform… If we had to manage everything ourselves, I would’ve had to hire at least 10 more people to keep the systems up and running .” Architecture planning Cluster sizing Cluster provisioning Broker settings Zookeeper management Partition placement and data durability Source/sink connectors development and maintenance Monitoring and reporting tools setup Software patches and upgrades Security controls and integrations Failover design and planning Mirroring and geo-replication Streaming data governance Load rebalancing and monitoring Expansion planning & execution Utilization optimization and visibility Cluster migrations Infrastructure & performance upgrades / enhancements I N V E S T M E N T & T I M E V A L U E 1 2 3 4 5 Experimentation/ early interest Central nervous system Mission critical, disparate LOBs Identify a project Mission-critical, connected LOBs Key challenges Operational burden and resources Manage and scale platform to support ever-growing demand Security and governance Ensure streaming data is as safe and secure as data-at-rest as Kafka usage scales Real-time connectivity and processing Leverage valuable legacy data to power modern, cloud-based applications and experiences Global availability Maintain high availability across environments with minimal downtime Kafka is hard in experimentation. It only gets harder (and riskier) as you add mission-critical data and use cases. Operationalizing Kafka on your own is difficult
Discover, understand, and trust your data streams Where did data come from? Where is it going? Where, when, and how was it transformed? What’s the common taxonomy? What is the current state of the stream? Stream Catalog Increase collaboration and productivity with self-service data discovery Stream Lineage Understand complex data relationships and uncover more insights Stream Quality Deliver trusted, high-quality event streams to the business “Confluent’s Stream Governance suite will play a major role in our expanded use of data in motion and creation of a central nervous system for the enterprise. With the self-service capabilities in stream catalog and stream lineage, we’ll be able to greatly simplify and accelerate the onboarding of new teams working with our most valuable data ."
Instantly connect popular data sources & sinks 120+ prebuilt connectors 100+ Confluent supported 20+ partner supported, Confluent verified
Better together: Confluent and AWS
Confluent has deep AWS service integrations
Together Confluent and AWS empower Endless Use Cases across many Industries Retail Healthcare Finance & Banking Transportation Common in all Industries Inventory Management Personalized Promotions Product Development & Introduction Sentiment Analysis Streaming Enterprise Messaging Systems of Scale for High Traffic Periods Connected Health Records Data Confidentiality & Accessibility Dynamic Staff Allocation Optimization Integrated Treatment Proactive Patient Care Real-Time Monitoring Early-On Fraud Detection Capital Management Market Risk Recognition & Investigation Preventive Regulatory Scanning Real-Time What-If Analysis Trade Flow Monitoring Advanced Navigation Environmental Factor Processing Fleet Management Predictive Maintenance Threat Detection & Real-Time Response Traffic Distribution Optimization Data Pipelines Hybrid Cloud Integration Microservices Security and Fraud Customer 360 Streaming ETL
Stream Processing on AWS and Confluent
Stateless Stateful Two types of Stream Processing
ksqlDB at a glance 31 What is it? ksqlDB is an event-streaming database for working with streams and tables of data All the key features of a modern streaming solution Aggregations Joins Windowing Event-time Dual query support Exactly-once semantics Out-of-order handling User-defined functions CREATE TABLE activePromotions AS SELECT rideId, qualifyPromotion(distanceToDst) AS promotion FROM locations GROUP BY rideId EMIT CHANGES How does it work? It separates compute from storage, and scales elastically in a fault-tolerant manner It remains highly available during disruption, even in the face of failure to a quorum of its servers
Pull and Push queries in ksqlDB Push query Pull query Tells you: Exits: Point in time value Immediately All value changes Never 33 Pull queries have some limitations currently: only available against tables require the object to be materialised (i.e. a TABLE built using an aggregate) Currently only available for single key lookup with optional window range Push queries are applicable to all stream and table objects, and indicated with an EMIT CHANGES clause.
Stateful aggregations in ksqlDB { "event_ts" : "2020-02-17T15:22:00Z" , "person" : "robin" , "location" : "Leeds" } { "event_ts" : "2020-02-17T17:23:00Z" , "person" : "robin" , "location" : "London" } { "event_ts" : "2020-02-17T22:23:00Z" , "person" : "robin" , "location" : "Wakefield" } { "event_ts" : "2020-02-18T09:00:00Z" , "person" : "robin" , "location" : "Leeds" } Kafka topic CREATE TABLE PERSON_MOVEMENTS AS SELECT PERSON, COUNT_DISTINCT(LOCATION) AS UNIQUE_LOCATIONS, COUNT(*) AS LOCATION_CHANGES FROM MOVEMENTS GROUP BY PERSON; PERSON_ MOVEMENTS Internal ksqlDB state store 35
36 Filters CREATE STREAM high_readings AS SELECT sensor, reading, FROM readings WHERE reading > 41 EMIT CHANGES; ** The above “Create Stream As Select (CSAS)” statement is a Persistent Query that runs indefinitely and produces the resulting stream backed by a new topic “high readings” .
37 Joins CREATE STREAM enriched_readings AS SELECT reading, area, brand_name, FROM readings INNER JOIN brands b ON b.sensor = readings.sensor EMIT CHANGES;
38 Aggregate CREATE TABLE avg_readings AS SELECT sensor, AVG(reading) AS location FROM readings GROUP BY sensor EMIT CHANGES; ** The above “ Create Table As Select (CTAS)” statement is a Persistent Query that runs indefinitely and produces the resulting table backed by a new topic “avg_readings” .
Kafka clients Kafka streams ksqlDB ConsumerRecords< String , String > records = consumer . poll( 100 ); Map< String , Integer > counts = new DefaultMap< String , Integer > (); for ( ConsumerRecord< String , Integer > record : records) { String key = record . key(); int c = counts . get(key) c += record . value() counts . put(key, c) } for (Map .Entry< String , Integer > entry : counts . entrySet()) { int stateCount; int attempts; while (attempts ++ < MAX_RETRIES ) { try { stateCount = stateStore . getValue(entry . getKey()) stateStore . setValue(entry . getKey(), entry . getValue() + stateCount) break ; } catch (StateStoreException e) { RetryUtils . backoff(attempts); } } } builder .stream( "input-stream" , Consumed.with(Serdes.String(), Serdes.String())) .groupBy((key, value) -> value) .count() .toStream() .to( "counts" , Produced.with(Serdes.String(), Serdes.Long())); SELECT x, count ( * ) FROM stream GROUP BY x EMIT CHANGES ; Flexibility Simplicity 3 modalities of stream processing with Confluent
2. Stateless Stream processing with AWS Lambda Event source mapping Lambda service Confluent Kafka sink connector Sink connector polls Kafka partitions and invokes your function Lambda can be invoked synchronously or asynchronously At least once semantics Provides a dead letter queue (DLQ) for any failed invocations Sink connector scales up to a soft maximum of 10 connectors Lambda service polls the Kafka partitions and invokes your Lambda function synchronously Starts with one concurrent poller and customer function Scaling Lambda service checks every 3 minutes if scaling is needed Starts with 1 poller and scales up to ≤ #partitions Batch records based on a BatchSize or Batchwindow
Enrich Transaction events for Fraud scoring Customer Transaction Jay $10 ksqlDB
Enrich Transaction events for Fraud scoring Customer Transaction Avg 7 days Num trans 10m Jay $10 $8.5 1
Enrich Transaction events for Fraud scoring Customer Transaction Avg 7 days Num trans 10m Jay $10 $8.5 1 Amazon SageMaker AWS Lambda ksqlDB
3. Kinesis Data Analytics Integrating Kinesis Data Analytics with Confluent allows you unlocks many use-cases with AWS AI/ML Services
When to use which?
ksqlDB Kafka Streams Kinesis Data Analytics Lambda Fully Managed ✅ — ✅ ✅ TYPE Stateful and Stateless Stateful and Stateless Stateful and Stateless Stateless FAULT TOLERANCE Exactly once Exactly once Exactly once At-least once UDF SUPPORT ✅ (self-managed) ✅ (self-managed) ✅ ✅ LATENCY FAST VERY FAST VERY FAST FAST When to use which?
Demo: Streaming ETL (Data cleaning) with ksqlDB and Lambda
Unicorn.Rentals New Hire Orientation CTO, Unicorn.Rentals
LARM: LEGENDARY ANIMAL RENTAL MARKET FANTASY REALITY
Luck Could have happened to anyone. Just a roll of the dice. 13% Software The actual software that was developed. Not that important. 6% Leadership There is nothing more important than the hands at the wheel of the ship. 60% Partners/3 rd Party Products If you don’t know how to do it, let someone else do it. 33% Independent analysis by BoozKinsey GartForest, ltd. KEY SUCCESS DRIVERS
Current Architecture "Who’s spending less than $20?” “How many rentals by a customer?” “How old are our customers?” RDBMS Unicorns Marketing Finance Legal Millions of records Customers On-prem
Current Problems Time Out Error Wrong Format "Who’s spending less than $20?” “How many rentals by a customer?” “How old are our customers?” RDBMS Unicorns Marketing Finance Legal Millions of records Customers On-prem
New Architecture RDBMS Unicorns Customers Marketing Finance Legal Millions of records Data Streaming and Transformation AWS Cloud On-prem