Big Data Analytics_basic introduction of Kafka.pptx

khareamit369 21 views 32 slides Jun 26, 2024
Slide 1
Slide 1 of 32
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32

About This Presentation

This ppt contain basic knowledge of kafka


Slide Content

B DA-Unit-4 Stream Memory: Kafka 16 August 2022 1

What Does Big Data Streaming Mean? 16 August 2022 2 Big data streaming is a process in which big data is quickly processed in order to extract real-time insights from it. The data on which processing is done is the data in motion. Big data streaming is ideally a speed-focused approach wherein a continuous stream of data is processed.

The Best Open-Source Data Streaming Software and Tools 16 August 2022 3 Apache Flink : Flink is a distributed processing engine and a scalable data analytics framework. You can use Flink   to process data streams at a large scale and to deliver real-time analytical insights about your processed data with your streaming application .

Companies Currently Using Apache Flink 16 August 2022 4 Company Name Website Sub Level Industry Apple apple.com General Interconnection Products & Services Capital One capitalone.com Banking Wells Fargo wellsfargo.com General Financial Services & Insights Walmart walmart.com Department Stores & Superstores

16 August 2022 5 Where is Apache Flink used? Alibaba, the world's largest retailer, uses a fork of Flink called Blink to optimize search rankings in real-time.   Amazon Kinesis Data Analytics , a fully managed cloud service for stream processing, uses Apache Flink in part to power its Java application capability.

16 August 2022 6 Apache Kafka: Kafka is a distributed streaming platform: – publish-subscribe messaging system A messaging system lets you send messages between processes, applications, and servers. – Store streams of records in a fault-tolerant durable way. – Process streams of records as they occur.

16 August 2022 7 Kafka is used for building real-time data pipelines and streaming apps It is horizontally scalable, fault-tolerant, fast, and runs in production in thousands of companies. Originally started by LinkedIn, later open sourced Apache in 2011.

16 August 2022 8

16 August 2022 9 Smart Meter Data Processing and Customer Billing Every business should be interested in improving customer service to boost customer satisfaction. Energy companies are no exception to this fundamental axiom. Energy providers are striving to improve on payment and billing operations, provide quality, and get rid of delays.

16 August 2022 10 New smart meters are being introduced into the industry. They allow customers to maintain close monitoring of energy usage and the cost associated with it. It also enables energy firms to automate the billing process. Apache Kafka and Streams API help in real-time aggregations to process different types of data streams. This allows companies to effectively deal with an increased volume of meter readings.

16 August 2022 11 It simply means that with Apache Kafka, transactions can be tracked in real-time and immediate actions can be taken in regards to communication services, prepaid and postpaid services, and billing.

16 August 2022 12 Apache Spark: As per Apache, “ Apache Spark is a unified analytics engine for large-scale data processing ”. Apache Spark was developed by a team at UC Berkeley in 2009. Since then, Apache Spark has seen a very high adoption rate from top-notch technology companies like Google, Facebook, Apple, Netflix, etc.

16 August 2022 13 The demand has been ever increasing day by day. According to marketanalysis.com survey, the Apache Spark market worldwide will grow at a  CAGR of 67% between 2019 and 2022.   The Spark market revenue is zooming fast and may  grow up $4.2 billion by 2022,  with a cumulative market v alued at $9.2 billion (2019 - 2022).

16 August 2022 14

16 August 2022 15 Apache Storm: Apache Storm is a distributed real-time big data-processing system. The storm is designed to process the vast amount of data in a fault-tolerant and horizontally scalable method. It is a streaming data framework that has the capability of highest ingestion rates.

16 August 2022 16 Though Storm is stateless, it manages distributed environment and cluster state via Apache ZooKeeper . It is simple and you can execute all kinds of manipulations on real-time data in parallel.

Use-Cases of Apache Storm 16 August 2022 17 Twitter  − Twitter is using Apache Storm for its range of “Publisher Analytics products”. “Publisher Analytics Products” process each and every tweets and clicks in the Twitter Platform. Apache Storm is deeply integrated with Twitter infrastructure. NaviSite  − NaviSite is using Storm for Event log monitoring/auditing system. Every logs generated in the system will go through the Storm. Storm will check the message against the configured set of regular expression and if there is a match, then that particular message will be saved to the database.

16 August 2022 18 Wego  − Wego is a travel metasearch engine located in Singapore. Travel-related data comes from many sources all over the world at different timing. Storm helps Wego to search real-time data, resolves concurrency issues and find the best match for the end-user.

16 August 2022 19 Apache Samza : Samza allows you to build stateful applications that process data in real-time from multiple sources including Apache Kafka. it supports flexible deployment options to run on YARN or as a standalone library.

Apache Kafka 16 August 2022 20

16 August 2022 21 Apache Kafka is a highly scalable and distributed platform for creating and processing streams in real-time . Apache Kafka is an open-source and distributed message streaming platform for building event-driven applications. Used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. More than 80% of all Fortune 100 companies trust, and use Kafka.

How does it works 16 August 2022 22

16 August 2022 23 Publisher: is a message producer Broker: We install a Kafka server to act as a message broken. Responsible for receiving messages from producers and storing them in local storage. In the center and act as a middle man between producer and customer . Subscriber: A client application that read the message from the broker and processes. We create consumer applications to process the data stream.

16 August 2022 24 Note: Kafka works as an enterprise messaging system architecture. Kafka works as a pub of the message system.

How Kafka evolution 16 August 2022 25 Kafka initially started with two things: Server software –Broker Client API – Java Library a) Producer API b) Consumer API Producer API : Enables an application to publish a stream of records to one or more Kafka topics. Consumer API : Enables an application to subscribe to one or more Kafka topics. it also makes it possible for the application to process streams of records that are produced for these topics.

16 August 2022 26 Later Kafka was inspired to become full-fledged real-time streaming platform. And to achieve this objective they augmented Kafka with three more components. Kafka connect Kafka streams K-SQL The first two are still open source components and they are available in Apache . K-SQL is available with some license.

16 August 2022 27 From 2011-2019 Kafka evolved as a set of 5 components. Kafka Broker : central server Kafka client : Is an API producer and consumer API library Kafka connect : Addresses the initial data integration problem for which Kafka was initially designed. Kafka streams : Another library for creating real-time stream processing applications. KSQL : Kafka is now to become a real-time database and capture some market sharing, databases.

Kafka in enterprise application eco-system? 16 August 2022 28

16 August 2022 29 Kafka working circulatory system of your data eco-system, just like the circulatory system carries blood, Kafka brings data to various no's of the ecosystem. Kafka occupies the central place in the real-time data integration system. Data producers can send data like messages, sent to Kafka and can send it to the Kafka broker as quickly as the business event occur. Data consumers can consume the messages from the broker as soon as data arrived. Messages can reach from producer to consumer.

16 August 2022 30 Producer and consumer are completely de-coupled and they do not need tight coupling or direct connection. They always interact with the Kafka broker using the consistence interface. Producer is not concerned about who is using data.

16 August 2022 31 Topics:  A topic refers to a particular stream of data . Similar to a table in a database (without all the constraints) Any number of topics are possible as long as they have different names because a topic is identified by its name . Brokers:  Kafka cluster is composed of multiple servers , called brokers, which serve and receive data. Each broker is assigned an integer ID. Each broker contains certain topic partitions and connecting to any broker will connect you to the entire cluster.

16 August 2022 32 producer and consumer: Producers are those client applications that publish (write) events to Kafka. consumers are those that subscribe to (read and process) these events.
Tags