real time data processing is a tsubtopic in the topic in the domain bigdata
ArasuVishnu
7 views
19 slides
Aug 10, 2024
Slide 1 of 19
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
About This Presentation
gigdata analysis concepts
Size: 684.09 KB
Language: en
Added: Aug 10, 2024
Slides: 19 pages
Slide Content
BIG DATA ANALYSIS REAL TIME BIG DATA PROCESSING BY: PONNARASU A 112225 CSE B
INTRODUCTION Real-time big data processing involves analyzing and acting upon data as it is generated or received. This approach allows for immediate insights and responses, which is crucial in various applications such as fraud detection, personalized recommendations, and monitoring systems . By implementing real-time data processing systems, businesses can achieve higher efficiency, better customer experiences, and a stronger competitive edge. 2
TERMINOLOGIES USED 3 Latency: The delay between the generation of data and the processing or action on that data. . Throughput: The amount of data processed in a given period of time Event Stream : A continuous flow of data events generated by various sources, such as sensors, user interactions, or transactions . Stream Processing : The real-time processing of data streams to extract insights and trigger actions. Data Ingestion : The process of collecting and importing data for immediate use or storage. Scalability : The capability of a system to handle growing amounts of work or data by adding resources . Fault Tolerance : The ability of a system to continue operating without interruption when one or more of its components fail.
REAL TIME VS BATCH PROCESSING 4 Real-Time Processing Real-time processing involves the continuous input, processing, and output of data. Data is processed as soon as it is generated or received, enabling immediate insights and actions. Characteristics : Latency: Very low, often in milliseconds or seconds. Data Handling: Continuous flow of data, processed in real-time. Use Cases: Fraud detection, real-time recommendations, live monitoring, financial trading. Technologies: Apache Kafka, Apache Storm, Apache Flink , Spark Streaming.
5 Batch Processing Batch processing involves collecting data over a period and processing it in bulk. Data is accumulated, then processed at scheduled intervals, allowing for comprehensive analysis of large data sets . Characteristics: Latency: Higher, ranging from minutes to hours or even days. Data Handling: Processes data in large volumes at specific intervals. Use Cases: End-of-day reporting, data warehousing, historical data analysis. Technologies: Hadoop MapReduce , Apache Spark, Apache Hive, Apache Pig.
COMPARISON Aspect Real-Time Processing Batch Processing Latency Milliseconds to seconds Minutes to hours or days Data Handling Continuous, as data arrives Bulk, at scheduled intervals Use Cases Immediate insights, live monitoring Comprehensive analysis, historical data Advantages Immediate actions, up-to-date information Efficient for large volumes, cost-effective Challenges Complexity, scalability, fault tolerance Complexity, scalability, fault tolerance Technologies Kafka, Storm, Flink , Spark Streaming Hadoop MapReduce , Spark, Hive, Pig 6
BASIC TECHNOLOGIES 7
DATA SOURCES 8 Sensors and IoT Devices: Devices that collect and transmit data about their environment.EX : IoT devices, environmental sensors Social Media: Platforms where users generate a continuous stream of data through posts, comments, likes, and shares.EX : Twitter, Facebook feeds Financial Transactions: Data from payment systems, stock exchanges, and financial institutions.EX : Twitter, Facebook feeds Log Files: Continuous records of events or activities in software applications and systems.EX : Server logs, application logs
KEY TECHNOLOGIES 9 Apache kafka A distributed streaming platform that handles real-time data feeds. Features: High throughput for publishing and subscribing to data streams. Durable storage of streams . Apache Storm A distributed real-time computation system for processing data streams. Features: Fast and reliable processing. Supports various programming languages.
10 Apache Flink A stream processing framework with powerful event-time processing capabilities. Features: Stateful computations over data streams. Exactly-once processing guarantees . Spark Streaming (Apache Spark) A scalable and fault-tolerant stream processing system built on Apache Spark. Features: Micro-batch processing model. Integration with Spark's batch and machine learning libraries.
11 Spark Streaming (Apache Spark) A scalable and fault-tolerant stream processing system built on Apache Spark. Features: Micro-batch processing model. Integration with Spark's batch and machine learning libraries . Amazon Kinesis A platform for real-time data streaming and analytics by AWS. Features: Easily collect, process, and analyze real-time data. Scalable and fully managed.
NoSQL Storage Technologies in Real-Time Data Processing 12 Apache Cassandra A highly scalable, distributed NoSQL database designed to handle large amounts of data across many commodity servers. Features: High availability with no single point of failure. Linear scalability . MongoDB A document-oriented NoSQL database that stores data in JSON-like format. Features: Flexible schema design. Powerful querying and indexing.
13 Redis An in-memory key-value store known for its high performance and support for various data structures. Features: Extremely low latency. Supports complex data structures (lists, sets, sorted sets ). Amazon DynamoDB A fully managed NoSQL database service by AWS that provides fast and predictable performance with seamless scalability. Features: Single-digit millisecond response times. Fully managed and serverless .
Data Processing in Real-Time Big Data Systems 14 Steps in Real-Time Data Processing: Data Ingestion The process of collecting and importing data in real-time from various sources . Data Stream Processing Continuous processing of data streams to derive insights and trigger actions . Data Transformation Converting raw data into a structured format or enriching it with additional information.
15 Data Storage Storing processed data in databases or data lakes for further analysis and querying . Data Analysis and Querying Analyzing processed data to extract insights and generate reports or dashboards . Data Visualization Presenting data insights through interactive dashboards and visualizations . Event Handling and Alerting Responding to specific events or conditions detected in the data.
FURURE TRENDS 16 Edge Computing: Performing data processing tasks closer to the data source to minimize latency and reduce data transmission costs . Enhanced Real-Time Analytics with AI and Machine Learning :Integrating artificial intelligence (AI) and machine learning (ML) with real-time data processing to enhance predictive analytics and decision-making . Quantum Computing: Exploring quantum computing for solving complex problems in real-time data processing and analytics.
17 Privacy-Preserving Data Processing : Ensuring data privacy with federated learning and advanced encryption. Serverless Architectures: Implementing serverless computing to manage real-time data processing tasks without managing infrastructure. Quantum Computing: Exploring quantum computing for solving complex problems in real-time data processing and analytics .
CONCLUSION 18 Real-time big data processing is essential for deriving immediate insights and actions in various industries. Understanding the key concepts, technologies, and best practices helps in designing efficient and effective real-time data systems.