parallel adn distributed systems in software enginerring
Size: 1.92 MB
Language: en
Added: Jun 16, 2024
Slides: 27 pages
Slide Content
Department of Computer Engineering HCE/HHE/HSE311 Parallel &Distributed Computing
Parallel and Distributed Data Processing MapReduce frameworks Distributed stream processing Characteristics of Distributed stream processing Benefits of distributed stream processing systems Distributed Stream Processing Frameworks Parallel databases
Parallel and Distributed Data Processing features of parallel databases Examples of Parallel Databases Types of Parallel databases Parallel data warehouses characteristics of parallel data warehouses: Examples of Parallel Data Warehouses: Data partitioning strategies
MapReduce frameworks They are programming models and software frameworks that simplify the processing of large-scale data sets in a distributed computing environment. They provide a high-level abstraction for developers to write parallelizable and fault-tolerant data processing applications
MapReduce frameworks are software tools that help process large amounts of data by breaking them down into smaller tasks that can be handled in parallel across multiple computers. Here's a simplified explanation: *Map*: - Takes a large dataset and breaks it down into smaller chunks. - Processes each chunk independently, transforming the data as needed. - Produces a new set of data that's smaller and more manageable. *Reduce*: - Takes the output from the Map phase and combines it into a single result. - Aggregates the data, performing calculations or operations as needed. - Produces a final output that answers a specific question or solves a problem. Think of it like a census: - *Map* is like sending census takers to each city to count the population. - *Reduce* is like collecting the results from each city and calculating the total population. MapReduce frameworks, like Hadoop and Spark, make it easy to write programs that can handle massive datasets by automatically distributing the work across many computers. This allows for faster processing and analysis of big data.
MapReduce frameworks Apache Hadoop MapReduce Apache Hadoop is a popular open-source framework that provides a distributed storage and processing system for big data Apache Spark: Apache Spark is an open-source distributed computing system that offers an enhanced version of the MapReduce programming model
MapReduce frameworks Apache Flink : Apache Flink is another open-source distributed processing framework that supports the MapReduce model along with stream processing capabilities. Apache Spark: Microsoft Azure HDInsight HDInsight is a cloud-based managed service provided by Microsoft Azure. Amazon Elastic MapReduce (EMR) EMR is a cloud-based big data processing service provided by Amazon Web Services (AWS).
Distributed stream processing It is a computing paradigm that involves processing and analyzing data streams in real-time across multiple machines or nodes in a distributed system. Each node independently processes a portion of the data stream, and the results are combined to produce the final output.
Distributed stream processing Features of Distributed stream processing Data Streams Data streams are an unbounded sequence of data records that are generated continuously over time. Stream Processing Stream processing involves performing computations on data streams in real-time or near-real-time
Distributed stream processing Features of Distributed stream processing Event Time vs. Processing Time: Distributed stream processing systems typically operate in either event time or processing time Fault Tolerance and Scalability: Distributed stream processing systems are designed to be fault-tolerant and scalable to handle high-volume data streams and accommodate varying workloads.
Parallel databases Parallel databases are designed to handle large-scale data processing by distributing the workload across multiple nodes in a cluster or a shared-nothing architecture.
Parallel databases Features of Parallel databases Data Distribution Parallel databases distribute data across multiple nodes in a cluster, allowing for parallel processing and improved performance. Parallel Query Execution Parallel databases enable the execution of queries in parallel across multiple nodes, allowing for faster processing of large datasets .
Parallel databases Features of Parallel databases Shared-Nothing Architecture Parallel databases often adopt a shared-nothing architecture, where each node in the cluster has its own CPU, memory, and storage resources Query Optimization Parallel databases employ query optimization techniques to determine the most efficient execution plans for parallel query processing.
Parallel databases Features of Parallel databases Data Replication and Fault Tolerance To ensure fault tolerance, parallel databases may replicate data across multiple nodes Concurrency Control Parallel databases employ concurrency control mechanisms to handle concurrent access to shared data.
Parallel databases Types of Parallel databases Shared-Nothing Parallel Databases: Shared-nothing parallel databases distribute data across multiple nodes, and each node has its own CPU, memory, and storage. Examples: Teradata, Greenplum, Amazon Redshift
Parallel databases Types of Parallel databases Shared-Disk Parallel Databases: Shared-disk parallel databases have a shared storage system accessible by all nodes in the cluster Examples: Oracle Real Application Clusters (RAC), IBM Db2 Parallel Edition .
Parallel databases Types of Parallel databases Massively Parallel Processing (MPP) Databases: MPP databases distribute data and processing across a large number of nodes in a cluster. Examples: Netezza, Vertica, Google BigQuery .
Parallel data warehouses Parallel data warehouses are a type of database system designed specifically for handling large-scale data warehousing workloads.
Parallel data warehouses Characteristics of parallel data warehouses Massively Parallel Processing (MPP) Architecture Data Distribution and Partitioning Shared-Nothing Architecture Query Optimization and Parallel Execution: Columnar Storage Integration with Data Integration and Analytics Tools
Data partitioning strategies Data partitioning strategies are techniques used to divide a dataset into smaller, more manageable subsets called partitions. They help distribute data across multiple nodes or storage systems,
Data partitioning strategies Data partitioning strategies Range Partitioning In range partitioning, data is divided based on a specific range of values from a chosen attribute or key. For example, a dataset of sales transactions could be partitioned based on the date range or a numeric range such as sales amounts
Data partitioning strategies Data partitioning strategies Hash Partitioning Hash partitioning involves applying a hash function to a data attribute or key to determine the partition assignment Example: Partitioning a customer database table based on a customer ID
Data partitioning strategies Data partitioning strategies List Partitioning List partitioning involves explicitly specifying a list of values that determine the partition assignment for each data item. Example: Partitioning a product inventory table based on product categories
Data partitioning strategies Data partitioning strategies Round-Robin Partitioning Round-robin partitioning evenly distributes data across partitions by cyclically assigning data items to each partition in a sequential manner .