Unit 6 - Compression and Serialization in Hadoop.pptx

329 views 24 slides Oct 27, 2024
Slide 1
Slide 1 of 24
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24

About This Presentation

Unit 6 - Compression and Serialization in Hadoop.


Slide Content

MCS7101 - Big Data Analytics Unit Six – Compression and Serialization in Hadoop

Instructor Tamale Micheal Assistant Lecturer - Computer Science (PhD - Student) Department of Computer Science Faculty of Computing, Library and Information Sciences Kabale University

Introduction In Hadoop, compression is an essential technique used in the I/O (Input/Output) operations to reduce the size of data, which improves storage efficiency and speeds up data transfer across the network. Since Hadoop deals with very large datasets, compressing data helps reduce the amount of space required to store it and decreases the time required for data processing by lowering disk I/O and network traffic.

Why Compression Is Important in Hadoop Reduced Storage Costs Improved Data Transfer Speed Faster Processing Optimized Resource Usage

Types of Compression in Hadoop Gzip (GNU zip) Bzip2 Snappy LZO (Lempel-Ziv- Oberhumer )

Splittable vs. Non- Splittable Compression Formats Splittability refers to the ability to split a compressed file into chunks for parallel processing, which is a key aspect of how Hadoop processes large datasets in a distributed manner. Splittable Compression Formats (e.g., Bzip2, LZO with indexing) allow large files to be processed in parallel, making Hadoop’s MapReduce framework more efficient. Non- Splittable Compression Formats (e.g., Gzip , Snappy) don’t allow splitting, which can be a bottleneck for large files. This means that the entire compressed file has to be handled by a single mapper, reducing parallelism.

Hadoop Compression Codecs In Hadoop, a codec is a compression/decompression algorithm. Hadoop provides several built-in compression codecs. Some popular codecs include: GzipCodec : Handles Gzip compression and decompression. BZip2Codec : Handles Bzip2 compression and decompression. SnappyCodec : Handles Snappy compression and decompression. Lz4Codec : Provides a balance between compression speed and ratio, often used in high-performance environments. LzoCodec : Used for LZO compression, requires installation of additional libraries for Hadoop to support it.

Compression in Hadoop MapReduce In Hadoop MapReduce, compression can be applied at different stages. Input Compression Intermediate (Shuffle) Compression Output Compression

Trade-offs in Compression 1. Compression Ratio vs. Speed High compression ratio algorithms (e.g., Bzip2) result in smaller files but tend to be slower. They are suitable for scenarios where storage space is a bigger concern than processing speed. Fast compression algorithms (e.g., Snappy, LZO) focus on speed but may not compress as well. They are more suitable for real-time data processing or situations where speed is a priority.

Cont.... 2. CPU Overhead Compression and decompression require CPU resources. For large clusters, the CPU overhead may be offset by the benefits of reduced I/O and network usage. However , in smaller environments, the CPU cost may become a bottleneck if compression algorithms are too slow.

Serialization Serialization in Hadoop I/O refers to the process of converting data objects (like records, values, or structures) into a stream of bytes that can be efficiently stored or transmitted over a network. In Hadoop, serialization is critical because it allows data to be written to and read from the Hadoop Distributed File System (HDFS) and enables communication between nodes during distributed processing tasks like MapReduce .

Why Is Serialization Important? Efficient Storage Efficient Transmission Interoperability

Hadoop's Default Serialization Frameworks Hadoop provides several serialization mechanisms, each designed for specific use cases. The most commonly used serialization frameworks in Hadoop are; Writable Interface (Hadoop’s Native Serialization) Writable is the default serialization mechanism in Hadoop. Any object in Hadoop that needs to be serialized must implement the Writable interface. It’s highly optimized for Hadoop’s I/O operations and is lightweight compared to other serialization mechanisms.

Cont... Key Characteristics : Compact : Data is serialized in a binary format, resulting in minimal storage overhead. Fast : Writable is designed for high-performance I/O. Customizable : Users can implement custom Writable objects for their specific data types. Writable Example : IntWritable , LongWritable , Text , and DoubleWritable are examples of built-in Writable types that correspond to primitive data types.

Cont... 2. WritableComparable Interface This is an extension of Writable that adds comparison functionality, often used when key objects need to be compared (such as in sorting tasks). Any object that is used as a key in Hadoop MapReduce must implement WritableComparable .

Cont... 3. Text (Writable for String data) In Hadoop, the Text class is a specialized Writable for handling UTF-8 encoded strings. It is more efficient and optimized for Hadoop’s internal data processing compared to Java’s String.

Third-Party Serialization Frameworks in Hadoop In addition to Hadoop's built-in Writable system, Hadoop can integrate with third-party serialization frameworks that are more flexible or efficient for specific use cases, especially when interoperability with other systems is required . 1. Apache Avro Avro is a popular serialization framework used in Hadoop for working with complex data types. It stores data in a compact binary format and also includes a schema with the data, which makes it self-describing.

Cont... Key Characteristics Schema-based : Avro uses a schema to describe the structure of the data, enabling both serialization and deserialization to be flexible across different languages. Interoperability : Avro is language-neutral, meaning data serialized with Avro can be deserialized in any language that has an Avro library (e.g., Java, Python, C++). Efficient : Avro is more compact than Hadoop's Writable, making it suitable for scenarios requiring schema evolution and cross-language communication.

Cont... 2. Protocol Buffers ( Protobuf ) Google Protocol Buffers is another serialization framework used in Hadoop for structured data. Like Avro, Protobuf is schema-based and language-neutral. Key Characteristics Compact binary format : Data serialized with Protobuf is extremely compact. Schema-based : Like Avro, Protobuf uses a schema to describe the structure of the data. Language-neutral : Supports multiple programming languages (e.g., Java, C++, Python).

Cont... 3. Thrift Apache Thrift is a serialization and RPC (Remote Procedure Call) framework developed by Facebook. It allows efficient data serialization and is used in Hadoop when cross-language data exchange and high-performance network communication are needed.

Cont... Key Characteristics Schema-based : Thrift, like Avro and Protobuf , relies on schema definitions. RPC support : In addition to serialization, Thrift supports RPC, making it more suitable for distributed applications that need both data serialization and service communication.

Key Considerations for Choosing a Serialization Framework Efficiency Schema Evolution Interoperability Speed vs. Size

Serialization in Hadoop I/O Workflow Serialization plays a critical role throughout the entire Hadoop I/O workflow. Input to MapReduce : When data is read from HDFS or other sources, it is deserialized from a byte stream into objects that MapReduce jobs can process. Intermediate Data (Shuffle Phase) : During the shuffle phase in MapReduce, the key-value pairs generated by mappers are serialized before being sent to reducers over the network. Output to HDFS : Once the data is processed, it is serialized again before being written back to HDFS for storage.

Thank you | Asante | Mwebare Questions?