Introduction to Data Storage and Cloud Computing

341 views 49 slides Apr 28, 2024
Slide 1
Slide 1 of 49
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49

About This Presentation

Cloud COmputing


Slide Content

Unit II Data Storage and Cloud Computing

Introduction to enterprise data storage​ There  are two  types of digital information input and output data ​ Users  provide the input data, computers provide the output data. But a computer’s CPU  can't compute anything or produce output data without the user’s input.​ Users  can enter the input data directly into computer ​. With data storage space, users can save data onto a device and it is saved even if the device is powered down​ Also  instead of manually entering data into a computer you can construct the computer to pull data from a storage devices​ Computers can read input data from successes as needed and it can then create and save the output to the same sources or other storage location ​ Organizations and users require data storage to meet today’s high-level computational needs like big data projects, Ai, ML, IOT. Huge data storage is required to protect against data loss due to disaster, failure or fraud. To avoid data loss, you can also employ data storage as backup solutions.

Direct Area/Attached Storage (DAS) ​ Is often in the immediate area and directly connected to the computing machine accessing it​ Often  only one machine connects to it​ For  example the memory card on your phone or a hard disk  attached  to your laptop.​ DAS can provide decent local backup services, too but sharing is limited ​. DAS devices include floppy  disk optical disk compact  discs(CDs)  and digital video discs(DVDs) hard disk  drives(HDD) flash drives and solid-state drives(SSD).​

Network Based Storage​ Allows more than one computer to access storage over a network, making it better for data sharing and collaboration​ It's   off-site storage capability also makes it better suited for backups and data protection​ The storage can reside anywhere while the machines accessing it could be somewhere else ​. For  example when you store your data on Google Drive it is stored on the storage owned and operated by Google​ You  don't have control over the storage itself you can just use the storage quota that you're eligible for​ For accessing Google Drive  storage,  you need to have a network connection available​ Two common networks based-storage types are network attached storage(NAS) and storage area  network(SAN).​

NAS(Network Attached Storage) NAS –  are storage devices which connect to a network . NAS is often a single device made-up of redundant storage containers or a redundant array of independent  disks(RAID). ​ NAS typically has the following characteristics​ Single storage device ​ File storage  system​ TCP/ IP Ethernet network​ Limited  users​ Limited  speed​ Limited expansion  options​ Lower cost and easy setup ​ NAS systems are a type of the file service device​ A  NAS is  connected to the LAN just like a file server​ Rather than containing a full blown  OS, it typically uses a slim microkernel specialized for handling only I/O requests such as NFS( Unix), GIFS/8MB ( windows 2000/NT ) and NCP(  Netware) ​ Adding or removing a NAS system is like adding or removing any network code ​.

Storage Area Network(SAN) SAN is a computer network which provides access to consolidated, block-level data storage . SAN  storage is a network of a multiple devices of various types, including SSD and flash storage, hybrid storage, hybrid cloud storage, backup software add appliances and cloud  storage​. SAN typically has the following characteristics​ Network of multiple devices​ Block storage system​ Fibre  Channel network​ Optimized for multiple users​ Faster performance ​ Highly expendable ​ Higher cost and complex setup​ In SAN,  data is presented from storage devices to machine such that the storage looks like it is locally attached ​. This is achieved through various types of data visualization techniques​ SAN storage provides a high speed network storage ​. In some  cases  SANS can be so large that they span multiple sites as well as internal data  centers  and the cloud ​.

Data storage management​ It refers to the software and processes that improve the performance of data storage resources​ It may include network virtualization, replication, mirroring, security, compression, deduplication, traffic analysis, process automation, storage provisioning, and memory management.​ These processes help businesses store more data on existing hardware, speed up data retrieval, prevent data loss, mid data retention requirements, and reduce it expenses.​ Storage management makes it possible to reassign storage capacity quickly as business needs change.​ Storage management techniques can be applied to primary, backup or archived storage.​ Primary storage holds effectively or frequently accessed data; backup storage holds copies of primary storage data for use in disaster recovery; an archive storage holds outdated or seldom used data that must be written for compliance or business continuity.​ Storage provisioning is management technique asset assigns storage capacity to servers, computers, virtual machines and other devices.​ It may use automation to allocate storage space in a network environment.​ Intelligent storage management uses software policies and algorithms to automate the provisioning and de provisioning of storage resources, continuously monitoring data utilization and rebalancing data placement without human intervention​

Cloud file system​ File system is an approach to manage an operate files and data on a storage systems.​ There are various file systems, such as NTFS, FAT32,EXT4,etc that are commonly used today in operating systems.​ File systems typically provide mechanism for reading, writing, modifying, deleting or organizing files in folders and directories​ Cloud file systems are specifically designed to be distributed and operated in the cloud based environment.​ Files are typically stored in chunks on various storage servers( devices) such a distributed nature of file systems makes it fault tolerant and also high performance due to the possible parallelism on file operations.​ Architecture for cloud systems fall into two categories​ 1 ) Client server  architecture​ 2 )  Cluster based  architecture hi​

Client server architecture​ In client server architecture, the file server host the file system that can be mounted( attached) by the clients.​ One file server can host multiple file shares and each file share can be mounted and operated by multiple clients.​ All file operations are then synchronized back to the file server so that the other clients that have mounted the same file share can get the updates as well. ​ One example of such a file system is network file system or NFS.​ Client  server based file system architecture could be limited due to dependency on the availability of the file server and the need to synchronize the file operations periodically.​

Client-Server Architecture Client 1 Client 2 Client 3 Client 4 Share 1 Share 1 Share 2 Share 2 File Server

Cluster-based Architecture​ In a  cluster based architecture, the file is broken into smaller parts called chunks and each chunk is stored on the storage server or devices.​ The chunks are redundantly stored on several servers too withstand any fault and have high availability.​ This architecture does not depend upon a single server for hosting the file system.​ The file system is distributed and provides parallelism that significantly improves the scale and performance.​ This architecture is commuted to use today in the cloud environment​ for example Google file system, Amazon S3​

Cluster-Based architecture Redundantly stored File Chunk1 Chunk 2 Chunk 3 Storage 1 Storage 2 Storage 3

Google file systems The Google file system (GFS) is a distributed file system (DFS) for data-centric applications with robustness, scalability, and reliability. GFS can be implemented in commodity servers to support large-scale file applications with high performance and high reliability.

Characteristics and features of GFS​ 1. fault tolerant: if a few disks are corrupted, the data stored on them can still be restored and used.​ 2. big data size: the file system can manage several petabytes of data without crashing.​ 3 high availability: the data is highly available( copied to several disks) and is present across various clusters of disk.​ 4 performance: the file system provides very high performance for read and write from the disks.​ 5 resource sharing: the file system allows sharing disk resources across users.​ 6 Google cloud services: there are quite few Google cloud services, such as big table that are built on GFS2 also other Google Apps such as Gmail and maps use GFS2 as well.​

Cluster-Based architecture Redundantly stored Direct access to chunks Chunk Mapping File Chunk1 Chunk 2 Chunk 3 Chunk Server 1 Chunk Server 2 Chunk Server 3 Master Server Application

Hadoop Distributed File System(HDFS) Hadoop is an open source framework based on java that manages the storage and processing of large amounts of data for applications. Hadoop comes with a distributed file system called HDFS . In HDFS data is distributed over several machines and replicated to ensure their durability to failure and high availability to parallel application. It is cost effective as it uses commodity hardware. It involves the concept of blocks, data nodes and node name. Where to use HDFS? Very Large Files:  Files should be of hundreds of megabytes, gigabytes or more. Streaming Data Access:  The time to read whole data set is more important than latency in reading the first. HDFS is built on write-once and read-many-times pattern. Commodity Hardware: It works on low cost hardware.

HDFS Architecture

Features of HDFS Highly Scalable -  HDFS is highly scalable as it can scale hundreds of nodes in a single cluster. Replication -  Due to some unfavorable conditions, the node containing the data may be loss. So, to overcome such problems, HDFS always maintains the copy of data on a different machine. Fault tolerance -  The HDFS is highly fault-tolerant that if any machine fails, the other machine containing the copy of that data automatically become active. Distributed data storage -  This is one of the most important features of HDFS that makes Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes. Portable -  HDFS is designed in such a way that it can easily portable from platform to another.

Goals of HDFS Handling the hardware failure -  The HDFS contains multiple server machines. Anyhow, if any machine fails, the HDFS goal is to recover it quickly. Streaming data access -  The HDFS applications usually run on the general-purpose file system. This application requires streaming access to their data sets. Coherence Model -  The application that runs on HDFS require to follow the write-once-ready-many approach. So, a file once created need not to be changed. However, it can be appended and truncate.

Bigtable

Bigtable Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, enabling you to  store terabytes or even petabytes of data . A single value in each row is indexed; this value is known as the row key. Bigtable is a fully managed wide-column and key-value NoSQL database service for large analytical and operational workloads as part of the Google Cloud portfolio.

High-level Architecture of Bigtable

High Level architecture of BigTable A big table implementation has three major component 1 One master server ​ The master server is responsible for assigning tablets to tablet servers, detecting the addition and expiration of a tablet servers, balancing tablet server load, and garbage collection of files in GFS.​ in addition it handle schema changes such as table and column family creations.​ 2 Many  tablet server ​ each tablet server manages a set of tablets( typically you can have somewhere between 10 to 1000 tablets per tablet server)​ tablet servers can be dynamically added or removed from a cluster to accommodate changes in workloads.​ the tablet server handles read and write requests to the table that it has loaded, ​ and also splits tablets that have grown too large 3 Chubby Is a highly available and persistent distributed lock service that manages leases for resources and stores configuration information. The service runs with five replicas, one of which is elected as a master to serve request.

Features and characteristics of a big table​ Massive scale : big table is designed to store and process massive( petabytes and more) volumes of data. High performance : bigtable is designed to provide very high performance with under less than 10 millisecond latency​ Run on commodity hardware : bigtable is distributed in nature that allows it to run in parallel on commodity hardware. you do not require any specialized hardware to run big table. ​ Flexibility : big table schema parameters let users dynamically control whether to serve data out of memory or from the disk. data is indexed using row and column names that can be arbitrary strings.​

HBase HBase is an open-source non-relational distributed database modeled after Google's Bigtable and written in Java. It is developed as part of Apache Software Foundation's Apache Hadoop project and runs on top of HDFS. Apache HBase  is known as the Hadoop database. It is a column oriented, distributed and scalable big data store. It is also known as a type of  NoSQL  database that is not a  relational database management system .

HBase applications are also written in Java, built on top of Hadoop and runs on HDFS. HBase is used when you need real-time read/write and random access to big data. HBase is modeled based on Google's  BigTable  concepts. HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). HBase   provides a fault-tolerant way of storing sparse data sets , which are common in many big data use cases.

Characteristics and features of a  H base ​ 1 Highly scalable : at your base is highly scalable and is designed to handle petabytes of data. it can run on thousands of servers in parallel​ 2 High performance : HBase provides low latency reads and writes to data and thus allowing for fast processing of massive datasets​ 3 No SQL database : this is not a traditional relational database. it is a no SQL database that allows storing arbitrary key value pairs.​ 4 Fault tolerant : it reads splits data stored in tables across multiple machines in the cluster and is built to withstand individual machine failures in a cluster​ 5 API support : enterprise provides Java APIs using which you can perform several operations on  HBase  data is stored in it​

High-level Architecture of HBase

1  HDFS : all  HBase  data is stored on HDFS.​ 2   Regions : Tables in   HBase  are divided horizontally by row key range in two regions. A region contains all rows in the table between the regions start key and end key. regions are assigned to the nodes in the cluster called region. Servers and these serve data for reads and writes to the clients. A region server can serve around 1000 regions.​ 3 Master server (  HMaster ) : The master server coordinates the cluster and performs administrative operations such as assigning regions to the region servers and balancing the load. it also performed other administrative operations such as creating and deleting the tables.​ 4 Region Servers(   HRegion ) : The region servers perform data processing. each region server stores a subset of the data of each table. clients talk to region servers to access the data in  HBase .​ 5 Zookeeper : is the centralized service for maintaining configuration information, naming, providing distributed synchronization and providing group services.  zookeeper maintains which region servers( HRegion ) are alive and available and provides server failure notification to the master server( HMaster ) to coordinate administrative tasks such as region assignment Establishing communication across the Hadoop cluster Maintaining configuration information Tracking Region Server and HMaster failure Maintaining Region Server information

DynamoDB DynamoDB  is a fully managed NoSQL database service that allows to create database tables that can store and retrieve any amount of data. It automatically manages the data traffic of tables over multiple servers and maintains performance. It also relieves the customers from the burden of operating and scaling a distributed database. Hence, hardware provisioning, setup, configuration, replication, software patching, cluster scaling, etc. is managed by Amazon. With DynamoDB , you can create database tables that can store and retrieve any amount of data and serve any level of request traffic . It is one of the main components of Amazon.com, the biggest e-commerce stores in the world.

Characteristics and Features of DynamoDB Scalable − Amazon DynamoDB is designed to scale. There is no need to worry about predefined limits to the amount of data each table can store. Any amount of data can be stored and retrieved. DynamoDB will spread automatically with the amount of data stored as the table grows. Fast − Amazon DynamoDB provides high throughput at very low latency. As datasets grow, latencies remain stable due to the distributed nature of DynamoDB's data placement and request routing algorithms. Durable and highly available − Amazon DynamoDB replicates data over at least 3 different data centers’ results. The system operates and serves data even under various failure conditions. Flexible : Amazon DynamoDB allows creation of dynamic tables, i.e. the table can have any number of attributes, including multi-valued attributes. Cost-effective : Payment is for what we use without any minimum charges. Its pricing structure is simple and easy to calculate.

Architecture of Dynamo Client Interface Clients Dynamo Node 1 Request Coordination Membership and failure detection Local Persistence engine Dynamo Node 2 Request Coordination Membership and failure detection Local Persistence engine Dynamo Node 3 Request Coordination Membership and failure detection Local Persistence engine

In Dynamo  each storage node has three main software components that are implemented in Java​ 1 Request  coordination​ The coordinator executes the read and write request on behalf of clients by collecting data from one or more nodes( for reads) or storing data at one or more nodes( for writes).  Each client requests result in the creation of a state machine on the node that received the client request. The state machine contains all the logic for identifying the nodes responsible for a key, sending the request, waiting for the responses, potentially doing retries, processing the replies and packaging the response to the client. each state machine instance handles exactly 1 client request.​ 2 Membership and failure detection​ Failure  detection in Dynamo is used to avoid attempts to communicate with unreachable peer nodes. For the  purpose of avoiding failed attempts at communication, a purely local mechanism of a failure detection is used . For  example node A may consider node B failed if node B does not respond to node A's messages. Node   A quickly discovers that node B is unresponsive when B fails to respond to A’s message. Node A then uses alternate nodes to service request that map to B’s partitions. Node A periodically retries node B to check for node B’s recovery. decentralized failure detection protocols using simple gossip style protocol that enable each node in the system to learn about the arrival of other nodes​ ​

3 A local persistence engine ​ Dynamo provides the flexibility to choose the underlying persistent storage based on application requirements. The main reason for designing a   pluggable  persistent component is to choose the storage engine best suited for an applications access patterns. For  instance some database can handle objects typically in the order of 10s of kilobytes whereas some can handle objects of larger sizes.  A pplications choose Dynamos local persistence engine based on their object size distribution​

Google cloud data store​ Cloud storage is a cloud computing model that stores data on the i nternet through a cloud computing provider who manages and operates data storage as a service. In t h is fas t - m o v i n g wor l d it become nec e s s ary to st o re data o n the cloud storage. The biggest advantage of cloud storage is that we can store any type of data in digital form on the cloud. Another advantage of cloud storage is that we can access data from anywhere anytime on any device . There are many cloud storage providers such as, Go o gle Drive, Drop b o x , On e Driv e , iCl o u d , et c . They p r o v i d e free service for limited storage but if you want to store beyond the limit, you have to pay.

Using grids for data storage( grid oriented storage)​

Cloud Storage Cloud storage is a data deposit model in which digital information such as documents, photos, videos and other forms of media are stored on virtual or cloud servers hosted by third parties. It allows you to transfer data on an offsite storage system and access them whenever needed. Cloud storage is a cloud computing model that allows users to save important data or media files on remote, third-party servers. Users can access these servers at any time over the internet. Also known as utility storage, cloud storage is maintained and operated by a cloud-based service provider.

Data Management in Cloud Storage Cloud data management is the practice of storing a company’s data at an offsite data center that is typically owned and overseen by a vendor who specializes in public cloud infrastructure, such as AWS or MicrosoftAzure. Managing data in the cloud provides an automated backup strategy, professional support, and ease of access from any location.

Cloud Provisioning Cloud provisioning means allocating a cloud service provider’s resources to a customer. It is a key feature of cloud computing. It refers to how a client gets cloud services and resources from a provider. The cloud services that customers can subscribe to include infrastructure-as-a-service (IaaS), software-as-a-service (SaaS), and platform-as-a-service (PaaS) in public or private environments .

Types of Cloud Provisioning Network Provisioning : Network Provisioning in the telecom industry is a means of referring to the provisions of telecommunications services to a client. Server Provisioning: Datacenter’s physical infrastructure, installation, configuration of the software, and linking it to middleware, networks, and storage. User Provisioning: It is a method of identity management that helps us in keeping a check on the access and privileges of authorization. Provisioning is featured by the artifacts such as equipment, suppliers, etc. Service Provisioning: It requires setting up a service and handling its related data .

Data Intensive Technology in Cloud Computing Data Intensive Computing is a class of parallel computing which uses data parallelism in order to process large volumes of data. The size of this data is typically in terabytes or petabytes. This large amount of data is generated each day and it is referred to Big Data. Data intensive computing has some characteristics which are different from other forms of computing. They are: In order to achieve high performance in data intensive computing, it is necessary to minimize the movement of data. This reduces system overhead and increases performance by allowing the algorithms to execute on the node where the data resides. The data intensive computing system utilizes a machine independent approach where the run time system controls the scheduling, execution, load balancing, communications and the movement of programs.

Data Intensive Technology in Cloud Computing Data intensive computing has some characteristics which are different from other forms of computing. They are: Data intensive computing hugely focuses on reliability and availability of data. Traditional large scale systems may be susceptible to hardware failures, communication errors and software bugs, and data intensive computing is designed to overcome these challenges. Data intensive computing is designed for scalability so it can accommodate any amount of data and so it can meet the time critical requirements. Scalability of the hardware as well as the software architecture is one of the biggest advantages of data intensive computing.

Cloud Storage from LANs to WANs Characteristics : 1. Computer power is elastic , when it can perform parallel operations. In general, applications conceived to run on the peak of a shared-nothing architecture are well matched for such an environment. Some cloud computing goods, for example, Google’s App Engine, supply not only a cloud computing infrastructure, but also an entire programs stack with a constrained API so that software developers are compelled to compose programs that can run in a shared-nothing natural environment and therefore help elastic scaling.

Cloud Storage from LANs to WANs Characteristics : 2. Data is retained at an unknown host server. In general, letting go off data is a threat to many security issues and thus suitable precautions should be taken. The title ‘ c loud computing’ implies that the computing and storage resources are being operated from a celestial position. The idea is that the data is physically stored in a specific host country and is subject to localized laws and regulations. Since most cloud computing vendors give their clientele little command over where data is stored, the clientele has no alternative but to expect the least that the data is encrypted utilizing a key unavailable with the owner, the data may be accessed by a third party without the customer’s knowledge.

Cloud Storage from LANs to WANs Characteristics : 3. Data i s du p l i ca t ed of t en over dis t ant l oc at i on s. Da t a a c c ess i b i l ity and durability is paramount for cloud storage providers, as data tampering can be impairing for both the business and the access i b i l ity and dur a b i lity are nor m al l y o r ganiz a tio n ’ s rep u ta t i o n . Data a cco m pl i s h ed th r ough hidden replications. Large cloud computing providers with data hubs dispersed all through the world have the proficiency to provide high levels of expected error resistance by duplicating data at distant locations across continents. Amazon’s S3 cloud storage service replicates data over ‘regions’ and ‘availability zones’ so that data and applications can survive even when the whole location collapses.

Cloud Storage from LANs to WANs Distributed Data Storage : Distributed storage means are evolving from the existing practices of data storage for the new generation of WWW applications through organizations like Google, Amazon and Yahoo. There are some reasons for distributed storage means to be favoured over traditional relational database systems encompassing scalability, accessibility and performance. The new generation of applications require processing of data to a tune of terabytes and even peta bytes. This is accomplished by distributed services. Distributed services means distributed data.

CouchDB CouchDB is a document-oriented database server. Couch is an acronym for ‘Cluster Of Unreliable Commodity Hardware’, emphasizing the distributed environment of the database. CouchDB is designed for document-oriented applications for example , forums, bug following, wiki, Internet note, etc. CouchDB is ad-hoc and schema-free with a flat address space. CouchDB aspires to persuade the Four Pillars of Data Management by these methods: 1. Save : ACID( Atomicity , Consistency, Isolation, and Durability). compliant, save efficiently 2. See : Easy retrieval, straightforward describing procedures, fulltext search 3. Secure : Strong compartmentalization, ACL, connections over SSL 4. Share : Distributed means A purchaser sees a snapshot of the data and works with it even if it is altered at the same time by a distinct client. CouchDB actually has no apparent authentication scheme, i.e., it is in-built. The replication is distributed. A server can revise others once the server is made offline and data is changed. If there are confrontations, CouchDB will choose a survivor and hold that as latest. Users can manually suspend this surviving alternative later . Importantly , the confrontation tenacity yields identical results comprehensively double-checking on the offline revisions.
Tags