19-01-2024 Mr Saurabh Gupta Unit -5 1 Mr. Saurabh Gupta Department of CSE Unit: 5 CLOUD COMPUTING (TCS 074) Course Details ( B.Tech 7th Sem) Case Studies and Advancements
Cloud Hadoop Map Reduce Virtual Box Google App Engine Programming Environment for Google App Engine Open Stack 19-01-2024 Mr Saurabh Gupta Unit -5 2 Unit Content – Federation in the Cloud – Four Levels of Federation – Federated Services and Applications – Future of Federation
19-01-2024 •The most well known technology used for Big Data is Hadoop. •It is actually a large scale batch data processing system using simple programming Hadoop software library is a framework It is made by apache software foundation in 2011. and written in java Mr Saurabh Gupta Unit -5 3 Cloud Hadoop
Hadoop Hadoop is open source software. Framework Massive Storage Processing Power 19-01-2024 Mr Saurabh Gupta Unit -5 4
Why Hadoop Distributed cluster system Platform for massively scalable applications Enables parallel data processing 19-01-2024 Mr Saurabh Gupta Unit -5 5
Big data Big data is a term used to define very large amount of unstructured and semi structured data a company creates. when talking about Petabytes and Exabyte of data. • That much data would take so much time and cost to load into relational database for analysis. Facebook has almost 10billion photos taking up to 1Petabytes of storage. 19-01-2024 Mr Saurabh Gupta Unit -5 6
Processing that large data is very difficult in relational database. It would take too much time to process data and cost 19-01-2024 Mr Saurabh Gupta Unit -5 7 So what is the problem
problems in distributed computing Chances of hardware failure is always there. Combine the data after analysis Data from all disks have to be combined from all the disks which is a mess. 19-01-2024 Mr Saurabh Gupta Unit -5 8
Hadoop parts To Solve all the Problems Hadoop Came. It has two main parts – Hadoop Distributed File System (HDFS) - Data Processing Framework & Map Reduce 19-01-2024 Mr Saurabh Gupta Unit -5 9
Hadoop Distributed File System It ties so many small and reasonable priced machines together into a single cost effective computer cluster. Data and application processing are protected against hardware failure. If a node goes down, jobs are automatically redirected to other nodes to make sure the distributed computing does not fail. it automatically stores multiple copies of all data. It provides simplified programming model which allows user to quickly read and write the distributed system. 19-01-2024 Mr Saurabh Gupta Unit -5 10
Map Reduce Map Reduce is a programming model for processing and generating large data sets with a parallel, distributed algorithm on a cluster. It is an associative implementation for processing and generating large data sets. MAP function that process a key pair to generates a set of intermediate key pairs. REDUCE function that merges all intermediate values associated with the same intermediate key 19-01-2024 Mr Saurabh Gupta Unit -5 11
Map Reduce Map Reduce: programming model developed at Google Objective: Implement large scale search Text processing on massively scalable web data stored using Big Table and GFS distributed file system Designed for processing and generating large volumes of data via massively parallel computations, utilizing tens of thousands of processors at a time 19-01-2024 Mr Saurabh Gupta Unit -5 12
Hadoop Advantages Computing power Flexibility Fault Tolerance Low Cost Scalability 19-01-2024 Mr Saurabh Gupta Unit -5 13
Hadoop Disadvantages 19-01-2024 Mr Saurabh Gupta Unit -5 14 Integration with existing systems Hadoop is not optimized for ease for use. Installing and integrating with existing databases might prove to be difficult, especially since there is no software support provided. Administration and ease of use Hadoop requires knowledge of Map Reduce, while most data practitioners use SQL. This means significant training required Hadoop clusters. Security Hadoop lacks the level of security functionality
Two phases of Map Reduce: – Map operation – Reduce operation Map phase: – Each mapper reads approximately 1/ M of the input from the global file system using locations given by the master – Map operation consists of transforming one set of key-value pairs to another: – Each mapper writes computation results in one file per reducer – Files are sorted by a key and stored to the local file system – The master keeps track of the location of these files 19-01-2024 Mr Saurabh Gupta Unit -5 15 Map Reduce
Map Reduce Reduce phase: – The master informs the reducers where the partial computations have been stored on local files of respective mappers – Reducers make remote procedure call requests to the mappers to fetch the files – Each reducer groups the results of the map step using the same key and performs a function f on the list of values that correspond to these key value: – Final results are written back to the GFS file system 19-01-2024 Mr Saurabh Gupta Unit -5 16
Daily Quiz _______ maps input key/value pairs to a set of intermediate key/value pairs. a) Mapper b) Reducer c) Both Mapper and Reducer d) None of the mentioned 2. Although the Hadoop framework is implemented in Java, MapReduce applications need not be written in ____________ a) Java b) C c) C# 3. Running a ___________ program involves running mapping tasks on many or all of the nodes in our cluster. a) MapReduce b) Map c) Reducer d) All of the mentioned 19-01-2024 Mr Saurabh Gupta Unit -5 17
Virtual box Virtual Box is open-source software for virtualizing the x86 computing architecture. It acts as a hypervisor, creating a VM (virtual machine) in which the user can run another OS (operating system). The operating system in which Virtual Box runs is called the "host" OS. Virtual Box supports Windows, Linux, or macOS as its host OS. 19-01-2024 Mr Saurabh Gupta Unit -5 18
Virtual Box Virtual Box was originally developed by Innotek GmbH released on January 17, 2007 as an open-source software package. The company was later purchased by Sun Microsystems . On January 27, 2010, Oracle Corporation purchased Sun, and took over development of Virtual Box. 19-01-2024 Mr Saurabh Gupta Unit -5 19
What IS App Engine Google App Engine •Google’s Platform to Build Web Application on Cloud •Dynamic Web server with full support for common web technologies •Automatic Scaling & Load balancing •Transactional Data store model 19-01-2024 Mr Saurabh Gupta Unit -5 20 Google App Engine
Google App Engine Google’s Platform to build Web applications on the cloud Dynamic Web Server, with full support to common web technologies Automatic scaling and local balancing SQL and NoSQL Data Store Model • Integration with Google Account through APIs 19-01-2024 Mr Saurabh Gupta Unit -5 21
Why Google app engine Auto Scaling - No need to over provision Static Files - Static files use Google’s CDN Easy Logs - View logs in web console Easy Deployment - Literally 1-click deploy Free Quota - 99% of apps will pay nothing 19-01-2024 Mr Saurabh Gupta Unit -5 22 Affordable Scaling - Prices better than AWS No config - No need to con-fig OS or servers Easy Security - Google patches OS/servers
Language support Python v2.5, v2.7 Java 5, Java 6 Go 19-01-2024 Mr Saurabh Gupta Unit -5 23
Google app engine Advantages Infrastructure for Security Scalability Performance and Reliability Cost Savings Platform Independence Disadvantages You Are At Google’s Mercy Violation of Policies Forget Porting It isn’t Free 19-01-2024 Mr Saurabh Gupta Unit -5 24
Open Stack Open Stack is a cloud operating system that controls large pools of compute, storage, and networking resources throughout a datacenter all managed through a dashboard that gives administrators control while empowering their users to provision resources through a web interface. 19-01-2024 Mr Saurabh Gupta Unit -5 25
Open stack capability Software as Service (SaaS) ▪ Browser or Thin Client access ▪ Platform as Service (PaaS) ▪ On top of IaaS e.g. Cloud Foundry ▪ Infrastructure as Service (IaaS) ▪ Provision Compute, Network, Storage ▪ Virtual Machine (VMs) on demand 19-01-2024 Mr Saurabh Gupta Unit -5 26 ▪ Provisioning ▪ Snapshotting ▪ Network ▪ Storage for VMs and arbitrary files ▪ Multi-tenancy ▪ User can be associated with multiple projects
Component Service - Compute ▪ Project – Nova Service - Networking ▪ Project – Neutron Service - Object storage ▪ Project - Swift Service- Block storage ▪ Project- Cinder Service - Telemetry ▪ Project – Ceilometer Service - Dashboard ▪ Project - Horizon 19-01-2024 Mr Saurabh Gupta Unit -5 27
Federation Cloud federation is the practice of interconnecting the cloud computing environments of two or more service providers The purpose of load balancing traffic and accommodating spikes in demand. Cloud federation requires one provider to wholesale or rent computing resources to another cloud provider. 19-01-2024 Mr Saurabh Gupta Unit -5 28
19-01-2024 Mr Saurabh Gupta Unit -5 29 Four levels of Federation The conceptual level addresses the challenges in presenting a cloud federation as a favorable solution with respect to the use of services leased by single cloud providers. The logical and operational level of a federated cloud identifies and addresses the challenges in devising a framework that enables the aggregation of providers that belong to different administrative domains within a context of a single overlay infrastructure, which is the cloud federation. The infrastructural level addresses the technical challenges involved in enabling heterogeneous cloud computing systems to interoperate seamlessly
Federated Services and Applications The federation of cloud resources allows clients to optimize enterprise IT service delivery. Federation across different cloud resource pools allows applications to run in the most appropriate infrastructure environments. 19-01-2024 Mr Saurabh Gupta Unit -5 30
Benefits The federation of cloud resources allows clients to optimize enterprise IT service delivery. The federation of cloud resources allows a client to choose the best cloud services provider, in terms of flexibility, cost and availability of services, to meet a particular business or technological need within their organization. 19-01-2024 Mr Saurabh Gupta Unit -5 31
Benefits Federation across different cloud resource pools allows applications to run in the most appropriate infrastructure environments. The federation of cloud resources also allows an enterprise to distribute workloads around the globe, move data between disparate networks and implement innovative security models for user access to cloud resources. 19-01-2024 Mr Saurabh Gupta Unit -5 32
Future of Federation The federated cloud model is a force for real democratization in the cloud market. It’s how businesses will be able to use local cloud providers to connect with customers, partners and employees anywhere in the world. 19-01-2024 Mr Saurabh Gupta Unit -5 33
Future of Federation It’s how end users will finally get to realize the promise of the cloud. And, it’s how data center operators and other service providers will finally be able to compete with, and beat, today’s so-called global cloud providers 19-01-2024 Mr Saurabh Gupta Unit -5 34
Cloud Data Life Cycle There are seven phases of data life cycle are 1.Generation 2.Use 3.Transfer 4.Transformation 5.Storage 6.Archival 7.Destruction 19-01-2024 Mr Saurabh Gupta Unit -5 35
Cloud Data Life Cycle Generation of the Information Ownership: Who in the organization owns the user’s data, and how is the ownership of data maintained within the organization? Classification : How and when is personally identifiable information classified? Are there any limitations on cloud computing on specific data cases? Governance: To ensure that personally identifiable information is managed and protected throughout its life-cycle 19-01-2024 Mr Saurabh Gupta Unit -5 36
Cloud Data Life Cycle Use of the Information Internal v/s External: Are personally identifiable information used only inside the organization or they are used outside the organization? Third Party: Is the personally identifiable information shared with third parties(organizations besides the parent company having data). Appropriateness: Is the personally identifiable information of users being correctly used for which it is intended? Discovery/Subpoena: Is the information stored in the cloud will enable the organization to comply with legal requirements in legal proceedings? 19-01-2024 Mr Saurabh Gupta Unit -5 37
Cloud Data Life Cycle Transfer of the Data Public v/s Private Network: Are the public networks secure(protected) enough while the personally identifiable information is transferred to the cloud? Encryption Requirements: Is the personally identifiable information encrypted while transmitted via a public network? Access Control: Appropriate access control measures should be taken on personally identifiable information when it is in the cloud. 19-01-2024 Mr Saurabh Gupta Unit -5 38
Cloud Data Life Cycle Transformation of Data Derivation:- While data is being transformed in the cloud, it should be protected and user limitations should be imposed on it. Aggregations:- The data should be aggregated so that we can ensure that it is no longer identifying any personal individual. Integrity:- Is the integrity of personally identifiable information maintained while it is in the cloud? 19-01-2024 Mr Saurabh Gupta Unit -5 39
Cloud Data Life Cycle Storage of Data Access Control: Appropriate access controls should be used on personally identifiable information while it is stored in the cloud so that only individuals with a need to know will be able to access it. Structured v/s Unstructured: How the stored data will enable the organizations in accessing and managing the data in the future. Integrity/Availability/Confidentiality: How data integrity, availability, and confidentiality are maintained in the cloud? Encryption: The personally identifiable information should be encrypted while it is in the cloud. 19-01-2024 Mr Saurabh Gupta Unit -5 40
Cloud Data Life Cycle Archival Legal and Compliance: Personally identifiable information should have specific requirements that will instruct how long the data should be stored and archived. Off-site Considerations: Does the cloud service provider have the ability for long-term off-site storage and should also support the archival requirement? Media Concerns: Who will control the media and what is the organization’s ability to recover in such cases when the media is lost? Retention: For how long the data should be retained on the cloud by the cloud service providers? 19-01-2024 Mr Saurabh Gupta Unit -5 41
Cloud Data Life Cycle Destruction of the Data Secure: Does the cloud service providers destroy the personally identifiable information obtained by the customers to avoid a breach of information? Complete: Does the personally identifiable information be completely destroyed? (erase the data, or it can be recovered) 19-01-2024 Mr Saurabh Gupta Unit -5 42
Authentication and Authorization 1. The authentication credentials can be changed in part as and when required by the user. 1. The authorization permissions cannot be changed by user as these are granted by the owner of the system and only he/she has the access to change it. 2. The user authentication is visible at user end. 2. The user authorization is not visible at the user end. 3. The user authentication is identified with username, password, face recognition, retina scan, fingerprints, etc. 3. The user authorization is carried out through the access rights to resources by using roles that have been pre-defined. 4. In the authentication process, users or persons are verified. 4. While in this process, users or persons are validated. 5. It is done before the authorization process. 6. While this process is done after the authentication process. 7. It needs usually the user’s login details. 7. While it needs the user’s privilege or security levels. 8. Authentication determines whether the person is user or not. 8. While it determines What permission does the user have? 19-01-2024 Mr Saurabh Gupta Unit -5 43
Authentication and Authorization Generally, transmit information through an ID Token. Generally, transmit information through an Access Token. The OpenID Connect (OIDC) protocol is an authentication protocol that is generally in charge of user authentication process. The OAuth 2.0 protocol governs the overall system of user authorization process. Popular Authentication Techniques: Password-Based Authentication Password less Authentication 2FA/MFA (Two-Factor Authentication / Multi-Factor Authentication) Single sign on(SSO) Social authentication Popular Authorization Techniques- Role-Based Access Controls (RBAC) JSON Web Token Authorization SAML Authorization OpenID Authorization OAuth 2.0 Authorization The authentication credentials can be changed in part as and when required by the user. The authorization permissions cannot be changed by user as these are granted by the owner of the system and only he/she has the access to change it. Example : Employees in a company are required to authenticate through the network before accessing their company email. Example: After an employee successfully authenticates, the system determines what information the employees are allowed to access. 19-01-2024 Mr Saurabh Gupta Unit -5 44
Multi-Factor Authentication Multifactor authentication (MFA) is an account login process that requires multiple methods of authentication from independent categories of credentials to verify a user's identity for a login or other transaction. Multifactor authentication combines two or more independent credentials -- what the user knows , such as a password; what the user has , such as a security token; and what the user is , by using biometric verification methods. Multifactor authentication is a core component of an Identity and access management framework. The goal of MFA is to create a layered defense that makes it more difficult for an unauthorized person to access a target, such as a physical location, computing device, network or database. If one factor is compromised or broken, the attacker still has at least one or more barriers to breach before successfully breaking into the target. 19-01-2024 Mr Saurabh Gupta Unit -5 45
Security policy management Security Policy Management is the process of identifying, implementing, and managing the rules and procedures that all individuals must follow when accessing and using an organization’s IT assets and resources. The goal of these network security policies is to address security threats and implement strategies to mitigate IT security vulnerabilities, as well as defining how to recover from a system compromise or when a network intrusion occurs. Furthermore, the policies provide guidelines to employees on what to do and what not to do. They also define who gets access to what assets and resources, and what the consequences are for not following the rules. it’s important for every organization to have documented IT Security Policies and Security Policy Management to help protect the organization’s data and other valuable assets. 19-01-2024 Mr Saurabh Gupta Unit -5 46
Role based access controls Role-based access control (RBAC), also known as role-based security, is a mechanism that restricts system access. It involves setting permissions and privileges to enable access to authorized users. Most large organizations use role-based access control to provide their employees with varying levels of access based on their roles and responsibilities. This protects sensitive data and ensures employees can only access information and perform actions they need to do their jobs. An organization may let some individuals create or modify files while providing others with viewing permission only. 19-01-2024 Mr Saurabh Gupta Unit -5 47
Monitoring and Auditing Monitoring is to ensure that policies and procedures are in place and are being followed, Auditing is to determine whether the monitoring program is operating as it should and that policies, procedures, and controls adopted are adequate and their effectiveness is validated in reducing errors and risks. Auditing is focused on compliance. Monitoring measures compliance and success, and when necessary, offers a roadmap for improvement. 19-01-2024 Mr Saurabh Gupta Unit -5 48
19-01-2024 Mr Saurabh Gupta Unit -5 49 Weekly Assignment QUESTIONS Q.1 Illustrate use of hadoop . Q.2 Open stack is used to deploy IaaS. Elaborate. Q.3 Describe virtual Box and its Working. Q.4 Where map reduce is required and how its work Q.5 List out Four level of federation. .
What was Hadoop written in? a) Java (software platform) b) Perl c) Java (programming language) d) Lua (programming language) Which of the following platforms does Hadoop run on? a) Bare metal b) Debi an c) Cross-platform d) Unix-like Above the file systems comes the ________ engine, which consists of one Job Tracker, to which client applications submit Map Reduce jobs. a) Map Reduce b) Google c) Functional programming d) Facebook 19-01-2024 Mr Saurabh Gupta Unit -5 50 MCQs
1. A ________ serves as the master and there is only one NameNode per cluster. a) Data Node b) Name Node c) Data block d) Replication 2. HDFS works in a __________ fashion. a) master-worker b) master-slave c) worker/slave d) all of the mentioned 3. Point out the correct statement. a) Hadoop do need specialized hardware to process the data b) Hadoop 2.0 allows live stream processing of real-time data c) In Hadoop programming framework output files are divided into lines or records d) None of the mentioned 19-01-2024 Mr Saurabh Gupta Unit -5 51 MCQ
MCQ __________ can best be described as a programming model used to develop Hadoop-based applications that can process massive amounts of data. a) Map Reduce b) Mahout c) Oozie d) All of the mentioned Facebook Tackles Big Data With _______ based on Hadoop. a) ‘Project Prism’ b) ‘Prism’ c) ‘Project Big’ d) ‘Project Data __________ has the world’s largest Hadoop cluster. a) Apple b) Datamatics c) Facebook d) None of the mentioned 19-01-2024 Mr Saurabh Gupta Unit -5 52
MCQ 1. Which component serves as a dashboard for users to manage OpenStack compute, storage and networking services? a)Designate b)Horizon c)Glance d)Searchlight 2. Swift is Open Stack's object storage system, while Cinder deals with block storage. a)True b)False 3. What is Google App Engine for? A. Google App Engine is for detecting malicious apps. B. Google App Engine is for running web applications on Google’s infrastructure. C. Google App Engine replaces the modern computer. D. Google App Engine is a system to develop hardware interfaces. 19-01-2024 Mr Saurabh Gupta Unit -5 53
Define Hadoop technology and importance of Hadoop. Identify the use of map reduce and phases of map reduce. Describe security policy management. What is Role based access control. Describe virtual machine security. What is Multi factor authentication. Write short note on security governance and identity access management. Summarize software as a service security and policies of SaaS Describe Cloud data life cycle and explain its phases. Differentiate between Authorization and authentication. 19-01-2024 Mr Saurabh Gupta Unit -5 54 Assignment Questions