Topics Covered An Overview of Data Science Data and Information Data Types and Representation Data Processing Cycle Data Value Chain (Acquisition, Analysis, Curating, Storage, Usage) Basic Concepts of Big Data 2
2.1 Overview of Data Science What is Data science? A multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. 3 Or it is the field of study that combines programming skills , knowledge of mathematics and statistics to extract meaningful insights from data.
Cont. Data science is much more than simply analyzing data. It offers a range of role of and requires a range of skills. It is a science of different professionals. i.e. by combining mathematics, statics, computer programing etc Examples of data Your notebook Prices of items in supermarket Files in computer etc 4
Cont. . . Data science continues to evolve as one of the most promising and in-demand career paths for skilled professionals To be a successful data professional in today’s market requires to advance past traditional skills of analyzing large amounts of data by data mining and programming sk ills. 5
Data Science Experts/Scientist? Data scientists are analytical experts who utilize their skills in both technology and social science to find trends and manage data . They use industry knowledge, contextual understanding, uncertainty of existing assumptions to uncover solutions to business challenges. S kill n eeded for a data scientist are statistics and linear algebra as well as programming knowledge. Must master the full spectrum of the data science life cycle and possess a level of flexibility and understanding to maximize returns. 6
2.2 Data and Information Data? A representation of raw or unprocessed facts, figures, concepts, or instructions in a formalized manner, which should be suitable for communication, interpretation, or processing by human or electronic machine . It is not used for decision making The data doesn’t have pattern Data can be represented with the help of: Alphabet (A-Z, a-z) Digit (0-9) Special Characters (+,-, *, /, >,<, = etc. ) 7
Information? Interpreted data, created from organized, structured and processed data , which has some meaningful values for the receiver. It is organized, processed, structured and analyzed data It is used for decisions making purposes. Principle of information - processed data must qualify for the following Timely -Information should be available when required. Accuracy − Information should be accurate. Completeness − Information should be complete. 8
Data Information Described as unprocessed or raw facts and figures Described as processed data Cannot help in decision making Can help in decision making Raw material that can be organized, structured, and interpreted to create useful information systems. Interpreted data; created from organized, structured, and processed data in a particular context. An example of data is a students test score. The average score of a class is the information driven from the given data. Summery: Data Vs. Information 9
2.3 Data Processing Cycle Data Processing Cycle Input (prepared in some convenient form for processing) e.g. Electronic computers Output: is collecting the result of processing Processing : ( Changing data in to useful form) e.g. calculating CGPA Produced information need to be stored for future usage 10 is re-structuring or re-ordering of data by people or machine to increase their usefulness The set of operations used to transform data into useful information.
Cont. . . 11
2.4 Data Types and their Perspective Common data types include: Integers (int) - used to store whole numbers mathematically known as Integers Booleans (bool) – store one of two values: true or false or (High or Low) Characters (char) used to store a single character (numeric, Alphabetic, symb ol ) Floating – point numbers (float)- is used to store real numbers Alphanumeric strings (string ) --- used to stores a combination of characters and numbers. 12
Data Types and data Analytics Representation Structured Data: It has a pre-defined data model and straightforward to analyze take a tabular format (table format) with a relationship between different rows and columns E.g. Excel files or SQL databases Semi-structured Data: does not conform with the formal structure of data model. But, contains tags or other markers for separation semantic elements enforce hierarchies of records and fields within the data Known as self describing structure. Fore example: JSON and XML 13
Cont. . . Unstructured Data does not have a predefined data model or is not organized in a pre-defined manner. Examples: audio, video files or No-SQL databases. Metadata - data about data that provides additional information about a specific set of data. It is one of the most important elements for big data analysis and big data solutio n. E.g. photographs metadata - describe when and where the photos were taken. 14
2.5 Data Value Chain Is the information flow within a big data system as a series of steps needed to generate useful insights from data. Data value Chain includes: 1. Data Acquisition: is the process of Gathering, Filtering and Cleaning data before any data analysis can be carried out. 2. Data analysis: (making raw data amenable to use in decision making) Data analysis involves exploring, transforming and modeling data and extracting useful information 15
Cont. . . 3. Data Curation : Active management of data over its life cycle to ensure it meets the necessary data quality requirements for its effective usage. Include creation of content, selection, classification, transformation, validation and preservation . Data Curation is performed by expert Curators that are responsible for improving the accessibility and quality of data. 4.Data storage: (storing the processed data) 5. Data usage: ( using the processed data to make decision) 16
Use case of Data Science 17
Application domain of Data Science 18
2.6 Basic Concepts of Big data is a term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. 19
Characteristics of Big data Big data can be characterized by : 1. Volume: large amount of data, massive datasets 2. Velocity: data is live streaming or in motion (Rapidity) The speed that data moves through the system. 3.Variety: data comes in many different forms from diverse sources. (structured, unstructured, text) 4. Veracity: can we trust the data? How accurate is it? etc. Uncertainty due to data inconsistency and incompleteness etc. 20
The speed at which data are generated Data is live streaming or in motion Realtime Data trustworthiness (the degree to which big data can be trusted) Data accuracy How accurate is it? Characteristics of Big data The amount of data from myriad source large amounts of data Zeta bytes (Massive datasets) The types of data Data comes in many different forms from diverse sources The way in which the big data can be used and formatted To whom the data are accessible? Business value of the data collected Uses and purpose of data 21
The 4 Vs of Big Data 22
Five major use cases of Big Data Big data exploration or investigation Enhanced customer view Security / intelligence extension Operations analysis Data warehouse augmentation 23
Clustered Computing Individual computers are often inadequate for handling big data at most stages. Clustered Computing is a group of computers connected through LAN (local area network) that work together and they behave like a single system. Computer made up of computer Is used to better address high storage and computational needs of big data. 24
Cont. . . 25 The four nodes are connected through software to share loads and they perform like a single unit. It is important to maximize the processor that improves the speed of computers when analyzing big data. We can search, extract or allocate data from all nodes by accessing only one node b/c each node have relationships. Each node have backups, duplications
Benefits of Clustered Computing The benefits of combining the resources of many smaller machines are to get: 1.Resource pooling: combining available storage space, CPU or memory to get high speed operation or high speed transaction. 2.High availability: it provides varying levels of fault tolerance and availability guarantees. If one machine falls we can get the data from another machines. Thus, no data lost or west. 3. Easy scalability (scalable by adding additional machine) 26
Examples of Scaling Clustered Computing 27
2.7 Hadoop and its Ecosystem An open-source framework intended to make interaction with big data easier. It allows clustering multiple computers to analyze massive data sets in parallel more quickly. The four key characteristics of Hadoop Economical : ordinary computers can be used for data processing. Reliable : it stores copies of the data on different machines and is resistant to hardware failure. Scalable : It is easily scalable both, horizontally and vertically. Flexible : It is flexible and you can store as much structured and unstructured data as you need and to use them later. 28
The 4 core components of Hadoop includes Data Management, Data Access, Data Processing Data Storage. The 4 core components of Hadoop and its Ecosystem The Hadoop Ecosystem 29
Cont.….. It comprises the following components HDFS: Hadoop distributed File System YARN: Yet Another Resource Negotiator MapReduce: Programming based Data Processing Spark : in Memory data processing PIG, HIVE: Query- based processing of data service HBase: NoSQL Database etc 30
The Big data life cycle with Hadoop Stage 1- Ingesting data into the Hadoop. The data is ingested or transferred to Hadoop from various sources such as relational databases, systems, or local files. Stage 2-Processing: in this stage the data is stored and processed Stage 3- Computing and analyzing data The data is analyzing and processing by using opensource frameworks such as Pig, Hive, and Impala. Stage 4- Visualizing the results The analyzed data can be accessed by users. 31
Review Questions Discuss the difference between Big data and Data Science. Briefly discuss the Big data life cycle. List and explain Big data application domains with example. What is Clustered Computing? Explain its advantages. 32