Module 5.pptxData processing involves transforming raw data into useful information
LakshmiKVN1
39 views
45 slides
Oct 08, 2024
Slide 1 of 45
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
About This Presentation
Data processing involves transforming raw data into useful information
Stages of data processing include collection, filtering, sorting, and analysis
Data processing relies on various tools and techniques to ensure accurate, valuable output
The first stage of data collection involves gathering and ...
Data processing involves transforming raw data into useful information
Stages of data processing include collection, filtering, sorting, and analysis
Data processing relies on various tools and techniques to ensure accurate, valuable output
The first stage of data collection involves gathering and discovering raw data from various sources, such as sensors, databases, or customer surveys. It is essential to ensure the collected data is accurate, complete, and relevant to the analysis or processing goals. Care must be taken to avoid selection bias, where the method of collecting data inadvertently favors certain outcomes or groups, potentially skewing results and leading to inaccurate conclusions.
In the data processing stage, the input data is transformed, analyzed, and organized to produce relevant information. Several data processing techniques, like filtering, sorting, aggregation, or classification, may be employed to process the data. The choice of methods depends on the desired outcome or insights from the data.
Size: 1.41 MB
Language: en
Added: Oct 08, 2024
Slides: 45 pages
Slide Content
Module 5 Dr. K V N LAKSHMI
What is data processing Data processing involves transforming raw data into useful information Stages of data processing include collection, filtering, sorting, and analysis Data processing relies on various tools and techniques to ensure accurate, valuable output
Data collection The first stage of data collection involves gathering and discovering raw data from various sources, such as sensors, databases, or customer surveys. It is essential to ensure the collected data is accurate, complete, and relevant to the analysis or processing goals. Care must be taken to avoid selection bias , where the method of collecting data inadvertently favors certain outcomes or groups, potentially skewing results and leading to inaccurate conclusions.
Data preparation Once the data is collected, it moves to the data preparation stage. Here, the raw data is cleaned up, organized, and often enriched for further processing. This stage involves checking for errors, removing any bad data (redundant, incomplete, or incorrect), and enhancing the dataset with additional relevant information from external sources, a process known as data enrichment . Data preparation aims to create high-quality, reliable, and comprehensive data for subsequent processing steps.
Data input The next stage is data input. In this stage, the clean and prepped data is fed into a processing system, which could be software or an algorithm designed for specific data types or analysis goals. Various methods, such as manual entry, data import from external sources, or automatic data capture, can be used to input data into the processing system.
Data processing In the data processing stage, the input data is transformed, analyzed, and organized to produce relevant information. Several data processing techniques, like filtering, sorting, aggregation, or classification, may be employed to process the data. The choice of methods depends on the desired outcome or insights from the data.
Data output and interpretation The data output and interpretation stage deals with presenting the processed data in an easily digestible format. This could involve generating reports, graphs, or visualizations that simplify complex data patterns and help with decision-making. Furthermore, the output data should be interpreted and analyzed to extract valuable insights and knowledge.X
Data storage Finally, in the data storage stage, the processed information is securely stored in databases or data warehouses for future retrieval, analysis, or use. Proper storage ensures data longevity, availability, and accessibility while maintaining data privacy and security.
Batch processing Batch processing involves handling large volumes of data collectively at predetermined times, making it ideal for non-time-sensitive tasks. This method allows organizations to efficiently manage data by aggregating it and processing it during off-peak hours to minimize the impact on daily operations. Example : Financial institutions batch process checks and transactions overnight, updating account balances in one comprehensive sweep to ensure accuracy and efficiency.
Real-time processing Real-time processing is essential for tasks that require immediate handling of data upon receipt, providing instant processing and feedback. This type of processing is crucial for applications where delays cannot be tolerated, ensuring timely decisions and responses. Example : GPS navigation systems rely on real-time processing to offer turn-by-turn directions, adjusting routes based on live traffic and road conditions to ensure the fastest path.
Multiprocessing (parallel processing) Multiprocessing, or parallel processing, involves utilizing multiple processing units or CPUs to handle various tasks simultaneously. This approach allows for more efficient data processing, particularly for complex computations that can be broken down into smaller, concurrent tasks, thereby speeding up overall processing time. Example : Movie production often utilizes multiprocessing for rendering complex 3D animations. By distributing the rendering across multiple computers, the overall project's completion time is significantly reduced, leading to faster production cycles and improved visual quality.
Online processing Online processing facilitates the interactive processing of data over a network, with continuous input and output for instant responses. It enables systems to handle user requests immediately, making it an essential component of e-commerce and online services. Example : Online banking systems utilize online processing for real-time financial transactions, allowing users to transfer funds, pay bills, and check account balances with immediate updates.
Manual data processing Manual data processing requires human intervention for the input, processing, and output of data, typically without the aid of electronic devices. This labor-intensive method is prone to errors but was common before the advent of computerized systems. Example : Before the widespread use of computers, libraries cataloged books manually, requiring librarians to carefully record each book's details by hand for inventory and retrieval purposes.
Mechanical data processing Mechanical data processing uses machines or equipment to manage and process data tasks, a prevalent method before the digital era. This approach involved using tangible, mechanical devices to input, process, and output data. Example : Voting in the early 20th century often involved mechanical lever machines, where votes were tallied by pulling levers for each choice, simplifying vote counting and reducing the potential for errors.
Electronic data processing Electronic data processing employs computers and digital technology to process, store, and communicate data with efficiency and accuracy. This modern approach to data handling allows for rapid processing speeds, vast storage capabilities, and easy data retrieval. Example : Retailers use electronic data processing at checkouts, where barcode scans instantly update inventory systems and process sales, enhancing checkout speed and inventory management.
Distributed processing Distributed processing involves spreading computational tasks across multiple computers or devices to improve processing speed and reliability. This method leverages the collective power of various systems to handle large-scale processing tasks more efficiently than could be achieved with a single computer. Example : Video streaming services use distributed processing to deliver content efficiently. By storing videos on multiple servers, they ensure smooth playback and quick access for users worldwide.
Cloud computing Cloud computing offers computing resources, such as servers, storage, and databases, over the internet, providing flexibility and scalability. This model enables users to access and utilize computing resources as needed, without the burden of maintaining physical infrastructure. Example : Small businesses leverage cloud computing for data storage and software services, avoiding the need for significant upfront hardware investments and allowing easy scaling as the business grows.
Automatic data processing Automatic data processing uses software to automate routine tasks, reducing the need for manual input and increasing operational efficiency. This method streamlines repetitive processes, minimizes human error, and frees up personnel for more strategic tasks. Example: Automated billing systems in telecommunications automatically calculate and send out monthly charges to customers, streamlining billing operations and reducing errors.
Data preparation Data preparation is the process of cleaning and transforming raw data prior to processing and analysis. It is an important step prior to processing and often involves reformatting data, making corrections to data, and combining datasets to enrich data. Data preparation is often a lengthy undertaking for data engineers or business users, but it is essential as a prerequisite to put data in context in order to turn it into insights and eliminate bias resulting from poor data quality.
Benefits of data preparation in the cloud Fix errors quickly — Data preparation helps catch errors before processing. After data has been removed from its original source, these errors become more difficult to understand and correct. Produce top-quality data — Cleaning and reformatting datasets ensures that all data used in analysis will be of high quality. Make better business decisions — Higher-quality data that can be processed and analyzed more quickly and efficiently leads to more timely, efficient, better-quality business decisions. Additionally, as data and data processes move to the cloud, data preparation moves with it for even greater benefits, such as: Superior scalability — Cloud data preparation can grow at the pace of the business. Enterprises don’t have to worry about the underlying infrastructure or try to anticipate their evolutions. Future proof — Cloud data preparation upgrades automatically so that new capabilities or problem fixes can be turned on as soon as they are released. This allows organizations to stay ahead of the innovation curve without delays and added costs. Accelerated data usage and collaboration — Doing data prep in the cloud means it is always on, doesn’t require any technical installation, and lets teams collaborate on the work for faster results.
Data preparation steps questionnaire checking The data preparation process begins with finding the right data. This can come from an existing data catalog or data sources can be added ad-hoc. Check whether questionnaire is acceptable or not. Complete or not Data editing Data editing is the application of checks to detect missing, invalid or inconsistent entries or to point to data records that are potentially in error . No matter what type of data you are working with, certain edits are performed at different stages or phases of data collection and processing. Detect errors and omissions
Data preparation steps Data coding: Converting data into codes. Process of assigning numerical values to responses that are originally in a given format such as numerical, text, audio or video . The main objective is to facilitate the automatic treatment of data for analytical purposes. Coded data can be analyzed using statistical software tools.
Data preparation steps Data classification: Data classification is the practice of organizing and categorizing data elements according to pre-defined criteria . Classification makes data easier to locate and retrieve. Classifying data is instrumental in promoting risk management, security, and regulatory compliance. Steps for Effective Data Classification Understand the Current Setup: Taking a detailed look at the location of current data and all regulations that pertain to your organization is perhaps the best starting point for effectively classifying data. You must know what data you have before you can classify it. Creating a Data Classification Policy: Staying compliant with data protection principles in an organization is nearly impossible without proper policy. Creating a policy should be your top priority. Prioritize and Organize Data: Now that you have a policy and a picture of your current data, it’s time to properly classify the data. Decide on the best way to tag your data based on its sensitivity and privacy.
Data preparation steps: Classification is of two types According to attribute Ex. Literacy rate Honesty Beauty Weight Height According to class-interval Income Production Age Sometimes weight, height
Data preparation steps Tabulation: Tabulation is a method of presenting numeric data in rows and columns in a logical and systematic manner to aid comparison and statistical analysis . It allows for easier comparison by putting relevant data closer together, and it aids in statistical analysis and interpretation.
IMPORTANCE OF TABULATION Information or any statistics presented in a table should be alienated into different dimensions and for each dimension should be clearly mention the grand totals and sub totals to show the associations between different dimensions of data put in the tabular form easy understand. (The preparations of any statistics should be arranged in a systematic manner with a heading and proper numberings which simply helps the readers to recognize the necessary responsibility to the research. Tabulation builds the data into concise form; as a result, it helps the reader to understand easily. This data can also be presented in the form of graphs/charts/flow charts/ diagrams. The data in tabular form can be shown in the numerical figures in an attention-grabbing form. It makes difficult data into a simpler form and as a result it becomes easy to categorize within the data.
IMPORTANCE OF TABULATION Tabulation type of the arrangement is helpful in knowing the mistakes. Tables will be helpful in condensing the information and makes easy to examine the contents. Tabulation is economical mode to put the current data and helps to minimize the time and in turn researcher will able perform the work effectively. Recently the formation of tabular information with the help of gadgets easily summaries the large data which is scattered in a systematic.
Tabulation
Data preparation steps Graphical representation refers to the use of charts and graphs to visually display, analyze, clarify, and interpret numerical data, functions, and other qualitative structures .
Stem and leaf plot A stem and leaf plot is used to organize data as they are collected . A stem and leaf plot looks something like a bar graph. Each number in the data is broken down into a stem and a leaf, thus the name. Ex: 15,27, 8,17, 13, 17,22, 24,25,14,13, 36,22,22,32, 32,28,7 Ex. 72, 85,89, 93, 88, 109, 115, 97, 102, 113 Ex. 1.2, 2.3, 1.5, 1.6, 1.8, 2.7, 3.2, 3.6, 4.5,7.8,7.1,10.6, 11.5
Data preparation steps Data Cleaning: is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset
Deduplication Deduplication refers to a method of eliminating a dataset's redundant data . In a secure data deduplication process, a deduplication assessment tool identifies extra copies of data and deletes them, so a single instance can then be stored. Data deduplication software analyses data to identify duplicate byte patterns.
What is ANOVA? ANOVA, or Analysis of Variance, is a test used to determine differences between research results from three or more unrelated samples or groups The key word in ‘Analysis of Variance’ is the last one. ‘Variance’ represents the degree to which numerical values of a particular variable deviate from its overall mean. You could think of the dispersion of those values plotted on a graph, with the average being at the centre of that graph. The variance provides a measure of how scattered the data points are from this central value. H0:There is no difference between the group means
The Chi squared tests What Is Goodness-of-Fit: The term goodness-of-fit refers to a statistical test that determines how well sample data fits a distribution from a population with a normal distribution. Put simply, it hypothesizes whether a sample is skewed or represents the data you would expect to find in the actual population. H0:There is no difference between the group means
T-test A t-test is a statistical tool that compares the means of two groups or the mean of a group to a standard value. It's also known as a Student's t-test, t-statistic, or t-distribution H0:There is no difference between the group means
One-Sample Proportion Test The One-Sample Proportion Test is used to assess whether a population proportion (P1) is significantly different from a hypothesized value (P0). This is called the hypothesis of inequality. H0:There is no difference between the group means
Correlational test A correlational test, also known as correlation analysis, is a statistical method that measures the strength and direction of the relationship between two or more variables. The results of a correlational test are summarized as a correlation coefficient, which is a number between -1 and +1. The value of the coefficient indicates the strength of the relationship, and the sign indicates the direction
Hypothesis: The effect of social media on mental well-being does not significantly vary based on the frequency of its usage. ANOVA Average Daily Social Media Usage: Sum of Squares df Mean Square F Sig. Between Groups 52.737 4 13.184 14.688 .000 Within Groups 82.582 92 .898 Total 135.320 96
Hypothesis H0: The level of social media addiction does not differ significantly between gender. Chi-Square Tests Value df Asymp. Sig. (2-sided) Exact Sig. (2-sided) Exact Sig. (1-sided) Pearson Chi-Square .029 a 1 .865 Continuity Correction b .000 1 1.000 Likelihood Ratio .029 1 .865 Fisher's Exact Test 1.000 .516 Linear-by-Linear Association .029 1 .866 N of Valid Cases 97 a. 0 cells (0.0%) have expected count less than 5. The minimum expected count is 16.60. b. Computed only for a 2x2 table
Hypothesis There is no relation between the type of accommodation and spending more time on social media than intended.