Data Analytics with haddop in the project.pptx

ssuser71a2461 7 views 73 slides Oct 17, 2025
Slide 1
Slide 1 of 73
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73

About This Presentation

Data Analytics with haddop in the project


Slide Content

Data Analytics and Cognitive Ability: What makes them an important part of the accounting profession? Matthew J. Sargent Clinical Assistant Professor Department of Accounting – The University of Texas at Arlington

Heartbeats DataNodes send hearbeat to the NameNode periodically Once every 3 seconds NameNode uses heartbeats to detect DataNode failure

NameNode as a Replication Engine NameNode detects DataNode failures Chooses new DataNodes for new replicas Balances disk usage Balances communication traffic to DataNodes

Data Correctness Use Checksums to validate data Use CRC32 File Creation Client computes checksum per 512 bytes DataNode stores the checksum File access Client retrieves the data and checksum from DataNode If Validation fails, Client tries other replicas

Data Pipelining (i) Client retrieves a list of DataNodes on which to place replicas of a block Client writes block to the first DataNode The first DataNode forwards the data to the next node in the Pipeline When all replicas are written, the Client moves on to write the next block in file

Data Pipelining (ii)

Rebalancer Goal: % disk full on DataNodes should be similar Usually run when new DataNodes are added Cluster is online when Rebalancer is active Rebalancer is throttled to avoid network congestion Command line tool

User Interface Commads for HDFS User: hadoop dfs -mkdir /foodir hadoop dfs -cat /foodir/myfile.txt hadoop dfs -rm /foodir/myfile.txt Commands for HDFS Administrator hadoop dfsadmin -report hadoop dfsadmin -decommision datanodename Web Interface http://host:port/dfshealth.jsp

INTRODUCTION TO MAPREDUCE

MapReduce - What? MapReduce is a programming model for efficient distributed computing It works like a Unix pipeline cat input | grep | sort | uniq -c | cat > output Input | Map | Shuffle & Sort | Reduce | Output Efficiency from Streaming through data, reducing seeks Pipelining A good fit for a lot of applications Log processing Web index building

MapReduce - Dataflow

MapReduce - Features Fine grained Map and Reduce tasks Improved load balancing Faster recovery from failed tasks Automatic re-execution on failure In a large cluster, some nodes are always slow or flaky Framework re-executes failed tasks Locality optimizations With large data, bandwidth to data is a problem Map-Reduce + HDFS is a very effective solution Map-Reduce queries HDFS for locations of input data Map tasks are scheduled close to the inputs when possible

Word Count Example Mapper Input: value: lines of text of input Output: key: word, value: 1 Reducer Input: key: word, value: set of counts Output: key: word, value: sum Launching program Defines this job Submits job to cluster

Hadoop-MapReduce Workflow split 0 split 1 split 2 split 3 split 4 part0 map map map reduce reduce part1 input HDFS sort/copy merge output HDFS

15 MapReduce Dataflow

JobTracker generates three TaskTrackers for map tasks 16 Example I am a tiger, you are also a tiger I,1 am,1 a,1 tiger,1 you,1 are,1 also,1 a, 1 tiger,1 a,2 also,1 am,1 are,1 I, 1 tiger,2 you,1 reduce reduce map map map a, 1 a,1 also,1 am,1 are,1 I,1 tiger,1 tiger,1 you,1 Hadoop sorts the intermediate data JobTracker generates two TaskTrackers for map tasks part0 part1

Input and Output Formats A Map/Reduce may specify how it’s input is to be read by specifying an InputFormat to be used A Map/Reduce may specify how it’s output is to be written by specifying an OutputFormat to be used These default to TextInputFormat and TextOutputFormat , which process line-based text data Another common choice is SequenceFileInputFormat and SequenceFileOutputFormat for binary data These are file-based, but they are not required to be

How many Maps and Reduces Maps Usually as many as the number of HDFS blocks being processed, this is the default Else the number of maps can be specified as a hint The number of maps can also be controlled by specifying the minimum split size The actual sizes of the map inputs are computed by: max(min( block_size,data /#maps), min_split_size ) Reduces Unless the amount of data being processed is small 0.95* num_nodes * mapred.tasktracker.tasks.maximum

Some handy tools Partitioners Combiners Compression Counters Speculation Zero Reduces Distributed File Cache Tool

Partitioners Partitioners are application code that define how keys are assigned to reduces Default partitioning spreads keys evenly, but randomly Uses key.hashCode () % num_reduces Custom partitioning is often required, for example, to produce a total order in the output Should implement Partitioner interface Set by calling conf.setPartitionerClass ( MyPart.class ) To get a total order, sample the map output keys and pick values to divide the keys into roughly equal buckets and use that in your partitioner

Combiners When maps produce many repeated keys It is often useful to do a local aggregation following the map Done by specifying a Combiner Goal is to decrease size of the transient data Combiners have the same interface as Reduces, and often are the same class Combiners must not have side effects, because they run an intermdiate number of times In WordCount , conf.setCombinerClass ( Reduce.class );

Compression Compressing the outputs and intermediate data will often yield huge performance gains Can be specified via a configuration file or set programmatically Set mapred.output.compress to true to compress job output Set mapred.compress.map.output to true to compress map outputs Compression Types ( mapred(.map)?.output.compression.type) “block” - Group of keys and values are compressed together “record” - Each value is compressed individually Block compression is almost always best Compression Codecs (mapred(.map)?.output.compression.codec) Default (zlib) - slower, but more compression LZO - faster, but less compression

Counters Often Map/Reduce applications have countable events For example, framework counts records in to and out of Mapper and Reducer To define user counters: static enum Counter {EVENT1, EVENT2}; reporter.incrCounter(Counter.EVENT1, 1); Define nice names in a MyClass_Counter.properties file CounterGroupName=MyCounters EVENT1.name=Event 1 EVENT2.name=Event 2

Speculative execution The framework can run multiple instances of slow tasks Output from instance that finishes first is used Controlled by the configuration variable mapred.speculative.execution Can dramatically bring in long tails on jobs

Zero Reduces Frequently, we only need to run a filter on the input data No sorting or shuffling required by the job Set the number of reduces to 0 Output from maps will go directly to OutputFormat and disk

Distributed File Cache Sometimes need read-only copies of data on the local computer Downloading 1GB of data for each Mapper is expensive Define list of files you need to download in JobConf Files are downloaded once per computer Add to launching program: DistributedCache.addCacheFile(new URI(“hdfs://nn:8020/foo”), conf); Add to task: Path[] files = DistributedCache.getLocalCacheFiles(conf);

INTRODUCTION TO YARN AND MAPREDUCE INTERACTION

In the MapReduce paradigm, an application consists of  Map tasks  and  Reduce tasks . Map tasks and Reduce tasks align very cleanly with YARN tasks. MapReduce on Yarn 28

Putting it Together: MapReduce and YARN In a MapReduce application there are multiple map/reduce tasks each task runs in a container on a worker host in the cluster On the YARN side the ResourceManager , NodeManager , and ApplicationMaster work together to manage the cluster’s resources 29

 A cluster scheduler essentially has to address: Multi-tenancy: On a cluster, many users launch many different applications, on behalf of multiple organizations. A cluster scheduler allows varying workloads to run simultaneously. Scalability: A cluster scheduler needs to scale to large clusters running many applications. YARN uses queues to share resources among multiple tenants. The ApplicationMaster (AM) tracks each task’s resource requirements and coordinates container requests The RM/scheduler doesn’t need to track all containers running on the cluster. Scheduling in Yarn 30

The ResourceManager (RM) tracks resources on a cluster, and assigns them to applications that need them. The scheduler is that part of the RM that does this matching honoring organizational policies on sharing resources. Scheduling in YARN 31

Queues are the organizing structure for YARN schedulers, allowing multiple tenants to share the cluster. As applications are submitted to YARN, they are assigned to a queue by the scheduler. The Yarn queues are hierarchical queues The  root queue  is the parent of all queues. All other queues are each a child of the root queue or another queue  Yarn Queues for Scheduling 32

Example of Queues The marketing queue has a weight of 3.0 The sales queue has a weight of 4.0 The  datascience  queue has a weight of 13.0. So, the allocation from the root will be 15% to marketing, 20% to sales, and 65% to  datascience . 33

What are they? Understand the basics of data analytics and big data

What is Big Data Big Data is defined as datasets that are too large and complex for businesses’ existing systems to handle using their traditional capabilities to capture, store, manage and analyze these data sets. Big Data is no different than traditional data because if it can’t be analyzed to provide insight and help with decision making, its value is limited.

What is Big Data The four V’s—volume, velocity, veracity and variety—are often used to represent the defining features of Big Data. Volume refers to the massive amount of data involved. Velocity refers to the fact that the data comes in at quick speeds or in real time, such as streaming videos and news feeds. Variety refers to unstructured and unprocessed data, such as comments in social media, emails, global positioning system (GPS) measurements, etc. Veracity refers to the quality of the data including extent of cleanliness (without errors or data integrity issues), reliability and representationally faithful.

What is Data Analytics Data Analytics is defined as the science of examining raw data, removing excess noise, and organizing the data with the purpose of drawing conclusions for decision making. Data analytics often involves the technologies, systems, practices, methodologies, databases, and applications used to analyze diverse business data to help organizations make sound and timely business decisions.

What is Data Analytics The intent of Data Analytics is to transform raw data into valuable information. Data analytics is used in today’s business world by examining the data to generate models for predictions of patterns and trends. When used effectively, data analytics gives us the ability to search through large and unstructured data to identify unknown patterns or relationships, which when organized, is used to provide useful information.

What is the value of Big Data and Data Analytics 85% of CEOs put a high value on Data Analytics. 80% of CEOs place data mining and analysis as the second-most important strategic technology. Business analytics tops CEO’s list of priorities. Data Analytics could generate up to $3 trillion in value per year.

With a wealth of data on their hands, companies are empowered by using data analytics to discover various patterns, investigate anomalies, forecast future behavior, and so forth. Patterns discovered from historical data enable businesses to identify future opportunities and risks. In addition to producing more value externally, studies show that data analytics affects internal processes, improving productivity, utilization, and growth. The Power of Data Analytics

Reformatting, cleansing, and consolidating large volumes of data from multiple sources and platforms can be especially time consuming. Data analytics professionals estimate that they spend between 50 percent and 90 percent of their time cleaning data for analysis. The cost to scrub the data includes the salaries of the data analytics scientists and the cost of the technology to prepare and analyze the data. As with other information, there is a cost to produce these data. Benefits and Costs of Data Analytics

Many companies address the likely possibility that the data their organizations hold influence their market value. Facebook, for example, has a large amount of its market value driven by the number of users on the platform and the amount of data those users contribute which is sold to third parties.   Data analytics often also involves data management and business intelligence with knowledge of business functional areas. Today there is an increasing number of investments in data analytics and increasing demand for data analytics–related tasks The impact of Data Analytics

The real value of data comes from the use of data analytics. Companies are getting much smarter about using data analytics to discover various patterns, investigate anomalies, forecast future behavior, and so forth.  For example, companies can use their data to do more directed marketing campaigns based on patterns observed in their data. That can give them a competitive advantage and it can also be used on historical data to enable businesses to identify future opportunities and risks. The impact of Data Analytics

Why does it matter to accountants? Understand how data analytics has been traditionally used within the accounting profession

The Impact of Data Analytics on Accounting We refer to financial reporting as the responsibility internal to the firm of issuing financial statements and financial reports. Financial reporting includes a number of estimates and valuations that can be evaluated through use of data analytics. Many financial statement accounts are just estimates and accountants can use data analytics to evaluate those estimates.

The Impact of Data Analytics on Accounting In financial accounting, data analytics may be used to scan the environment—that is, by scanning social media to identify potential risks and opportunities to the firm. Data analytics plays a very critical role in the future of audit. By using data analytics, auditors can spend less time looking for evidence, which will allow more time for presenting their findings and making judgments. Data analytics also expands auditors’ capabilities in services such as testing for fraudulent transactions and automating compliance-monitoring activities (for example, filing financial reports with the SEC or IRS).

The Impact of Data Analytics on Accounting How much of the accounts receivable balance will be collected? (which will impact the Allowance for Doubtful Accounts and Net A/R) Is any of our inventory obsolete? Are customers still interested in it? If not, write it off. Should our inventory be valued at market or cost (applying the lower-of-cost-or-market rule)? Is our goodwill correctly valued, or has it been impaired? Due to conservatism, do we need to write it down or write it off? Is our property, plant, and equipment overvalued in the current real estate market?

The Impact of Data Analytics on Accounting Another way that businesses are controlling risk is through scanning the environment (i.e. social media) to identify potential risks and opportunities to the firm. For example, using data analytics effectively can allow a business to monitor its areas of business and better understand opportunities and threats around them. You can discover things like...Are their competitors, customers, or suppliers facing financial difficulty that might affect their interactions with them?

The Impact of Data Analytics on Accounting Data analytics may also allow an accountant or auditor to assess the probability of a goodwill write-down, warranty claims, or the collectability of bad debts based on what customers, investors, and other stakeholders are saying about the company in blogs and in social media. This information might help the firm determine both its optimal response to the situation and appropriate adjustment to its financial reporting.

How does data analytics affect auditing? Data analytics will enhance audit quality. Data analytics enables enhanced audits, expanded services, and added value to clients. External auditors will stay engaged beyond the audit .

Impact on Auditing Data analytics can help auditors in ways that impact both the effectiveness and efficiency of the audit. 1) Spend less time looking for evidence, which will allow more time for presenting their findings and making judgments. 2) Allow auditors to vastly expand sampling beyond current traditional sample sizes and, in many cases, may be able to test the full population of transactions.

Impact on Auditing Greater insight into the client’s operations gives auditors a better understanding of a client’s business risk during the planning processes. This understanding helps drive the auditor’s assessment of Inherent Risk (which is an estimate) and more accurately assess the RMM (Risk of Material Misstatement) which helps them better manage audit risk.   Increases in automation can result in both higher quality and consistency of the audit and will help auditors identify issues earlier. This gives them the ability to engage management earlier and resolve issues before it’s too late to fix them within the fiscal year. 

Impact on Auditing IT advisory/consulting and tax services would also benefit from the greater insight provided by data analytics from the audit team because the analyses could provide a better picture of multiple functional areas. Firms that adapt early to these changes will have a significant advantage over slow movers, as the harnessing of data analytics will provide notable benefits in the upcoming years.

How does How does one help the other? the other? Understand the relationship between cognitive ability and the ability to perform data analytics functions

Cognitive ability is considered our general intelligence level and provides us the ability to think abstractly, comprehend complex situations, problem solve, and gather and retain information from our life experiences (Plomin, 1999). Cognitive ability includes components such as mechanical reasoning, spatial awareness, numerical reasoning, critical thinking, and general intelligence (Davies, 2017). What is cognitive ability

This ability drives the processes of how we choose to engage in learning and how we learn, remember, and solve problems of various levels of complexity, and the best learners do not just memorize random pieces of information ( Nordin & Dakwah , 2015). When our cognitive abilities are developed at an advanced level, the process of learning is more straightforward for us. In contrast, when cognitive abilities are not as developed, the learning process is more challenging (Bhat, 2016). What is cognitive ability

The cycle of growth for cognitive development starts when an individual is around two years old and moves through a series of ups and downs in performance levels, which continue into early adulthood (Fischer & Bidell , 2006). Brain and human behaviors research indicate that the capacity to engage in reflective judgment through advanced levels of abstract thinking does not emerge until early adulthood. What is cognitive ability

Fischer’s (1980) model of cognitive skill theory (Skill Theory) added a background to how complex reasoning is developed. Skill Theory outlines the professional maturation of individuals and how an individual’s environment contributes to the development of skills. The developmental levels of Skill Theory occur as the brain conducts re-organizations of behavior. What is cognitive ability

This reorganization of behavior facilitates the use of new higher-order cognitive ability levels built upon combinations of previously constructed lower-order cognitive abilities. As individuals develop complex reasoning, their developmental range will fluctuate between functional and optimal skill levels based on their environment. An individual must solidify those less complex cognitive skills, and then they will be able to develop more complex cognitive skills. What is cognitive ability

Reflective judgment requires an individual to coordinate multiple views, so this skill cannot develop until adults can engage in abstract thought (Fischer & Pruyne , 2003). Individuals move through periods when their skills grow at a faster pace because they are in an environment that supports optimal performance During their functional performance, they are not pushing the limits of their cognitive ability (optimal performance), and they grow slowly or do not progress at all (Fischer, 2008). What is cognitive ability

During data analytics assignments, students are presented with open and ill-structured problems, which students will face in real-world situations (Chin & Chia, 2006). Because of this, students will need to engage in higher levels of cognitive ability, such as applying (using the information in a new way), analyzing (identifying relationships), and evaluating (using the information to make judgments) The connection to data analytics

The use of advanced data analytics during audits or financial statement reviews can help accounting firms provide better strategic guidance to their clients. In the realm of auditing, which is a core function of accounting, auditors have been evaluating data for decades to help them understand their client’s structure and financial transactions The connection to data analytics

Research has found an association between cognitive ability and performance on data analytics assignments. This research helps to highlight the importance of focusing on classroom activities that promote the growth of students’ cognitive abilities. Allowing accounting students to engage in data analytics in the classroom is helping to build a pathway for success in the accounting world. Building future accountants with strong cognitive ability will positively impact an area critical to the future success of the accounting profession. The connection to data analytics

What can be done to promote these skillsets? Understand why increased cognitive ability and more robust use of data analytics are critical to the future of the accounting profession

Why does it matter? Strong cognitive ability is deemed a critical skill in the modern workforce and an important commodity in the accounting profession (Terblanche & De Clercq , 2021). The need for strong cognitive ability and data analytics within Certified Public Accountants (CPA) is driving significant changes in the higher education curriculum. To help test these skills, a new Uniform CPA exam will be released in 2024 ( Bakarich et al., 2021).

Accountants need to be able to: Articulate business problems. Communicate with data scientists. Draw appropriate conclusions. Present results in an accessible manner. Develop an analytics mindset.

As well as be comfortable with: Data scrubbing and data preparation Data quality Descriptive data analysis Data analysis through data manipulation Define and address problems through statistical analysis Data visualization and data reporting

How does higher education help A primary goal of education is to promote the thinking skills of students. Universities recognize the importance of enabling cognitive development in their students and the impact on a student’s level of cognitive ability. For students within accounting programs, this is especially important because the accounting profession requires more creativity and innovative thinking to stay competitive within the market (Thompson & Washington, 2015).

How does higher education help Cognitive ability goes beyond just basic memorization or imitation; it gives a person the capacity to comprehend situations and determine how to assess and resolve them (Plomin & Von Stumm, 2018). Cognitive ability is closely associated with higher achievement in education, occupations, and better health outcomes.

It’s a shared responsibility Business students, especially those in accounting, finance, and audit positions, are expected to have higher levels of cognitive ability (Reding & Newman, 2017). The focus of today’s accounting and business professionals is to provide value-added services, and higher education serves as a pipeline for businesses to obtain new talent.

It’s a shared responsibility With the increasing reliance on big data to help within the decision-making process, higher levels of cognitive ability are critical in today's business world, and especially important for graduating students. Students entering the business world with lower cognitive ability with have direct impact on the profitability and productivity of organizations.

It’s a shared responsibility Cognitive development occurs when individuals allow their growth and experiences to enhance their cognitive ability, maintain an adequate level of engagement until they reach their optimal level of cognitive ability, and work to maintain that optimal level. Business leaders and their organizations have a vested interest in continuing to support and foster the growth of cognitive ability within their employees. Without that ongoing support, those employees will revert to their “functional” level and will become static in their growth.

Questions?
Tags