SlidePub
Home
Categories
Login
Register
Home
General
slideshare is annoying as fsck duh aaaaaa
slideshare is annoying as fsck duh aaaaaa
izhar84
6 views
59 slides
May 07, 2024
Slide
1
of 59
Previous
Next
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
About This Presentation
old deck
Size:
3.86 MB
Language:
en
Added:
May 07, 2024
Slides:
59 pages
Slide Content
Slide 1
An Overview of Open Source Big Data Ecosystem
Slide 2
© ABYRES SDN BHD
Slide 3
© ABYRES SDN BHD We Do Open Source
Slide 4
© ABYRES SDN BHD We Do Open Source With Enterprise Grade Support and Services
Slide 5
© ABYRES SDN BHD We Do Open Source With Enterprise Grade Support and Services Private Cloud Big Data Enterprise Mobility Proprietary to Open Source
Slide 6
© ABYRES SDN BHD In Partnership With Well Known Enterprise Open Source Vendors To Provide End-to-End Solutions
Slide 7
Big Data Challenges
Slide 8
© ABYRES SDN BHD Traditional Data Architecture Under Pressure
Slide 9
© ABYRES SDN BHD
Slide 10
© ABYRES SDN BHD
Slide 11
© ABYRES SDN BHD
Slide 12
Open Source Drives Big Data Innovation
Slide 13
© ABYRES SDN BHD
Slide 14
© ABYRES SDN BHD
Slide 15
© ABYRES SDN BHD Empowerment of individuals is a key part of what makes open source work, since in the end, innovations tend to come from small groups, not from large, structured efforts. - Tim O'reilly Open source is a development method of software that harnesses the power of distributed peer review and transparency of process. The promise of open source is better quality, higher reliability, more flexibility, lower cost and an end to predatory vendor lock-in - Jim Whitehurst, President and CEO of Red Hat
Slide 16
© ABYRES SDN BHD
Slide 17
Bird Eye View of Big Data Ecosystem
Slide 18
© ABYRES SDN BHD
Slide 19
Data Collection
Slide 20
© ABYRES SDN BHD Data Collection The key component before any Big Data initiative Big data mean lots of data, and to have lots of data, you need to collect lots of data Big Data infrastructure without lots of data => low business value Lots of data without infrastructure to handle it => high business value, just slow processing "You can't have Big Data until you have lots of little data" – Herb Caudill, CTO of DevResults
Slide 21
© ABYRES SDN BHD Strategies of Data Collection Sensors, listeners & logs Hardware and software that observe and collect data whenever it sees it Javascript code to collect browser information and user usage pattern Mobile app that aggressively collect information from its sensors (Gps, network, apps, etc) Crowdsourcing applications – get many people to contribute data Weather sensor, light sensor, CCTV, microphone, camera, etc Server logs, processor logs, RAM logs, error logs, CDR records Transactional data Web crawlers Collect data from websites and online sources Careful – web data tend to be dirty and poor quality – accuracy is low, but with high effort to extract information Data can provide indicative measurements if there are no other sources of quality data Expect low accuracy from analysis
Slide 22
© ABYRES SDN BHD Already have systems that is collecting or generating a lot of data, but yet untapped its potential? You probably ready for Big Data, if not, start planning, and implementing strategies to collect data NOW!
Slide 23
Data Loading
Slide 24
© ABYRES SDN BHD Data Loading Data generated from data collection stage need to be loaded into a data store High velocity data requires special strategy to load up into data store systems to enable Big Data and Real Time Analytics Parallel I/O Parallel execution Scalability to 100s of servers Data might need to be queued for real-time analytic
Slide 25
© ABYRES SDN BHD Software Ecosystem Apache Flume Stream loading system – load streaming transactional data as it get created Apache Kafka Message queue system – hold streaming data in a queue before writing into disk or passed to realtime analytics engine Apache Sqoop Batch loading of data into storage Logstash Listen to log files and store into storage And many more alternatives doing more or less the same thing
Slide 26
Data Storage
Slide 27
© ABYRES SDN BHD Data Storage Very large amount of data requires a low-cost strategy to store data SAN storage is way too expensive You are looking at a storage system that utilize low-cost normal disks, built on low-cost commodity hardware, and can be scaled up easily by adding more disks and servers
Slide 28
© ABYRES SDN BHD Software Ecosystem Apache Hadoop Distributed File System (HDFS) Default distributed storage system for Hadoop GlusterFS / Red Hat Storage Distributed software defined storage by Red Hat POSIX compliant – can be used for other purposes beyond storing data for Big Data processing HDFS compatible – can replace HDFS in a Hadoop implementation Others Tachyon Memory-centric distributed storage system enabling reliable, high speed data sharing across cluster
Slide 29
© ABYRES SDN BHD
Slide 30
Data Access and Processing
Slide 31
© ABYRES SDN BHD Data Access and Processing Processing lots of data requires a scalable strategy of data processing Reading data from storage, massaging and processing requires a flexible framework that can cater wide variety of data processing activities SQL for common analytical queries MapReduce & scripting languages for complex analysis on unstructured data or messy data In-memory data access for interactive exploration Stream processing for analyzing streaming data as it arrive Distributed processing to handle the scale
Slide 32
© ABYRES SDN BHD Software Ecosystem Apache Spark General purpose in-memory analytic engine that supports SQL, scripting, stream processing and machine learning Apache Pig Scripting language that simplifies MapReduce programming Apache Hive SQL data warehouse on Hadoop Allow traditional SQL developer to query data and analyze using familiar SQL queries yet leverage the power of Hadoop distributed processing Apache Storm Stream processing engine for realtime analytics Apache NiFi Flow-programming for stream processing and realtime analytics Hadoop Streaming Process data on Hadoop using any programming language you like through MapReduce programming Apache HBase, Accumulo, Phoenix NoSQL databases on Hadoop Application can write data directly into hadoop rather than going through an initial database And many more!!!
Slide 33
© ABYRES SDN BHD Choosing the right data access tool There are plenty of different tools for data access, each have its own strength and weaknesses Different business goals requires different approach to data access and processing Understand what you need in the end, what type of analysis needed and identify the tool that can cater the analysis Understand capabilities of your team Understand the business problem and goals
Slide 34
Data Analysis
Slide 35
© ABYRES SDN BHD Data Analysis This is mainly a human activity, with tools helping to speed some development up But still a lot of effort need to be done to ensure correct analysis is created Close involvement from business users is highly necessary to ensure good understanding on business case and what analysis can provide value Descriptive – Analysis of past historical data Predictive – Estimating a value or forecasting future value Prescriptive – Action suggestion to users 2 approach Data-driven analytics Business-driven analytics This process analyze the large amount of data, aggregate, transforms and apply machine learning algorithms on them to generate smaller dataset of analysis results for user or application consumption
Slide 36
© ABYRES SDN BHD Data-driven analytics Let the data speak to you Pros: Identify hidden gems in data With the right team, can get really high value results This is how the BigGuys(tm) handle it – exploring and analyzing data until they found something highly valuable, and utilize the discovery to do better things Cons: Requires internal analytics / data science team who continuously explore data to identify value 0 scope – very difficult to scope for vendor project (if not impossible), so this method is not recommended for tendering to 3 rd party implementers. Analysis might stray to topics not related to the business
Slide 37
© ABYRES SDN BHD Business-driven analytics Set an analytic goal based on what is important to the business Pros: Clear scope enables better design of data pipeline from collection to consumption Clear resource planning and identification of skillsets needed in project team Clear goal to focus on Cons: Rigid and less flexible to cater new needs Requires organization to clearly know what they want, and what is important to their business Recommendation : Want to start a Big Data project?. Start with identifying business questions you want to get answered through data analysis.
Slide 38
© ABYRES SDN BHD Key people involved and their roles Data engineer Prepare data pipeline, clean data, build OLAP cubes, preprocess data for data analysis Data scientist Develop machine learning models for classification of data, feature extractions, predictive analytics through learning from historical data, prescriptive analytics through capturing the domain knowledge from domain expert and build an automated expert system and identify patterns in data that is previously unknown Domain expert Advises on business case, domain knowledge, and as the key person who represent the business interest who provide information on what analysis results that the business deem important Big Data Infrastructure engineer Ensure that the whole infrastructure is capable in handling the workload needed for the project
Slide 39
Data Serving
Slide 40
© ABYRES SDN BHD Data Serving Store analysis results in easy to consume data store Usually optimized for speed and presentation Usually common operational databases, but can also be special-purpose databases with use-case specific optimizations
Slide 41
© ABYRES SDN BHD Software Ecosystem Common relational databases MariaDB, PostgreSQL, MySQL Document store databases MongoDB, CouchDB Graph databases Optimized for graph queries Neo4j, ArangoDB Columnar Store database High speed analytic queries MonetDB, LucidDB NewSQL database VoltDB Real-time/Streaming databases Can process and serve streaming data RethinkDB, PipelineDB And many more being developed
Slide 42
Data Consumption
Slide 43
© ABYRES SDN BHD Data Consumption It is not just visualization dashboards Any application that utilizes the analysis results to improve its functionalities and user experience Mobile app Intelligent systems Decision support systems Specialized applications Notification systems Automated decision systems Etc Data consumption applications may also collect data and feed-back into analytic system for continuous improvements of analysis accuracy and quality
Slide 44
© ABYRES SDN BHD Software Components Business intelligence systems SpagoBI, Pentaho, etc Visualization libraries D3.js , DC.js, Dimple.js, NVD3.js, heatmap.js Matplotlib, Bokeh, Pygal and many more Any software development framework and platform
Slide 45
Development Tools
Slide 46
© ABYRES SDN BHD Development Tools Tools and platform to assist in managing the development of analytic algorithms Help connects all stages of data processing, from loading data to sending for serving
Slide 47
© ABYRES SDN BHD Software Ecosystem Hadoop User Interface Ambari Hue Data Science platform pydatalab Jupyter Apache Zeppelin ETL/ELT IDE Talend Pentaho Data mining tool Rapid miner Programming libraries Pandas Numpy Scipy SQLAlchemy Scikit-learn Tenserflow Gensim NLTK +hundreds more
Slide 48
Other Components
Slide 49
© ABYRES SDN BHD Other components in a Big Data infrastructure Cluster management Apache Ambari Cluster Security Apache Knox Apache Ranger Workflow and scheduling Apache Oozie Apache Falcon
Slide 50
Introduction To Hortonworks Hadoop
Slide 51
© ABYRES SDN BHD Big Data Ecosystem is Big! So which product should I buy exactly? It all depends on your use-case Hortonworks knows this challenge, and created Hortonworks Data Platform, a Hadoop distribution which provides pretty much all you need to get started with Big Data
Slide 52
© ABYRES SDN BHD Hortonworks Data Platform 2.3
Slide 53
© ABYRES SDN BHD Full Component Version List of HDP 2.3
Slide 54
© ABYRES SDN BHD Ambari: Cluster Manager
Slide 55
© ABYRES SDN BHD Ambari: HDFS Browser
Slide 56
© ABYRES SDN BHD Ambari: Hive UI
Slide 57
© ABYRES SDN BHD Hortonworks Support
Slide 58
For more information
Slide 59
© ABYRES SDN BHD Reference Links http://hortonworks.com/download - Hortonworks Data Platform download link https://github.com/onurakpolat/awesome-bigdata - curated list on Big Data frameworks and resources https://github.com/youngwookim/awesome-hadoop - curated list on Hadoop and Hadoop ecosystem resources https://github.com/okulbilisim/awesome-datascience - curated list on data science resources https://github.com/koslab/ansible-pydatalab - pydatalab, a python data science platform download 1.0-preview release at: https://goo.gl/QyWvq9
Tags
Categories
General
Download
Download Slideshow
Get the original presentation file
Quick Actions
Embed
Share
Save
Print
Full
Report
Statistics
Views
6
Slides
59
Age
573 days
Related Slideshows
22
Pray For The Peace Of Jerusalem and You Will Prosper
RodolfoMoralesMarcuc
30 views
26
Don_t_Waste_Your_Life_God.....powerpoint
chalobrido8
32 views
31
VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf
JaiJai148317
30 views
14
Fertility awareness methods for women in the society
Isaiah47
29 views
35
Chapter 5 Arithmetic Functions Computer Organisation and Architecture
RitikSharma297999
26 views
5
syakira bhasa inggris (1) (1).pptx.......
ourcommunity56
28 views
View More in This Category
Embed Slideshow
Dimensions
Width (px)
Height (px)
Start Page
Which slide to start from (1-59)
Options
Auto-play slides
Show controls
Embed Code
Copy Code
Share Slideshow
Share on Social Media
Share on Facebook
Share on Twitter
Share on LinkedIn
Share via Email
Or copy link
Copy
Report Content
Reason for reporting
*
Select a reason...
Inappropriate content
Copyright violation
Spam or misleading
Offensive or hateful
Privacy violation
Other
Slide number
Leave blank if it applies to the entire slideshow
Additional details
*
Help us understand the problem better