slideshare is annoying as fsck duh aaaaaa

izhar84 6 views 59 slides May 07, 2024
Slide 1
Slide 1 of 59
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59

About This Presentation

old deck


Slide Content

An Overview of Open Source Big Data Ecosystem

© ABYRES SDN BHD

© ABYRES SDN BHD We Do Open Source

© ABYRES SDN BHD We Do Open Source With Enterprise Grade Support and Services

© ABYRES SDN BHD We Do Open Source With Enterprise Grade Support and Services Private Cloud Big Data Enterprise Mobility Proprietary to Open Source

© ABYRES SDN BHD In Partnership With Well Known Enterprise Open Source Vendors To Provide End-to-End Solutions

Big Data Challenges

© ABYRES SDN BHD Traditional Data Architecture Under Pressure

© ABYRES SDN BHD

© ABYRES SDN BHD

© ABYRES SDN BHD

Open Source Drives Big Data Innovation

© ABYRES SDN BHD

© ABYRES SDN BHD

© ABYRES SDN BHD Empowerment of individuals is a key part of what makes open source work, since in the end, innovations tend to come from small groups, not from large, structured efforts. - Tim O'reilly Open source is a development method of software that harnesses the power of distributed peer review and transparency of process. The promise of open source is better quality, higher reliability, more flexibility, lower cost and an end to predatory vendor lock-in - Jim Whitehurst, President and CEO of Red Hat

© ABYRES SDN BHD

Bird Eye View of Big Data Ecosystem

© ABYRES SDN BHD

Data Collection

© ABYRES SDN BHD Data Collection The key component before any Big Data initiative Big data mean lots of data, and to have lots of data, you need to collect lots of data Big Data infrastructure without lots of data => low business value Lots of data without infrastructure to handle it => high business value, just slow processing "You can't have Big Data until you have lots of little data" – Herb Caudill, CTO of DevResults

© ABYRES SDN BHD Strategies of Data Collection Sensors, listeners & logs Hardware and software that observe and collect data whenever it sees it Javascript code to collect browser information and user usage pattern Mobile app that aggressively collect information from its sensors (Gps, network, apps, etc) Crowdsourcing applications – get many people to contribute data Weather sensor, light sensor, CCTV, microphone, camera, etc Server logs, processor logs, RAM logs, error logs, CDR records Transactional data Web crawlers Collect data from websites and online sources Careful – web data tend to be dirty and poor quality – accuracy is low, but with high effort to extract information Data can provide indicative measurements if there are no other sources of quality data Expect low accuracy from analysis

© ABYRES SDN BHD Already have systems that is collecting or generating a lot of data, but yet untapped its potential? You probably ready for Big Data, if not, start planning, and implementing strategies to collect data NOW!

Data Loading

© ABYRES SDN BHD Data Loading Data generated from data collection stage need to be loaded into a data store High velocity data requires special strategy to load up into data store systems to enable Big Data and Real Time Analytics Parallel I/O Parallel execution Scalability to 100s of servers Data might need to be queued for real-time analytic

© ABYRES SDN BHD Software Ecosystem Apache Flume Stream loading system – load streaming transactional data as it get created Apache Kafka Message queue system – hold streaming data in a queue before writing into disk or passed to realtime analytics engine Apache Sqoop Batch loading of data into storage Logstash Listen to log files and store into storage And many more alternatives doing more or less the same thing

Data Storage

© ABYRES SDN BHD Data Storage Very large amount of data requires a low-cost strategy to store data SAN storage is way too expensive You are looking at a storage system that utilize low-cost normal disks, built on low-cost commodity hardware, and can be scaled up easily by adding more disks and servers

© ABYRES SDN BHD Software Ecosystem Apache Hadoop Distributed File System (HDFS) Default distributed storage system for Hadoop GlusterFS / Red Hat Storage Distributed software defined storage by Red Hat POSIX compliant – can be used for other purposes beyond storing data for Big Data processing HDFS compatible – can replace HDFS in a Hadoop implementation Others Tachyon Memory-centric distributed storage system enabling reliable, high speed data sharing across cluster

© ABYRES SDN BHD

Data Access and Processing

© ABYRES SDN BHD Data Access and Processing Processing lots of data requires a scalable strategy of data processing Reading data from storage, massaging and processing requires a flexible framework that can cater wide variety of data processing activities SQL for common analytical queries MapReduce & scripting languages for complex analysis on unstructured data or messy data In-memory data access for interactive exploration Stream processing for analyzing streaming data as it arrive Distributed processing to handle the scale

© ABYRES SDN BHD Software Ecosystem Apache Spark General purpose in-memory analytic engine that supports SQL, scripting, stream processing and machine learning Apache Pig Scripting language that simplifies MapReduce programming Apache Hive SQL data warehouse on Hadoop Allow traditional SQL developer to query data and analyze using familiar SQL queries yet leverage the power of Hadoop distributed processing Apache Storm Stream processing engine for realtime analytics Apache NiFi Flow-programming for stream processing and realtime analytics Hadoop Streaming Process data on Hadoop using any programming language you like through MapReduce programming Apache HBase, Accumulo, Phoenix NoSQL databases on Hadoop Application can write data directly into hadoop rather than going through an initial database And many more!!!

© ABYRES SDN BHD Choosing the right data access tool There are plenty of different tools for data access, each have its own strength and weaknesses Different business goals requires different approach to data access and processing Understand what you need in the end, what type of analysis needed and identify the tool that can cater the analysis Understand capabilities of your team Understand the business problem and goals

Data Analysis

© ABYRES SDN BHD Data Analysis This is mainly a human activity, with tools helping to speed some development up But still a lot of effort need to be done to ensure correct analysis is created Close involvement from business users is highly necessary to ensure good understanding on business case and what analysis can provide value Descriptive – Analysis of past historical data Predictive – Estimating a value or forecasting future value Prescriptive – Action suggestion to users 2 approach Data-driven analytics Business-driven analytics This process analyze the large amount of data, aggregate, transforms and apply machine learning algorithms on them to generate smaller dataset of analysis results for user or application consumption

© ABYRES SDN BHD Data-driven analytics Let the data speak to you Pros: Identify hidden gems in data With the right team, can get really high value results This is how the BigGuys(tm) handle it – exploring and analyzing data until they found something highly valuable, and utilize the discovery to do better things Cons: Requires internal analytics / data science team who continuously explore data to identify value 0 scope – very difficult to scope for vendor project (if not impossible), so this method is not recommended for tendering to 3 rd party implementers. Analysis might stray to topics not related to the business

© ABYRES SDN BHD Business-driven analytics Set an analytic goal based on what is important to the business Pros: Clear scope enables better design of data pipeline from collection to consumption Clear resource planning and identification of skillsets needed in project team Clear goal to focus on Cons: Rigid and less flexible to cater new needs Requires organization to clearly know what they want, and what is important to their business Recommendation : Want to start a Big Data project?. Start with identifying business questions you want to get answered through data analysis.

© ABYRES SDN BHD Key people involved and their roles Data engineer Prepare data pipeline, clean data, build OLAP cubes, preprocess data for data analysis Data scientist Develop machine learning models for classification of data, feature extractions, predictive analytics through learning from historical data, prescriptive analytics through capturing the domain knowledge from domain expert and build an automated expert system and identify patterns in data that is previously unknown Domain expert Advises on business case, domain knowledge, and as the key person who represent the business interest who provide information on what analysis results that the business deem important Big Data Infrastructure engineer Ensure that the whole infrastructure is capable in handling the workload needed for the project

Data Serving

© ABYRES SDN BHD Data Serving Store analysis results in easy to consume data store Usually optimized for speed and presentation Usually common operational databases, but can also be special-purpose databases with use-case specific optimizations

© ABYRES SDN BHD Software Ecosystem Common relational databases MariaDB, PostgreSQL, MySQL Document store databases MongoDB, CouchDB Graph databases Optimized for graph queries Neo4j, ArangoDB Columnar Store database High speed analytic queries MonetDB, LucidDB NewSQL database VoltDB Real-time/Streaming databases Can process and serve streaming data RethinkDB, PipelineDB And many more being developed

Data Consumption

© ABYRES SDN BHD Data Consumption It is not just visualization dashboards Any application that utilizes the analysis results to improve its functionalities and user experience Mobile app Intelligent systems Decision support systems Specialized applications Notification systems Automated decision systems Etc Data consumption applications may also collect data and feed-back into analytic system for continuous improvements of analysis accuracy and quality

© ABYRES SDN BHD Software Components Business intelligence systems SpagoBI, Pentaho, etc Visualization libraries D3.js , DC.js, Dimple.js, NVD3.js, heatmap.js Matplotlib, Bokeh, Pygal and many more Any software development framework and platform

Development Tools

© ABYRES SDN BHD Development Tools Tools and platform to assist in managing the development of analytic algorithms Help connects all stages of data processing, from loading data to sending for serving

© ABYRES SDN BHD Software Ecosystem Hadoop User Interface Ambari Hue Data Science platform pydatalab Jupyter Apache Zeppelin ETL/ELT IDE Talend Pentaho Data mining tool Rapid miner Programming libraries Pandas Numpy Scipy SQLAlchemy Scikit-learn Tenserflow Gensim NLTK +hundreds more

Other Components

© ABYRES SDN BHD Other components in a Big Data infrastructure Cluster management Apache Ambari Cluster Security Apache Knox Apache Ranger Workflow and scheduling Apache Oozie Apache Falcon

Introduction To Hortonworks Hadoop

© ABYRES SDN BHD Big Data Ecosystem is Big! So which product should I buy exactly? It all depends on your use-case Hortonworks knows this challenge, and created Hortonworks Data Platform, a Hadoop distribution which provides pretty much all you need to get started with Big Data

© ABYRES SDN BHD Hortonworks Data Platform 2.3

© ABYRES SDN BHD Full Component Version List of HDP 2.3

© ABYRES SDN BHD Ambari: Cluster Manager

© ABYRES SDN BHD Ambari: HDFS Browser

© ABYRES SDN BHD Ambari: Hive UI

© ABYRES SDN BHD Hortonworks Support

For more information

© ABYRES SDN BHD Reference Links http://hortonworks.com/download - Hortonworks Data Platform download link https://github.com/onurakpolat/awesome-bigdata - curated list on Big Data frameworks and resources https://github.com/youngwookim/awesome-hadoop - curated list on Hadoop and Hadoop ecosystem resources https://github.com/okulbilisim/awesome-datascience - curated list on data science resources https://github.com/koslab/ansible-pydatalab - pydatalab, a python data science platform download 1.0-preview release at: https://goo.gl/QyWvq9
Tags