slideshare is annoying as fsck duh aaaaaa

An Overview of Open Source Big Data Ecosystem

Big Data Challenges

Open Source Drives Big Data Innovation

© ABYRES SDN BHD Empowerment of individuals is a key part of what makes open source work, since in the end, innovations tend to come from small groups, not from large, structured efforts. - Tim O'reilly Open source is a development method of software that harnesses the power of distributed peer review and transparency of process. The promise of open source is better quality, higher reliability, more flexibility, lower cost and an end to predatory vendor lock-in - Jim Whitehurst, President and CEO of Red Hat

Bird Eye View of Big Data Ecosystem

Data Collection

© ABYRES SDN BHD Data Collection The key component before any Big Data initiative Big data mean lots of data, and to have lots of data, you need to collect lots of data Big Data infrastructure without lots of data => low business value Lots of data without infrastructure to handle it => high business value, just slow processing "You can't have Big Data until you have lots of little data" – Herb Caudill, CTO of DevResults

© ABYRES SDN BHD Strategies of Data Collection Sensors, listeners & logs Hardware and software that observe and collect data whenever it sees it Javascript code to collect browser information and user usage pattern Mobile app that aggressively collect information from its sensors (Gps, network, apps, etc) Crowdsourcing applications – get many people to contribute data Weather sensor, light sensor, CCTV, microphone, camera, etc Server logs, processor logs, RAM logs, error logs, CDR records Transactional data Web crawlers Collect data from websites and online sources Careful – web data tend to be dirty and poor quality – accuracy is low, but with high effort to extract information Data can provide indicative measurements if there are no other sources of quality data Expect low accuracy from analysis

© ABYRES SDN BHD Already have systems that is collecting or generating a lot of data, but yet untapped its potential? You probably ready for Big Data, if not, start planning, and implementing strategies to collect data NOW!

Data Loading

© ABYRES SDN BHD Data Loading Data generated from data collection stage need to be loaded into a data store High velocity data requires special strategy to load up into data store systems to enable Big Data and Real Time Analytics Parallel I/O Parallel execution Scalability to 100s of servers Data might need to be queued for real-time analytic

© ABYRES SDN BHD Software Ecosystem Apache Flume Stream loading system – load streaming transactional data as it get created Apache Kafka Message queue system – hold streaming data in a queue before writing into disk or passed to realtime analytics engine Apache Sqoop Batch loading of data into storage Logstash Listen to log files and store into storage And many more alternatives doing more or less the same thing

Data Storage

© ABYRES SDN BHD Data Storage Very large amount of data requires a low-cost strategy to store data SAN storage is way too expensive You are looking at a storage system that utilize low-cost normal disks, built on low-cost commodity hardware, and can be scaled up easily by adding more disks and servers

© ABYRES SDN BHD Software Ecosystem Apache Hadoop Distributed File System (HDFS) Default distributed storage system for Hadoop GlusterFS / Red Hat Storage Distributed software defined storage by Red Hat POSIX compliant – can be used for other purposes beyond storing data for Big Data processing HDFS compatible – can replace HDFS in a Hadoop implementation Others Tachyon Memory-centric distributed storage system enabling reliable, high speed data sharing across cluster

Data Access and Processing

© ABYRES SDN BHD Data Access and Processing Processing lots of data requires a scalable strategy of data processing Reading data from storage, massaging and processing requires a flexible framework that can cater wide variety of data processing activities SQL for common analytical queries MapReduce & scripting languages for complex analysis on unstructured data or messy data In-memory data access for interactive exploration Stream processing for analyzing streaming data as it arrive Distributed processing to handle the scale

© ABYRES SDN BHD Software Ecosystem Apache Spark General purpose in-memory analytic engine that supports SQL, scripting, stream processing and machine learning Apache Pig Scripting language that simplifies MapReduce programming Apache Hive SQL data warehouse on Hadoop Allow traditional SQL developer to query data and analyze using familiar SQL queries yet leverage the power of Hadoop distributed processing Apache Storm Stream processing engine for realtime analytics Apache NiFi Flow-programming for stream processing and realtime analytics Hadoop Streaming Process data on Hadoop using any programming language you like through MapReduce programming Apache HBase, Accumulo, Phoenix NoSQL databases on Hadoop Application can write data directly into hadoop rather than going through an initial database And many more!!!

© ABYRES SDN BHD Choosing the right data access tool There are plenty of different tools for data access, each have its own strength and weaknesses Different business goals requires different approach to data access and processing Understand what you need in the end, what type of analysis needed and identify the tool that can cater the analysis Understand capabilities of your team Understand the business problem and goals

Data Analysis

© ABYRES SDN BHD Data Analysis This is mainly a human activity, with tools helping to speed some development up But still a lot of effort need to be done to ensure correct analysis is created Close involvement from business users is highly necessary to ensure good understanding on business case and what analysis can provide value Descriptive – Analysis of past historical data Predictive – Estimating a value or forecasting future value Prescriptive – Action suggestion to users 2 approach Data-driven analytics Business-driven analytics This process analyze the large amount of data, aggregate, transforms and apply machine learning algorithms on them to generate smaller dataset of analysis results for user or application consumption

© ABYRES SDN BHD Data-driven analytics Let the data speak to you Pros: Identify hidden gems in data With the right team, can get really high value results This is how the BigGuys(tm) handle it – exploring and analyzing data until they found something highly valuable, and utilize the discovery to do better things Cons: Requires internal analytics / data science team who continuously explore data to identify value 0 scope – very difficult to scope for vendor project (if not impossible), so this method is not recommended for tendering to 3 rd party implementers. Analysis might stray to topics not related to the business

© ABYRES SDN BHD Business-driven analytics Set an analytic goal based on what is important to the business Pros: Clear scope enables better design of data pipeline from collection to consumption Clear resource planning and identification of skillsets needed in project team Clear goal to focus on Cons: Rigid and less flexible to cater new needs Requires organization to clearly know what they want, and what is important to their business Recommendation : Want to start a Big Data project?. Start with identifying business questions you want to get answered through data analysis.

© ABYRES SDN BHD Key people involved and their roles Data engineer Prepare data pipeline, clean data, build OLAP cubes, preprocess data for data analysis Data scientist Develop machine learning models for classification of data, feature extractions, predictive analytics through learning from historical data, prescriptive analytics through capturing the domain knowledge from domain expert and build an automated expert system and identify patterns in data that is previously unknown Domain expert Advises on business case, domain knowledge, and as the key person who represent the business interest who provide information on what analysis results that the business deem important Big Data Infrastructure engineer Ensure that the whole infrastructure is capable in handling the workload needed for the project

Data Serving

© ABYRES SDN BHD Data Serving Store analysis results in easy to consume data store Usually optimized for speed and presentation Usually common operational databases, but can also be special-purpose databases with use-case specific optimizations

© ABYRES SDN BHD Software Ecosystem Common relational databases MariaDB, PostgreSQL, MySQL Document store databases MongoDB, CouchDB Graph databases Optimized for graph queries Neo4j, ArangoDB Columnar Store database High speed analytic queries MonetDB, LucidDB NewSQL database VoltDB Real-time/Streaming databases Can process and serve streaming data RethinkDB, PipelineDB And many more being developed

Data Consumption

© ABYRES SDN BHD Data Consumption It is not just visualization dashboards Any application that utilizes the analysis results to improve its functionalities and user experience Mobile app Intelligent systems Decision support systems Specialized applications Notification systems Automated decision systems Etc Data consumption applications may also collect data and feed-back into analytic system for continuous improvements of analysis accuracy and quality

© ABYRES SDN BHD Software Components Business intelligence systems SpagoBI, Pentaho, etc Visualization libraries D3.js , DC.js, Dimple.js, NVD3.js, heatmap.js Matplotlib, Bokeh, Pygal and many more Any software development framework and platform

Development Tools

© ABYRES SDN BHD Software Ecosystem Hadoop User Interface Ambari Hue Data Science platform pydatalab Jupyter Apache Zeppelin ETL/ELT IDE Talend Pentaho Data mining tool Rapid miner Programming libraries Pandas Numpy Scipy SQLAlchemy Scikit-learn Tenserflow Gensim NLTK +hundreds more

Other Components

Introduction To Hortonworks Hadoop

© ABYRES SDN BHD Big Data Ecosystem is Big! So which product should I buy exactly? It all depends on your use-case Hortonworks knows this challenge, and created Hortonworks Data Platform, a Hadoop distribution which provides pretty much all you need to get started with Big Data

For more information

© ABYRES SDN BHD Reference Links http://hortonworks.com/download - Hortonworks Data Platform download link https://github.com/onurakpolat/awesome-bigdata - curated list on Big Data frameworks and resources https://github.com/youngwookim/awesome-hadoop - curated list on Hadoop and Hadoop ecosystem resources https://github.com/okulbilisim/awesome-datascience - curated list on data science resources https://github.com/koslab/ansible-pydatalab - pydatalab, a python data science platform download 1.0-preview release at: https://goo.gl/QyWvq9

slideshare is annoying as fsck duh aaaaaa

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

slideshare is annoying as fsck duh aaaaaa

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45

Slide 46

Slide 47

Slide 48

Slide 49

Slide 50

Slide 51

Slide 52

Slide 53

Slide 54

Slide 55

Slide 56

Slide 57

Slide 58

Slide 59

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Pray For The Peace Of Jerusalem and You Will Prosper

Don_t_Waste_Your_Life_God.....powerpoint

VILLASUR_FACTORS_TO_CONSIDER_IN_PLATING_SALAD_10-13.pdf

Fertility awareness methods for women in the society

Chapter 5 Arithmetic Functions Computer Organisation and Architecture

syakira bhasa inggris (1) (1).pptx.......