Big Data Open Source Technologies

1,189 views 11 slides Oct 17, 2020
Slide 1
Slide 1 of 11
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11

About This Presentation

Presentation on Big Data Open Source Technologies of Data Analytic


Slide Content

Presentation On Big Data Open Source Technologies Presented By: Neeraj Rathore

What is Big Data? Big Data refers to the large amounts of data pouring in from various data sources & has different formats(structured, semi-structured & unstructured) Because of the varied nature of this Data, the traditional relational database systems are incapable of handling this data.

What are Big Data Technologies & Why these are needed? It can be defined as a Software-Utility that is designed to Analyse , Process and Extract the information from an extremely complex and large data sets which the Traditional Data Processing Software could never deal with. We need big Data Processing Technologies to Analyse this huge amount of Real-time data and come up with Conclusions and Predictions to reduce the risks in the future.

Top Big Data Technologies Top big data technologies are divided into four fields based on their usage: Data Storage : Big data storage is a storage infrastructure that is designed specifically to store , manage & retrieve massive amounts of data or big data. It enable quick processing & retrieval of big quantities of data. Data Analytics: Data analytics is the process of inspecting , cleansing , transforming & modelling data with the goal of discovering useful information ,informing conclusions & supporting decision making. Data Mining: Data mining involves exploring & analyzing large amounts of data to find patterns for big data. The goal of data mining is either classification or prediction. Data Visualisation : Data Visualisation is the practice of translating information into a visual context , such as a map or graph, to make data easier for human brain to understand.

Open Source Big Data Technologies for Storage & Management Apache Hadoop: The Apache Hadoop software library is a big data framework . HDFS is used for storing data. It allows distributed processing of data sets across clusters of computers. Developed by: Apache Software Foundation in the year 2011 on 10 Dec. Written in: JAVA Companies using it: Microsoft, IBM, Intel, MAPR, cloudera, Hortonworks etc. Cassandra: Apache Cassandra database provides an effective management of large amounts of data. Supports replication of data across multiple data centers for scalability. Offers very good fault tolerance and low latency. Devloped by: Apache Software Foundation in the year 2008 in july. Written in: JAVA Companies using it: Netflix , Walmart , Uber , McDonalds etc.

Mongo DB: Mongo DB is an open source No SQL database which is cross-platform compatible with many built-in features. Developed by: Mongo DB in the year 2009 on 11 Feb. Written in: C++ , Go , JavaScript , Python Apache Hbase: Apache HBase is a popular & highly efficient Column-oriented Nosql database built on top of HDFS that allows performing read/write operations on large datasets in real time using key/Value data. Developed by: Apache Software Foundation in the year 2008 on 28 March. Written in: JAVA

Open Source Big Data Technologies For Data Analytics Apache Spark Open source big data tool which fills the gaps of Apache Hadoop concerning data processing. Spark can handle both batch data & real-time data. As Spark does in- memory data processing, it processes data much faster than traditional disk processing. Developed by:Apache Software Foundation Written in: JAVA, Scala, Python , R Apache Hive: It allows programmers analyze large data sets on Hadoop It helps with quering and managing large datasets real fast Developed by: Apache Software Foundation in year 2010 on 1 oct. Written in: JAVA

Hadoop MapReduce: Programming model or pattern used to access big data stored in the Hadoop File System(HDFS) Facilitates processing by splitting petabytes of data into smaller chunks The logic is executed on the server where the data already resides which makes the process quicker. Apache kafka: Distributed streaming platform. It aims to provide a unified , high throughput , low-latency platform for handling real-time data feeds. Developed by: Apache Software Foundation in the year 2011 Written in: Scala, JAVA

Open Source Big data Technologies for Data Mining Presto: Open Source Distributed SQL Query Engine for running analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Developed by: Apache Foundation in the year 2013. Written in: JAVA Elastic Search: Based on Lucene library. It provides a distributed , multiTenant-capable , full-text search engine with an HTTP web interface and schema –free JSON documents. Developed by: Elastic NV in the year 2012 Written in: JAVA

Open Source Technologies for Data Visualisation Candela: Candala is a data visualisation package made available through the Resonant platform. It separates itself from other tools by providing a full suite of data visualisation tools. Charted: An open-source tool that automatically visualizes data. Charted is perhaps one of the easiest data visualisation tools around, as it simply requires a link to a .csv file or a google sheets location; hit Go and charted creates a visual display using a bar or line chart. Developed by: Product Science Team in the year 2013

Thank you
Tags