Data Intensive Computing- Slide in Presentation

AnandNayyar1 36 views 48 slides May 10, 2024
Slide 1
Slide 1 of 48
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48

About This Presentation

Data Intensive computing slide to understand the concept


Slide Content

Data-intensive computing Inf-2202 Concurrent and Data-intensive Programming University of Tromsø, Fall 2015 Lars Ailo Bongo ([email protected])

Outline Today: Introduction to data-intensive computing Data-intensive computing platforms Google File System, MapReduce 15/10: Guest lecture (Inge Alexander Raknes) Scala Spark AWS 22/10: Spark ecosyste m GraphX , Shark, Mllib 3 /11: Hadoop ecosystem Hbase , Impala? Storm?

Data-intensive Computing Big data + Machine learning/ statistics FYS-3012 Pattern recognition (Linear algebra & statistics) + Distributed systems INF-3200, INF-3203, INF-3201, and more (= Data analytics)

Human produced content Videos, photos, audio… Human activity Online activity, GPS traces, tax records… Scientific instruments CERN LHC, Sloan Digital Sky Survey, DNA sequencers… Sensor data Big Data Sources

Big data players Industry: Google, Facebook, Twitter, Amazon, Netflix, Visa, … Use data to provide services Use data to make money Has developed (most of the) technology for managing and processing peta -scale datasets Government: NSA, Skatteetaten , Kartverket , e- resept , … Use data to make (hopefully) informed decisions Make data available for public and commercial services Science Biology , physics , medicine , social sciences ,… Use data for novel scientific insights Should be open access , indexed , reusable , …

Big Data How big?

Dataset Size < 4GB < 512GB TBs PBs

Statistical Analysis (N x M) Billions of samples & few dimensions, or Billions of samples & thousands of dimensions, or Thousands of samples & thousands of dimensions

Data Analysis T ool

Computation Time <100ms seconds minutes hours weeks

Optimizations R or Matlab implementation Algorithm parameter tuning C++/ Java / … implementation Data structure optimization Multi-threaded parallelization (single machine) Distributed parallelization (multiple-machines)

Outline History of Big Data + Biology My research Interactive data analytics Elixir infrastructure Other interesting stuff Google File System MapReduce

Jim Grays Talk

“Data, data everywhere” Source: The Economist [ http:// www.economist.com/node/15557443?story_id=15557443 ]

Scientific Storage Systems Source: http://www.usenix.org/events/lisa10/tech/slides/cass.pdf

Data growth in the life sciences PB

Increase in bionformaticians ? @ UiT

My Lab Biological Data Processing Systems Lab 3 + 1 PhD students Edvard Pedersen, Einar Holsbø , Bjørn Fjukstad + Espen Mikal Robertsen 2 engineers Inge Alexander Raknes , Giacomo Tartari 3 master students Kenneth Knudsen, Morten Grønnesby , Jarl Fagerli http://bdps.cs.uit.no

Research Goal

Norwegian Woman and Cancer (NOWAC) Large and unique biobank of blood samples Understand development of cancer (and how to avoid it) Develop diagnosis approaches Develop or improve treatment http://site.uit.no/nowac /

Center for Bioinformatics ( SfB ) Interdisciplinary research and services Computer science Biotechnology Bioinformatics Special focus on marine metagenomics Commercial exploitation of marine resources http://sfb.cs.uit.no

Interactive Data Exploration

Interactive Data Exploration Components Human experts for data analysis Interactive user interface A nalysis methods and models Data management and backend processing Compute and storage resources

Data-intensive computing platforms

Outline (part 2) Hardware platforms Infrastructure systems Google File System MapReduce Ecosystems

Hardware Requirements Process 1TB of data? Process 1PB of data?

Single Computer

Supercomputer Disadvantages: Centralized storage has limited bandwidth High cost of interconnect … … Infiniband 56Gbits/s 164Gbit/s

Commodity Component Distributed System … … SATA 6Gbit/s

Hadoop

Google File System (GFS) https://courses.cs.washington.edu/courses/cse490h/11wi/CSE490H_files/ gfs.pdf Hadoop Distributed File System implements GFS design

MapReduce http://research.google.com/archive/mapreduce-osdi04-slides/ index.html Hadoop MapReduce implements Google File System design

Spark Ecosystem