Data Intensive Computing- Slide in Presentation

AnandNayyar1 36 views 48 slides May 10, 2024

Slide 1 of 48

About This Presentation

Data Intensive computing slide to understand the concept

Size: 11.02 MB

Language: en

Added: May 10, 2024

Slides: 48 pages

Slide Content

Data-intensive computing Inf-2202 Concurrent and Data-intensive Programming University of Tromsø, Fall 2015 Lars Ailo Bongo ([email protected])

Outline Today: Introduction to data-intensive computing Data-intensive computing platforms Google File System, MapReduce 15/10: Guest lecture (Inge Alexander Raknes) Scala Spark AWS 22/10: Spark ecosyste m GraphX , Shark, Mllib 3 /11: Hadoop ecosystem Hbase , Impala? Storm?

Data-intensive Computing Big data + Machine learning/ statistics FYS-3012 Pattern recognition (Linear algebra & statistics) + Distributed systems INF-3200, INF-3203, INF-3201, and more (= Data analytics)

Human produced content Videos, photos, audio… Human activity Online activity, GPS traces, tax records… Scientific instruments CERN LHC, Sloan Digital Sky Survey, DNA sequencers… Sensor data Big Data Sources

Big data players Industry: Google, Facebook, Twitter, Amazon, Netflix, Visa, … Use data to provide services Use data to make money Has developed (most of the) technology for managing and processing peta -scale datasets Government: NSA, Skatteetaten , Kartverket , e- resept , … Use data to make (hopefully) informed decisions Make data available for public and commercial services Science Biology , physics , medicine , social sciences ,… Use data for novel scientific insights Should be open access , indexed , reusable , …

Big Data How big?

Dataset Size < 4GB < 512GB TBs PBs

Statistical Analysis (N x M) Billions of samples & few dimensions, or Billions of samples & thousands of dimensions, or Thousands of samples & thousands of dimensions

Data Analysis T ool

Computation Time <100ms seconds minutes hours weeks

Optimizations R or Matlab implementation Algorithm parameter tuning C++/ Java / … implementation Data structure optimization Multi-threaded parallelization (single machine) Distributed parallelization (multiple-machines)

Outline History of Big Data + Biology My research Interactive data analytics Elixir infrastructure Other interesting stuff Google File System MapReduce

Jim Grays Talk

“Data, data everywhere” Source: The Economist [ http:// www.economist.com/node/15557443?story_id=15557443 ]

Scientific Storage Systems Source: http://www.usenix.org/events/lisa10/tech/slides/cass.pdf

Data growth in the life sciences PB

Increase in bionformaticians ? @ UiT

My Lab Biological Data Processing Systems Lab 3 + 1 PhD students Edvard Pedersen, Einar Holsbø , Bjørn Fjukstad + Espen Mikal Robertsen 2 engineers Inge Alexander Raknes , Giacomo Tartari 3 master students Kenneth Knudsen, Morten Grønnesby , Jarl Fagerli http://bdps.cs.uit.no

Research Goal

Norwegian Woman and Cancer (NOWAC) Large and unique biobank of blood samples Understand development of cancer (and how to avoid it) Develop diagnosis approaches Develop or improve treatment http://site.uit.no/nowac /

Center for Bioinformatics ( SfB ) Interdisciplinary research and services Computer science Biotechnology Bioinformatics Special focus on marine metagenomics Commercial exploitation of marine resources http://sfb.cs.uit.no

Interactive Data Exploration

Interactive Data Exploration Components Human experts for data analysis Interactive user interface A nalysis methods and models Data management and backend processing Compute and storage resources

Data-intensive computing platforms

Outline (part 2) Hardware platforms Infrastructure systems Google File System MapReduce Ecosystems

Hardware Requirements Process 1TB of data? Process 1PB of data?

Single Computer

Supercomputer Disadvantages: Centralized storage has limited bandwidth High cost of interconnect … … Infiniband 56Gbits/s 164Gbit/s

Commodity Component Distributed System … … SATA 6Gbit/s

Hadoop

Google File System (GFS) https://courses.cs.washington.edu/courses/cse490h/11wi/CSE490H_files/ gfs.pdf Hadoop Distributed File System implements GFS design

MapReduce http://research.google.com/archive/mapreduce-osdi04-slides/ index.html Hadoop MapReduce implements Google File System design

Data Intensive Computing- Slide in Presentation

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Data Intensive Computing- Slide in Presentation

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx