Vu Pham
Introduction to Big Data
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna [email protected]
Big Data Computing Introduction to Big Data
Vu Pham
Preface
ContentofthisLecture:
Inthislecture,wewilldiscussabriefintroductionto
BigData:WhyBigData,Wherediditcomefrom?,
ChallengesandapplicationsofBigData,Characteristics
ofBigDatai.e.Volume,Velocity,VarietyandmoreV’s.
Big Data Computing Introduction to Big Data
Vu Pham
What’s Big Data?
Bigdataisthetermforacollectionofdatasetssolargeandcomplexthatit
becomesdifficulttoprocessusingon-handdatabasemanagementtoolsor
traditionaldataprocessingapplications.
Thechallengesincludecapture,curation,storage,search,sharing,transfer,
analysis,andvisualization.
Thetrendtolargerdatasetsisduetotheadditionalinformationderivable
fromanalysisofasinglelargesetofrelateddata,ascomparedtoseparate
smallersetswiththesametotalamountofdata,allowingcorrelationstobe
foundto"spotbusinesstrends,determinequalityofresearch,prevent
diseases,linklegalcitations,combatcrime,anddeterminereal-time
roadwaytrafficconditions.”
Big Data Computing Introduction to Big Data
Vu Pham
Walmarthandles1millioncustomertransactions/hour.
Facebookhandles40billionphotosfromitsuserbase!
Facebookinserts500terabytesofnewdataeveryday.
Facebookstores,accesses,andanalyzes30+Petabytesofuser
generateddata.
Aflightgenerates240terabytesofflightdatain6-8hoursofflight.
Morethan5billionpeoplearecalling,texting,tweetingand
browsingonmobilephonesworldwide.
Decodingthehumangenomeoriginallytook10yearstoprocess;
nowitcanbeachievedinoneweek.
8
ThelargestAT&Tdatabaseboaststitlesincludingthelargestvolume
ofdatainoneuniquedatabase(312terabytes)andthesecond
largestnumberofrowsinauniquedatabase(1.9trillion),which
comprisesAT&T’sextensivecallingrecords.
Facts and Figures
Big Data Computing Introduction to Big Data
Vu Pham
Byte: One grain of rice
KB(3): One cup of rice:
MB (6): 8 bags of rice: Desktop
GB (9): 3 Semi trucks of rice:
TB (12): 2 container ships of rice Internet
PB (15): Blankets ½ of Jaipur
Exabyte (18): Blankets West coast Big Data
Or 1/4
th
of India
Zettabyte (21): Fills Pacific Ocean Future
Yottabyte(24): An earth-sized rice bowl
Brontobyte (27): Astronomical size
An Insight
Big Data Computing Introduction to Big Data
Vu Pham
What’s making so much data?
Sources:People,machine,organization:Ubiquitous
computing
Morepeoplecarryingdata-generatingdevices
(Mobilephoneswithfacebook,GPS,Cameras,etc.)
Data on the Internet:
Internet live stats
http://www.internetlivestats.com/
Big Data Computing Introduction to Big Data
Vu Pham
Source of Data Generation
2+ billion
people
on the
Web by
end 2011
30 billionRFID tags
today
(1.3B in 2005)
4.6 billion
camera
phones
world
wide
100s of
millions of
GPS
enabled
devices
sold
annually
76 millionsmart meters
in 2009…
200M by 2014
12+ TBs
of tweet data
every day
25+ TBs of
log data
every day
? TBs
of
data every day
Big Data Computing Introduction to Big Data
Vu Pham
Crowdsourcing
An Example of Big Data at Work
Big Data Computing Introduction to Big Data
Vu Pham
Where is the problem?
TraditionalRDBMSqueriesisn'tsufficienttogetuseful
informationoutofthehugevolumeofdata
Tosearchitwithtraditionaltoolstofindoutifa
particulartopicwastrendingwouldtakesolongthat
theresultwouldbemeaninglessbythetimeitwas
computed.
BigDatacomeupwithasolutiontostorethisdatain
novelwaysinordertomakeitmoreaccessible,and
alsotocomeupwithmethodsofperforminganalysis
onit.
Big Data Computing Introduction to Big Data
Vu Pham
Challenges
Capturing
Storing
Searching
Sharing
Analysing
Visualization
Big Data Computing Introduction to Big Data
Vu Pham
IBM considers Big Data (3V’s):
The 3V’s: Volume, Velocity and Variety.
Big Data Computing Introduction to Big Data
Vu Pham
Volume (Scale)
Volume:Enterprisesareawashwithever-growing
dataofalltypes,easilyamassingterabyteseven
Petabytesofinformation.
Turn12terabytesofTweetscreatedeachdayinto
improvedproductsentimentanalysis
Convert350billionannualmeterreadingsto
betterpredictpowerconsumption
Big Data Computing Introduction to Big Data
Vu Pham
Volume (Scale)
Data Volume
44x increase from 2009 2020
From 0.8 zettabytesto 35zb
Data volume is increasing exponentially
Exponential increase in
collected/generated data
Big Data Computing Introduction to Big Data
Vu Pham
CERN’s Large HydronCollider (LHC) generates 15 PB a year
Big Data Computing Introduction to Big Data
Example 1: CERN’s Large Hydron Collider(LHC)
Vu Pham
Example 2: The Earthscope
•TheEarthscopeistheworld'slargest
scienceproject.Designedtotrack
NorthAmerica'sgeologicalevolution,
thisobservatoryrecordsdataover
3.8millionsquaremiles,amassing
67terabytesofdata.Itanalyzes
seismicslipsintheSanAndreasfault,
sure,butalsotheplumeofmagma
underneathYellowstoneandmuch,
muchmore.
(http://www.msnbc.msn.com/id/44363
598/ns/technology_and_science-
future_of_technology/#.TmetOdQ--uI)
1.
Big Data Computing Introduction to Big Data
Vu Pham
Velocity (Speed)
Velocity:Sometimes2minutesistoolate.Fortime-
sensitiveprocessessuchascatchingfraud,bigdata
mustbeusedasitstreamsintoyourenterprisein
ordertomaximizeitsvalue.
Scrutinize5milliontradeeventscreatedeachday
toidentifypotentialfraud
Analyze500milliondailycalldetailrecordsinreal-
timetopredictcustomerchurnfaster
Big Data Computing Introduction to Big Data
Vu Pham
Examples: Velocity (Speed)
Data is begin generated fast and need to be
processed fast
Online Data Analytics
Late decisions ➔missing opportunities
Examples
E-Promotions: Based on your current location, your purchase history,
what you like ➔send promotions right now for store next to you
Healthcare monitoring: sensors monitoring your activities and body ➔
any abnormal measurements require immediate reaction
Big Data Computing Introduction to Big Data
Vu Pham
Real-time/Fast Data
Social media and networks
(all of us are generating data)
Scientific instruments
(collecting all sorts of data)
Mobile devices
(tracking all objects all the time)
Sensor technology and networks
(measuring all kinds of data)
The progress and innovation is no longer hindered by the ability to collect data
But, by the ability to manage, analyze, summarize, visualize, and discover
knowledge from the collected data in a timely manner and in a scalable fashion
Big Data Computing Introduction to Big Data
Vu Pham
Customer
Influence
Behavior
Product
Recommendations
that are Relevant
& Compelling
Friend Invitations
to join a
Game or Activity
that expands
business
Preventing Fraud
as it is Occurring
& preventing more
proactively
Learning why Customers
Switch to competitors
and their offers; in
time to Counter
Improving the
Marketing
Effectiveness of a
Promotion while it
is still in Play
Real-Time Analytics/Decision Requirement
Big Data Computing Introduction to Big Data
Vu Pham
Variety (Complexity)
Variety:Bigdataisanytypeofdata–
Structured Data (example: tabular data)
Unstructured –text, sensor data, audio, video
Semi Structured : web data, log files
Big Data Computing Introduction to Big Data
Vu Pham
Examples: Variety (Complexity)
Relational Data (Tables/Transaction/Legacy
Data)
Text Data (Web)
Semi-structured Data (XML)
Graph Data
Social Network, Semantic Web (RDF), …
Streaming Data
You can only scan the data once
A single application can be
generating/collecting many types of data
Big Public Data (online, weather, finance, etc)
To extract knowledge➔all these types of data need to
linked together
Big Data Computing Introduction to Big Data
Vu Pham
The 3 Big V’s (+1)
Big 3V’s
Volume
Velocity
Variety
Plus 1
Value
Big Data Computing Introduction to Big Data
Vu Pham
The 3 Big V’s (+1) (+ N more)
Plus many more
Veracity
Validity
Variability
Viscosity & Volatility
Viability,
Venue,
Vocabulary, Vagueness,
…
Big Data Computing Introduction to Big Data
Vu PhamBig Data Computing Introduction to Big Data
Vu Pham
Value
Integrating Data
Reducing data complexity
Increase data availability
Unify your data systems
All 3 above will lead to increased data collaboration
-> add value to your big data
Big Data Computing Introduction to Big Data
Vu Pham
Veracity
Veracityrefers to the biases ,noise and
abnormality in data, trustworthiness of data.
1 in 3 business leaders don’t trust the information
they use to make decisions.
How can you act upon information if you don’t
trust it?
Establishing trust in big data presents a huge
challenge as the variety and number of sources
grows.
Big Data Computing Introduction to Big Data
Vu Pham
Valence
Valencerefers to the connectedness of big data.
Such as in the form of graph networks
Big Data Computing Introduction to Big Data
Vu Pham
Validity
Accuracy and correctness of the data relative to a
particular use
Example: Gauging storm intensity
satellite imagery vssocial media posts
prediction qualityvs human impact
Big Data Computing Introduction to Big Data
Vu Pham
Variability
How the meaning of the data changes over time
Language evolution
Data availability
Sampling processes
Changes in characteristics of the data source
Big Data Computing Introduction to Big Data
Vu Pham
Viscosity & Volatility
Both related to velocity
Viscosity: data velocity relative to timescale of
event being studied
Volatility: rate of data loss and stable lifetime
of data
Scientific data often has practically unlimited
lifespan, but social / business data may evaporate
in finite time
Big Data Computing Introduction to Big Data
Vu Pham
More V’s
Viability
Which data has meaningful relations to questions of
interest?
Venue
Where does the data live and how do you get it?
Vocabulary
Metadata describing structure, content, & provenance
Schemas, semantics, ontologies, taxonomies, vocabularies
Vagueness
Confusion about what “Big Data” means
Big Data Computing Introduction to Big Data
Vu Pham
Dealing with Volume
Distill big data down to small information
Parallel and automated analysis
Automation requires standardization
Standardize by reducing Variety:
Format
Standards
Structure
Big Data Computing Introduction to Big Data
Vu Pham
Harnessing Big Data
OLTP: Online Transaction Processing (DBMSs)
OLAP: Online Analytical Processing (Data Warehousing)
RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)
Big Data Computing Introduction to Big Data
Vu Pham
The Model Has Changed…
The Model of Generating/Consuming Data has Changed
Old Model: Few companies are generating data, all others are consuming data
New Model: all of us are generating data, and all of us are consuming data
Big Data Computing Introduction to Big Data
Vu Pham
What’s driving Big Data
-Ad-hoc querying and reporting
-Data mining techniques
-Structured data, typical sources
-Small to mid-size datasets
-Optimizations and predictive analytics
-Complex statistical analysis
-All types of data, and many sources
-Very large datasets
-More of a real-time
Big Data Computing Introduction to Big Data
Vu Pham
Big Data Analytics
Bigdataismorereal-timein
naturethantraditional
Datawarehouse(DW)
applications
TraditionalDWarchitectures
(e.g.Exadata,Teradata)are
notwell-suitedforbigdata
apps
Sharednothing,massively
parallelprocessing,scaleout
architecturesarewell-suited
forbigdataapps
Big Data Computing Introduction to Big Data
Vu Pham
Big Data Technology
Big Data Computing Introduction to Big Data
Vu Pham
Conclusion
In this lecture, we have defined Big Data and discussed
the challenges and applications of Big Data.
We have also described characteristics of Big Data i.e.
Volume, Velocity, Variety and more V’s, Big Data Analytics,
Big Data Landscape and Big Data Technology.
Big Data Computing Introduction to Big Data