5 Data Modeling for NoSQL 1/2

fabiofumarola1 32,235 views 54 slides Mar 25, 2015
Slide 1
Slide 1 of 54
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54

About This Presentation

The Information Technology have led us into an era where the production, sharing and use of information are now part of everyday life and of which we are often unaware actors almost: it is now almost inevitable not leave a digital trail of many of the actions we do every day; for example, by digital...


Slide Content

Aggregate'Data'Models'
Ciao'
ciao'
'
Vai'a'fare'
'
'ciao'ciao'
Dr. Fabio Fumarola

Agenda'
• Data'Model'Evolu7on'
• Rela7onal'Model'vs'Aggregate'Model'
• Consequences'of'Aggregate'Models'
• Aggregates'and'Transac7ons'
• Aggregates'Models'on'NoSQL'
– KeyAvalue'and'Document'
– ColumnAFamily'Stores'
• Summarizing'AggregateAOriented'databases'
1'

Data'Model''
• A'data'model'is'a'representa7on'that'we'use'to'
perceive'and'manipulate'our'data.'
• It'allows'us'to:'
– Represent'the'data'elements'under'analysis,'and'
– How'these'are'related'to'each'others'
• This'representa7on'depends'on'our'percep7on.'
2'

Data'Model:'Database'View'
• In'the'database'field,'it'describes'how'we'interact'
with'the'data'in'the'database.'
• This'is'dis7nct'from'the'storage'model:'
– It'describes'how'the'database'stores'and'manipulate'the'
data'internally.'
• In'an'ideal'worlds:'
– We'should'be'ignorant'of'the'storage'model,'but'
– In'prac7ce'we'need'at'least'some'insight'to'achieve'a'
decent'performance'
3'

Data'Models:'Example'
• A'Data'model'is'the'model'of'the'specific'data'in'an'
applica7on'
• A'developer'might'point'to'an'en7tyArela7onship'
diagram'and'refer'it'as'the'data'model'containing''
– customers,''
– orders'and''
– products'
4'

Data'Model:'Defini7on''
'
In'this'course'we'will'refer'“data'model”'as'the'model'
by'which'the'database'organize'data.'
It'can'be'more'formally'defined'as'metaAmodel'
5'

Last'Decades'Data'Model'
• The'dominant'data'model'of'the'last'decades'what'
the'rela7onal'data'model.'
1. It'can'be'represented'as'a'set'of'tables.''
2. Each'table'has'rows,'with'each'row'represen7ng'
some'en7ty'of'interest.'
3. We'describe'en77es'through'columns'(???)'
4. A'column'may'refer'to'another'row'in'the'same'or'
different'table'(rela7onship).'
6'

NoSQL'Data'Model'
• It'moves'away'from'the'rela7onal'data'model'
• Each'NoSQL'database'has'a'different'model'
– KeyAvalue,'
– Document,'
– ColumnAfamily,'
– Graph,'and'
– Sparse'(Index'based)'
• Of'these,'the'first'three'share'a'common'
characteris7c'(Aggregate'Orienta7on).'
7'

RELATIONAL)MODEL))
VS)
AGGREGATE)MODEL)
8'

Rela7onal'Model'
• The'rela7onal'model'takes'the'informa7on'that'we'
want'to'store'and'divides'it'into'tuples'(rows).'
• However,'a'tuple'is'a'limited'data'structure.'
• It'captures'a'set'of'values.'
• So,'we'can’t'nest'one'tuple'within'another'to'get'
nested'records.'
• Nor'we'can'put'a'list'of'values'or'tuple'within'
another.'
9'

Rela7onal'Model'
• This'simplicity'characterize'the'rela7onal'model'
• It'allows'us'to'think'on'data'manipula7on'as'
opera7on'that'have:'
– As'input'tuples,'and'
– Return'tuples'
• Aggregate'orienta7on'takes'a'different'approach.'
10'

Aggregate'Model'
• It'recognizes'that,'you'want'to'operate'on'data'unit'
having'a'more'complex'structure'than'a'set'of'tuples.'
• We'can'think'on'term'of'complex'record'that'allows:'
– List,'
– Map,'
– And'other'data'structures'to'be'nested'inside'it'
• KeyAValue,'document,'and'columnAfamily'databases'
uses'this'complex'structure.'
11'

Aggregate'Model'
• Aggregate'is'a'term'coming'from'DomainADriven'
Design'[Evans03]'
– An'aggregate'is'a'collec7on'of'related'objects'that'we'wish'
to'treat'as'a'unit'for'data'manipula7on,'management'a'
consistency.'
• We'like'to'update'aggregates'with'atomic'opera7on'
• We'like'to'communicate'with'our'data'storage'in'
terms'of'aggregates'
12'http://pbdmng.datatoknowledge.it/readingMaterial/Evans03.pdf

Aggregate'Models'
• This'defini7on'matches'really'with'how'keyAvalue,'
document,'and'columnAfamily'databases'works.'
• With'aggregates'we'can'easier'work'on'a'cluster,'
since'they'are'unit'for'replica7on'and'sharding.'
• Aggregates'are'also'easier'for'applica7on'
programmer'to'work'since'solve'the'impedance'
mismatch'problem'of'rela7onal'databases.'
13'

Example'of'Rela7onal'Model'
• Assume'we'are'building'
an'eAcommerce'
website;'
• We'have'to'store'
informa7on'about:'
users,'products,'orders,'
shipping'addresses,'
billing'addresses,'and'
payment'data.'
14'

Example'of'Rela7onal'Model'
• As'we'are'good'
rela7onal'soldier:'
– Everything'is'
normalized'
– No'data'is'repeated'
in'mul7ple'tables.'
– We'have'referen7al'
integrity'
15'

Example'of'Aggregate'Model'
• We'have'two'aggregates:'
– Customers'and'
– Orders'
• We'us'the'blackAdiamond'
composi7on'to'show'how'
data'fits'into'the'
aggregate'structure'
16'

Example'of'Aggregate'Model'
• The'customer'contains'a'
list'of'billing'addresses;'
• The'order'contains'a'list'
of:'
– order'items,'
– a'shipping'address,'
– and'payments'
• The'payment'itself'
contains'a'billing'address'
for'that'payment'
17'

Example'of'Aggregate'Model'
• A'single'address'appears'3'
7mes,'but'instead'of'using'an'
id'it'is'copied'each'7me'
• This'fits'a'domain'where'we'
don’t'want'shipping,'
payment'and'billing'address'
to'change'
• What'is'the'difference'w.r.t'a'
rela7onal'representa7on?'
18'

Example'of'Aggregate'Model'
• The'link'between'customer'and'the'order'is'a'rela7onship'
between'aggregates'
'
19'
//Customer
{
"id": 1,
"name": "Fabio",
"billingAddress": [
{
"city": "Bari"
}
]
}
//Orders
{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 34,
"productName": "Scala in Action”
} ],
"shippingAddress": [ {"city": "Bari”} ],
"orderPayment": [
{ "ccinfo": "100-432423-545-134",
"txnId": "afdfsdfsd",
"billingAddress": [ {"city": "Bari” }]
} ]
}

Orders'Details'
'
• There'is'the'customer'id'
• The'product'name'is'a'part'of'
the'ordered'Items'
• The'product'id'is'part'of'the'
ordered'items'
• The'address'is'stored'several'
7mes'
20'
//Orders
{
"id": 99,
"customerId": 1,
"orderItems": [
{
"productId": 27,
"price": 34,
"productName": "Scala in Action”
} ],
"shippingAddress": [ {"city": "Bari”} ],
"orderPayment": [
{ "ccinfo": "100-432423-545-134",
"txnId": "afdfsdfsd",
"billingAddress": [ {"city": "Bari” }]
} ]
}

Example:'Ra7onale'
• We'aggregate'to'minimize'the'number'of'aggregates'
we'access'during'data'interac7on'
• The'important'think'to'no7ce'is'that,''
1. We'have'to'think'about'accessing'that'data'
2. We'make'this'part'of'our'thinking'when'developing'the'
applica7on'data'model''
• We'could'drawn'our'aggregate'differently,'but'it'
really'depends'on'the'“data'accessing'models”'
21'

Example:'Ra7onale'
• Like'most'thinks'in'modeling,'there'is'no'an'universal'
answer'on'how'we'draw'aggregate'boundaries.'
• It'depends'on'how'we'have'to'manipulate'our'data'
1. If'we'tend'to'access'a'customer'with'all'its'orders'at'
once,'then'we'should'prefer'a'single'aggregate.'
2. If'we'tend'to'access'a'single'order'at'7me,'then'we'
should'prefer'having'separate'aggregates'for'each'order.'
22'

CONSEQUENCES)OF)AGGREGATE)
MODELS)
23'

No'Distributable'Storage'
• Rela7onal'mapping'can'captures'data'elements'and'
their'rela7onship'well.'
• It'does'not'need'any'no7on'of'aggregate'en7ty,'
because'it'uses'foreign'key'rela7onship.'
• But'we'cannot'dis7nguish'for'a'rela7onship'that'
represent'aggrega7ons'from'those'that'don’t.'
• As'result'we'cannot'take'advantage'of'that'
knowledge'to'store'and'distribute'our'data.'
24'

Marking'Aggregate'Tools'
• Many'data'modeling'techniques'provides'way'to'
mark'aggregate'structures'in'rela7onal'models'
• However,'they'do'not'provide'seman7c'that'helps'in'
dis7nguish'rela7onships'
• When'working'with'aggregateAoriented'databases,'
we'have'a'clear'views'of'the'seman7c'of'the'data.'
• We'can'focus'on'the'unit'of'interac7on'with'the'data'
storage.'
25'

Aggregate'Ignorant'
• Rela7onal'database'are'aggregate8ignorant,'since'
they'don’t'have'concept'of'aggregate'
• Also'graph'database'are'aggregateAignorant.'
• This'is'not'always'bad.'
• It'domains'where'it'is'difficult'to'draw'aggregate'
boundaries'aggregateAignorant'databases'are'useful.'
26'

Aggregate'and'Opera7ons'
• An'order'is'a'good'aggregate'when:'
– A'customer'is'making'and'reviewing'an'order,'and''
– When'the'retailer'is'processing'orders'
• However,'when'the'retailer'want'to'analyze'its'
product'sales'over'the'last'months,'then'aggregate'
are'trouble.'
• We'need'to'analyze'each'aggregate'to'extract'sales'
history.'
27'

Aggregate'and'Opera7ons'
• Aggregate'may'help'in'some'opera7on'and'not'in'
others.'
• In'cases'where'there'is'not'a'clear'view'aggregateA
ignorant'database'are'the'best'op7on.'
• But,'remember'the'point'that'drove'us'to'aggregate'
models'(cluster'distribu7on).'
• Running'databases'on'a'cluster'is'need'when'dealing'
with'huge'quan77es'of'data.'
28'

Running'on'a'Cluster'
• It'gives'several'advantages'on'computa7on'power'
and'data'distribu7on'
• However,'it'requires'to'minimize'the'number'of'
nodes'to'query'when'gathering'data'
• By'explicitly'including'aggregates,'we'give'the'
database'an'important'of'which'informa7on'should'
be'stored'together'
• But,'s7ll'we'have'the'problem'on'querying'historical'
data,'do'we'have'any'solu7on?'
29'

AGGREGATES)AND)TRANSACTIONS)
30'

ACID'transac7ons'
• Rela7onal'database'allow'us'to'manipulate'any'
combina7on'of'rows'from'any'table'in'a'single'
transac7on.'
• ACID'transac7ons:'
– Atomic,'
– Consistent,'
– Isolated,'and'
– Durable''
'have'the'main'point'in'Atomicity.'
31'

Atomicity'&'RDBMS'
• Many'rows'spanning'many'tables'are'updated'into'
an'Atomic'opera7on'
• It'may'succeeded'or'failed'en7rely'
• Concurrently'opera7ons'are'isolated'and'we'cannot'
see'par7al'updates'
• However'rela7onal'database's7ll'fail.'
32'

Atomicity'&'NoSQL'
• NoSQL'don’t'support'Atomicity'that'spans'mul7ple'
aggregates.'
• This'means'that'if'we'need'to'update'mul7ple'
aggregates'we'have'to'manage'that'in'the'
applica7on'code.'
• Thus'the'Atomicity'is'one'of'the'considera7on'for'
deciding'how'to'divide'up'our'data'into'aggregates'
33'

AGGREGATES)MODELS)ON)NOSQL)
KeyAvalue'and'Document'
34'

KeyAValue'and'Document'
• KeyAvalue'and'Document'databases'are'strongly'
aggregateAoriented.'
• Both'of'these'types'of'databases'consists'of'lot'of'
aggregates'with'a'key'used'to'get'the'data.'
• The'two'type'of'databases'differ'in'that:'
– In'a'keyAvalue'stores'the'aggregate'is'opaque'(Blob)'
– In'a'document'database'we'can'see'a'structure'in'the'
aggregate.'
35'

KeyAValue'and'Document'
• The'advantage'of'keyAvalue'is'that'we'can'store'any'
type'of'object'
• The'database'may'impose'some'size'limit,'but'we'
have'freedom'
• A'document'store'imposes'limits'on'what'we'can'
place'in'it,'defining'a'structure'on'the'data.'
– In'return'we'a'language'to'query'documents.'
36'

KeyAValue'and'Document'
• With'a'keyAvalue'we'can'only'access'by'its'key'
• With'document:'
– We'can'submit'queries'based'on'fields,'
– We'can'retrieve'part'of'the'aggregate,'and'
– The'database'can'create'index'based'on'the'fields'of'the'
aggregate.'
• But'in'prac7ce'they'are'used'differently'
37'

KeyAValue'and'Document'
• People'use'document'as'keyAvalue'
• Riak'(keyAvalue)'allows'you'to'add'metadata'to'
aggregates'for'indexing'
• Redis'allows'you'to'break'aggregates'into'lists,'sets'
or'maps.'
• You'can'support'queries'by'integra7ng'search'tools'
like'Solr.'(Riak'include'solr'for'searching'data'stored'
as'XML'or'JSON).'
38'

KeyAValue'and'Document'
Despite'this'the'general'dis7nc7on's7ll'holds.'
• With'keyAvalue'databases'we'expect'aggregates'
using'a'key'
• With'document'databases,'we'mostly'expect'to'
submit'some'form'of'query'on'the'internal'structure'
of'the'documents.'
39'

AGGREGATES)MODELS)ON)NOSQL)
ColumnAFamily'Stores'
40'

ColumnAFamily'Stores'
• One'of'the'most'influen7al'NoSQL'databases'was'
Google’s'BigTable'[Chang'et'al.]'
• Its'name'derives'from'its'structure'composed'by'
sparse'columns'and'no'schema.'
• We'don’t'have'to'think'to'this'structure'as'a'table,'
but'to'a'twoAlevel'map.'
• BigTable'models'influenced'the'design'the'open'
source'HBase'and'Cassandra.'
41'

ColumnAFamily'Stores'
• These'BigTableAstyle'data'model'are'referred'to'as'
column'stores.'
• PreANoSQL'column'stores'like'CAStore'used'SQL'and'
the'rela7onal'model.'
• What'make'NoSQL'columns'store'different'is'how'
physically'they'store'data.'
• Most'databases'has'rows'as'unit'of'storage,'which'
helps'in'wri7ng'performances'
42'

ColumnAFamily'Stores'
• However,'there'are'many'scenarios'where:'
– Write'are'rares,'but'
– You'need'to'read'a'few'columns'of'many'rows'at'once'
• In'this'situa7ons,'it’s'beqer'to'store'groups'of'
columns'for'all'rows'as'the'basic'storage'unit.'
• These'kind'of'databases'are'called'column'stores'or'
columnAfamily'databases'
43'

ColumnAFamily'Stores'
• ColumnAfamily'databases'have'a'twoAlevel'aggregate'
structure.'
• Similarly'to'keyAvalue'the'first'key'is'the'row'
iden7fier.'
• The'difference'is'that'retrieving'a'key'return'a'Map'
of'more'detailed'values.'
• These'secondAlevel'values'are'defined'to'as'columns.''
• Fixing'a'row'we'can'access'to'all'the'columnAfamilies'
or'to'a'par7cular'element.'
44'

Example'of'Column'Model'
45'

ColumnAFamily'Stores'
• They'organize'their'columns'into'families.'
• Each'column'is'a'part'of'a'family,'and'column'
family'acts'as'unit'of'access.'
• Then'the'data'for'a'par7cular'column'family'
are'accessed'together.'
46'

ColumnAFamily'Stores:''
How'to'structure'data'
• In'rowAoriented:'
– each'row'is'an'aggregate'(For'example'the'customer'with'
id'456),'
– with'column'families'represen7ng'useful'chinks'of'data'
(profile,'order'history)'within'that'aggregate'
• In'columnAoriented:'
– each'column'family'defines'a'record'type'(e.g.'customer'
profiles)'with'rows'for'each'of'the'records.'
– You'can'this'of'a'row'as'the'join'of'records'in'all'columnA
families'
47'

Column'Family:'Storage'Insights'
• Since'the'database'knows'about'column'grouping,'it'
uses'this'informa7on'for'storage'and'access'
behavior.'
• If'we'consider'Document'store,'also'if'a'document'as'
an'array'of'elements'the'document'is'a'single'unit'of'
storage.'
• Column'families'give'a'twoAlevel'dimension'of'stored'
data.'
48'

Modeling'Strategies'on'Column'DB'
• We'can'model'list'based'elements'as'a'
columnAfamily.'
• Thus'we'can'have'columnAfamilies'with'
many'elements'and'other'with'few'
element.'
• This'may'let'to'high'fragmenta7on'on'
data'stored.''
• In'Cassandra'this'problem'is'faced'by'
defining:'Skinny'and'Wide'rows.'
49'

Modeling'Strategies'on'Column'DB'
• Another'characteris7c'is'that'the'
element'in'columnAfamilies'are'sorted'
by'their'keys.'
• For'the'orders,'this'would'be'useful'if'
we'made'a'key'out'of'a'concatena7on'
of'date'and'id.'
50'

SUMMARIZING)AGGREGATE8
ORIENTED)DATABASES)
51'

Key'Points'
• All'share'the'no7on'of'an'aggregated'indexed'by'a'
key.'
• The'key'can'be'used'for'lookup'
• The'aggregate'is'central'to'running'on'a'cluster.'
• The'aggregates'acts'as'the'atomic'unit'for'updates'
providing'transac7on'on'aggregates.'
• KeyAvalue'treats'the'values'as'Blob'
52'

Key'Points'
• The'document'model'makes'the'aggregate'
transparent'for'queries,'but'the'document'is'treated'
as'single'unit'of'storage.'
• ColumnAfamily'models'divide'aggregates'into'column'
families,'allowing'the'database'to'treat'them'as'units'
of'data'in'the'aggregates.'
• This'imposes'some'structure'but'allows'the'database'
to'take'advantage'of'the'structure'to'improve'its'
accessibility.'
53'