Blazing-Fast Serverless MapReduce Indexer for Apache Solr

SeaseLtd 180 views 39 slides Jun 19, 2024
Slide 1
Slide 1 of 39
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39

About This Presentation

Indexing data from databases to Apache Solr has always been an open problem: for a while, the data import handler was used even if it was not recommended for production environments. Traditional indexing processes often encounter scalability challenges, especially with large datasets.
In this talk, ...


Slide Content

Blazing-Fast Serverless MapReduce Indexer for Apache Solr Speaker: Daniele Antuzi, R&D Software Engineer @ Sease BERLIN BUZZWORDS 2024 - 11 /06/2024

R&D Search Software Engineer Master in Com puter Science at University of Pisa Passionate about algorithms and data structures Food (and sometimes sport) lover DANIELE ANTUZI WHO I AM

Headquarter in London/distributed Open-source Enthusiasts Apache Lucene/Solr experts Elasticsearch/OpenSearch experts Community Contributors Active Researchers Hot Trends : Neural Search, Natural Language Processing Learning To Rank, Document Similarity, Search Quality Evaluation, Relevance Tuning www.sease.io SEArch SErvices

AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer

The problem of combining DB records DB schema Solr document Songs Composers Albums Tags { id: 235 , title: "House Of The Rising Sun" , albumName: "The Best of The Animals" , composers: ["The Animals"] , tags: ["1964", "folk", "en"] }, { id: 594 , title: "That's All Right" , albumName: "Rock 'n' Roll" , composers: ["Elvis Presley"] , tags: ["1946", "us"] },

The problem of combining DB records SELECT * FROM Songs jOIN Composers ON … JOIN Tags ON … JOIN Albums ON … … Simple solution

The problem of combining DB records SELECT * FROM Songs jOIN Composers ON … JOIN Tags ON … JOIN Albums ON … … Simple solution Not scalable High number of records Too much work for DB server

The problem of combining DB records songs = getDBRecords( "SELECT * FROM Songs" ) foreach song in songs : getDBRecords( "SELECT * FROM Composers c WHERE c.songId = " + song .id) getDBRecords( "SELECT * FROM Tags t WHERE t.songId = " + song .id) getDBRecords( "SELECT * FROM Albums a WHERE a.songId = " + song .id) . . . More scalable

The problem of combining DB records songs = getDBRecords( "SELECT * FROM Songs" ) foreach song in songs : getDBRecords( "SELECT * FROM Composers c WHERE c.songId = " + song .id) getDBRecords( "SELECT * FROM Tags t WHERE t.songId = " + song .id) getDBRecords( "SELECT * FROM Albums a WHERE a.songId = " + song .id) . . . Simple solution High Database workload Too slow

AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer

Map Reduce Programming pattern to access big data from a distributed FS Paper "MapReduce: Simplified Data Processing on Large Clusters" in 2004 The user only defines the functions Map and Reduce Implemented by Apache Hadoop or Apache Spark

Map Reduce - Word Count the: 2469493 quick: 34904 brown: 45865 fox: 3547 jumps: 57843 over: 29044 lazy: 346975 dog: 239685

Map Reduce - Word Count - Split Node 2 Node 3 Node 1

Map Reduce - Word Count - Map Node 2 Node 3 Node 1 [ < Berlin, 5>, <Buzzwords, 3> ] [ < Berlin, 1>, <AI, 7> ] [ <AI, 5>, <Opensource, 4> ]

Map Reduce - Word Count - Shuffle Node 2 Node 3 Node 1 <Berlin, 5>, <Buzzwords, 3> <Berlin, 1>, <AI, 7> < AI, 5>, <Opensource, 4> Node 2 Node 3 Node 1 < Berlin, [5, 1]>, <AI, [7,5]> < Buzzwords, [3]> < Opensource, [4]>

Map Reduce - Word Count - Reduce Node 2 Node 3 Node 1 < Berlin, [5, 1]>, <AI,[7, 5]> < Buzzwords, [3]> < Opensource, [4]> AI: 12 Berlin : 6 Buzzwords: 3 Opensource: 4

AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer

Combining DB records with Map reduce

Combining DB records with Map reduce

Combining DB records with Map reduce

Combining DB records with Map reduce Songs Albums + Song ID Composers + Song ID { songID: 235, title: "House Of The Rising Sun", . . . }, { songID: 345, title: "The Magic Flute", . . . } { songID: 594, title: "The Marriage of Figaro", . . . } { songID: 235, albumName: " The Best of The Animals ", . . . }, { songID: 345, albumName: "The Best of Mozart", . . . }, { songID: 594, albumName : " The Best of Mozart ", . . . } { songID: 235, composerName: " The Animals ", . . . }, { songID: 345, composerName: "Mozart", . . . }, { songID: 594, composerName: " Mozart ", . . . }

Combining DB records with Map reduce {songID: 235, title: " House Of The Rising Sun ", ... } {songID: 594, title: " The Marriage of Figaro ", . . . } {songID: 235, albumName: " The Best of The Animals "} {songID: 594, albumName: " The Best of The Mozart "} {songID: 235, composerName: " The Animals ", ... } {songID: 594, composerName: " Mozart ", ... } <235, title :" House Of The Rising Sun">, <594, title:"The Marriage of Figaro"> <235, composerName :" The Animals ">, <594, composerName :" Mozart "> <235, albumName :" The Best of The Animals ">, <594, albumName :" The Best of The Mozart ">

Node y Node x Combining DB records with Map reduce <235, title:" House Of The Rising Sun "> <594, title:" The Marriage of Figaro "> <235, composerName:" The Animals ">, <594, composerName:" Mozart "> <235, albumName:" The Best of The Animals "> <594, albumName:" The Best of Mozart "> 235 title:" House Of The Rising Sun " albumName:"The Best of The Animals" composerName:"The Animals" 594 title:" The Marriage of Figaro " albumName:" The Best of Mozart " composerName:" Mozart "

Combining DB records with Map reduce 235 title: " House Of The Rising Sun " albumName: " The Best of The Animals " composerName: " The Animals " . . . 594 title: " The Marriage of Figaro " albumName: " The Best of Mozart " composerName: " Mozart " . . . { id: 235, title: " House Of The Rising Sun ", albumName: " The Best of The Animals ", composers: [" The Animals "], tags: ["1964", "folk", "en"] }, { id: 594, title: " The Marriage of Figaro ", albumName: " The Best of Mozart ", composers: [" Mozart "], tags: [" 1786 ", "classic"] },

AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer

Serverless Implementation - Ingredients SQL Database AWS Lambda function AWS S3 bucket AWS Simple Queue Service (SQS) AWS DynamoDB AWS Step function - distributed map Apache Solr (ElasticSearch, Opensearch)

Serverless Implementation - Pipeline

Serverless Implementation - Pipeline

Composers _part_000 . . . Serverless Implementation - Fetch SELECT SongId, Title, Description, … FROM Songs …. Songs_part_000 . . . Album s_part_000 . . . SELECT SongId, ComposerName, … FROM Composers JOIN … …. SELECT SongId, AlbumName, … FROM Albums JOIN … …. Songs_part_749 Composers_part_194 Albums_part_033

Serverless Implementation - Pipeline

Serverless Implementation - Map & Shuffle Albums_part_063 {songId:13, albumName: "A"}, {songId:13, albumName: "B"}, {songId:15, albumName: "B"}, Song_13 Song_15 Song_65 Tags _part_394 {songId:15, tag Name : "X"}, {songId:15, tagName: "Y"}, {songId:65, tagName: "Z"},

Serverless Implementation - Distributed lock ResourceId LockId ExpireAt 13 2049668395 1716812546 245 6739294643 1716812536 ddb_table.put_item( Item={'ResourceId': resource_id, 'ExpireAt’: now_ms + timeout_ms, 'LockId': lock_id}, ConditionExpression='attribute_not_exists(#ResourceId) OR ExpireAt <= :now', ExpressionAttributeNames={"#ResourceId": "ResourceId"}, ExpressionAttributeValues={":now": now_ms}) Atomic put

Serverless Implementation - Pipeline

Serverless Implementation - Reduce Song_34 {songId: 235 , albumName: " The Animals "}, {songId: 235 , tagName: "year: 1964 "}, {songId: 235 , composerName: " The Animals "}, {songId: 235 , title: " House Of The Rising Sun "}, {songId:235, albumName: "The Best of The Animals"}, { songId: 235 , title: " House Of The Rising Sun ", composer: " The Animals ", albumNames: [ "The Animals", " The Best of The Animals" ], tags: ["year: 1964 "] }

Serverless Implementation - Pipeline

Serverless Implementation - Batch & Push Batch size = 2 Batch window = 60 seconds Maximum concurrency = 2

AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer

The Apache Serverless Solr indexer Solution in production since mid April Indexing time from about 150 hours to 5 hours (30 times faster) Cost reduced by 90%