Blazing-Fast Serverless MapReduce Indexer for Apache Solr

Blazing-Fast Serverless MapReduce Indexer for Apache Solr Speaker: Daniele Antuzi, R&D Software Engineer @ Sease BERLIN BUZZWORDS 2024 - 11 /06/2024

R&D Search Software Engineer Master in Com puter Science at University of Pisa Passionate about algorithms and data structures Food (and sometimes sport) lover DANIELE ANTUZI WHO I AM

Headquarter in London/distributed Open-source Enthusiasts Apache Lucene/Solr experts Elasticsearch/OpenSearch experts Community Contributors Active Researchers Hot Trends : Neural Search, Natural Language Processing Learning To Rank, Document Similarity, Search Quality Evaluation, Relevance Tuning www.sease.io SEArch SErvices

AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer

The problem of combining DB records DB schema Solr document Songs Composers Albums Tags { id: 235 , title: "House Of The Rising Sun" , albumName: "The Best of The Animals" , composers: ["The Animals"] , tags: ["1964", "folk", "en"] }, { id: 594 , title: "That's All Right" , albumName: "Rock 'n' Roll" , composers: ["Elvis Presley"] , tags: ["1946", "us"] },

The problem of combining DB records SELECT * FROM Songs jOIN Composers ON … JOIN Tags ON … JOIN Albums ON … … Simple solution

The problem of combining DB records SELECT * FROM Songs jOIN Composers ON … JOIN Tags ON … JOIN Albums ON … … Simple solution Not scalable High number of records Too much work for DB server

The problem of combining DB records songs = getDBRecords( "SELECT * FROM Songs" ) foreach song in songs : getDBRecords( "SELECT * FROM Composers c WHERE c.songId = " + song .id) getDBRecords( "SELECT * FROM Tags t WHERE t.songId = " + song .id) getDBRecords( "SELECT * FROM Albums a WHERE a.songId = " + song .id) . . . More scalable

The problem of combining DB records songs = getDBRecords( "SELECT * FROM Songs" ) foreach song in songs : getDBRecords( "SELECT * FROM Composers c WHERE c.songId = " + song .id) getDBRecords( "SELECT * FROM Tags t WHERE t.songId = " + song .id) getDBRecords( "SELECT * FROM Albums a WHERE a.songId = " + song .id) . . . Simple solution High Database workload Too slow

AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer

Map Reduce Programming pattern to access big data from a distributed FS Paper "MapReduce: Simplified Data Processing on Large Clusters" in 2004 The user only defines the functions Map and Reduce Implemented by Apache Hadoop or Apache Spark

Map Reduce - Word Count the: 2469493 quick: 34904 brown: 45865 fox: 3547 jumps: 57843 over: 29044 lazy: 346975 dog: 239685

Map Reduce - Word Count - Split Node 2 Node 3 Node 1

Map Reduce - Word Count - Map Node 2 Node 3 Node 1 [ < Berlin, 5>, <Buzzwords, 3> ] [ < Berlin, 1>, <AI, 7> ] [ <AI, 5>, <Opensource, 4> ]

Map Reduce - Word Count - Shuffle Node 2 Node 3 Node 1 <Berlin, 5>, <Buzzwords, 3> <Berlin, 1>, <AI, 7> < AI, 5>, <Opensource, 4> Node 2 Node 3 Node 1 < Berlin, [5, 1]>, <AI, [7,5]> < Buzzwords, [3]> < Opensource, [4]>

Map Reduce - Word Count - Reduce Node 2 Node 3 Node 1 < Berlin, [5, 1]>, <AI,[7, 5]> < Buzzwords, [3]> < Opensource, [4]> AI: 12 Berlin : 6 Buzzwords: 3 Opensource: 4

AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer

Combining DB records with Map reduce

Combining DB records with Map reduce Songs Albums + Song ID Composers + Song ID { songID: 235, title: "House Of The Rising Sun", . . . }, { songID: 345, title: "The Magic Flute", . . . } { songID: 594, title: "The Marriage of Figaro", . . . } { songID: 235, albumName: " The Best of The Animals ", . . . }, { songID: 345, albumName: "The Best of Mozart", . . . }, { songID: 594, albumName : " The Best of Mozart ", . . . } { songID: 235, composerName: " The Animals ", . . . }, { songID: 345, composerName: "Mozart", . . . }, { songID: 594, composerName: " Mozart ", . . . }

Combining DB records with Map reduce {songID: 235, title: " House Of The Rising Sun ", ... } {songID: 594, title: " The Marriage of Figaro ", . . . } {songID: 235, albumName: " The Best of The Animals "} {songID: 594, albumName: " The Best of The Mozart "} {songID: 235, composerName: " The Animals ", ... } {songID: 594, composerName: " Mozart ", ... } <235, title :" House Of The Rising Sun">, <594, title:"The Marriage of Figaro"> <235, composerName :" The Animals ">, <594, composerName :" Mozart "> <235, albumName :" The Best of The Animals ">, <594, albumName :" The Best of The Mozart ">

Node y Node x Combining DB records with Map reduce <235, title:" House Of The Rising Sun "> <594, title:" The Marriage of Figaro "> <235, composerName:" The Animals ">, <594, composerName:" Mozart "> <235, albumName:" The Best of The Animals "> <594, albumName:" The Best of Mozart "> 235 title:" House Of The Rising Sun " albumName:"The Best of The Animals" composerName:"The Animals" 594 title:" The Marriage of Figaro " albumName:" The Best of Mozart " composerName:" Mozart "

Combining DB records with Map reduce 235 title: " House Of The Rising Sun " albumName: " The Best of The Animals " composerName: " The Animals " . . . 594 title: " The Marriage of Figaro " albumName: " The Best of Mozart " composerName: " Mozart " . . . { id: 235, title: " House Of The Rising Sun ", albumName: " The Best of The Animals ", composers: [" The Animals "], tags: ["1964", "folk", "en"] }, { id: 594, title: " The Marriage of Figaro ", albumName: " The Best of Mozart ", composers: [" Mozart "], tags: [" 1786 ", "classic"] },

AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer

Serverless Implementation - Ingredients SQL Database AWS Lambda function AWS S3 bucket AWS Simple Queue Service (SQS) AWS DynamoDB AWS Step function - distributed map Apache Solr (ElasticSearch, Opensearch)

Serverless Implementation - Pipeline

Composers _part_000 . . . Serverless Implementation - Fetch SELECT SongId, Title, Description, … FROM Songs …. Songs_part_000 . . . Album s_part_000 . . . SELECT SongId, ComposerName, … FROM Composers JOIN … …. SELECT SongId, AlbumName, … FROM Albums JOIN … …. Songs_part_749 Composers_part_194 Albums_part_033

Serverless Implementation - Pipeline

Serverless Implementation - Map & Shuffle Albums_part_063 {songId:13, albumName: "A"}, {songId:13, albumName: "B"}, {songId:15, albumName: "B"}, Song_13 Song_15 Song_65 Tags _part_394 {songId:15, tag Name : "X"}, {songId:15, tagName: "Y"}, {songId:65, tagName: "Z"},

Serverless Implementation - Distributed lock ResourceId LockId ExpireAt 13 2049668395 1716812546 245 6739294643 1716812536 ddb_table.put_item( Item={'ResourceId': resource_id, 'ExpireAt’: now_ms + timeout_ms, 'LockId': lock_id}, ConditionExpression='attribute_not_exists(#ResourceId) OR ExpireAt <= :now', ExpressionAttributeNames={"#ResourceId": "ResourceId"}, ExpressionAttributeValues={":now": now_ms}) Atomic put

Serverless Implementation - Pipeline

Serverless Implementation - Reduce Song_34 {songId: 235 , albumName: " The Animals "}, {songId: 235 , tagName: "year: 1964 "}, {songId: 235 , composerName: " The Animals "}, {songId: 235 , title: " House Of The Rising Sun "}, {songId:235, albumName: "The Best of The Animals"}, { songId: 235 , title: " House Of The Rising Sun ", composer: " The Animals ", albumNames: [ "The Animals", " The Best of The Animals" ], tags: ["year: 1964 "] }

Serverless Implementation - Pipeline

Serverless Implementation - Batch & Push Batch size = 2 Batch window = 60 seconds Maximum concurrency = 2

AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer

The Apache Serverless Solr indexer Solution in production since mid April Indexing time from about 150 hours to 5 hours (30 times faster) Cost reduced by 90%

Blazing-Fast Serverless MapReduce Indexer for Apache Solr

About This Presentation

Slide Content

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

Blazing-Fast Serverless MapReduce Indexer for Apache Solr

About This Presentation

Slide Content

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Tags

Categories

Download

Quick Actions

Statistics

Related Slideshows

8-top-ai-courses-for-customer-support-representatives-in-2025.pptx

7-essential-ai-courses-for-call-center-supervisors-in-2025.pptx

25-essential-ai-courses-for-user-support-specialists-in-2025.pptx

8-essential-ai-courses-for-insurance-customer-service-representatives-in-2025.pptx

Know for Certain

PPT OPD LES 3ertt4t4tqqqe23e3e3rq2qq232.pptx