Blazing-Fast Serverless MapReduce Indexer for Apache Solr
SeaseLtd
180 views
39 slides
Jun 19, 2024
Slide 1 of 39
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
About This Presentation
Indexing data from databases to Apache Solr has always been an open problem: for a while, the data import handler was used even if it was not recommended for production environments. Traditional indexing processes often encounter scalability challenges, especially with large datasets.
In this talk, ...
Indexing data from databases to Apache Solr has always been an open problem: for a while, the data import handler was used even if it was not recommended for production environments. Traditional indexing processes often encounter scalability challenges, especially with large datasets.
In this talk, we explore the architecture and implementation of a serverless MapReduce indexer designed for Apache Solr but extendable to any search engine. By embracing a serverless approach, we can take advantage of the elasticity and scalability offered by cloud services like AWS Lambda, enabling efficient indexing without needing to manage infrastructure.
We dig into the principles of MapReduce, a programming model for processing large datasets, and discuss how it can be adapted for indexing documents into Apache Solr. Using AWS Step Functions to orchestrate multiple Lambdas, we demonstrate how to distribute indexing tasks across multiple resources, achieving parallel processing and significantly reducing indexing times.
Through practical examples, we address key considerations such as data partitioning, fault tolerance, concurrency, and cost.
We also cover integration points with other AWS services such as Amazon S3 for data storage and retrieval, as well as DynamoDB for distributed lock between the lambda instances.
R&D Search Software Engineer Master in Com puter Science at University of Pisa Passionate about algorithms and data structures Food (and sometimes sport) lover DANIELE ANTUZI WHO I AM
Headquarter in London/distributed Open-source Enthusiasts Apache Lucene/Solr experts Elasticsearch/OpenSearch experts Community Contributors Active Researchers Hot Trends : Neural Search, Natural Language Processing Learning To Rank, Document Similarity, Search Quality Evaluation, Relevance Tuning www.sease.io SEArch SErvices
AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer
The problem of combining DB records DB schema Solr document Songs Composers Albums Tags { id: 235 , title: "House Of The Rising Sun" , albumName: "The Best of The Animals" , composers: ["The Animals"] , tags: ["1964", "folk", "en"] }, { id: 594 , title: "That's All Right" , albumName: "Rock 'n' Roll" , composers: ["Elvis Presley"] , tags: ["1946", "us"] },
The problem of combining DB records SELECT * FROM Songs jOIN Composers ON … JOIN Tags ON … JOIN Albums ON … … Simple solution
The problem of combining DB records SELECT * FROM Songs jOIN Composers ON … JOIN Tags ON … JOIN Albums ON … … Simple solution Not scalable High number of records Too much work for DB server
The problem of combining DB records songs = getDBRecords( "SELECT * FROM Songs" ) foreach song in songs : getDBRecords( "SELECT * FROM Composers c WHERE c.songId = " + song .id) getDBRecords( "SELECT * FROM Tags t WHERE t.songId = " + song .id) getDBRecords( "SELECT * FROM Albums a WHERE a.songId = " + song .id) . . . More scalable
The problem of combining DB records songs = getDBRecords( "SELECT * FROM Songs" ) foreach song in songs : getDBRecords( "SELECT * FROM Composers c WHERE c.songId = " + song .id) getDBRecords( "SELECT * FROM Tags t WHERE t.songId = " + song .id) getDBRecords( "SELECT * FROM Albums a WHERE a.songId = " + song .id) . . . Simple solution High Database workload Too slow
AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer
Map Reduce Programming pattern to access big data from a distributed FS Paper "MapReduce: Simplified Data Processing on Large Clusters" in 2004 The user only defines the functions Map and Reduce Implemented by Apache Hadoop or Apache Spark
AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer
Combining DB records with Map reduce
Combining DB records with Map reduce
Combining DB records with Map reduce
Combining DB records with Map reduce Songs Albums + Song ID Composers + Song ID { songID: 235, title: "House Of The Rising Sun", . . . }, { songID: 345, title: "The Magic Flute", . . . } { songID: 594, title: "The Marriage of Figaro", . . . } { songID: 235, albumName: " The Best of The Animals ", . . . }, { songID: 345, albumName: "The Best of Mozart", . . . }, { songID: 594, albumName : " The Best of Mozart ", . . . } { songID: 235, composerName: " The Animals ", . . . }, { songID: 345, composerName: "Mozart", . . . }, { songID: 594, composerName: " Mozart ", . . . }
Combining DB records with Map reduce {songID: 235, title: " House Of The Rising Sun ", ... } {songID: 594, title: " The Marriage of Figaro ", . . . } {songID: 235, albumName: " The Best of The Animals "} {songID: 594, albumName: " The Best of The Mozart "} {songID: 235, composerName: " The Animals ", ... } {songID: 594, composerName: " Mozart ", ... } <235, title :" House Of The Rising Sun">, <594, title:"The Marriage of Figaro"> <235, composerName :" The Animals ">, <594, composerName :" Mozart "> <235, albumName :" The Best of The Animals ">, <594, albumName :" The Best of The Mozart ">
Node y Node x Combining DB records with Map reduce <235, title:" House Of The Rising Sun "> <594, title:" The Marriage of Figaro "> <235, composerName:" The Animals ">, <594, composerName:" Mozart "> <235, albumName:" The Best of The Animals "> <594, albumName:" The Best of Mozart "> 235 title:" House Of The Rising Sun " albumName:"The Best of The Animals" composerName:"The Animals" 594 title:" The Marriage of Figaro " albumName:" The Best of Mozart " composerName:" Mozart "
Combining DB records with Map reduce 235 title: " House Of The Rising Sun " albumName: " The Best of The Animals " composerName: " The Animals " . . . 594 title: " The Marriage of Figaro " albumName: " The Best of Mozart " composerName: " Mozart " . . . { id: 235, title: " House Of The Rising Sun ", albumName: " The Best of The Animals ", composers: [" The Animals "], tags: ["1964", "folk", "en"] }, { id: 594, title: " The Marriage of Figaro ", albumName: " The Best of Mozart ", composers: [" Mozart "], tags: [" 1786 ", "classic"] },
AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer
Serverless Implementation - Ingredients SQL Database AWS Lambda function AWS S3 bucket AWS Simple Queue Service (SQS) AWS DynamoDB AWS Step function - distributed map Apache Solr (ElasticSearch, Opensearch)
Serverless Implementation - Pipeline
Serverless Implementation - Pipeline
Composers _part_000 . . . Serverless Implementation - Fetch SELECT SongId, Title, Description, … FROM Songs …. Songs_part_000 . . . Album s_part_000 . . . SELECT SongId, ComposerName, … FROM Composers JOIN … …. SELECT SongId, AlbumName, … FROM Albums JOIN … …. Songs_part_749 Composers_part_194 Albums_part_033
Serverless Implementation - Reduce Song_34 {songId: 235 , albumName: " The Animals "}, {songId: 235 , tagName: "year: 1964 "}, {songId: 235 , composerName: " The Animals "}, {songId: 235 , title: " House Of The Rising Sun "}, {songId:235, albumName: "The Best of The Animals"}, { songId: 235 , title: " House Of The Rising Sun ", composer: " The Animals ", albumNames: [ "The Animals", " The Best of The Animals" ], tags: ["year: 1964 "] }
AGENDA Combining DB records Introduction to Map reduce Combining DB records with Map reduce Serverless implementation The Apache Solr indexer
The Apache Serverless Solr indexer Solution in production since mid April Indexing time from about 150 hours to 5 hours (30 times faster) Cost reduced by 90%