TalkPlay: An LLM-based music recommendation — Its model, data, and the future”

KeunwooChoi 6 views 43 slides Oct 27, 2025
Slide 1
Slide 1 of 92
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53
Slide 54
54
Slide 55
55
Slide 56
56
Slide 57
57
Slide 58
58
Slide 59
59
Slide 60
60
Slide 61
61
Slide 62
62
Slide 63
63
Slide 64
64
Slide 65
65
Slide 66
66
Slide 67
67
Slide 68
68
Slide 69
69
Slide 70
70
Slide 71
71
Slide 72
72
Slide 73
73
Slide 74
74
Slide 75
75
Slide 76
76
Slide 77
77
Slide 78
78
Slide 79
79
Slide 80
80
Slide 81
81
Slide 82
82
Slide 83
83
Slide 84
84
Slide 85
85
Slide 86
86
Slide 87
87
Slide 88
88
Slide 89
89
Slide 90
90
Slide 91
91
Slide 92
92

About This Presentation

Keunwoo's Invited Talk about "TalkPlay" at RIT (Rochester Institute of Technology), 2025/10/24.


Slide Content

Keunwoo Choi, 2025 October
An LLM-based music recommendation — Its model, data, and the future”
TalkPlay

Presenter: Keunwoo Choi

Keunwoo Choi
is a Machine Learning Engineer in..
LLM for Drug
Discovery
(Genentech)
LLM Music Recsys
(TalkPlay)Music AI
a
GenAudio
(Ga Music AI
(ByteDa Music AI
(Spotify) Music Recsys
Deep Lea
(PhD,
Queen Ma 3D Audio
(ETRI)
202320202014 2025/10/10today
NYU KAISTApplied
Acoustics MIR LLM BOTH

Keunwoo Choi
is a Machine Learning Engineer in..
LLM for Drug
Discovery
(Genentech)
LLM Music Recsys
(TalkPlay)Music AI
a
GenAudio
(Ga Music AI
(ByteDa Music AI
(Spotify) Music Recsys
Deep Lea
(PhD,
Queen Ma 3D Audio
(ETRI)
202320202014 2025/10/10today
NYU KAISTApplied
Acoustics

Scatterplot of ContentsMIR LLM BOTH

Music Information RetrievalMIR

ISMIR (2000 — now)
Music Information Retrieval
extract info!"
https://ismir.net/
https://ismir2025.ismir.net/

Industry
E.g., Sponsors of ISMIR
2005 2015 2025

My MIR Tasks
•Tagging (various types of music classification) // during PhD & at ByteDance
•Drum Transcription // for fun & at Spotify
•Representation Learning // for Spotify
•Music Captioning // at Genentech, side project
•Lyrics Alignment // at Gaudio
•Source Separation // at ByteDance and Gaudio

LLM for Drug
Discovery
(Genentech) LLM Music Recsys
(T Music AI
a
GenAudio
(Ga Music AI
(ByteDa Music AI
(Spotify) Music Recsys
Deep Lea
(PhD,
Queen Ma 3D Audio
(ETRI) 202320202014 2025/10/10todayApplied
Acoustics
# 2022 Nov
ChatGPT release$MIR

Large Language ModelsLLM

LLM?
•You know, ChatGPT, Gemini, Claude, Deepseek, Grok, Llama, Qwen, ..

Next Remark [1 min]:
Training LLMs

Next Token Prediction
LLM
input
output
Imagine all the people
living
sharing
that

Next Token Prediction
LLM
input
output
Imagine all the peopleliving
life
as
their
in
like
as
life

Next Token Prediction
•Given a sequence, LLMs are trained to make every auto-regressive prediction.
•In this talk, illustrations are simplified as if it’s sequence-to-sequence.
input
Corpus
Imagine
Imagine all the people living for today
alloutput
inputImagine all theoutput
inputImagine all the peopleoutput
inputImagine all the people livingoutput
inputImagine all the people living foroutput
inputImagine all the people living for todayoutput

Training with Next Token Prediction
= Learning to Solve Tasks
= Injecting knowledge
GPT-2

Remark 1:
Next-Token Prediction
can pretrain a powerful model
that works in many domains.

Next Remark [3 min]:
Making LLMs Talk

Pretraining vs. Post-training
•After pre-training LLMs on huge corpus, they can auto-complete accurately on the domain
included in the corpus.
Pretrained
LLM
input
The capital city of France is
output
Paris, which is the biggest..
The capital city of South Korea is Seoul. The capital city of Japan is
The human genes consists of DNA, a molecule made of ..
What is the capital city of France? This obvious question has been..
1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 ..

•But, you (may) want it to talk back!
input
What’s the capital city of France?
output
Paris!
Tell me the capital city of S. Korea. It’s Seoul.
What do human genes consists of? They consists of DNAs.
What’s the capital city of France? It is still Paris!
Calculate 1 + 2 + 3 + 4? The answer is 10.
Post-
Trained
LLM

•Post-training / Conversational / Instruction-following
Pretrained
LLM
ChatGPT,
Gemini,
Claude,
Deepseek,
Post-
Training

Pretraining Post-training
Training
Method
Next Token Prediction
Corpus Documents, books, ..
(Instruction-following)
Conversations
Goal Knowledge injection Behavior control

Pretraining Post-training
Training
Method
(loss
function)
Next Token Prediction
Corpus Documents, books, ..
(Instruction-following)
Conversations
Goal Knowledge injection Behavior control

Training is the same, next-token prediction
•Using a lot of {instruction, response} pairs.
•Technically, the post-trained LLM still performs the “auto-complete”.
•But the completion - the LLM response - would be about following the instruction.
What’s the capital city of France? Paris!
Tell me the capital city of S. Korea. It’s Seoul.
What do human genes consists of? They consists of DNAs.
What’s the capital city of France? It is still Paris!
Calculate 1 + 2 + 3 + 4? The answer is 10.
Post-
trained
LLM
Pretrained
LLM

Remark 2:
Next-Token Prediction
with {instruction, response}
makes LLMs talk.

Next Remark [5 min]:
Multimodality

Recap
•LLMs are huge auto-completion machines, that are perhaps very useful.
•Why so useful?
•The model architecture is general enough for any sequences.
•“any” = any domain
•“any” = any behaviors
•and..
Pretrained
LLM
Post-
Trained
LLM

Multimodal LLM
some kind of multimodal data
output

Advanced: Actual audio-in, audio-out
(But today, we will focus on audio “analysis” only)
https://kyutai.org/next/codec-explainer

Multimodal LLM
text tokenizer
audio
tokenizer
This music is a 90’s rock music, with a fast tempo and funky guitar.
input
output

Advanced: Multimodal Inputs
•How?
•“Patchify” the image or audio to, process it, and feed it to the LLM
•I.e. make whatever data a sequence, and treat them like a sequence
•LLM learns to understand the input information, along with text (as usual).

E.g., LLark

E.g., Google Gemini

Let’s ask Gemini!
“Describe this music clip in the best detail you can. "
you and i - keunwoo.OOO

"gemini-1.5-flash": "The audio clip is a short excerpt of what sounds like Korean vocal music.
It's characterized by:* **Vocal Style:** The vocals are primarily melodic, with a focus on
a single, clear voice . The style is expressive, with subtle variations in pitch and volume that
suggest emotion. The singer uses a style that could be described as lyrical and perhaps
slightly melancholic . The pronunciation is clear, but without the strong, forceful quality of
some traditional Korean vocal styles.*
**Musical Accompaniment:** There is no discernible instrumental accompaniment in this short
clip. The audio focuses solely on the voice .*
**Tempo and Rhythm:** The tempo is relatively slow to moderate . The rhythm is somewhat free-
flowing, not strictly adhering to a rigid beat, but still possessing a distinct melodic pulse.*
**Pitch and Melody:** The melody is relatively simple, but pleasing. It moves primarily within
a limited range, suggesting a focus on expression rather than virtuosic display . There's a
gentle rise and fall in the pitch, creating a sense of movement and narrative within the
melody.
**Overall Impression:** The overall impression is one of quiet intimacy and emotional depth . It
sounds like a segment of a larger vocal piece, possibly a song or aria, characterized by a
refined and expressive vocal delivery. The lack of instrumental accompaniment enhances the
focus on the voice and its emotional conveyance.”

"gemini-2.0-flash-lite" : "This music clip features a simple yet effective arrangement.
Acoustic guitar plays a gentle, fingerpicked melody, possibly with a light reverb effect. The
music begins with a clear and slightly echoey acoustic guitar introduction, followed by a male
vocalist. The vocalist sings a soft, melancholic melody . The production emphasizes the intimacy
and emotional weight of the performance, with the simple instrumentation allowing the singer's
voice to take center stage. The overall feel is calm, a little wistful, and intimate, creating
a peaceful atmosphere. It seems like a nice piece of singer-songwriter material.”
“gemini-2.0-flash": "This is a slow, melancholic Korean pop song . It features a male vocalist
with a gentle, slightly grainy voice singing in Korean, accompanied by a simple piano melody
that emphasizes a sense of longing and reflection . The music has a heartfelt, emotional
quality."

"gemini-2.5-flash-preview-05-20" : "This is a **serene and mellow K-R&B (Korean R&B) or K-Soul ballad **, with
a strong influence of smooth jazz and contemporary R&B . The tempo is slow and unhurried , creating a very
relaxed and intimate atmosphere.
Here's a detailed breakdown:
* **Tempo:** Very slow (Adagio/Largo), estimated around 60-70 BPM, giving it a peaceful, unhurried feel.
* **Mood/Atmosphere:** Calm, contemplative, gentle, warm, and somewhat introspective. It evokes a
feeling of quiet comfort or peaceful reflection.
* **Vocalist:** A male vocalist with a smooth, gentle, and slightly breathy delivery sings in Korean.
His voice is clear and has a comforting, almost crooning quality, resting comfortably in a baritone or lower
tenor range.
* **Instrumentation:** The arrangement is sparse and tasteful, focusing on a warm, layered texture.
* **Keyboards/Synth:** A soft, sustained synth pad creates an airy, spacious, and warm backdrop
throughout the clip. There may be subtle electric piano or piano chords layered beneath the pad.
* **Drums/Percussion:** The drums are exceptionally subtle, providing a soft, understated rhythm
section. A gentle kick drum lays down the beat, accompanied by a very light snare (possibly brushed or
played with soft sticks) and delicate hi-hat work, ensuring the rhythm is felt rather than explicitly heard.
* **Bass:** A warm, smooth electric bass provides a foundational, melodic pulse without being overly
prominent, blending seamlessly into the background.
* **Guitar:** A clean-toned electric guitar plays sparse, delicate melodic lines and arpeggiated chords,
often in the higher register. It functions as a decorative, atmospheric element rather than a lead
instrument, adding touches of warmth and introspection.
**Timeline Breakdown:**
* **0:00 - 0:02:** The clip begins with a brief instrumental intro featuring the sustained synth pad and
a short, clean electric guitar motif, setting the mellow tone.
* **0:02 - 0:09:** The male vocalist enters smoothly, carrying the melody in Korean, accompanied by the
full, subtle instrumentation of the bass, drums , synth pad, and the occasional delicate guitar lick. The
dynamics remain consistently soft and gentle throughout this section.
The overall texture is very soft, airy, and layered, with no harsh or loud elements. The dynamics remain
relatively consistent and low, maintaining a sense of calm."

Remark 3:
LLMs can understand
multimodal data,
just like texts

Remark 1 + 2 + 3
•1. LLMs are large, and can learn from a large amount of data, by next token prediction.
•2. We can make it a strong instruction follower with the same, next token prediction.
•3. We can also make it understand audio with next token prediction.
•1 + 3: We can let LLM learn from a large amount of audio data.
•2 + 3: We can make it perform music information retrieval tasks.
•1 + 2 + 3: We can make it a very strong audio analyzer.

Tokens

Tokens
•Token is the minimum discrete & distinguished unit of language in LLMs.
•One token ~= one concept
•A sentence == A sequence of token
•A LLM usually has about 100k tokens
•Inside LLM, a single token is represented by a single vector
•100k tokens x 4096-dim x 2 bytes/number ~=410M parameters ~= 800 MB in memory/disk
•Long sequence → Many tokens → More compute and memory, sometimes O(n**2)

Tokens
•Word-by-word tokenization is not great, because there are too many words.
•Say, select top 1M words:
•[Memory Issue] Tokens already take 4B parameters (8G), not even counting the language model itself.
•[Performance] It still would have out-of-vocab issues.
•[Modeling Efficiency] There would be too many similar words; LLMs need to understand all of them,
separately.
•Character-wise tokenization is also not great.
•(+) Just 26 * 2 = 52 tokens to represent English!
•(-) Sequences become too long. (E.g. 4 times longer → 16 times more memory usage)
•(-) The semantic of each character is too vague; making it harder for the LM to learn.

Tokens
•Therefore, sub-word tokenization.
https://belladoreai.github.io/llama-tokenizer-js/example-demo/build/

Advanced: Deep dive to Tokenization
•Deep dive: https://www.fast.ai/posts/2025-10-16-karpathy-tokenizers#gpt-2-and-gpt-4-
tokenizers

Tokens, anyway.
•LLMs break down texts into sub-word units, and it’s called tokens.
•A token == a vector in LLM
•Thus, a sentence == a sequence of vectors.
•LLMs don’t read texts. They take a sequence of vectors!
LLM
https://jalammar.github.io/illustrated-transformer/

What to worry about Tokens
•Everything:
•Model Performance
•Memory Efficiency
•Modeling Efficiency
•Sub-word tokenizer: Best among we have.
•~100k vocab: good enough for 1+ languages

Q&A
End of the first 30ish minutes

TalkPlayBOTH

LLM Recommendation
also known as Generative Recommendation

Actually, since early October..
•This latest feature is very similar
to TalkPlay, and something that
has been discussed in Recsys
research community in the last
1-2 years.
•So.. great!
But let’s get back to the
motivation, and to do that, I need
to pretend this feature doesn't
exist.
Spotify x ChatGPT

Why LLM is not enough?
•LLMs know so many music, artists, albums, and musical concepts, and can understand our
query, respond well, speak better English than me, etc, etc..
•But - LLM itself has no connection to actual music catalog
•How can we do that?

Connect LLM <> Item DB
• == next token prediction machine
•We want LLM to recommend items just like it predict next tokens…
•(because that’s how LLM becomes so powerful at understanding,
and we want it to be same good at recommending music)
•A “token” (its vector) represents a (sub)word, a concept, an entity.
•So.. we need to represent songs using tokens. How?
•→ LET’S ADD MORE TOKENS THAT REPRESENT MUSIC ITEMS!!
•Recommendation can be made just like generative models
•Generative Recommendation
LLM
Music Database

Connect LLM <> Item DB
•One song == one token?
•Similar to word-as-a-token
•e.g. 100M different vectors for 100M songs — Not good, because,
•Too much memory usage
•Too many vectors for model to learn
•A better solution is..
~100M songs, ~1M artists
5+ B videos on YouTube
600M items on Amazon

•Naively (like word-as-a-token):
•one item is represented by one token
•16 items in DB → 16 new tokens needed
•1M items in DB → 1M new tokens needed
•Efficiently (like sub-word tokens)
•one item is represented by two tokens
•16 items → 8 (sqrt(16) * two) new tokens needed
•1M items → 2000 (sqrt(1M) * two) needed
item representation
1
token_1
2
token_2
3
token_3
4
token_4
5
token_5
6
token_6
7
token_7
8
token_8
9
token_9
10
token_10
11
token_11
12
token_12
13
token_13
14
token_14
15
token_15
16
token_16
item representation
1
(token_1, token_5)
2
(token_1, token_6)
3
(token_1, token_7)
4
(token_1, token_8)
5
(token_2, token_5)
6
(token_2, token_6)
7
(token_2, token_7)
8
(token_2, token_8)
9
(token_3, token_5)
10
(token_3, token_6)
11
(token_3, token_7)
12
(token_3, token_8)
13
(token_4, token_5)
14
(token_4, token_6)
15
(token_4, token_7)
16
(token_4, token_8)

item representation
1
(category_1, subcategory_1)
2
(category_1, subcategory_2)
3
(category_1, subcategory_3)
4
(category_1, subcategory_4)
5
(category_2, subcategory_1)
6
(category_2, subcategory_2)
7
(category_2, subcategory_3)
8
(category_2, subcategory_4)
9
(category_3, subcategory_1)
10
(category_3, subcategory_2)
11
(category_3, subcategory_3)
12
(category_3, subcategory_4)
13
(category_4, subcategory_1)
14
(category_4, subcategory_2)
15
(category_4, subcategory_3)
16
(category_4, subcategory_4)
item representation
1
item_1
2
item_2
3
item_3
4
item_4
5
item_5
6
item_6
7
item_7
8
item_8
9
item_9
10
item_10
11
item_11
12
item_12
13
item_13
14
item_14
15
item_15
16
item_16
•Imagine: the the double-token representation may mean something like: %

•As known as “SemanticID”
•Currently de factor method for generative
recommendation
https://arxiv.org/abs/2305.05065

LLM recommender
Give me music like keunwoo.OOO’s songs
category_subcategory_1
item representation
1
(category_1, subcategory_1)
2
(category_1, subcategory_2)
3
(category_1, subcategory_3)
4
(category_1, subcategory_4)
5
(category_2, subcategory_1)
6
(category_2, subcategory_2)
7
(category_2, subcategory_3)
8
(category_2, subcategory_4)
9
(category_3, subcategory_1)
10
(category_3, subcategory_2)
11
(category_3, subcategory_3)
12
(category_3, subcategory_4)
13
(category_4, subcategory_1)
14
(category_4, subcategory_2)
15
(category_4, subcategory_3)
16
(category_4, subcategory_4)

•If everything went well:
•“category_3” token indicates some musical quality similar to keunwoo.OOO
•“subcategory_5” token also indicates something like that.
•LLM recommender knows all the meanings of category_* to subcategory_*
perhaps because it’s trained on relevant recommendation data.
LLM recommender
Give me music like keunwoo.OOO’s songs
category_subcategory_1

•Efficiently (like sub-word tokens)
•one item is represented by two tokens
•16 items → 8 (sqrt(16) * two) new tokens needed
•1M items → 2000 (sqrt(1M) * two) needed
item representation
1
(token_1, token_5)
2
(token_1, token_6)
3
(token_1, token_7)
4
(token_1, token_8)
5
(token_2, token_5)
6
(token_2, token_6)
7
(token_2, token_7)
8
(token_2, token_8)
9
(token_3, token_5)
10
(token_3, token_6)
11
(token_3, token_7)
12
(token_3, token_8)
13
(token_4, token_5)
14
(token_4, token_6)
15
(token_4, token_7)
16
(token_4, token_8)
•BUT we can’t randomly assign the tokens!!
•Effectively
•Item 1, 2, 3, 4 should have something in
common, since they’re all “token_1”
•Item 5, 6, 7, 8 too.
•Item 9, 10, 11, 12; item 13, 14, 15 16 too.


AND..
•Item 1, 5, 9, 13 should have something in
common, since they’re all “token_5”

•We need to assign tokens to a lot of songs in
•By using 2 (or more!) tokens, we keep the token space manageable (Efficiency)
•But the token assignment needs to make sense, semantically.
DB
https://arxiv.org/abs/2502.13713

TalkPlay 1

TalkPlay 1
How to make the music tokenizer make sense?

•A song is represented by five tokens
•Each token dimension has its own, dedicated meaning.
•1st token: indicates the playlist embedding of the song
•2nd token: indicates the tags embedding
•3rd token: the metadata embedding
•4th token: the lyrics embedding
•5th token: the audio signal embedding

•Train a LLM with the multimodal music tokens
•replacing music item
•one music item == 5 multimodal music tokens
•e.g.
•LLM learns the meanings of multimodal music
tokens , associated with the
query and all the other contexts.
<p-13><t-1><m-11><l-143><a-94>
<p-13> <t-1> <m-11><l-143><a-94>
What was the Michael Jackson’s song about piece?
I think you’re looking for “Heal the World”
Training TalkPlay

TalkPlay recommends (predicts) a song
In the music DB,
each song is converted into 5 tokens
( )
based on its
audio, metadata, lyrics, and other
information
TalkPlay
Query: play me something like ..
(predicted multimodal music tokens)

Evaluation of TalkPlay
•It works fairly well.
•It is difficult to evaluate it because there’s no actual baseline model, but we did our best.

TalkPlayData 2

Motivation of TPD2
•TalkPlay models are trained on conversational music recommendation datasets.
•that has queries, music recommendations, maybe feedback, multi-turn chats, ..
•but…
•..
•…
•…..
•….……
•Such a dataset doesn’t exists!

Previously: TalkPlay
Essentially, “TalkPlayData 1”
•For each playlist in MPD ( ):
•ask an LLM to convert the playlist to a conversation
•, using the technique we introduced in .
(which is like .. see our ISMIR 2023 paper)

Previously: TalkPlay
•What we did it well &
•First LLM-based chat synthesis for music recommendation
•Realistic conversation
•Large scale
•Music sequence is grounded by playlists
•Limitations '
•Although it’s multi-turn, the turn-by-turn coherence was weak
•Each conversation flow feels too simple
•Lack of conversation goal; every conversation sounds too similar to each other
•Lack of personalization

Solution
•ISSUE: Without any input, LLM just generates something around the “average”
•average music chat topic, average likes/dislikes, average listener type, average goal, ..
•Speak like the same person, with the same musical taste, and the same goal for a music conversation.
•So, we have to condition the LLM highly specifically.

TalkPlayData 2
•More realism: Four separate LLMs — each of them has a dedicated role
•Enhance/Provide a listener profile
•Generate a conversation goal
https://arxiv.org/abs/2509.09685

Listener Profile
•User demographics
•Musical preferences
https://talkpl.ai/p/talkplaydata2/

Conversation Goal

Examples:
https://talkpl.ai/p/talkplaydata2/

•A listening session (sequence of tracks + other information) is sampled
•Profiling tracks + user info goes into to generate a
•Recommendation Pool and goes into that generates a

•Separation of Listener LLM and Recsys LLM:
•They are exposed to different subsets of information
•Listener wants to achieve Conversation Goal
•Listener doesn’t know which songs are available exactly
•Recsys wants to satisfy Listener
•Recsys doesn’t know too much about the listener; has to rely on the current chat.

Data Generation
•RecsysLLM can see audio, album art, lyrics,
metadata, tags, etc., of recommendation
pool.
•Listener LLM also can see __same_info__ of
each recommended track.
•Multiple calls to Gemini 2.5 Flash per chat
•More calls, more costs
•More calls, more problems
•→ pretty complicated generation pipeline

Does the goal/profile help?
•Yes, they make the overall conversation much more diverse.

A lot more details in the paper
( Paper: On arXiv: "TalkPlayData 2: An Agentic Synthetic Data Pipeline for Multimodal Conversational Music Recommendation"
) Dataset: Hugging Face. 16,500 conversations in total. 15,500 in training split and 1,000 in test split.
* Generation Code: GitHub — self-contained and runnable with a small dummy set, designed to expose the exact prompts and the end-to-end process.

TalkPlay-Tools [1 min]

https://talkpl.ai/p/talkplay_tools/
https://arxiv.org/abs/2510.01698

Future Works

Public Challenge
•TalkPlayData 2 is adopted as the official dataset
for NLP4MUSA 2026 CMR Challenge
•Please join there!
https://sites.google.com/view/nlp4musa-2026

More Realistic Data
•TalkPlayData 1 and 2 are outcomes out of our data struggle of academia x Recsys
•Little user data, only preview audios if lucky, no user interaction, ..

More adoption from LLM Research
•RAG, Multimodal RAG
•Tools
•Reinforcement Learning:
•DeepSeek, ChatGPT O-1 and O-3, Grok, Gemini — all extensively used RL
•Works best / easiest when the answer is verifiable (e.g. math and coding)
•Is music recommendation a (very quickly) verifiable problem? Hmm.. do you know?

More expansion in Recsys
•TalkPlay (embedding prediction; somewhat close to “similar song search”

→ TalkPlay-Tools (Combine keyword search, attribute filtering, and embedding-based
approach into a system)

→ then, what?
•Obvious answer: Personalization, podcast /audiobooks / samples and stems
•A lot more works to be done; towards a singe LLM-only recommendation system

What’s stopping me?
•There’s no data — really no data, to the point I already used so many adjacent data to
synthesize TalkPlayData 2. But we want real user interaction, real audio, larger, up-to-date!
•No one pays me for this, lol

What has encouraged me?
•Being an indie / academic researcher has some benefits!
•No conflicts of interests
•Always publish my result
•Can follow my heart and do what I believe is important
•Now that I have re-re-started Diablo 2 Resurrected, things are highly uncertain..

LLM for Drug
Discovery
(Genentech)
LLM Music Recsys
(TalkPlay)Music AI
a
GenAudio
(Ga Music AI
(ByteDa Music AI
(Spotify) Music Recsys
Deep Lea
(PhD,
Queen Ma 3D Audio
(ETRI)
202320202014 2025/10/10today
NYU KAISTApplied
Acoustics MIR LLM BOTH
Keunwoo Choi
and the journey of TalkPlay
(with Seungheon Doh , the other TalkPlayer)

Q&A
thank you for listening! - Keunwoo