Mamba: A new era or
ephemeral?
Mamba: Linear-Time Sequence Modeling with Selective State
Spaces
Transformer cannot
Drawbacks of transformers are finite window, quadratic scaling with respect to the
window length
O(L^2)
Generating tokens for a sequence of length L needs roughly L² computations which can be costly if the
sequence length increases.
RNN: recurrent neural network
Fast inference, linear complexity
Drawbacks
-small memory
-not parallelizable
SSM: state space model
Input x
output y
hidden state h
Looks
familiar ??????
Mamba how to solve small memory problem
Matrix A responsible for updating hidden state H, how to update memory along
sequence continues?
hippo as High-order Polynomial Projection:
SSM: parallel training
What hinders RNN is that its nonlinearity tanh() cannot be fastly trained.
From definition, SSM kind of like a RNN without tanh()
SSM: parallel training
In a way, it’s like a convolution operation
Mamba: selective SSM
Not Content-awareness: independent A, B, C results in problems with
content-awareness.
In comparison, these tasks are relatively easy for Transformers since they
dynamically change their attention based on the input sequence. They can
selectively “look” or “attend” at different parts of the sequence.
Mamba: selective SSM
Mamba makes matrics B, C, and step size Δ depend on the input, which is similar
to transformer
This raised another issue -> cannot train parallelly like in S4
parallel scan?
As long as algorithms satisfy the associative prosperity, can be parallelized!
t is the parallel thread
feels like
DP
hardware-aware algorithm
Skip
Results and thoughts
-useful
-Can it replace transformers?