UNLOCKING THE POWER OF STATE SPACE LANGUAGE MODELS
Black mambas are majestic animals that are incredibly strong and intelligent. They are alert snakes with a keen sense of sight. - Africa Geographic
Recently, Technology Innovation Institute (TII) released their new Falcon Mamba 7B large language model (LLM) based on a novel neural network architecture called “Mamba”. Over the past few years, transformer-based architectures, such as GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers), have dominated the field, achieving impressive results across various natural language processing tasks. Now, a new player is emerging on the scene - introducing state space language models (SSLMs). While transformers use self-attention mechanisms to process sequential data, Mamba uses techniques from control theory and signal processing to model long-range dependencies in text (e.g., the clues at the beginning of a mystery novel must link to the culprit all the way at the end). It addresses a major limitation of transformers in handling long sequences of text which could ultimately lead to more efficient, smarter models.
Traditional Transformer Architecture
Let’s back up a little bit and paint a picture of what’s in use today: the transformer, which powers ChatGPT, Claude, Gemini and Meta’s Llama series. The transformer architecture, introduced in the seminal "Attention Is All You Need" paper, revolutionized natural language processing. Essentially, the transformer consists of two main parts, self-attention mechanisms and feed-forward neural networks. The self-attention mechanism allows the model to weigh the importance of different words in a sequence when processing each individual word so it can capture dependencies and complex relationships in the text. For the transformer, “attention” refers to this ability to connect all of the words in a sentence together. It is this attention that allows the neural network to represent and understand language. Language would be much harder to learn if we had to do it one word at a time.
Transformers’ power comes from the ability to process input sequences in parallel, significantly speeding up training and inference compared to previous neural networks, like RNNs. The self-attention mechanisms allow parallelization by considering all positions simultaneously, rather than step by step. This mechanism is how transformers can attend to long-range connections between text. With billions, or even trillions of parameters, transformer-based neural networks can achieve some impressive results like coding, creative writing, and potentially real reasoning.
State Space Language Models (SSLMs)
The underlying architecture of transformers has one major drawback – the quadratic computational complexity regarding sequence length (I know that’s a mouthful). Basically, as the input sequence grows (say from one word to one sentence, to one paragraph, to one chapter…), the number of computations increases quadratically (x^2) leading to more memory, more computation, and more energy to process. So, doubling the attention across 10 words to 20 words doesn’t require twice the processing, it requires four times the processing. Going from 10 to 100 requires 100x (10^2) the processing, not 10x! This limitation means we have to truncate our inputs or use workarounds to handle longer contexts and it means the energy required to process long contexts is even greater, potentially putting a cap on how scalable we can transform transformers.
Jumping in to address this very need come State Space Language Models (SSLMs)! They treat sequential data as a continuous time signal and model the dynamics of the signal using linear state space models. They can efficiently represent all the information, similar to how MP3 compression represents sound files without much loss in quality. This approach allows SSLMs to capture both short-term (word-to-word) and long-term (chapter-to-chapter) dependencies more efficiently.
Mamba is a particular implementation of SSLMs, developed by researchers at Carnegie Mellon University and Apple, that processes sequences using a state space model, allowing it to dynamically select which information to retain in its state. By selectively updating its information state, Mamba can focus on the most relevant information for each step of sequence processing, enhancing its ability to capture important long-range dependencies while ignoring irrelevant details, like the candlestick that Colonel Mustard used in the library. Mamba’s calculations are designed specifically to take advantage of modern GPU architecture for maximum parallelization and memory efficiency which improves training and inference speed and reduces energy usage. Mamba also incorporates techniques like multi-query attention and rotary positional embeddings, which have proven to be effective with transformers. This combined approach has the potential to exceed the capabilities of transformers, pound for pound.
Benefits of Mamba over Transformers
While Mamba improves on transformers in theory, we should see improved efficiency in processing long sequences, which is particularly valuable for tasks like summarization, long-form content creation, analyzing lengthy code repositories, or understanding entire books. Mamba can maintain coherence and consistency across these longer sequences, potentially leading to higher quality outputs. The efficiency also translates to faster training and inference times for long sequences, which will make models even easier to work with. Mamba models of similar size could potentially sequence hundreds of thousands of tokens with the same memory and energy needs. Larger context sizes can maintain coherence over long passages, which could potentially lead to more sophisticated reasoning capabilities.
Size does matter. In tasks like document summarization, a large context size allows the model to maintain coherence and factual consistency over longer passages. It enables the model to refer to information mentioned much earlier in the chat, which is important when you end up chatting for a long time. If your Python (no relation) code is long and complex, you need your AI assistant to be able to consider the whole script, not just some parts at a time. In legal document analysis, larger context size allows a model to connect related clauses separated by many paragraphs, potentially making better connections. In creative writing, it helps maintain plot consistency and character development across the entire story, which is key to a good story. In a scientific literature analysis, large context would allow a more comprehensive understanding by connecting methods and results spread across lengthy papers. As research progresses, we may see Mamba-based models that can maintain coherent understanding and generation over unprecedented lengths of text, potentially approaching our ability to comprehend and reason over large volumes of information.
Challenges and Future Directions
SSLMs like Mamba could be the next evolution of LLMs, finally moving beyond the transformer. But the road ahead is difficult. As a new kid on the block, there just isn’t much known about their performance and scalability yet. Transformers have benefited from years of intensive research and optimization to get to where we are now - the original transformers paper came out in 2017 after all. SSLMs are still in the early stages and require the development of tools, libraries, and various models for researchers and developers to kick the tires and understand their strengths and weaknesses. And while Mamba exceeds transformers capabilities on context processing, transformers still may outperform SSLMs in other tasks. The sequential nature of SSLMs’ processing, while efficient, may struggle to capture some non-sequential relationships that transformers tend to handle well. We are also still figuring out how to train these new models, with the Falcon release being one of the few so far.
Researchers are always on the lookout for the neural network architecture that will replace the transformer, so we can see even larger gains in performance. Maybe it will be SSLMs like Mamba, or maybe not. Maybe it will be hybrid architectures that combine the strengths of SSLMs with the tried and tested transformer, like Jamba. There will be particular interest in optimizing the training process for SSLMs and potentially developing specialized hardware to accelerate processing even more. Multi-modal SSLMs that can process text, images, audio and even video could all be on the horizon. Any architecture that replaces the transformer must have significant improvements in capabilities, require less energy for the same level of competence, or hopefully both. We will have to see how Mamba scales with size, as current models tend to be small, only 7 billion parameters or fewer.
Conclusion
Mamba is a potential successor to the transformer, offering several important benefits. Its ability to process long sequences with linear time and memory complexity, with the potential for much larger context sizes means Mamba could be a powerful alternative for tasks that require large contexts. The efficiency gained in handling long-range dependencies not only improves performance on various language tasks but offers potential new applications. While transformers will likely remain relevant for many applications in the future, the unique advantages of SSLMs could move us towards more efficient, context-aware models that require less energy to run. As SSLMs gain more visibility with research and development interest, there is promise for further improvements in performance, scalability, and applicability. As they mature and the surrounding ecosystems develop, a new generation of language models may emerge, one that pushes the boundaries of what language models can do, ultimately leading to more human-like language understanding.