The pursuit of artificial intelligence systems capable of processing sequences with effectively infinite context has been a longstanding goal in the field. Early models, such as recurrent neural networks (RNNs), were theoretically designed to handle sequences of arbitrary length. However, in practice, their ability to utilize information across more than approximately 50 timesteps was limited due to issues like vanishing and exploding gradients.

Recent advancements have significantly extended the context windows of AI models. For instance, Google’s Gemini 1.5 Pro features a context window of up to 1 million tokens, enabling the processing of extensive data such as entire codebases or lengthy documents. Similarly, Magic AI’s LTM-2-mini model boasts a 100 million-token context window, equivalent to processing around 10 million lines of code or 750 novels. These developments represent impressive engineering feats that broaden the applicability of generative AI across various domains.

Despite these improvements, achieving truly scalable in-context learning, where AI systems improve through accumulated experience and utilize in-context learning to train themselves, has not yet been achieved and a successful design must adhere to specific engineering requirements. This article outlines a set of essential criteria that any effective scalable infinite-context model should fulfill to advance toward this objective. It is important to note that this is not an exhaustive list; ongoing empirical research will continue to reveal additional requirements necessary for realizing the full potential of in-context learning at scale.

Infinite Context: Enabling In-Context Learning at Scale

The promise of infinite context lies in its ability to process and learn from data over sequences of unprecedented length, unlocking new capabilities for AI systems. Models capable of very long context learning can operate on large, complex inputs such as entire repositories, datasets, tax codes, or medical documentation. By eliminating the reliance on retrieval systems, which often face limitations in scope and accuracy, these models enable seamless comprehension of intricate relationships within expansive data. For example, a tax code analysis system with infinite context could trace dependencies across thousands of pages without losing coherence, making it a game-changer in automation and decision-making.

Infinite context models also open the door to learning as a direct result of in-context interactions. Rather than relying solely on traditional gradient-based training, these systems can adapt dynamically to new data, learning efficiently from raw sequences. They could extract meaningful knowledge from modalities like video or spatial data, even if these inputs were not explicitly part of the training distribution. This flexibility allows the models to expand their understanding into domains beyond their original design, laying the groundwork for AI systems that improve with experience. Moreover, the ability to learn from and integrate episodic context could enable these models to identify gaps in their knowledge, assess the trustworthiness of new information, and optimize their operations in real-time.

Infinite context is particularly transformative for high-bandwidth modalities. Current systems often rely on sliding window techniques that cause them to forget earlier portions of a sequence. Infinite context eliminates this limitation, allowing AI to process long video streams or other media while retaining critical early details. Imagine a model preconditioned on an entire cinematic sequence, using its understanding to generate more realistic visual effects or assess complex physical interactions. Similarly, persistent agents—powered by infinite context—could build long-term relationships with users or master complex automation tasks by retaining relevant context from prior interactions.

An infinite context model can be defined as one that consistently operates over an infinite time horizon while encoding all useful information from an unbounded operational period. Consistent operation requires that the computational and algorithmic cost per input remains constant, ensuring scalability and stability regardless of sequence length. Any increase in overhead or instability introduced by sequence length undermines the model’s ability to operate effectively in long-term scenarios. The second component, encoding useful information over an unbounded period, ensures that the model can filter, aggregate, and compress its history into meaningful representations. This capability allows the model to robustly learn new skills, retain relevant knowledge, and build upon prior understanding, ensuring that it can dynamically adapt to evolving inputs. Together, these components create a foundation for scalable in-context learning that is robust, efficient, and adaptable, paving the way for AI systems that can truly learn from their experiences.

Requirements for Infinite-Context Models at Scale

Designing models capable of approximating infinite context learning requires addressing several key engineering challenges, each rooted in the definition of infinite context established earlier. First, these models must enable efficient operation over an infinite horizon, ensuring that computational and memory requirements remain constant regardless of sequence length. Second, they must exhibit consistent memory dynamics, allowing information to be preserved and updated over indefinite time scales without degradation or instability. Third, the models must possess the capacity to memorize and encode large volumes of information, which means having a very large memory size while selectively filtering and compressing their history to retain what is relevant while discarding redundancies. Finally, they must overcome the bottlenecks inherent in recurrent models by developing the ability to scale parallelism, leveraging modern hardware to process sequences efficiently both within a single sequence and across distributed systems. The following subsections will explore these requirements in detail, outlining the principles that underpin scalable infinite context learning.

Efficient Operation Over an Infinite Horizon

Efficient operation over an infinite horizon is a foundational requirement for models approximating infinite context. At its core, this principle ensures that a model can process sequences of any length without encountering scaling issues that make such operations infeasible. Scalability here applies both to the computational cost of processing each additional input and to the memory resources required to maintain the model’s state. Without this efficiency, the notion of infinite context remains theoretical, as practical limitations would prevent its application to real-world problems.

To achieve compute efficiency, the processing cost per input must remain constant, regardless of the sequence’s length. Models with computational requirements that scale quadratically or higher—such as traditional attention mechanisms—are inherently unsuitable for infinite sequences. These approaches, while effective for bounded contexts, become prohibitively expensive as sequence lengths grow. Efficient infinite context models must operate at a consistent per-step cost, allowing them to handle unbounded sequences without overwhelming available compute resources.
Memory efficiency is equally critical. The model’s memory footprint must remain constant per sequence, ensuring that its storage requirements do not scale with the length of the input. This constraint ensures the model can maintain its state indefinitely without exhausting available memory. Recurrent state mechanisms, which compactly represent the entire history of the sequence in a fixed-size state vector, are an elegant solution to this challenge. By updating this state dynamically as new inputs are processed, recurrent models meet the memory efficiency criteria while preserving the ability to encode relevant information.

Recurrent models naturally satisfy the requirement for efficient operation, as their design inherently processes one input at a time with constant computational and memory requirements. However, this efficiency comes at a cost: the temporal bottleneck. Since recurrent operations must process each timestep sequentially, they fail to leverage the parallelism of modern hardware, limiting their throughput compared to other architectures. Overcoming this bottleneck is a key challenge in designing scalable infinite context models.

In contrast, attention mechanisms fundamentally struggle to meet the efficiency requirements for infinite context. Attention mechanisms, while flexible in capturing dependencies, incur quadratic computational and memory costs, making them impractical for truly infinite sequences. These limitations highlight the necessity of developing new architectural approaches that combine the efficiency of recurrent models with the parallelism of modern frameworks.

Consistent Memory Dynamics Over an Infinite Horizon

For a model to operate effectively over an infinite horizon, its memory dynamics must remain consistent regardless of the sequence’s length or the accumulation of information. This means ensuring that the model neither forgets critical information nor becomes overwhelmed by the sheer volume of accumulated data. Consistent memory dynamics are crucial for enabling the robust learning, retention, and integration of information, laying the groundwork for scalable in-context learning.

A fundamental requirement for consistent memory dynamics is the ability to preserve information indefinitely unless explicitly overwritten by the model’s update dynamics. Many techniques to maintain numerical stability include a decay mechanism, where older information becomes less accessible or less impactful over time. This decay is particularly problematic for long-term tasks, as it limits the model’s ability to leverage earlier context effectively. An ideal memory mechanism must ensue that information persists unless explicitly deemed irrelevant.

One clear way to identify a decaying state is the inclusion of a decay factor in the state update equation:

$S_{t + 1} = (1 – \beta)S_{t} + \beta\mathrm{\Delta}S_{t}$

If $\mathrm{\Delta}S_{t}$ is not a function of the state $S_{t}$ , then the information held within $S_{t}$ is decayed naively at every timestep.

Equally important is the ability to integrate new information without saturation. Saturation in memory-based models occurs when the model’s state ceases to meaningfully update due to ineffective integration of new information. This happens when the state update equation fails to account for prior state information, causing new updates to diminish over time. Consider a generic state update:

$S_{t+1} = S_t + \Delta S_t$

If ${\Delta S_t}$ becomes independent of ${S_t}$ , such as when it is computed without reference to the existing state, the model risks saturating. Over time, the state accumulates information, resulting in diminishing updates:

$\lim_{t \to \infty} \|\frac{\Delta S_t}{S_{t+1}}\| \to 0$

Infinite context models operate over an unbound temporal landscape potentially without an explicit ending or even start. For effective operation these models must eliminate the need for fixed temporal dependencies such as position embeddings. Many existing architectures rely on positional embeddings, to organize sequence information, notably they are used as a temporal encoding for attention mechanisms which lack an explicit measure of order. While these embeddings (e.g., sinusoidal or rotary) are useful for contexts that appear during gradient descent training, they struggle to extrapolate effectively over sequence longer than those seen during training. Fixed dependencies introduce a structural bias that limits the model’s adaptability to unbounded sequences. To overcome this limitation, infinite context models must adopt dynamic, data-dependent embeddings, or utilize sequence processing algorithms that are inherently sequential. Data dependent embeddings could include hybrid mechanisms that combine sliding context attention, convolutional layers, and explicit embeddings approximating the behavior of rotary embeddings.

Eliminating fixed temporal dependencies does not require discarding all forms of positional embeddings. For certain tasks, local reference frames, such as those found in image subsequences, remain valuable. Establishing embeddings tied to offsets within these local frames can provide additional structure and improve adaptability without sacrificing scalability. By intelligently blending dynamic and localized embeddings, models can retain the benefits of positional encoding while extending their applicability to infinite contexts.

Very Large Capacity State

Infinite context models must operate with an exceptionally large capacity state for encoding information. This capacity underpins their ability to memorize, retrieve, and build upon knowledge accumulated over time. At its core, this requirement involves managing a very large recurrent state, even if the full state is not materialized during computation. The scale of this state determines the model’s ability to encode nuanced patterns, retain relevant details, and adaptively integrate new information without overwriting prior knowledge.

Large language models (LLMs) provide an instructive benchmark for understanding the necessary scale of such capacity. These models, with parameter counts ranging from billions to trillions, have demonstrated deep comprehension of their pre-training datasets, effectively memorizing significant portions of the data they were trained on. If infinite context models are to achieve similar feats through experience and interaction, their state parameter count may need to approach or even exceed the scale of LLM parameters.

The capacity of a vector can be quantified as the number of orthogonal vectors that can be stored independently and accessed via dot product. If the query vectors are unconstrained, this capacity is equal to the dimensionality of the vector itself. Translating this principle to infinite context models implies that the dimensionality of the model’s recurrent state determines how much independent information it can encode. While increasing dimensionality aligns with the need for higher capacity, doing so introduces significant computational challenges. Clever strategies are required to maintain efficiency while scaling state size to meet these demands.

One useful metric for understanding the required scale is the state expansion ratio, introduced in Zhang et al. (2024) in “Gated Slot Attention for Efficient Linear-Time Sequence Modeling.” The state expansion ratio is defined as the state parameter count of a layer divided by the hidden size used for its input. For example, a layer with a head key size of 128 and a value vector equal to the hidden size divided by the number of heads would have a state expansion ratio of 128×. Models designed for infinite context are likely to exhibit very large state expansion ratios—on the scale of 100,000×, 1,000,000×, or even 1,000,000,000×.

Achieving such massive state expansion ratios requires innovative techniques to represent and process large states efficiently. Sparse representations, kernel tricks, delayed state materialization, distributed memory architectures, quantization, and dynamic allocation mechanisms may play a critical role in enabling these models to operate with both scalability and flexibility.

Scalable Parallelism

For infinite context models to fulfill their potential as a transformative training and inference technique, they must scale to match the computational feats achieved by current gradient descent methods. Today’s large-scale training runs, mobilizing entire datacenters of GPUs, process trillions of tokens in just days. To enable in-context learning as a viable alternative or complement to gradient-based training, these models must achieve similar throughput while maintaining their efficiency and adaptability. Achieving this level of scalability necessitates addressing two significant barriers to parallelism in recurrent models.

Recurrent models, identified as a promising class of architectures for infinite context learning, naturally lend themselves to constant compute and memory requirements per input. However, their inherently sequential nature makes it difficult to parallelize computation across the time dimension. Scalable infinite context models must be able to take advantage of asynchronous hardware and be parallelizable across the time dimension. Recent developments in chunk-wise parallel computation, such as those applied to linear transformers, offer a potential solution. By dividing sequences into chunks, these methods sacrifice linear-time execution within each chunk to enable parallel processing across multiple timesteps, leading to over 100x speedups in some cases.

The second major challenge is scaling parallelism beyond a single device or node to the cluster level. For infinite context models to process and learn from trillion-token datasets, they must support distributed parallelism in the assembly and updating of recurrent states. Traditional gradient-based methods achieve this by dividing data across nodes, synchronizing updates via gradients. However, recurrent models present unique obstacles: their state must reflect the entirety of the sequence history, making state synchronization across nodes complex. To address this, new paradigms for cluster-level parallelism must be developed, allowing recurrent states to be efficiently assembled, updated, and shared across distributed systems without sacrificing fidelity.

Ultimately, any system aspiring to scale infinite context learning as a training mechanism must achieve seamless parallelism both within sequences and across clusters. By overcoming these barriers, infinite context models could unlock unprecedented scalability, enabling them to learn from trillion-token datasets and beyond while maintaining the efficiency and adaptability that define their potential. These advancements would not only establish in-context learning as a competitive training paradigm but also open new possibilities for real-time, experience-driven AI systems.

Key Challenges

Transformers have become the gold standard for many sequence modeling tasks, largely due to their exceptional performance on long-range dependencies and their ability to scale to large datasets. Recurrent models, while theoretically well-suited for infinite context due to their ability to process sequences with constant memory and compute requirements, often struggle to match the inference quality of transformers. Designing recurrent architectures that can match the performance of transformers remains an open problem. This requires innovations in state representations, update mechanisms, and optimization strategies to ensure recurrent models achieve parity with or surpass transformer performance.

The requirement for very large recurrent states introduces substantial computational complexity. Efficient manipulation of very large states is critical for scaling infinite context models. Managing these large states also demands careful handling of memory constraints, as naively increasing state size can quickly exceed the capabilities of modern hardware. Techniques such as sparse representations, state compression, and distributed memory architectures may be necessary to address these issues, but implementing these solutions without sacrificing performance or fidelity remains a significant challenge.

Recurrent models inherently process data sequentially, but must be capable of parallelization across the time dimension to fully utilize modern hardware. Their sequential nature contrasts sharply with transformers, which excel at leveraging parallel computation across the entire sequence. Overcoming this bottleneck is essential for scaling recurrent models to handle long sequences efficiently. Emerging techniques, such as chunk-wise parallelization in linear transformers, offer promising directions by enabling parallel computation within subsequences.

Scaling infinite context models to operate on trillion-token datasets requires leveraging distributed systems to compute new states. Current distributed training methods for transformers rely on splitting gradients and parameters across nodes, but these techniques may not be directly applicable to recurrent state updates. Developing methods for efficient state assembly, synchronization, and sharing across clusters is critical for achieving the throughput necessary for large-scale in-context learning.

Roadmap

Fulfilling the engineering requirements for infinite context AI is essential for unlocking scalable, efficient, and transformative systems capable of dynamic learning and real-time adaptation. These challenges—efficient operation, consistent memory dynamics, large capacity, and computational scalability—define the path toward making infinite-context learning a practical reality. We’ve developed a design that satisfies all these requirements, leveraging innovations in linear transformers, large matrix manipulation, and mathematical reformulations. While much work remains to gather empirical evidence, scale, and eventually demonstrate this approach, we are confident in its potential. Over the coming weeks and months, we will share details of our design, including experiments and benchmarks, to demonstrate its capabilities and refine its utility. Our roadmap begins with validating the core principles of the design, benchmarking performance, and iterating based on results. Each phase will further align the system’s design with the vision of scalable in-context learning. We’re excited to explore this technology and look forward to sharing our progress with you.