Connect with us

Thought Leaders

Five Steps to Turn Memory From AI’s Biggest Constraint Into a Competitive Advantage

mm

For the past few years, AI infrastructure has focused on compute above all other metrics. More accelerators, larger clusters and higher FLOPS drove the conversation to make the most of GPUs. This approach made sense when model progress depended mainly on training scale. Now with AI production deployments taking priority, there’s a new constraint to focus on: memory.

Today, many of the toughest constraints for AI show up in memory capacity, bandwidth, latency and the time and energy cost of moving data through a system. Context windows keep expanding, with companies like Anthropic now offering million token windows in their standard-priced offering. Inference workloads are growing. The growth of multi-agent systems means AI systems are passing larger volumes of data from one stage to the next. Operators can keep trying to add more GPUs, but they still fall short of the performance they expect because these systems are starved for enough RAM to feed accelerators efficiently when each server is operating on its own, limited to in-system RAM.

This shift affects both throughput and cost for hyperscalers and data center operators. When memory becomes the limiting factor, organizations often respond by overprovisioning expensive hardware, leaving GPU capacity underused and absorbing higher power and infrastructure costs. The next stage of AI scale will depend less on adding raw compute and more on building memory architectures that fit the way production AI actually runs.

Here are five steps infrastructure leaders can take now to prepare for ever growing demands for memory.

1. Start by measuring the real bottleneck

Many organizations still evaluate AI performance through a compute-first lens. They track cluster utilization, accelerator counts and top-line throughput, then assume improvements will come from adding more GPU accelerators. That view often misses the real issue.

Memory pressure often shows up in stalled accelerators, higher per-token latency and inconsistent throughput under load. A GPU may look underutilized if it’s waiting for data to arrive from another memory tier, another server or another stage in the application. Inference makes that problem more visible as KV cache size grows and more simultaneous sessions compete for bandwidth.

Operators need better visibility into effective memory utilization, looking at bytes moved per token, accelerator stall time and memory access patterns across CPUs, GPUs and adjacent memory tiers. They also need pipeline tracing that can separate memory-related delays from network or storage issues. Without that visibility, teams risk spending more on compute without addressing the actual source of the slowdown.

2. Reduce data movement before adding more capacity

In large AI systems, moving data can create as much overhead as processing the data.

This is especially true in inference. As context windows expand, the KV cache can become one of the largest consumers of system memory in the stack. Multi-tenant serving and multi-agent workflows can add even more. The first stage generates an output, then another consumes it and the infrastructure handles this handoff by copying large blocks of data between GPUs, across servers or through framework-level serialization.

Those copies carry a real cost. They consume bandwidth, add latency and leave expensive compute resources waiting for the next transfer to finish. They also push operators to buy more high-cost memory than the workload really requires.

Before investing in more accelerators, teams should identify where in a system data is moving more than necessary. GPU-to-GPU transfers, server-to-server copies and repeated movement of intermediate states across agent pipelines are good places to start. In many environments, cutting unnecessary movement delivers more usable performance than another server.

3. Build memory tiers around workload behavior

AI infrastructure works better when operators stop treating memory as a single source and start treating it as a hierarchy with distinct roles.

The hottest data should stay closest to the accelerator. That includes working sets that demand the lowest latency and the highest bandwidth. Other active buffers and frequently accessed states can sit in DRAM. Larger structures that need scale more than absolute speed can move into pooled memory. Colder data and less active models belong farther down the stack.

This approach requires teams to understand which data changes constantly, which data many processes share and which data can tolerate a modest latency tradeoff without affecting service quality. Too many deployments still default to pushing everything into the fastest HBM tier because it feels safer. That approach drives up cost and usually leaves efficiency on the table.

A tiered memory strategy gives operators more control over both performance and economics. In production AI, that balance is becoming a core design requirement.

4. Treat shared memory as part of the architecture for agentic AI

Multi-agent AI is raising the cost of fragmented memory design.

In many agentic systems, one agent produces output that another agent uses immediately. A third service may rank that output, add context or route it into another model. If each step creates a fresh copy of the same state, traffic rises quickly. As context grows, the size of that copied data grows with it. The system spends more time moving information than processing data.

This is where shared memory becomes increasingly important, particularly for shared KV cache and other states that multiple agents or services need to access. Shared memory can reduce redundant copies, lower network traffic and improve utilization across the full application path. It can also help agentic systems scale effectively as different nodes or agents are able to reuse KV cache with shared memory.

For hyperscalers, this is no longer an edge case. As agentic AI matures, shared memory is becoming a practical requirement for efficient deployment.

5. Embrace CXL for production infrastructure

For the past few years, the industry viewed CXL as a promising standard that needed more time to mature, as CXL quickly moved from version 1 to 2. Now with 3.x hardware available soon, CXL is reaching the point of being feature complete, backward compatible, and ready to take on production loads.

CXL has reached a level of maturity where hyperscalers and data center operators should treat it as a practical option for production memory expansion, pooling and shared-memory architectures. It now belongs in serious infrastructure planning, especially for environments that need more flexible memory scaling and better economics around inference.

That does not mean every workload should move to CXL-based memory. Local memory will remain essential for the hottest and most latency-sensitive data. But operators no longer need to wait for some future version of the standard before they act. The more useful question is where CXL can solve real production problems today.

The clearest opportunities are in memory expansion, pooled memory and shared-memory designs that reduce unnecessary copies across AI workflows. Those use cases line up directly with current pressure points: rising KV cache demands, growing agent-to-agent data transfer and the need to improve GPU utilization without pushing total cost of ownership even higher.

Operators still need to engineer carefully. Latency, predictability and software support still matters. Memory management policies need to place data in the right tier at the right time. But those are implementation questions, not reasons to postpone planning.

At XCENA, we see memory, data movement and utilization as the central constraints in production AI infrastructure. That is why we focus on CXL-based computational memory and architectures that reduce unnecessary copying, support shared access and help operators make better use of expensive compute resources.

The industry spent years treating memory as a supporting resource behind the real engine of AI progress. That view no longer fits the production deployment reality. Memory now shapes utilization, efficiency and cost at every level of the stack. The operators that recognize that shift early will have an advantage that is measured not just in performance, but in how effectively they scale AI in the real world.

Jin Kim is the CEO and co-founder of XCENA, a South Korea–based fabless semiconductor company focused on building next-generation memory solutions for AI and large-scale data processing. With a background that includes senior leadership roles at SK Hynix—where he was one of the youngest corporate vice presidents—Kim brings deep expertise in data-centric computing and semiconductor architecture.