DRAM’s Whac‑A‑Mole Security Crisis
Key takeaways:
- Rowhammer remains a DRAM security threat, while Rowpress has increasingly become a related threat.
- New commands issued by the memory controller can help manage refreshes, but they’re not a perfect solution.
- A smaller, vertical DRAM cell may eliminate the problem, but it’s years away.
Rowhammer has been a persistent DRAM issue across several memory generations, worsening as manufacturing technology has advanced. It’s also been joined by a related issue called Rowpress. New refresh commands can help avoid ill effects, but the only hope for eliminating the issues may rest on a new DRAM cell.
“Rowhammer is caused by cell-to-cell interference, leading to victim bits to flip,” said Xi-Wei Lin, executive director, applications engineering at Synopsys. “It can be exploited for security breaches and get worse as cells get closer to each other when scaling the 6F2 architecture down.”
Rowhammer has spurred countless ideas for how to fix it, test for it, and mitigate damage, but efforts generally have been thwarted by new attacks. New commands intended to solve these issues have been employed to mount new attacks, and a permanent fix, if it happens, is still years away.
“Rowhammer and related neighbor-disturb issues like Rowpress remain an industry concern and have a growing impact at smaller process nodes,” said Steven Woo, fellow and distinguished inventor at Rambus. “Countermeasures continue to evolve.”
DRAM layout secrecy contributes to the problem, but there’s no indication that it will change. “We argue that keeping internal DRAM topologies secret hurts DRAM customers in several ways,” wrote Microsoft’s Stefan Saroiu, Alec Woman, and Lucian Cojocar in a research report.
In theory, this isn’t a tale of weak parts that weren’t caught at some test point. Every memory has some level of vulnerability. It’s just that some are more vulnerable than others. “If you do a Rowhammer test for every DRAM product over time, you always find a failure,” noted Jongsin Yun, director of memory research at Siemens EDA.
A puff of electrons
Rowhammer is caused by trapped electrons getting pushed into the bulk silicon around the cell. Imperfections in etched sidewalls act as traps, catching electrons. When the word line is activated, some of those electrons are pushed out. Once in the bulk silicon, they can migrate to neighboring cells that share the same bulk silicon.
A single such event isn’t notable, but repeatedly activating or hammering the word line can blow enough electrons into neighboring cells to change their state. Refreshing undoes some of the damage, as long as a cell hasn’t already flipped. But once flipped, refreshes will reinforce that flipped state, so Rowhammer requires repeated accesses between refreshes to be an effective attack.
The row being hammered is typically called the aggressor. The rows being affected are called the victims.
As processing nodes have evolved, memory bit cells have been packed even closer together, making the problem worse. Each row has a “threshold” of repeated accesses, at which point neighbors will flip bits. Manufacturing variations result in each row potentially having a different threshold, but with each process improvement, those thresholds have come down. They now require fewer activations to cause problems.
And it’s not just the next-door neighbor row that may be impacted. Rows farther away can also change state, especially as rows have come closer together. The number of neighboring victim rows that may be affected is called the blast radius.
Numerous “fixes” have been proposed. Unfortunately, they have all fallen short of solving the problem so far. Most of the focus has been on mitigation and managing refreshes.
“We have a regular refresh schedule, and in the meantime, we can add an extra refresh based on access counting,” said Siemens’ Yun.
A slow, hot wind
More recently, a new related but different phenomenon has emerged. It’s triggered not by repeated accesses, but by one prolonged access. It’s called Rowpress, because the row is pressed instead of hammered.
Rowpress occurs because of the pass-gate effect (PGE) between cells. When a word line is activated, it alters the threshold voltage of a neighboring cell, and that cell’s leakage rises. After enough time, it can flip state. A dummy word line exists between cells to add some isolation, but it’s insufficient.
Rowpress and Rowhammer feel similar. Both ultimately rely on neighboring cells sharing the same bulk silicon, through which the traveling electrons can migrate, and both can be solved by a timely refresh. The key difference is that their effects go in opposite directions with temperature changes.
Counting accesses
Managing refreshes to keep the data clean might appear to be an easy approach — keep track of the rows that are vulnerable as neighboring rows are hammered (or pressed), and issue peremptory refreshes when necessary, even if this is outside of the standard refresh schedule. But that requires knowing how rows are laid out so that one can identify neighbors. And this is where part of the problem lies.
The first mitigation attempt was called target row refresh (TRR), a feature built into the memory chip to track row accesses and, when it finds one that appears to be hammered, it issues a refresh of that row. Keeping it on the memory chip protects the layout secrets.
TRR has been only partially successful, however. It’s not standardized, and each memory maker has a proprietary way of identifying access patterns. Attacks have been able to work around it. In addition, it does not protect against Rowpress because it’s focused on counting activations within a period of time.
Let the controller handle it
To combat this, a new command called refresh management (RFM) was standardized. In this case, it’s the memory controller monitoring accesses, not the chip itself. But it only knows which row is being hammered. It doesn’t know which neighbors might be affected.
RFM is a bank-oriented command. The memory controller can track how often a bank is accessed, and if it suspects an attack, it can issue an RFM command to refresh the bank and undo any damage. As a result, it’s not very precise.
Early implementations issued RFM on a fixed schedule, making it similar to standard refreshes, although it stalls a bank for less time. When the number of accesses in a bank exceeds a threshold, a refresh is initiated, and then the access count is reset.
A new variant called adaptive refresh management (ARFM) allows software to adjust both the access threshold and the number by which that count is reduced after a refresh. This provides flexibility for adapting to different workloads.
“Once RFM is executed, the access counter is reduced by a designated number to restart counting,” said Yun. “ARFM allows the user to modify both the threshold number and the decremental value of the counter for a single bank or for all banks.”
An even newer command called directed refresh management (DRFM) has now been standardized. “They record the accessing address and refresh not just on the direct neighboring cells, but those from two or three rows away, depending on their setting,” said Yun.
DRFM operates at the row level, not the bank level, so it’s more precise. “RFM and DRFM allow systems to proactively target vulnerable victim rows in Rowhammer attacks and reduce disturbance effects without excessive blanket refreshing,” Rambus’ Woo explained. “This helps improve data security and system reliability while minimizing the impact to performance and power, which are especially important in HBM as stacks get denser and systems demand more performance.”
It’s possible that DRFM could increase power by issuing more refreshes than other approaches, some of which may be redundant. For instance, the memory controller might see two rows being hammered and issue DRFM commands for both of them. It’s also possible that the two rows share victim rows, so refreshing for one might have fixed the other. This is particularly true for so-called double-sided attacks, where two rows that share a victim are hammered, reducing the threshold since electrons are coming at the victim from both sides.
DDR5, LPDDR4, and HBM3 support RFM and ARFM. DDR5, LPDDR5, and HBM4 support DRFM. GDDR supports none of the commands, relying on internal techniques.
These commands can backfire
Although commands such as DRFM are intended to protect against attacks, they also can be used for attacks. Refreshes activate rows, and directed refreshes activate specific rows. Repeated DRFM commands have been found to act as transitive Rowhammer attacks because the victim row being refreshed now becomes a new aggressor row, affecting a row next to it.
Part of the issue is that memory controllers are operating blind. They can see which rows are attacked, but they don’t know the victim rows, the blast radius, or the potential victims of transitive attacks. Refreshing an entire bank may be overkill, and multiple DRFM refreshes may be redundant. Both may raise power due to the additional — and potentially excessive — refreshes.
Making the memory layout public could improve the situation. Instead of tracking aggressor rows, controllers could track victims. The controller would have full visibility, making TRR and other circuits on the DRAM chip unnecessary.
“Memory controllers could provide stronger and more efficient Rowhammer defenses than what is possible today,” argued the Microsoft team. “Rowhammer defenses could be more targeted and consume less DRAM bandwidth and power. Memory controllers could refresh victim rows using traditional row activation commands without the need for a new DRAM command, saving on implementation and testing costs.”
The motivation for secrecy appears to have two parts. First is the desire to keep knowledge from memory competitors, while the second is a concern that customers could use such information in bake-offs to pick a vendor. The Microsoft team felt that these concerns were misplaced.
“DRAM vendors have the budgets and the know-how to reverse-engineer each other’s parts and learn about their internal IP secrets,” the team explained. “DRAM customers will be unlikely to use internal DRAM topology information to decide which vendor’s devices are best. Some DRAM devices already report timing and other information through their Serial Presence Detect (SPD) chips. Yet there is no evidence that SPD-based information dictates DRAM customers’ purchases.”
Memory makers have so far declined to make their layouts public, so we’re left with the current Whac-A-Mole approach to closing holes to prevent attacks.
It’s going to take a new cell
Two characteristics of traditional DRAM cells contribute to this problem. The traps collecting electrons contribute to Rowhammer, and the fact that neighboring cells share bulk silicon enables both Rowhammer and Rowpress. There are no simple fixes for these characteristics.
For other reasons, a new DRAM cell is in the research stages by several memory companies, one that moves to a vertical transistor in the DRAM cell. Although the details vary, two aspects of these cells may eliminate both attacks.
Most importantly, in the new proposals, the cells do not share bulk silicon, which cuts off a path for electrons to float from one cell to another. In addition, some reports suggest that manufacturing steps have replaced etching with epitaxy. That can eliminate many of the traps at the edges because they’re cleaner.
“With the advent of 4F2 architecture, the Rowhammer effect is expected to be largely diminished, because adjacent vertical-channel transistors no longer share the same substrate body as in 6F2,” said Lin.
This cell is some years away from hitting production, however. And even then, it will benefit only new memories. Existing memory generations tend to have long lifetimes, so we’ll still be dealing with these attacks for many years to come.
Editor’s Note: Internet searches on the new commands may provide highly conflicting, confusing, and sometimes erroneous results, more than usual. Readers looking for more information should use such searches cautiously, following through directly with suggested sources to confirm any results.
Related Articles
DRAM’s Persistent Threat To Chip Security
Rowhammer attack on memory could create significant issues for systems; a possible solution emerges.
HBM4 Sticks With Microbumps, Postponing Hybrid Bonding
Process cost and yield issues delay the adoption of hybrid bonding.
New Automotive Architectures Are Shaking Up Processor And Memory Choices
Exponential increases in data and a mix of performance requirements are driving a top-to-bottom rethinking of what works best where.
