Ambitious hacker reduces worst-case memory latency by up to 93%, but with severe downsides — 1960s bottleneck overcome by hedging memory accesses to avoid runni

Clever software trick works on x86 and Arm to radically reduce worst-case memory latency, but there's a trade-off

When you purchase through links on our site, we may earn an affiliate commission. Here’s how it works .

A method devised by YouTuber, Googler, and security researcher LaurieWired could have huge implications for a very few specific use cases and workloads that are highly sensitive to " tail latency ," or near-worst-case memory access latency. The project is called TailSlayer, and fundamentally, leverages hedging memory accesses to avoid running into DRAM refresh stalls.

Without getting fully into the weeds, the type of memory we all use, DRAM, has one serious downside: it has to be refreshed constantly. The cells where DRAM stores its data are fundamentally tiny capacitors, and they are highly leaky by design, so we have to continually top up the charge in them to make sure that they retain their data. This is known as DRAM Refresh. The refresh cycle happens at a rate that varies widely depending on the system and type of DRAM in question, but generally it's going to happen at an interval that is measured in microseconds, meaning that your memory is refreshing hundreds of times in the time it takes you to blink.

Now, DRAM refreshes aren't performed synchronously with memory accesses. Because of that, it's entirely possible that your system may try to access memory that's currently being refreshed. If that happens, the request simply stalls until the refresh cycle is finished. This can cause a stall of hundreds of nanoseconds, which isn't a long time in absolute terms, but at the speeds of modern chips, even a 200-nanosecond stall can be a thousand cycles where a CPU core isn't getting any work done .

You may like Deepseek research touts memory breakthrough, decoupling compute power and RAM pools to bypass GPU constraints AI memory crunch forces DRAM market into 'hourly pricing' model, report claims Samsung, SK Hynix, Micron team up to block memory hoarding This isn't a major problem for most use cases because we have all kinds of strategies in place to deal with stalls like this; this problem has been known since the 1960s when DRAM was invented, so naturally, hundreds of very smart people have put their genius to work developing workarounds. However, certain very specific workloads are extremely sensitive to non-deterministic memory latency, and these DRAM refresh cycle stalls are a major source of exactly that kind of behavior.

So what can you do? LaurieWired decided to tackle this problem for reasons she never fully elucidates but which probably boil down to "it was interesting," given her other work. Her initial ideas involve attempting to predict DRAM refreshes and synchronize around them, but that's completely impossible for several reasons she goes over in her video. Her next idea was parallelism, but she was stymied by CPU cache and reorder buffers—CPU features that largely obviate the issue in the first place.

Her breakthrough was when she realized she didn't necessarily have to do all of this on one CPU core. In the end, what she did was elect to fully duplicate the working set across memory addressing boundaries, ensuring that each copy resides on a different physical memory channel with independent timing behavior, and then run her operations simultaneously on two different CPU cores with each accessing a different memory channel. Then, she could simply let them race to finish and take the result of the one that finishes first; if one core hits a DRAM refresh interval, the likelihood that the other core also does is pretty low. This method allowed her to reduce the tail latency of DRAM accesses on her consumer Ryzen desktop system by more than half, which is huge.

By renting server time on Amazon AWS instances, she was able to test on high-end AMD, Intel, and Arm server hardware. She managed to achieve far greater results on these machines for a few reasons: they have slower CPU clock rates and slower memory, and they also have more conservative memory timings, that mean stalls are even worse for performance. The real difference is in the number of available memory channels, though. An EPYC Turin processor has some twelve memory channels, and by executing her strategy there, hedging across all twelve channels, she was able to cut near-worst-case memory latency (tail latency) by a staggering 89%.

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Key considerations

Investor positioning can change fast
Volatility remains possible near catalysts
Macro rates and liquidity can dominate flows

Reference reading

More on this site

Informational only. No financial advice. Do your own research.

Key considerations

Reference reading

More on this site

Related posts:

Leave a Comment Cancel reply