Ambitious hacker reduces worst-case memory latency by up to 93%, but with severe downsides — 1960s bottleneck overcome by hedging memory accesses to avoid runni

The clever hacker achieved even better results on Intel and Arm hardware. On Intel Xeon processors from the Sapphire Rapids and Diamond Rapids families, she managed to achieve gains as high as 93.3%, or in other words, she slashed p99.99 memory latency from 1697ns all the way down to 113ns. Considering the lowest value on the chart is around 105ns, that means the Xeon managed to achieve unbelievably deterministic memory latency.

I've said a few times that "certain workloads" benefit from TailSlayer. The most obvious place where determinism in memory latency is absolutely critical is in the slightly absurd world of high-frequency trading (HFT). If you're not familiar, imagine a bunch of hyper-caffeinated algorithms in a cage match where the winner is whoever can buy or sell a stock a few microseconds before everyone else. HFT firms spend obscene amounts of money on servers co-located right next to the exchange's matching engine, shaving off nanoseconds with custom hardware, microwave links instead of fiber optics, and code so obsessively optimized it would make a demoscene guru blush.

These systems operate with such tight tolerances that if a memory access runs into a DRAM refresh cycle, the opportunity is likely missed, potentially costing the firm millions of dollars. Because of that, the HFT world is the most obvious place where technology like this could be deployed, and it's almost one of the only places where it makes sense. There are other workloads that benefit from eliminating DRAM refresh stalls, sure; high-QPS microservices, matching engines, real-time ranking structures, anything using concurrent queues, and even potentially simulators or game servers, particularly those operating with a high level of precision.

The problem with using TailSlayer for many of these workloads will have already become apparent to many of you reading this, and it's primarily that Laurie's method requires fully duplicating the working set of the application for each memory channel you're hedging across. You're trading memory capacity and CPU cores for latency determinism, as this effectively multiplies memory requirements for any given application by a factor of the number of hedges you're willing to make. For some tasks—like, again, HFT—the actual memory requirements are quite modest, and so accepting a twelve-fold increase in memory usage in exchange for a 15× drop in p99.99 memory latency makes sense. For most workloads, it doesn't.

Watch On Of course, LaurieWired acknowledges this in her 54-minute video talking about the technique. In general, while she's (understandably!) quite pleased to have come up with TailSlayer, she's also quite frank about its relatively limited utility. Her video goes into significant depth about the research she had to do, including reverse-engineering undocumented memory scrambling behavior as well as devising a way to make the method work on Amazon's Graviton Arm-based CPUs, since they don't expose the same level of hardware counters that x86-64 CPUs do. It's highly recommended to watch if you're interested in the topic. Alternatively, you can head over to her GitHub repository to check out the demo code for yourself.

Follow Tom's Hardware on Google News , or add us as a preferred source , to get our latest news, analysis, & reviews in your feeds.

Zak is a freelance contributor to Tom's Hardware with decades of PC benchmarking experience who has also written for HotHardware and The Tech Report. A modern-day Renaissance man, he may not be an expert on anything, but he knows just a little about nearly everything. ","collapsible":{"enabled":true,"maxHeight":250,"readMoreText":"Read more","readLessText":"Read less"}}), "https://slice.vanilla.futurecdn.net/13-4-20/js/authorBio.js"); } else { console.error('%c FTE ','background: #9306F9; color: #ffffff','no lazy slice hydration function available'); } Zak Killian Contributor Zak is a freelance contributor to Tom's Hardware with decades of PC benchmarking experience who has also written for HotHardware and The Tech Report. A modern-day Renaissance man, he may not be an expert on anything, but he knows just a little about nearly everything.

bit_user This is clever. The article said: You're trading memory capacity and CPU cores for latency determinism, as this effectively multiplies memory requirements for any given application by a factor of the number of hedges you're willing to make. Not just memory capacity, but also bandwidth! I'd argue that's the limited quantity being burned by this technique much more than cores. Modern server CPUs have oodles of cores, but they have just ~12 memory channels. So, if you tie up 12 cores, in a race to see which achieves the lowest latency, it might only be ~10% of the total core count (e.g. on a 128-core CPU), but you could be burning close to 100% of the memory bandwidth! Moreover, it would be self-defeating to try and have those other cores doing very much that would compete for that memory bandwidth, so most of them will need to be kept fairly idle. Basically, you're sacrificing scalability for the sake of reducing latency. In mostly-serial, realtime applications, that might be a worthwhile tradeoff. But, where you can effectively parallelize your workload – even in latency-sensitive applications – it should always be preferable to do so! I think this point is worth clearly spelling out: this technique effectively turns a modern, multi-core, multi-channel CPU into a single-threaded machine! That said, if you don't need the absolute minimum latency , you could regain a little parallelism, but the most distinct threads you can run is no more than half of the memory channels. Quite a severe tradeoff! Reply

bit_user I hope DDR6 has more granular self-refresh. That would go a long ways towards both minimizing collisions and reducing the amount of time they take, when they do happen. Reply

IntelUser2000 bit_user said: I hope DDR6 has more granular self-refresh. That would go a long ways towards both minimizing collisions and reducing the amount of time they take, when they do happen. That would require a significant architectural change on the DRAM chips itself, since self-refresh is fundamental to the operation of memory. If just using another core reduces latency that much though, essentially solving the problem, I doubt the hardware "fix" would happen. Reply

davidjkay This seems like a flaw that doesn't need to exist… dram refresh is a read/write cycle. So a cache that reads the dram and writes it back and counts as a cache hit… the cache also answers cpu call when cpu accesses same ram would turn these worst case scenarios into best case Reply

davidjkay davidjkay said: This seems like a flaw that doesn't need to exist… dram refresh is a read/write cycle. So a cache that reads the dram and writes it back and counts as a cache hit… the cache also answers cpu call when cpu accesses same ram would turn these worst case scenarios into best case In other words turn a refresh collision into a cache hit with a special line of cache Reply

bit_user IntelUser2000 said: If just using another core reduces latency that much though, essentially solving the problem, I doubt the hardware "fix" would happen. Like I said, this isn't merely tying up another core. It's burning a multiple of the memory bandwidth and you have to basically idle the rest of the cores, so that they don't fill up the transaction queues in the SoC and the memory controller. In other words, it basically reverts your CPU to a single core, like we had more than 20 years ago. Reply

bit_user davidjkay said: dram refresh is a read/write cycle. So a cache that reads the dram and writes it back and counts as a cache hit… the cache also answers cpu call when cpu accesses same ram would turn these worst case scenarios into best case I had a similar thought: why can't a simple read be used to drive a row refresh, if it happens to come in at the right time? Then again, my understanding of DRAM is fairly simplistic. I think probably the issue with that is that refreshes happen at a granularity that's far too high for that to work. I also suspect there could be some parallelism happening to make refreshes faster, that would interfere with single cell access. davidjkay said: In other words turn a refresh collision into a cache hit with a special line of cache First of all, this can only work if refreshes are granular and happen concurrently with accesses to other parts of the chip. Otherwise, the cache would have to be as large as the entire DRAM chip, which is clearly infeasible. If those preconditions were true, then refreshes would be far less intrusive than they currently are, and that much more ignorable. However, if you did also bolt on a cache to hold values that have just been refreshed, then it only solves half of the problem, because you're still stuck waiting to read cells that haven't yet been refreshed, or writes to cells that have already been refreshed. Reply

bit_user I wonder whether GDDR6 or GDDR7 memory does refreshes any differently than DDR5? I think LPDDR and DDR memory is basically designed around optimizing GB/$, whereas GDDR memory might make more concessions in area-efficiency to enable higher performance. Reply

N6543 This is a cool idea, but the article is flawed. The title is wrong, for a start: the technique doesn't reduce worst-case latency, despite greatly reducing p99. 99 latency. It's also wrong in my view to say it is "trading memory capacity and CPU cores for latency determinism", since the behaviour remains probabilistic. Reply

Ambitious hacker reduces worst-case memory latency by up to 93%, but with severe downsides — 1960s bottleneck overcome by hedging memory accesses to avoid runni

Key considerations

Reference reading

More on this site

Leave a Comment Cancel reply

Key considerations

Reference reading

More on this site

Related posts:

Leave a Comment Cancel reply