Ambitious hacker reduces worst-case memory latency by up to 93%, but with severe downsides — 1960s bottleneck overcome by hedging memory accesses to avoid runni

The clever hacker achieved even better results on Intel and Arm hardware. On Intel Xeon processors from the Sapphire Rapids and Diamond Rapids families, she managed to achieve gains as high as 93.3%, or in other words, she slashed p99.99 memory latency from 1697ns all the way down to 113ns. Considering the lowest value on the chart is around 105ns, that means the Xeon managed to achieve unbelievably deterministic memory latency.

I've said a few times that "certain workloads" benefit from TailSlayer. The most obvious place where determinism in memory latency is absolutely critical is in the slightly absurd world of high-frequency trading (HFT). If you're not familiar, imagine a bunch of hyper-caffeinated algorithms in a cage match where the winner is whoever can buy or sell a stock a few microseconds before everyone else. HFT firms spend obscene amounts of money on servers co-located right next to the exchange's matching engine, shaving off nanoseconds with custom hardware, microwave links instead of fiber optics, and code so obsessively optimized it would make a demoscene guru blush.

These systems operate with such tight tolerances that if a memory access runs into a DRAM refresh cycle, the opportunity is likely missed, potentially costing the firm millions of dollars. Because of that, the HFT world is the most obvious place where technology like this could be deployed, and it's almost one of the only places where it makes sense. There are other workloads that benefit from eliminating DRAM refresh stalls, sure; high-QPS microservices, matching engines, real-time ranking structures, anything using concurrent queues, and even potentially simulators or game servers, particularly those operating with a high level of precision.

The problem with using TailSlayer for many of these workloads will have already become apparent to many of you reading this, and it's primarily that Laurie's method requires fully duplicating the working set of the application for each memory channel you're hedging across. You're trading memory capacity and CPU cores for latency determinism, as this effectively multiplies memory requirements for any given application by a factor of the number of hedges you're willing to make. For some tasks—like, again, HFT—the actual memory requirements are quite modest, and so accepting a twelve-fold increase in memory usage in exchange for a 15× drop in p99.99 memory latency makes sense. For most workloads, it doesn't.

Watch On Of course, LaurieWired acknowledges this in her 54-minute video talking about the technique. In general, while she's (understandably!) quite pleased to have come up with TailSlayer, she's also quite frank about its relatively limited utility. Her video goes into significant depth about the research she had to do, including reverse-engineering undocumented memory scrambling behavior as well as devising a way to make the method work on Amazon's Graviton Arm-based CPUs, since they don't expose the same level of hardware counters that x86-64 CPUs do. It's highly recommended to watch if you're interested in the topic. Alternatively, you can head over to her GitHub repository to check out the demo code for yourself.

Follow Tom's Hardware on Google News , or add us as a preferred source , to get our latest news, analysis, & reviews in your feeds.

Zak is a freelance contributor to Tom's Hardware with decades of PC benchmarking experience who has also written for HotHardware and The Tech Report. A modern-day Renaissance man, he may not be an expert on anything, but he knows just a little about nearly everything. ","collapsible":{"enabled":true,"maxHeight":250,"readMoreText":"Read more","readLessText":"Read less"}}), "https://slice.vanilla.futurecdn.net/13-4-20/js/authorBio.js"); } else { console.error('%c FTE ','background: #9306F9; color: #ffffff','no lazy slice hydration function available'); } Zak Killian Contributor Zak is a freelance contributor to Tom's Hardware with decades of PC benchmarking experience who has also written for HotHardware and The Tech Report. A modern-day Renaissance man, he may not be an expert on anything, but he knows just a little about nearly everything.

bit_user This is clever. The article said: You're trading memory capacity and CPU cores for latency determinism, as this effectively multiplies memory requirements for any given application by a factor of the number of hedges you're willing to make. Not just memory capacity, but also bandwidth! I'd argue that's the limited quantity being burned by this technique much more than cores. Modern server CPUs have oodles of cores, but they have just ~12 memory channels. So, if you tie up 12 cores, in a race to see which achieves the lowest latency, it might only be ~10% of the total core count (e.g. on a 128-core CPU), but you could be burning close to 100% of the memory bandwidth! Moreover, it would be self-defeating to try and have those other cores doing very much that would compete for that memory bandwidth, so most of them will need to be kept fairly idle. Basically, you're sacrificing scalability for the sake of reducing latency. In mostly-serial, realtime applications, that might be a worthwhile tradeoff. But, where you can effectively parallelize your workload – even in latency-sensitive applications – it should always be preferable to do so! I think this point is worth clearly spelling out: this technique effectively turns a modern, multi-core, multi-channel CPU into a single-threaded machine! That said, if you don't need the absolute minimum latency , you could regain a little parallelism, but the most distinct threads you can run is no more than half of the memory channels. Quite a severe tradeoff! Reply

bit_user I hope DDR6 has more granular self-refresh. That would go a long ways towards both minimizing collisions and reducing the amount of time they take, when they do happen. Reply

IntelUser2000 bit_user said: I hope DDR6 has more granular self-refresh. That would go a long ways towards both minimizing collisions and reducing the amount of time they take, when they do happen. That would require a significant architectural change on the DRAM chips itself, since self-refresh is fundamental to the operation of memory. If just using another core reduces latency that much though, essentially solving the problem, I doubt the hardware "fix" would happen. Reply

Ambitious hacker reduces worst-case memory latency by up to 93%, but with severe downsides — 1960s bottleneck overcome by hedging memory accesses to avoid runni

Key considerations

Reference reading

More on this site

Leave a Comment Cancel reply

Key considerations

Reference reading

More on this site

Related posts:

Leave a Comment Cancel reply