Moving to Real World Benchmarks in SSD Reviews

Many of our readers embrace our "real world" approach with hardware reviews. We have not published an SSD review for almost 2 years while we have been looking to revamp our SSD evaluation program. Today we wanted to give you some insight as to how we learned to stop worrying and love the real world SSD benchmark.


Understanding I/O

The Command Queue: Get in Line

On an average PC, there are numerous little I/O operations going on in the background- small reads and writes here and there from processes you don’t even realize exist. Each of these operations represents a command sent to your storage device. AHCI processes commands in a mostly sequential way; there is one queue that can hold up to 32 commands, and if too much disk access takes place at once, congestion occurs. NVMe introduces multiple queues along with higher maximum queue depths, to the point where the queues are unlikely to fill up.

Another limitation of AHCI relates to the creation of queue items from multiple threads or logical processors. In practice, AHCI requires locks on objects, which leads to threads waiting for other threads’ locks to clear. Locks are no longer necessary with NVME, so situations where multiple threads are accessing the same data should have much lower overhead, higher efficiency, and generally higher throughput.

It took years for developers to catch up with the widespread presence of multiple cores. As developers have spent their careers thinking in terms of I/O locking making parallel file access difficult, these capabilities will take some time to be widely used in commercial software. Certain applications (mainly on the server side, in the form of things like virtualization or database applications) will be able to use a high queue depth effectively, but for many users, NVMe will be solving a problem they never really had.

You can easily view your queue depth by using the Resource Monitor built into Windows, and looking at the Disk Queue Length column on the Disk tab.

Article Image

In this example, I’m creating a copy of a folder of 632 CSV files with a total of 20.0GB, so an average 32MB each. This is part of my real-world workflow; I deal with a lot of time series data on stocks. Windows is clever enough to use some concurrent I/O when copying a large number of files, which creates a higher queue depth. Most of the time, users will not see such high queue depths, for example, as I’m creating a ZIP file of that same folder of 632 CSVs, my queue depth is hovering between 0.1-0.5. And while Windows knows to use concurrent I/O when copying a folder of smaller files, when I copy a ZIP file (now 3.3gb) of that same folder, it’s not quite able to scale as well:

Article Image

Random and Sequential Access

The IOPS numbers that manufacturers typically cite pertain to random 4KB reads and writes. 4KB is a good number to start with for a client drive because it’s the default NTFS allocation size, and there’s a fair amount of I/O at that level. The little operations happening in the background are often 4KB, it becomes the basic unit of applications and the OS saving tiny pieces of information.

When you start doing operations with larger files, whether it’s loading large files as part of an application start, of copying a large ISO file, disk I/O will happen in larger block sizes, and will typically be sequential. When manufacturers cite throughput numbers in MBps, they’re referring to some sort of large-block sequential transfer.

It only makes sense to match your workload to a drive that’s strong in relevant areas. To truly understand what your needs are, HD Tune Pro is an excellent tool. Install it, start the Disk Monitor, and let it listen and log your disk access as you go about your business. I turned it on for a while as I wrote this article, and the chart below shows what an I/O profile looks like while running Chrome with a bunch of tabs open, Word 2010, Visual Studio 2013, Notepad++, and Pandora. I also launched Photoshop CS6 in this window and saved some screenshots.

Article Image

You’ll probably notice right away that the vast majority of I/O in this fairly standard application load is small reads and writes. If I were copying or saving large files, of course, that wouldn’t be the case. You’ll also notice that I have more data written than read, which is superficially unusual for a client workload. SSD manufacturers typically optimize client SSDs for read-heavy workloads. In my specific case, I can drill into the programs causing the I/O in the chart with HD Tune and see that I have a lot of writes coming in from Chrome (thanks to HTML5’s local storage standards and Chrome’s use of caching), and a lot coming from Photoshop as well, due to its creation of a scratch file on startup.

Compressible Data

A few years ago, when SSDs based on SandForce controllers hit, semiconductor industry analysts everywhere were beyond excited. At the time, SSD and NAND endurance were still largely open questions, and SandForce offered a solution, through real-time write compression, the wear on the precious flash chips could be reduced by minimizing the amount of data being written. As a business unit, SandForce has had a difficult few years; the company was acquired, then changed hands, and hasn’t really done much in the consumer space. What’s more, the market has moved on, and savvy buyers realize that the best-case throughput numbers put up by SandForce drives are mostly fluff-workloads in the wild will have some level of compressibility, but that level certainly won’t allow you to approach the numbers on the box.

You can think of the same sort of data that would compress nicely into a ZIP file, such as my CSVs of stock data above, which were compressible from 20GB to 3.3GB, as being the same sort of data that compresses nicely by SandForce and other SSD controllers that use similar technologies. Anything repetitive will compress well; this also includes workloads made up of non-random data that many synthetic benchmarks will generate.

You might wonder at this point, "If write compression was developed to solve problems that have been subsequently addressed by better NAND, why are we still talking about it?" There are still real-world benefits in some cases, and a number of manufacturers have done things like combine multiple SandForce controllers in RAID to maximize performance.

Using a 240GB ASUS ROG RAID Express PCIe SSD, which is equipped with dual SandForce SF-2281 controllers in RAID-0, I ran some quick tests illustrating how write performance falls apart where uncompressible data is used by benchmarking the time to copy 9.85GB of different types of data:

Article Image

Looking at copy times isn’t the most scientific example (though these are the average times across 3 runs), but we can see a very clear difference here. We can copy 9.85GB of highly-compressible data generated by a synthetic benchmarking tool (zeroes) in less than half the time of a ZIP file of the same size. The CSVs of stock data I used are very compressible examples of real-world data, but even here, performance is drastically reduced as compared to the best-case synthetic data.

TRIM: Taking out the Trash

All modern SSDs and operating systems support TRIM, which is essential to keep SSDs humming along in regular use. You may also see the same functionality referred to as SCSI UNMAP, if you’re running certain RAID controllers or a drive that has one built-in. As disused blocks of formerly-occupied memory build up, TRIM acts as a smart garbage collection process that cleans up and restores these blocks to their original state. Before TRIM was commonplace, SSD performance would permanently degrade with heavy use; something to keep in mind if using RAID, as it’s still not supported across all RAID controllers.

Most SSD manufacturers now provide utilities to perform a "secure erase" of a drive, which will delete all of its contents as well as performing drive-wide garbage collection. Others, such as Intel, have optimization tools that will just force garbage collection without destroying data. Mostly, however, TRIM garbage collection is supposed to take place in the background.

I don’t want to keep beating up on SandForce, but it’s well-known that the TRIM process takes some time to restore performance on SandForce-powered drives. It’s difficult to quantify performance consistency in the real world; that point where you’re doing something fairly disk-intensive, and then have to wonder why it’s suddenly taking so long. Anecdotally, I find myself hitting this point with SandForce drives when doing things that shouldn’t be all that taxing. Steady-state testing, while not a perfect proxy of this scenario, demonstrates that performance of these drives often degrades if not given time to "rest" and allow TRIM to run.