Benchmarking Workload Storage Performance – An Expert Q&A

Nearly 1,000 people have watched our most recent SNIA ESF webcast, Storage Performance Benchmarking: Workloads. We hope you didn’t miss this 5th and final installment of the now famous Storage Performance Benchmarking webcast series where our experts, Mark Rogov and Chris Coniff, explained how to measure and optimize storage performance of workloads. If you haven’t seen it, it’s available on-demand. The live audience had many great questions. Here are answers to them all.

Q. Is it good to assume that sequential IO would benefit from large IO size and, conversely, random IO from small IO size?

A. I don’t think “benefit” is a practical way to look at this. Workloads come in all sizes and mixes, and it is the job of the storage array to handle what is thrown at it. Storage admins (and placement algorithms) need to configure the system to produce the best performance given the current load. Historically, random workloads are harder to optimize than sequential. Mixing block sizes, even with sequential workloads, is also tough to deal with. It all comes to figuring out where the bottlenecks are, and how to overcome them.

Q. But pre-fetch, cache, etc. will show benefit to sequential IO on all Flash array versus random?

A. Technically, we need to look at the effects of cache and pre-fetch on Reads and Writes separately. For Reads, pre-fetching data into cache does show a lot of benefit, especially for sequential IO. For Writes, pre-fetching is not effective, but cache is: it is, generally, faster to save data to cache than straight to disk (assuming, of course, there is free space in cache to write to).

Q. Don’t you see [concatenation of small IOs into larger IO] when apps are inside a VM versus physical nodes? I ask because [block size] changes with versions of hypervisors.

A. This is a great question, and a big misconception floating out there. Hypervisors have three primary methods for accessing storage: direct block, NAS, or via internal filesystem.

The direct block, aka raw device mapping, is simple—all IOs are simply sent to the storage array as they are. No concatenation, folding, compression, etc.

Internal filesystem, VMFS, has a concept of a block size. These blocks are used for internal management of the filesystem (see our File webcast with explanation on how those work). A common misconception is that when a block size is smaller than the write IO size, the filesystem issues a write equal to its block size. In reality, the writes are simply passed to the underlying driver, plus some additional meta-data IO. The amount of data being written doesn’t change just because it is written into a larger block container. In cases when the write IO is larger than a block size of the FS, yes, the storage will see a multiple of IOs, depending on what the ratio is. In VMFS though, block sizes are usually quite large: 1MiB, 4MiB, 8MiB. Very few (if any!) workloads have IOs that big. VMFS also has a concept of “sub blocks”, which are smaller than 1MiB, but are also quite large: 64KiB, with the same logic applied.

NAS communication is the most complex. For the purposes of this question, consider the Ethernet Maximum Transfer Unit size (MTU). It regulates the size of the frame, which by default is 1500 bytes. Therefore, all data must be either smaller than or cut into 1500 byte chunks to be sent across the wire. For example, 4KiB IO will be split into 3 frames: 1500, 1500, and 1096. Sometimes, MTU is set to a “jumbo setting”, or 9000. Then each 4KiB IO will fit into one frame. With NFSv4, the protocol allows combining several NFS calls into one frame. Theoretically, that means that two NFS write calls for 4KiB could fit into a single 9KB jumbo Ethernet frame. In reality, one needs to examine closely which specific NFS calls are truly being used—see our File webcast on details of how block commands get translated to and from filesystem calls and extrapolate that into NFS calls (FS and NFS are not the same!).

Bottom line, regardless of the datastore access method, in most cases, your workload IO will be passed to the storage array as is without coalescing.

Q. There are several other factors that have to be considered: 1) More than random/sequential it’s I/O adjacency that matters most. Think how differently a hybrid storage system would handle random I/O to 5% of the volume vs. even sequential to the whole volume. 2) Does the data dedupe/compress using the array’s algorithm?

A. I agree somewhat with this comment. Adjacency of the IO is a good way to think about things. Intelligent placement algorithms do have a concept of a “Working Area,” which could elect promoting whole regions to the faster storage tier to speed up all the “adjacent” random requests. Data reduction (compression, deduplication, single-instancing, zero-padding, etc.) introduces an overhead in some arrays, and therefore is mudding the picture somewhat, yet the underlying principles remain the same. Keep in mind, this is a vendor neutral presentation, very commonly the differences in handling data reduction are heavily marketed between different solutions.

Q. I think Mark is at a different company now 🙂 2 4 8 is NOT sequential, it is strided, or Geometric. Holistically, If proc A reads 2,4,6 and Proc B read 1,3,5,7 then the strides are within a proc, but holistically, this would be sequential. Though few operating systems do this level of pre-fetch logic.

A. Good catch, and I agree. In a later revision of the slides, we changed the 2, 4, 6 sequence into “predictable” pattern, not sequential.

Q. If you have an all Flash array what kind of performance hit do u take between sequential / random reads? I would think that it isn’t as impactful as a mechanical drive.

A. The answer can be true under one condition and not another (LUN is in a write-pending condition, cache flushing is taking place, etc.).  

Generally speaking, reads always perform better than writes on flash drives.   As for random vs sequential IO, they are measured with two different metrics (iops vs throughput). In order to answer that part of the question, their measurements must be normalized to a common KPI, such that they could be compared.   And to do that, we’d have to know what IO block size the question assumes, as it must be known to solve for ‘IO size x IOPS = Throughput in MB/s’

This is from the spec sheet for one vendor’s SSD:

  • Sequential Read (up to)450 MB/s
  • Sequential Write (up to)380 MB/s
  • Random Read (100% Span)67500 IOPS
  • Random Write (100% Span)17500 IOPS
  • Latency – Read40 µs
  • Latency – Write42 µs

So, the ‘It Depends” answer is based on the manufacture & model of drive at specific code level, and (if it’s in an array) how the specific array vendor implemented it in their design.   If it’s an Integrated Cache Disk Array (ICDA), then pre-fetching (read ahead) and caching algorithms behave differently at various code levels for each vendor.  There are specific used defined configuration parameters that can negate the answer to the question below as well.   For instance, High & Low water marks, Dynamic Cache Partitioning, Workload QoS, etc.

In the case of an ICDA, what’s more important than read vs write or random vs sequential is whether or not the IO was a cache hit.  A cache hit for a single random read IO in an ICDA who’s LUN is on a 7.2K SATA drive will have better performance than a random read miss on that same array if the LUN were on a flash drive.

So, as we’ve seen throughout this series, there is more to the overall performance benchmark than any one variable.

Q. What observations do you have on rebuild time for Flash disk? On what magnitude is it faster than spinning disks considering a high-end hybrid or AFA storage system?

A. This question is dangerous, as it crosses into vendor specifics. Rebuild time depends on more than just the type of drive, the RAID type and configuration, drive utilization, array business—and many more factors come into play here. Generally, an SSD drive will rebuild faster than a similarly sized spinning drive.

Q. What about the fact that you get cache effects in the OS stack and also in the Flash (DRAM landing areas) that actually improves write latency versus reads? Isn’t it worth mentioning Writes can actually appear higher performant on Flash? Or am I missing something? (Probably the latter) 🙂

A. You are 100% correct! 🙂 But do consider the size of DRAM… how much data can it take? Does the entire “working area” fit there? If it does, viola! If it does not – welcome to the rest of the world!

Q. Can you send links to all the webcasts in this series?

A. Of course, please see below and happy viewing!

  1. Storage Performance Benchmarking: Introduction and Fundamentals
  2. Storage Performance Benchmarking: Part 2  – Solution under Test
  3. Storage Performance Benchmarking: Block Components  
  4. Storage Performance Benchmarking: File Components  
  5. Storage Performance Benchmarking: Workloads

Leave a Reply

Your email address will not be published. Required fields are marked *