Life of a Storage Packet (Walk): Q&A

We got a lot of great questions at our recent Ethernet Storage Forum webcast “The Life of a Storage Packet (Walk).” As promised, we’ve compiled all the questions with fairly detailed answers. We hope this blog helps to clear up any confusion or uncertainties. If you think of additional questions, please comment below and we’ll get back to you as soon as we can. Thanks to everyone who watched the live webcast. If you missed it, it’s now available on-demand.

Q. Does size of a block depend on OS?

A. Yes, some OS’ only support 512 byte blocks. Some OS’ provide a method to support both 512 byte blocks and 4096 (aka 4K) byte blocks; for example, via a qualifier in their format command. Some devices are built using 4K block sizes, but then emulate 512 byte blocks to the host (aka 512E devices). Many modern versions of OS’ automatically detect the block size of each device they discover and do the right thing based on what they discover. You have to check the documentation for your OS to know its capabilities.

Q. CIFS is not a file system.

A. Not in the same way that ext, ntfs, FAT, HFS, UFS, etc. are, no. However, in terms of certain functionalities that we discussed on the presentation – that is, the ability to manipulate files with operations such as read, write, create, delete, and rename – are all file system functionalities. The difference being, of course, that the files are not on the local computer and are actually on a remote computer.

For what it’s worth, the term ‘CIFS’ has been deprecated from usage, and SMB is the preferred term for precisely the reasons that it should not be confused with local OS systems like the ones mentioned above.

Q. Is it safe to say that a “block” at the file system level is equal to an IO?

A. Not specifically. The difference in the use of those terms, is that a “block” is a place that data is stored – it has an address and it contains data (512 byte of data or 4K bytes of data). An IO is an operation that requests access to a block. An IO may perform a read operation on a block, or it may perform a write operation on a block.

Remember, the term IOPS is I/O operations per second – so it is not about blocks or bytes or bits – it is operations. If the operations are performed on a 512 byte block, they produce a different number of MB/sec than if they operate on a 4K byte block.

So, let’s take an example. A 1MB/second bandwidth on a 4K block size device is the same speed as 1MB/second bandwidth on a 512 byte block size device (observe, however, that the 4K block size device will have only 1/8 the IOPS that the 512 byte device has – because it will take only 1/8 the operations to transfer the same number of Mega-bytes). However, 1M IOPS on a 4K block size device is much better than 1M IOPS on a 512 byte block size device (because the 4K block size device is moving 8X the amount of data than the 512 byte block size device in each operation.

There is an excellent explanation and walk-through of IOPS in our Storage Performance Benchmarking webinar.

Q. Don’t all those Inodes also live on the disk and so don’t the IOs to read those blocks also have to go to the SCSI controller?

A. Well-spotted!

The communication back and forth between the file system and the Inodes traverses the controller for all access to blocks on disk. This is one of the reasons why needing to access disk is considered “expensive.” When you add in a network to the mix, these kinds of situations need particular careful consideration.

Q. Shall we allocate blocks and inodes or is it an automatic process?

A. It’s an automated process, and there is no user intervention at all.

Q. Are Inodes created during OS installation?

A. Inodes are a particular block type, among some other block types (e.g., data blocks, boot blocks, superblocks, group descriptor blocks). These block types are combined into functional groups. These groups are OS-dependent. The block layout therefore, including Inodes, is created during OS installation. If the filesystem needs more Inodes after the OS installation, the OS dynamically adds them to the Inode pool.

Q. Which physical hardware does volume manager reside?

A. Volume managers are not hardware. They are software layers that create pseudo devices that are presented to layers of the OS above them (typically to the file system layer). The volume manager software fits into the OS to accept requests from the file system and pass those requests down to the device driver. In some systems they might be called “filter drivers”.

Q. In flash media, is there also an iSCSI controller that converts PCIe into iSCSI to interact with the flash?

A. I want to make sure that the answer to the question is clear.

On the host, we need to convert PCIe commands to SCSI, so we send them to an iSCSI controller/adapter to be sent across the Ethernet wire. . That adapter can be either software or hardware.

Flash drives are basic media, just like spinning disk drives. Flash will have its own controller, which can be SCSI. If you wish to access the drive over Ethernet using the iSCSI protocol, you will have an adapter on the flash drive (which can be either software or hardware) that will do the SCSI translation between the Flash media and the SCSI commands. This is often called the FTL – or the flash translation layer. Again, the SCSI commands are translated into an Ethernet-friendly packet to be sent along the wire.

There are other types of communication forms for working with Flash, too. The most recent is NVMe. You can see the SNIA webinar on The Performance Impact of NVMe and NVMe over Fabrics for more information.

Q. Why is there a SCSI language in between Storage and the hosts?

A. Before storage standards (like SCSI and ATA), you would purchase storage from Vendor X, and you would also buy a storage controller for that vendor X storage.   If you wanted Vendor Y storage, you could not use the vendor X controller, you had to purchase a new controller from vendor Y. Every vendor had their own language, and you had to purchase matching components.   Once you got locked into one vendor, you were stuck – at the hardware level.

Today with standards (such as SCSI), you can buy a SCSI device from any vendor and connect it to any storage controller that you buy from any vendor – and it just works.   That is the point of the SCSI standard. When the storage standardization efforts began, there were many competing ideas. It just so happened that SCSI won out, and so now it’s everywhere.

Q. If the storage side is flash, do we still need a SCSI controller between host and flash storage or is the controller different for a flash storage?

Yes… and no.

Yes, to use the SCSI part of the OS, the flash device must continue to speak the SCSI protocol and so a SCSI controller is needed. This method of communicating with flash devices enables all the existing S/W on the host OS to just continue working without even knowing it is flash.

No, flash memory chips can be connected to the system using non-SCSI methods. Most of those methods are generally special purpose applications and so the existing OS S/W simply cannot use that device (only the special purpose S/W designed for that device can use it). However, NVMe is a new protocol that is enabling more general use of the flash memory technology with the hopes to provide new capabilities that are beyond what SCSI provides. This is also discussed in our webcast on The Performance Impact of NVMe and NVMe over Fabrics.

Q. What’s the difference between partition, logical disk, volume, LUN, etc?

Excellent question!

Let’s work this one backwards – LUN – that is a SCSI term (actually an acronym) that refers to the Logical Unit Number – it is part of the address used to access a logical unit. Small SCSI devices (such as a single spindle disk drive) have only a single logical unit. Large storage arrays may contain 100s of logical units. Each of those logical units appears to the host OS as if it were a single spindle disk. So, the logical unit is the SCSI object that contains the blocks where the data is stored (where that object may be an individual piece of hardware, or a logical entity within a larger SCSI device). To access a SCSI logical unit, the OS must specify the address of the SCSI device, and then the LUN (logical unit number) for the logical unit within that SCSI device.

The other terms (partition, logical disk, and volume) are OS terms that have to do with virtualization and how the storage blocks are managed. When a SCSI (or other storage device) is formatted by the OS, it may be broken into multiple partitions. Each partition is then treated by the OS as if it were a unique device (a virtual device, or a logical disk). Each of those partitions may then be used independently.

For example, on a Unix system, the “a” partition may contain a file system that has all the files that are necessary to boot. The “b” partition may be setup without a file system and used as the swap or paging storage for use by the virtual memory subsystem. The “d” partition may then contain a file system that contains all the user’s data files. Each partition is unique storage space and may even use a different file system to organize the data located there.

Notice, that I skipped the “c” partition. That partition is often setup to access all the blocks of the physical device. So, on a 500GB disk, maybe “a” contains 10GB, “b” contains 90GB, and “d” contains 400GB; while “c” contains all 500GB. Partitions “a” and “d” may be backed up or restored independently, and partition “c” may be used to perform an image copy of the entire device.

Now, to the other terms:

Logical disk and volume are terms often related to volume managers.

  • Logical disk may be a term used to refer to a partition, but usually that is not the case. Logical disks are typically the devices created by the volume management layer when they combine individual devices into a single larger device (a.k.a. a logical disk).
  • Volume managers also may be used to divide up a large device into small chunks (just like partitions), and those smaller chunks are referred to as logical disks.
  • Volume is a more vague term that typically is used as another term for a logical disk. In some circles, the term Volume is used to refer to a RAID set in a SCSI storage controller (but this is a much less often used definition for Volume).

Q. Will the “complete” presentation somewhere we can go review?

Yes. Click here to access the on-demand webcast as well as a PDF of the webcast slides

Update: If you missed the live event, it’s now available  on-demand. You can also  download the webcast slides.

New Webcast: The Life of a Storage Packet (Walk)

Wonder how storage really works? When we talk about “Storage” in the context of data centers, it can mean different things to different people. Someone who is developing applications will have a very different perspective than someone who is responsible for managing that data on some form of media. Moreover, someone who is responsible for transporting data from one place to another has their own view that is related to, and yet different from, the previous two.

Add in virtualization and layers of abstraction, from file systems to storage protocols, and things can get very confusing very quickly. Pretty soon people don’t even know the right questions to ask! That’s why we’re hosting our next SNIA Ethernet Storage Webcast, “Life of a Storage Packet (Walk).”

Join us on November 19th to learn how applications and workloads get information. Find out what happens when you need more of it, or faster access to it, or move it far away. This Webcast will take a step back and look at “storage” with a “big picture” perspective, looking at the whole piece and attempting to fill in some of the blanks for you. We’ll be talking about:

  • Applications and RAM
  • Servers and Disks
  • Networks and Storage Types
  • Storage and Distances
  • Tools of the Trade/Offs

The goal of the Webcast is not to make specific recommendations, but equip you with information that will help you ask the relevant questions, as well as get a keener insight to the consequences of storage choices.  As always, this event is live, so please bring your questions, we’ll answer as many as we can on the spot. I encourage you to register today. Hope to see you on November 19th!

Update: If you missed the live event, it’s now available  on-demand. You can also  download the webcast slides.

New White Paper: An Updated Overview of NFSv4

Maybe you’ve asked yourself recently; “Hmm, I wonder what’s new in NFSv4?” Maybe (and more likely) you haven’t; but you should.

During the last few years, NFSv4 has become the version of choice for many users, and there are lots of great reasons for making the transition from NFSv3 to NFSv4. Not the least of which is that it’s a relatively straightforward transition.

But there’s more; NFSv4 offers features unavailable in NFSv3. Parallelization, better security, WAN awareness and many other features make it suitable as a file protocol for the next generation of applications. As a proof point, lately we’ve seen new clients of NFSv4 servers beyond the standard Linux client, including support in VMware’s vSphere for virtual machine datastores accessible via NFSv4.

In this updated white paper, An Updated Overview of NFSv4, we explain how NFSv4 is better suited to a wide range of datacenter and high performance compute (HPC) uses than its predecessor NFSv3, as well as providing resources for migrating from v3 to v4.

You’ll learn:

  • How NFSv4 overcomes statelessness issues associated with NFSv3
  • Advantages and features of NFSv4.1 & NFSv4.2
  • What parallel NFS (pNFS) and layouts do
  • How NFSv4 supports performant WAN access

We believe this document makes the argument that users should, at the very least, be evaluating and deploying NFSv4 for use in new projects; and ideally, should be using it wholesale in their existing environments. The information in this white paper  is meant to be comprehensive and educational and we hope you find it helpful.

If you have questions or comments after reading this white paper, please comment on this blog and we’ll get back to you as soon as possible.

OpenStack Manila – A Q&A on Liberty and Mitaka

Our recent Webcast with OpenStack Manila OpenStack Manila Project Team Lead (PTL), Ben Swartzlander, generated a lot of great questions. As promised, we’ve complied answers for all of the questions that came in. If you think of additional questions, please feel free to comment on this blog. And if you missed the live Webcast, it’s now available on-demand.

Q. Is Hitachi Data Systems contributing to the Manila project?

A. Yes, Hitachi contributed a new driver and also contributed a major new feature (migration) during Liberty. HDS was also active during the Kilo release with a different driver which is unfortunately no longer maintained.

Q. EMC has open sourced ViPER as CopperHD. Do you see any overlap between Manila/Cinder from one side and CopperHD from the other hand?

A. I’m not familiar enough with CoprHD to answer authoritatively, but I understand that there is definitely some overlap between it and Cinder, and I also expect there is some overlap with Manila. Assuming there is some overlap, I think that’s a great thing because competition within open source drives greater quality, and it’s confirmation that there is real demand for what we’re building.

Q. Could Manila be used stand-alone (without OpenStack) to create a fileshare server?

A. Yes, the only OpenStack service Manila depends on is Keystone (for authentication). Running Manila in a stand-alone fashion is a specific use case the team supports.

Q. If we are mapping the snap shot images what is the guarantee for data integrity?

A. Snapshots are typically crash-consistent copies of the filesystem at a point in time. In reality the exact guarantee depends on the backend used, and that’s something we’d like to avoid, so that the snapshot semantics are clear to the user. In the future, backends which cannot meet the crash-consistent guarantee will probably be forced to advertise a different capability so end users are aware of what they’re getting.

Q. Is there manila automation with ansible?

A. As far as I know this hasn’t been done yet.

Q. For kilo deployed in production does it work for all commercial drivers or is there a chart that says which commercial drivers support kilo?

A. The developer doc now has a table which attempts to answer this question. However, the most reliable way to see which drivers are part of the stable/kilo release would be to look at the driver directory of the code. This is an area where the docs need to improve.

Q. Could you explain consistency groups?

A. Consistency groups are a mechanism to ensure that 2 or more shares can be snapshotted in a single operation. Without CGs, you can take 2 snapshots of 2 shares but there is no guarantee that those snapshots will represent the same point in time. CGs allow you to guarantee that the snapshots are synchronized, which makes it possible to use multiple shares together for a single application and to take snapshots of that application’s data in a consistent way.

Q. How is the consistency group in Manila different from Cinder? Is it similar?

A. The designs are very similar. There are some semantic differences in terms of how you modify the membership of the CGs, but the snapshot functionality is identical.

Q. Are you considering pNFS, but I guess this will be hard since it has req. on the client as well?

A. Manila is agnostic to the data protocol so if the backend supports pNFS and Manila is asked to create an NFS share, it may very well get a share with pNFS support. Certainly Manila supports shares with multiple export locations so that on a system with multiple network interfaces, or a clustered system, Manila will tell the clients about all of the paths to the share. In the future we may want Manila to actually know the capabilities of the backends w.r.t. what version of NFS they support so that if a user requires a minimum version we can guarantee that they get that version or get a sensible error if it’s not possible.

Q. Share Replication. In what mode, Async and/or Sync?

A. We plan to support both, and the choice of which is used will be up to the administrator. Communication about which is used and any relevant information like RPO time would be out of band from Manila. The goal of the feature in Manila is to make Manila able to configure the replication relationship, and able to initiate failovers. The intention is for planned failovers to be disruptive but with no data loss, and for unplanned failovers to be disruptive, with data loss corresponding to the RPO that the administrator configured (which would be zero for synchronous replication).

Q. Can you point me to any resources with SNIA available for OpenStack? Where can I download document, videos, etc?

A. You can find several informative OpenStack on-demand Webcasts on the SNIA BrightTalk channel here.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Storage Performance Benchmarking – The Sequel

We at the Ethernet Storage Forum heard you loud and clear. You need more info on storage performance benchmarking. Our first Webcast, “Storage Performance Benchmarking: Introduction and Fundamentals” was tremendously popular – reaching over 2x the average audience, while hundreds more have read our Q&A blog on the same topic. So, back by popular demand Mark Rogov, Advisory Systems Engineer at EMC, Ken Cantrell, Performance Engineering Manager at NetApp, and I will move past the basics to the second Webcast in this series Storage Performance Benchmarking Part 2. With a focus on System Under Test (SUT), we’ll cover:

  • Commonalities and differences between basic Block and File terminology
  • Basic file components and the meaning of data workloads
  • Main characteristics of various workloads and their respective dependencies, assumptions and environmental components
  • The complexity of the technology benchmark interpretations
  • The importance to System Under Test:
    • What are the elements of a SUT?
    • Why are caches so important to understanding performance of a SUT?
    • Bottlenecks and threads and why they matter

SUT

I hope you’ll join us on October 21st at 9:00 a.m. PT. to learn why file performance benchmarking truly is an art. My colleagues and I plan to deliver another informative and interactive hour. Please register today and bring your questions. I hope to see you there.

Update: This webcast is part of a series on storage performance benchmarking. Check out the others:

Congestion Control in New Storage Architectures Q&A

We had a great response to last week’s Webcast “Controlling Congestion in New Storage Architectures” where we introduced CONGA, a new congestion control mechanism that is the result of research at Stanford University. We had many good questions at the live event and have complied answers for all of them in this blog. If you think of additional questions, please feel free to comment here and we’ll get back to you as soon as possible.

Q. Isn’t the leaf/spine network just a Clos network?   Since the network has loops, isn’t there a deadlock hazard if pause frames are sent within the network?

A. CLOS/Spine-Leaf networks are based on routing, which has its own loop prevention (TTLs/RPF checks).

Q. Why isn’t the congestion metric subject to the same delays as the rest of the data traffic?    

A. It is, but since this is done in the data plane with 40/100g within a data center fabric it can be done in near real time and without the delay of sending it to a centralized control plane.

Q. Are packets dropped in certain cases?

A.  Yes, there can be certain reasons why a packet might be dropped.

Q. Why is there no TCP reset? Is it because the Ethernet layer does the flowlet retransmission before TCP has to do a resend?

A. There are many reasons for a TCP reset, CONGA does not prevent them, but it can help with how the application responds to a loss.   If there is a loss of the flowlet it is less detrimental to how the application performs because it will resend what it has lost versus the potential for  full TCP connection to be reset.

Q. Is CONGA on an RFC standard track?

A. CONGA is based on research done at Stanford. It is not currently an RFC.

The research information can be found here.

Q. How does ECN fit into CONGA?

A. ECN can be used in conjunction with CONGA, as long as the host/networking hardware supports it.

 

 

Storage Performance Benchmarking Q&A

Our recent SNIA-ESF webcast, “Storage Performance Benchmarking: Introduction and Fundamentals” really struck a chord! It was our most highly rated and well attended webcast to date with more than 300 people at the live event. If you missed it, it’s now available on-demand. Thanks again to my colleagues, Ken Cantrell and Mark Rogov, who did an outstanding job explaining the basics and setting the stage for our next webcast in this series, “Storage Performance Benchmarking: Part 2” on October 21st. Mark your calendar! Our audience had many great questions and there just wasn’t enough time to answer them all, so as promised, here are answers to all the questions we received.

Q. Can you explain the difference between MiB and MB?

A. The difference lies between the way that bytes are calculated. A megabyte (MB) is defined in decimal (base-10), and a mebibyte (MiB) in binary (base-2). There is a similar relationship between kilobytes (base-10) and kibibytes (KiB). So, if you begin with a single byte:

1 kilobyte (KB) = 1000 bytes (B)

1 kibibyte (KiB) = 1024 bytes (B)

1 megabyte (MB) = 1000 KB = 1000 * 1000 bytes = 1,000,000 bytes

1 mebibyte (MiB) = 1024 KiB = 1024 * 1024 bytes = 1,048,576 bytes

The distinction is important because very few applications or vendors make use of the MiB notation and instead use the MB notation to refer to binary numbers. If vendor or tool A is using MB to refer to base-10 measurements, and vendor or tool B is using MB to refer to base-2 measurements, it may falsely appear that the two vendors or tools are performing differently (or the same), when in fact the difference (or similarity) is simply because they are labeling two different things (MB and MiB) as the same (MB).

Telecommunications and networking generally measure and report in MB.  Disk drive manufacturers tend label storage in terms of MB. Many operating systems report storage labeled as MB, but calculated as MiB.  Storage performance (whether from an application’s view, or the storage vendor’s own tools) is generally measured in binary (MiB/s), but reported in decimal (MB/s).

Q. I disagree with the conflation of the terms “IOPS” and “throughput.” Throughput generally refers to overall concept of aggregate performance. IOPS measure throughput at a given request size, which most clients assume to be small blocks (~512B-4K). Many clients would say that bandwidth, measured in MB/s or GB/s, is a measure of throughput @ large block size (>~128K). Performance typically varies significantly @ different block sizes. So clients need to understand that all 3 of these concepts differ.

A. The SNIA dictionary equates throughput to IOPS, which is why we listed it as a common alternative for IOPS.   However, the problem with “throughput” is in the industry’s lack of agreement on what this term means. See the examples below, where sometimes it clearly refers to IOPS and other times to MB/s.   This is why, in the end, we recommend you simply don’t use the term at all, and use either IOPS or MB/s.

Throughput = MB/s

Techterms.com:  “Throughput refers to how much data can be transferred from one location to another in a given amount of time. It is used to measure the performance of hard drives and RAM, as well as Internet and network connections.”

Meriam-Webster:  “The amount of material, data, etc., that enters and goes through something (such as a machine or system)”

Throughput = IOPS

SNIA Dictionary: “Throughput:   [Computer System] the number of I/O requests satisfied per unit time.     Expressed in I/O requests/second, where a request is an application request to a storage subsystem to perform a read or write operation.”

Wikipedia:   “When used in the context of communication networks, such as Ethernet or packet radio, throughput or network throughput is the rate of successful message delivery over a communication channel. The data these messages belong to may be delivered over a physical or logical link, or it can pass through a certain network node. Throughput is usually measured in bits per second (bit/s or bps), and sometimes in data packets per second (p/s or pps) or data packets per time slot.”

TechTarget:   “Throughput is a term used in information technology that indicates how many units of information can be processed in a set amount of time.”

Throughput = Either

Dictionary.com:   “1. The rate at which a processor can work expressed in instructions per second or jobs per hour or some other unit of performance. 2.  Data transfer rate.”

Q. So, did I hear it right? IOPS and Throughput are the same and thus can be used interchangeably?

A. Please see definitions above.

Q. The difference between FE/BE is the raid level + Write percentage = overhead IOPS. It is different at times due to RAID or any virtualization at the storage level, like if you have RAID 5, every front-end write would have 4 backend write. What happens if the throughput at the back end of a controller is more than at the front end?

A. The difference between Front End (FE) and Back End (BE) is generally the vantage point. One of the objectives of the presentation was to show that when comparing two results (or systems) the IOPS must be measured at the same point. BE IOPS depend, in part, on the protection overhead and on the software engine’s ability to counteract it. But instead of defining why there is a difference, we aimed at exposing that such difference exists. The “whys” will be addressed in the future webcasts.

Q. Also, another point that would dictate which system is the storage capacity each has to offer. Very good compare & contrast illustration.

A. Agreed, although capacity itself has several facets: raw capacity or the sum of manufacture advertised capacities of all internal drives; usable capacity or the amount of data that could be written on one system after all formatting and partitioning; and logical capacity or the amount of data that external hosts could write onto the system. Logical may be different from usable based on some data reduction services that a system could offer.

Q. Why is there a lower limit on your acceptable latency band? Storage can be too fast for your business needs?

A. We had some discussions looking at whether you show it as a cap, or a band. In many cases, customers will have a target latency cap, but allow some level of exceptions (either % of time it is exceeded, or % by which it is exceeded). That initially suggested a soft cap and a hard cap. One of us also has seen customers that highly value consistent behavior. In a service provider context, providing “too fast” performance could actually build a downstream customer expectation that will cause satisfaction issues if performance later degrades, even if it degrades to levels within the original targets.   You could certainly see this more generally as a cap though, which Ken stated verbally in the presentation. The band represents not only the cap, but also variance.

Q. What is an “OP”? An”Operation Per”? Shouldn’t this be “$/OPS” or “dollars per operation per second”?

A. This question was raised during the verbal Q&A. An “OP” in our context was any of the potential protocol-level operations. For example, in an NFSv3 context, this could be a CREATE, SETATTR, GETATTR, or FINDFILE operation. The number of these that occur in a second is what we referred to as OPS. There is certainly potential confusion here, and we struggled for consistency as we put the presentation together.

For example, sometimes we (and others) refer to operations per second as “op/s” instead of OPS, or use “Ops” or “OPs” or “ops” to mean “more than one operation” instead of “operations per second.” We chose to use OPS because it is consistent with IOPS, which we see frequently in this space, and are trying to push others towards this terminology for consistency.

Q. More IOPS in the same amount of time seems to imply that the system with more IOPS is performing each IO in less time than the other system. So how do we explain a system that can do more IOPS than another system but with a higher response time for each IO than the other system?

A. Excellent question! Two different angles on a reply follow …

First angle:

A simple IOP count doesn’t include the size of the IOP, or the work that the storage controller must perform to process it. Therefore comparing just the IOPS does not provide a good comparison: what if system 1 was doing large IOs (64KiB or more) and system 2 was doing small IOs (4KiB or less)? To avoid that, most people fix the IO size or run controlled tests changing the IO size in steps and then compare the arrays.

The point that we are making with the webcast is that IOPS alone are not enough for a comparison, one must look beyond that into IO size, response time and above all business requirements.

Second angle:

This is an excellent question, and there are at least two answers.

To begin, we need to explain the relationship between IOPS and response time.

We generally talk about response time (latency) in terms of seconds or milliseconds. We say things like “a response time of 5ms.” But this is actually a shortcut. It is really “seconds per I/O” or “milliseconds per I/O.” Also, remember that in “IOPS” the “p” stands for “per”, so this is “IOs per second.” So, ignoring the conversions between “seconds” and “milliseconds”, the relationship between IOPS and response time is easy…they are inverses of each other:

blog image 1

So:

blog image 2

Implies:

blog image 3

Example:

blog image 4

 

blog image 5

Given this:

The first answer to your question is tied to idle time. Consider the following simple example.

System 1:

For system1, assume that the storage array service time is the only significant contributor to latency.

Client issues an I/O

Storage array services the I/O in 10ms

As soon as the client sees the response, it immediately issues a new I/O

Process continues, with all I/Os serviced by the storage array in 10ms each

How many IOPS is system 1 doing?   Measured at the client, or the storage array, it is doing 100 IOPS with an average response time of 10ms.   Remember the relationship between IOPS and response time from the intro above.

System 2:

For system2, assume that both the client and storage array contribute to latency.

Client issues an I/O

Storage array services the I/O in 5ms.

Client waits 5ms before issuing a new I/O.

Process continues, with all I/Os being serviced by the storage array in 5ms each, and the client waiting 5ms between issuing each I/O.

How many IOPS is system2 doing? If you measure the total number of I/Os completed in a minute, it is completing the same # as System1 and so doing 100 IOPS. But what is the latency? If you measured the latency at the storage array, it would be 5ms. If you measured at the client, it would depend on whether your measurement was above or below the point where the 5ms delay per I/O was inserted, and would see either 5ms or 10ms.

The second answer is tied to “concurrency,” an element we didn’t discuss in the webcast.

Concurrency might also be called “parallelism.”  In the examples above, the concurrency was ‘1′. There was only ‘1′ I/O active at a time.

What if we had more though?   Example:

For both systems, assume that the storage array service time is the only significant contributor to latency.

System 1 (same as above)

Client issues an I/O

Storage array services the I/O in 10ms

As soon as the client sees the response, it immediately issues a new I/O

Process continues, with all I/Os serviced by the storage array in 10ms each

How many IOPS is system 1 doing?   Measured at the client, or the storage array, it is doing 100 IOPS with an average response time of 10ms. Remember the relationship between IOPS and response time from the intro above.

System 2:   Increased concurrency

Client issues two I/Os

Storage array receives both and is able to handle both in parallel, but they take 20ms to complete

As soon as the client sees the response, it immediately issues two new I/Os

Process continues, with all I/Os serviced by the storage array in 20ms each, and the client always issuing two at a time

How many IOPS is system 2 doing? In the aggregate, measured at the client or the storage array, it is doing 100 IOPS. What is the response time? This is where the relationship between IOPS and response time introduced before can break down, depending on what you’re really asking for and it really takes some queuing theory to properly explain this. Ignoring the different kinds of queuing models and simplifying for this discussion, at a high level you could say that the average response time is still 10ms. Some reporting tools are likely to do this. That is misleading however, since each I/O really took 20ms. You’d only see a 20ms response if (a) you actually kept track of the real response times somewhere [which many systems do … they’ll create many different response time buckets and store counts of the # of I/Os with a given response time in a given bucket], or (b) you make use of Little’s Law. Little’s Law would at least give you a better mean response time. We didn’t discuss Little’s Law in the webcast, and I won’t go into much detail of it here (follow the Wikipedia link if you want to know more), but in short, Little’s Law can be rearranged to state that:

blog image 6

or, in our case above:

blog image 7

Q. What is valuable from SPEC’s SFS benchmark in this context?

A. Excellent question. We think very highly of SPEC SFS © 2014 for a variety of reasons, but in this context, a few stick out in particular:

1) The workloads provided by SPEC SFS 2014 are the result of multi-vendor research, actual workload traces, and efforts to target real business use cases. The VDI (virtual desktop infrastructure), SWBUILD (software build), DB (OLTP database) and VDA (video data acquisition) workloads are all distinct workloads with very different I/O patterns, I/O size mixes, and operation mixes. We’ll talk more in another session about these aspects of a workload, but the important element is that each workload is designed to help users understand performance in a specific business context.

2) To encourage folks to think in a business context, SPEC SFS 2014 primarily reports results  not  in IOPS or MB/S, but in the relevant “business units” at a given response time, along with a measure of overall response time, aptly referred to as ORT (Overall Response Time).   For example, the SWBUILD workload reports load points as “number of concurrent software builds.” MB/s information is still available for those that want to dig in, but what is really important is not the MB/s, but how many software builds, or databases, or virtual desktops, or video streams can a system support, and this is where the reporting focus is.

3) SPEC SFS 2014 recognizes that  consistent  performance is generally an important element of  good performance. SPEC SFS 2014 implements a variety of checks during execution to make sure that each element doing I/O is doing roughly the same amount of I/O and that they are doing the amount of work that was requested of them and aren’t lagging behind.

More information on SPEC SFS 2014 (and SPEC in general) is available from  http://www.spec.org

Q. Anything different/special about measuring and testing object storage?

A. Object Storage is a hot topic in the industry right now, but unfortunately fell outside the scope of this fundamentals presentation. We will be examining object storage – and corresponding network protocols – in a later webcast. Stay tuned!

Q. Does Queue depth not matter in response times?

A. We did not define Queue Depth as a term in this webcast because we classified it as an advanced subject. We are planning to address queue depth during a future webcast. In short, the queue depth does play an important role in the performance world, but response time is affected by it only indirectly. When the target Queue is full, the Initiator can’t send any more IOs and waits for the Queue to free up.

The time spent waiting is not response time. Response time measures the time the Target spends processing the IO it was able to receive (place into its queue). However, some scenarios may keep an IO inside a Queue for a long time (due to storage controller doing something else), and thus increase the response time for the waiting IO. In the latter case, increasing the depth of the Queue will increase the response time.

Keep in mind thought that “It Depends”. It depends on where you measure. For example, a client may have no visibility into what is happening at the storage array. It cares about how long the storage array takes to respond … it doesn’t know or care about the differences between the storage array’s view of its service time (how long the storage array is doing work) and its queuing delay / wait time (how long it is waiting to do work).

As far as the client is concerned, this is all response time. It just wants a response. Most “response time” metrics reported by tools / apps/ storage arrays really decompose into a set of true service times and queuing delays, although you may never get to see how those components come together w/o using other tools.

Q. Order is important, in addition to mix ratios. For IOs, sequential vs. random (and specifics on randomness) is important. For file operations, where the protocol is stateful, the sequence of operations can strongly impact performance. Both block and file storage can have caches, which behave very differently depending on ordering and working set size. I suggest including the concept of order in a storage benchmarking intro.

A. These are excellent points, but given the amount of time we had, we had to make some hard choices about what to include and what to drop (or defer). This fell on the chopping block. Someone once told me that the best talks are those where you cry over what you’ve been forced to leave on the cutting room floor. There was a lot of good stuff we couldn’t include. I expect that we will address this idea in the planned discussion of workloads.

Q. The “S” in “OPS” and “IOPS” means Seconds and isn’t a pluralization. Some people make a mistake here and use the term “OP” and “IOP” to mean “1 operation/second” or “1 IO/second”. This mistake happens in this presentation, where “$/OP” is used. The numbers suggest what was intended is “# of dollars to get 1 operations/second.” The term “$/OP” can be misinterpreted to mean “dollars per operation,” which is a different but also relevant metric for perishable storage, like flash.

A. Good catch. We tried hard to be consistent in using “OPS” instead of “op/s” or some variant, when we meant “operations per second”.  There is another Q&A that addresses this. Technically, we slipped up in saying “$/OP” instead of “$/OPS” when we meant, as you noted, “Dollars per (operations per second)”. That said, I think you’ll find that nearly everyone makes this shortcut/mistake. That doesn’t make it right, but does mean you should watch out for it.

Q. Isn’t capacity reported in MiB, Gib etc and bandwidth in MB/s today?

A. Unfortunately, our impression is that capacity is also reported in decimal, at least in marketing literature. All we were trying to point out is that units of measurements must be consistent at all points. Many people mix decimal and binary numbers creating ambiguities and errors (remember that at a TiB vs. TB level, the difference is 10%!) For example, graphs that show IO size in decimal and bandwidth in binary.

Q. Would you also please talk about different SAN protocols and how they differ/impact the performance?

A. There are additional webcasts under development that discuss the storage networking protocols, their performance trade/offs, and how performance can be affected by their use. Due to time constraints we needed to postpone this portion because it deserves more attention than we could provide in this session.

Q. So even though IOPS could be more, the response time could be less…so it’s important to consider both. Is that correct?

A. Absolutely! But just response time is not enough either! You need more load points, and a business objective to truly assess the value of a solution.

Update: This webcast is part of a series on storage performance benchmarking. Check out the others:

 

 

New Webcast: Data Center Congestion Control

How do new architectures being deployed in today’  s data centers affect IP-based storage? Find out on September 15th in our next SNIA Ethernet Storage Forum live Webcast, “Data Center Congestion Control,” where we will discuss new  architectures and a new congestion control mechanism called CONGA. Developed from research done at Stanford, CONGA is  a network-based distributed congestion-aware load balancing mechanism. It is being researched for use in next generation data centers to help enhance IP-based storage networks and is becoming available in commercial switches. This Webcast will dive into:

  • A definition of CONGA
  • How CONGA efficiently handles load balancing and asymmetry without TCP modifications
  • CONGA as part of a new data center fabric
  • Spine-Leaf/CLOS architectures
  • Affects of 40g/100g in these architectures
  • The CONGA impact on IP storage networks

Discover the new data center architectures that will support the most demanding applications such as big data analytics and large-scale web services. As always, this Webcast will be live. I encourage you to register today and bring your questions.

 

New Webcast: Storage Performance Benchmarking 101

There’s an art to making sense of storage performance benchmarks. That’s why ESF is hosting a live Webcast on this important topic. Please join us on July 30th for “Storage Performance Benchmarking: Introduction and Fundamentals.”     At this Webcast, you’ll gain an understanding of the complexities of benchmarking modern storage arrays and learn the terminology foundations necessary for the rest of the series as Mark Rogov, Advisory Systems Engineer at EMC, Ken Cantrell, Performance Engineering Manager at NetApp, and I discuss:

  • The different kinds of performance benchmarking engagements
  • An introduction to the variety of relevant metrics
  • How to determine the “right” metrics for your business
  • Terminology basics:  IOPS, op/s, throughput, bandwidth, latency/response time

If you’re untrained in the storage performance arts, this Webcast will bring you up-to-speed on the basics. My colleagues and I hope it will be an informative and interactive hour. Please register today and bring your questions. I hope to see you there.

Update: This webcast is part of a series on storage performance benchmarking. Check out the others:

Ethernet Roadmap for Networked Storage Q&A

Almost 200 people attended our joint Webcast with the Ethernet Alliance: “The 2015 Ethernet Roadmap for Networked Storage.” We had a lot of great questions during the live event, but we did not have time to answer them all. As promised, we’ve complied answers for all of the questions that came in. If you think of additional questions, please feel free to comment on this blog.

Q. What did you mean by parity of flash with HDD?

A. We were referring to the O’Reilly article in “Network Computing.”   O’Reilly is predicting parity in BOTH capacity and price in 2016.

Q. When do we expect IEEE standards ratification for 25G speed?

A. 2016.   You can see the exact schedule here.

Q. Do you envision the Enterprise, Cloud Providers, HPC, Financials getting rid of their 10/40GbE infrastructure and replacing that with 25/100GbE infrastructure in 2017? Will these customers deploy 100GbE/25GbE switch in the leaf layer in 2017?

A. Deployment will occur over a multi-year time span overall if only because switch infrastructure is expensive to upgrade, as reflected in the Crehan Research forecast.   New deployments will likely move to 25/100GbE as new switches with 100GbE downstream ports become available in 2016.     Just because the Cloud Service Providers are currently the most aggressive in driving new infrastructure purchases, they represent the largest early volumes for 25/100 GbE.   Enterprise is still in the midst of the transition from 1GbE to 10GbE.

Q. What are some of the developments on spanning-tree derivatives vs. Dykstra based derivatives such as OSPF, FSPF for switches?

A. Beyond the scope of this presentation on Ethernet.   Ethernet is defined by the IEEE for L1 and L2 in the ISO model.   Your questions are at L3 and L4, which is handled by organizations like IETF.

Q. With all the speeds possible who is working on flow control?

A. Flow control at the 802.1 level is supported in the Layer 1/2 PHY & MAC by setting upper bounds on the delay through each layer which allows higher layers to comprehend the delays & response times to pause frames. Each new speed & PHY in 802.3 is accompanied by delay constraint specifications to support this.

Q.   Do you have an overlay graphic that shows the Ethernet RDMA roadmap?   If so, is Ethernet storage the primary driver for that technology?

A.   Beyond the scope of this presentation on Ethernet.   Ethernet is defined by the IEEE for L1 and L2 in the ISO model.   Your questions are at L3 and L4, which is handled by organizations like IETF and the InfiniBand Trade Association.

Q. The adoption of faster and new Ethernet always has to do with the costs of acquiring new technology. How long do you think it will take to adopt/acquire faster Ethernet in datacenters now that the development is happening much faster than the last 20 years?

A. Please see the chart on slide 7 where Crehan Research predicts how fast the technology will diffuse into deployments.

Q. What do you expect as cost comparison between Ethernet and InfiniBand going forward?
Also, what work is being done to reduce latency?

A. Beyond the scope of this presentation.   Latency is primarily a consequence of design methodologies and semiconductor process technology, and thus under the control of the silicon device manufacturers.   Some vendors prioritize latency more than others.

Q. What’s the technical limitation as speeds go higher and higher?

A. A number of factors limit speeds going faster and faster, but the main problem is that materials attenuate signals as they travel at higher frequencies.

Q. Will 1GbE used for manageability purposes disappear from public cloud? If so, what is the expected time frame?

A. This is a choice for end users.   Most equipment is managed on a separate network for security concerns, but users can eliminate these management networks at any time.

Q. What are the relative market size predictions for the expanding number of standards (25G, 50G, 100G, 200G, etc.)?

A. See the Crehan Research forecast in the presentation.

Q. What is the major difference between SMF & MMF for the not so initiated?

A. The SMF has a 9um core while the MMF has a 50um core.   Different lasers are used for each fiber type and MMF typically goes 100 meters above 10GbE and SMF goes from 500m to 10km.

Q. Will 25G be available through both copper and fibre connectivity?

A. Yes.   IEEE 802.3 work is currently underway to specify 25Gb/s on twinax (“direct attach copper)” to 5 meters, printed circuit backplane up to ~1m, twisted pair copper to 30m, multimode fiber to 100m.   There is no technology barrier to 25G on SMF, just that a standards project to specify it has not started yet.

Q. This is interesting from a hardware viewpoint, but has nothing to do with storage yet.   Are we going to get to how this relates to storage other than saying flash drives are fast and only Ethernet can keep up?

A. Beyond the scope of this presentation on Ethernet.   Ethernet is defined by the IEEE for L1 and L2 in the ISO model.   Your questions are directed at the higher layers.   The key point of this webcast is that storage networking engineers need to pay much more attention to the Ethernet roadmap than they have historically, primarily because of NVM.

Q. How does “SFP 28” fit in this mix?   Is it required for 25G?

A. SFP28 connectors and modules are required for 25GbE because they give better performance than SFP+ that only works to 10GbE.

Q. Can you provide the quick difference between copper & optical on speed & costs?

A. Copper and optical Ethernet links are usually standardized at the same speed.   400GbE is not defining a copper link but an active Direct Attached Cable (DAC) will probably support 400GbE.   Cost depends on volume and many factors and is beyond the scope of this presentation.   Copper is usually a fraction of the cost of optical links.

Q. Do you think people will try to use multiple CAT 5e to get more aggregate bandwidth to the access points to avoid having to run Fibre to them?

A. IEEE is defining 2.5GBASE-T and 5GBASE-T to enable Cat5e to support faster wireless access points.

Q. When are higher speeds and PoE going to reach the point when copper based Ethernet will become a viable heat source for buildings thus helping the environment?

A. 🙂   IEEE is defining 4 wire PoE to deliver at least 60W to end devices.   You can find out more here.

Q. What are the use cases for 2.5Gb and 5.0Gb Base-T?

A. The leading use case for 2.5G/5GBASE-T is to provide the uplink for wireless LAN access points that support 802.11ac and future wireless technology.   Wireless LAN technology has advanced to the point where >1Gb/s BW is needed upstream from the AP, and 2.5G/5G provide a higher speed uplink while preserving the user’s investment in Cat5e/Cat6 cabling.

Q. Why not have only CFP2 sockets right away with things disabled for lower speeds for all the intervening years leading to full-fledged CFP2?

A. CFP2 is defined for 100GbE and 8 ports can be used on a 1U switch. 100GbE switches are shifting to QSFP28 so that 32 ports of 100GbE is supported in a 1U switch at low cost.   The CFP2 is much more expensive than QSFP28 and will not be used for lower speeds because of the high cost.