What Does Software Defined Storage Means for Storage Networking?

Software defined storage (SDS) is growing in popularity in both cloud and enterprise accounts. But why is it appealing to some customers and what is the impact on storage networking? Find out at our SNIA Networking Storage Forum webcast on October 22, 2019 “What Software Defined Storage Means for Storage Networking” where our experts will discuss:

  • What makes SDS different from traditional storage arrays?
  • Does SDS have different networking requirements than traditional storage appliances?
  • Does SDS really save money?
  • Does SDS support block, file and object storage access?
  • How data availability is managed in SDS vs. traditional storage
  • What are potential issues when deploying SDS?

Register today to save your spot on Oct. 22nd.   This event is live, so as always, our SNIA experts will be on-hand to answer your questions.

Storage Congestion on the Network Q&A

As more storage traffic traverses the network, the risk of congestion leading to higher-than-expected latencies and lower-than expected throughput has become common. That’s why the SNIA Networking Storage Forum (NSF) hosted a live webcast earlier this month, Introduction to Incast, Head of Line Blocking, and Congestion Management. In this webcast (which is now available on-demand), our SNIA experts discussed how Ethernet, Fibre Channel and InfiniBand each handles increased traffic.

The audience at the live event asked some great questions, as promised, here are answers to them all.

Q. How many IP switch vendors today support Data Center TCP (DCTCP)? Read More

Intro to Incast, Head of Line Blocking, and Congestion Management

For a long time, the architecture and best practices of storage networks have been relatively well-understood. Recently, however, advanced capabilities have been added to storage that could have broader impacts on networks than we think.

The three main storage network transports – Fibre Channel, Ethernet, and InfiniBand – all have mechanisms to handle increased traffic, but they are not all affected or implemented the same way. For instance, utilizing a protocol such as NVMe over Fabrics will offer very different methodologies for handling congestion avoidance, burst handling, and queue management when looking at one networking in comparison to another.

Unfortunately, many network administrators may not understand how different storage solutions place burdens upon their networks. As more storage traffic traverses the network, customers face the risk of congestion leading to higher-than-expected latencies and lower-than expected throughput.

That’s why the SNIA Networking Storage Forum (NSF) is hosting a live webcast on June 18, 2019, Introduction to Incast, Head of Line Blocking, and Congestion Management where our NSF experts will cover:

  • Typical storage traffic patterns
  • What is Incast, what is head of line blocking, what is congestion, what is a slow drain, and when do these become problems on a network?
  • How Ethernet, Fibre Channel, InfiniBand handle these effects
  • The proper role of buffers in handling storage network traffic
  • Potential new ways to handle increasing storage traffic loads on the network

Register today to save your spot for June 18th. As always, our experts will be available to answer your questions. We hope to see you there.

A Q&A from the FCoE vs. iSCSI vs. iSER Debate

It’s become quite clear to those of us in the SNIA Ethernet Storage Forum (ESF) that everyone loves a great debate. We’ve proved that with our “Great Storage Debates” webcast series which has had over 3,500 views in just a few months! Last month we had another friendly debate on FCoE vs. iSCSI vs. iSER. If you missed the live event, you can watch it now on-demand and download a pdf of the webcast slides.  Our live audience asked a lot of interesting questions. As promised, here are answers to them all.

Q. How often are iSCSI offload adapters used in customer environments as compared to software initiators?   Can these adapters be used for all IP traffic or do they only run iSCSI?

A. iSCSI offload adapters are ideally suited for enabling high-performance storage access at up to 100Gbps data rates for business-critical applications, for example, latency-sensitive transactional applications and large-file business intelligence applications. iSCSi offload adapters typically also support offload of other storage protocols such as NVMe-oF, iSER, FCoE as well as regular Ethernet traffic using offload or non-offload means.

Q. What you’ve missed with iSCSI is Jumbo Frames. That payload size is one of the biggest advantages over Fibre Channel. The biggest problem with both FCoE and iSCSI is they build the networks too complex, with too many hops, without true redundant isolation. Best Practices with block based FC is to keep the host and storage as close to each other as possible. And to have separate isolated redundant networks/fabric.

A. The Jumbo Frame (JF) argument is quite contentious among iSCSI storage and network administrators, even beyond anything to do with Fibre Channel.

Considering that the performance advantages of JFs are minimal – only 3%-5% performance boost over default MTU sizes of 1500. In mixed workload environments (which dominate the Data Center application deployments), JFs simply do not provide the kind of benefits that people expect in real-world scenarios. The only time JFs can “push the needle,” so to speak, is when you have massively scaled systems with 100s or 1000s of devices, but this raises other issues.

One of those issues is that every device in the system needs to have JFs enabled. This can be something of a problem when systems get as large as they need to be in order to take advantage of JFs. Ensuring that every device is configured properly – especially over time, and especially when considering how iSCSI devices are added to networked environments – is a job that requires the coordination of the server/virtualization teams, the networking teams, and the storage teams. By and large, many people find QoS to be a more productive means of performance improvement for iSCSI systems than JFs.

Fibre Channel, on the other hand, has a maximum frame size of 2112 bytes. FCoE, then, only requires “baby jumbo” frames, for which the configuration is pushed from the switch to the end devices (~2.5k). What FC has that iSCSI does not have is the concept of “sequences” and “exchanges,” which ensure that the long-flow of frames (regardless of their size) are sent as an entity. So, regardless of what the frame size is (2.5k or 9k), the data flow is sent with consistency and low-jitter because of the way that the sequences and exchanges are handled.

The concern about “too complex” and “too many hops” is an interesting one, as Fibre Channel (and, correspondingly, FCoE) are deliberately kept as simple and straightforward as possible. A FC network, for instance, rarely goes beyond 2 hops (“hops” in FC are measured as the links between switches, whereas in Ethernet “hops” are measured as the switches themselves).

Logically, then, there is usually, at most, an edge-core-edge topology with a predeterministic path to be followed thanks to Fibre Channel’s FSPF routing algorithm.

iSCSI topologies, on the other hand, can be complex (as Ethernet topologies sometimes can be). For larger iSCSI environments, it is often recommended to isolate the storage traffic out into its own, simplified topology. iSCSI SANs that have grown organically, however, can sometimes struggle to be reined in over time.

Best practices for all storage is to keep it as close to the host/source as is reasonably possible, not just block. In backup scenarios for example, you want the storage far enough away to be safe from any catastrophe, but close enough to ensure recovery objectives. The design principle of keeping storage as close to the host is a common best practice, and as mentioned in the webinar it is important that architectural principles ensure high availability (HA) to compensate for the rigidity that block storage systems require to compensate for weaker ULP recovery mechanisms.

 Q.  Most servers today have enough compute power to not need offload adapters.

A.  This statement might be true in some situations, but definitely not most. With more and more virtual machines being deployed on physical systems and new storage technologies such as SSDs, and NVMe devices which greatly lower latencies, servers are often CPU bound when moving or retrieving data from storage. Offloading storage related activities to an adapter frees the CPU and increases overall server performance.

Q. In which industry is each protocol (i.e. FCOE or ISCSI and iSER) widely used and where?

A. iSCSI is the most widely-supported Ethernet SAN protocol  with native initiator support integrated into all the major operating systems and hypervisors, built-in RDMA for high performance offloaded implementations supporting up to 100Gbps and support across major storage platforms and  is thus ideally suited for deployment across cloud and enterprise data center environments.

Q. Do iSCSI offload adapters provide the IPSec encryption, or is this done in software only solutions? Please answer from both initiator and target perspective.

A. Yes, iSCSI protocol offload adapters can optionally provide offload of IPSec encryption for both iSCSI (as well as NVMe-oF) initiator and target operation at data rates of up to 100 Gigabits-per-second. This results in overall higher server and target efficiency including power, cooling, memory, and CPU savings.

Q. Does iSER support direct or is a switch between them required?

A. A switch is not required.

Q. J, you left out the centralized management that Fibre Channel provides for FCoE as a positive.

A. I got there eventually! But you are correct, the Fibre Channel tools for a centralized management plane with the name server – regardless of the number of switches in the fabric – is a tremendous positive for FCoE/FC solutions at scale.

Q. Is multipath possible on the initiator with ISER and will it scale with high IOPs?

A. Yes. Mulitpath is possible on the initiator with iSER and scales with high IOPs.

Q. FCoE has been around for a while, but I noticed that some storage vendors are dropping support for it. Do you still see a big future for FCoE?

A. As a protocol, FCoE has always been able to be used wherever and whenever needed. Almost all converged infrastructure systems use FCoE, for instance. Given that the key advantage of FCoE has been traffic/protocol consolidation, there is an extremely strong use case for FCoE at “the first hop” – that is, from the server to the first network switch.

Q. What is the MTU for iSER ?

A. iSER as a protocol that sits above the Layer 2 Data Link Layer, which is where the MTU is set. As a result, iSER will accept/accommodate any MTU setting that is configured at that layer. Please see the answer earlier about Jumbo Frames for more information.

Ready for more great storage debates? Our next one will be RoCE vs. iWARP on August 22, 2018. Save you place by registering here.

And you can check out our previous debates “File vs. Block vs. Object Storage” and “Fibre Channel vs. iSCSI” on-demand at your convenience too. Happy debating!

Benchmarking Workload Storage Performance – An Expert Q&A

Nearly 1,000 people have watched our most recent SNIA ESF webcast, Storage Performance Benchmarking: Workloads. We hope you didn’t miss this 5th and final installment of the now famous Storage Performance Benchmarking webcast series where our experts, Mark Rogov and Chris Coniff, explained how to measure and optimize storage performance of workloads. If you haven’t seen it, it’s available on-demand. The live audience had many great questions. Here are answers to them all.

Q. Is it good to assume that sequential IO would benefit from large IO size and, conversely, random IO from small IO size?

A. I don’t think “benefit” is a practical way to look at this. Workloads come in all sizes and mixes, and it is the job of the storage array to handle what is thrown at it. Storage admins (and placement algorithms) need to configure the system to produce the best performance given the current load. Historically, random workloads are harder to optimize than sequential. Mixing block sizes, even with sequential workloads, is also tough to deal with. It all comes to figuring out where the bottlenecks are, and how to overcome them.

Q. But pre-fetch, cache, etc. will show benefit to sequential IO on all Flash array versus random?

A. Technically, we need to look at the effects of cache and pre-fetch on Reads and Writes separately. For Reads, pre-fetching data into cache does show a lot of benefit, especially for sequential IO. For Writes, pre-fetching is not effective, but cache is: it is, generally, faster to save data to cache than straight to disk (assuming, of course, there is free space in cache to write to).

Q. Don’t you see [concatenation of small IOs into larger IO] when apps are inside a VM versus physical nodes? I ask because [block size] changes with versions of hypervisors.

A. This is a great question, and a big misconception floating out there. Hypervisors have three primary methods for accessing storage: direct block, NAS, or via internal filesystem.

The direct block, aka raw device mapping, is simple—all IOs are simply sent to the storage array as they are. No concatenation, folding, compression, etc.

Internal filesystem, VMFS, has a concept of a block size. These blocks are used for internal management of the filesystem (see our File webcast with explanation on how those work). A common misconception is that when a block size is smaller than the write IO size, the filesystem issues a write equal to its block size. In reality, the writes are simply passed to the underlying driver, plus some additional meta-data IO. The amount of data being written doesn’t change just because it is written into a larger block container. In cases when the write IO is larger than a block size of the FS, yes, the storage will see a multiple of IOs, depending on what the ratio is. In VMFS though, block sizes are usually quite large: 1MiB, 4MiB, 8MiB. Very few (if any!) workloads have IOs that big. VMFS also has a concept of “sub blocks”, which are smaller than 1MiB, but are also quite large: 64KiB, with the same logic applied.

NAS communication is the most complex. For the purposes of this question, consider the Ethernet Maximum Transfer Unit size (MTU). It regulates the size of the frame, which by default is 1500 bytes. Therefore, all data must be either smaller than or cut into 1500 byte chunks to be sent across the wire. For example, 4KiB IO will be split into 3 frames: 1500, 1500, and 1096. Sometimes, MTU is set to a “jumbo setting”, or 9000. Then each 4KiB IO will fit into one frame. With NFSv4, the protocol allows combining several NFS calls into one frame. Theoretically, that means that two NFS write calls for 4KiB could fit into a single 9KB jumbo Ethernet frame. In reality, one needs to examine closely which specific NFS calls are truly being used—see our File webcast on details of how block commands get translated to and from filesystem calls and extrapolate that into NFS calls (FS and NFS are not the same!).

Bottom line, regardless of the datastore access method, in most cases, your workload IO will be passed to the storage array as is without coalescing.

Q. There are several other factors that have to be considered: 1) More than random/sequential it’s I/O adjacency that matters most. Think how differently a hybrid storage system would handle random I/O to 5% of the volume vs. even sequential to the whole volume. 2) Does the data dedupe/compress using the array’s algorithm?

A. I agree somewhat with this comment. Adjacency of the IO is a good way to think about things. Intelligent placement algorithms do have a concept of a “Working Area,” which could elect promoting whole regions to the faster storage tier to speed up all the “adjacent” random requests. Data reduction (compression, deduplication, single-instancing, zero-padding, etc.) introduces an overhead in some arrays, and therefore is mudding the picture somewhat, yet the underlying principles remain the same. Keep in mind, this is a vendor neutral presentation, very commonly the differences in handling data reduction are heavily marketed between different solutions.

Q. I think Mark is at a different company now 🙂 2 4 8 is NOT sequential, it is strided, or Geometric. Holistically, If proc A reads 2,4,6 and Proc B read 1,3,5,7 then the strides are within a proc, but holistically, this would be sequential. Though few operating systems do this level of pre-fetch logic.

A. Good catch, and I agree. In a later revision of the slides, we changed the 2, 4, 6 sequence into “predictable” pattern, not sequential.

Q. If you have an all Flash array what kind of performance hit do u take between sequential / random reads? I would think that it isn’t as impactful as a mechanical drive.

A. The answer can be true under one condition and not another (LUN is in a write-pending condition, cache flushing is taking place, etc.).  

Generally speaking, reads always perform better than writes on flash drives.   As for random vs sequential IO, they are measured with two different metrics (iops vs throughput). In order to answer that part of the question, their measurements must be normalized to a common KPI, such that they could be compared.   And to do that, we’d have to know what IO block size the question assumes, as it must be known to solve for ‘IO size x IOPS = Throughput in MB/s’

This is from the spec sheet for one vendor’s SSD:

  • Sequential Read (up to)450 MB/s
  • Sequential Write (up to)380 MB/s
  • Random Read (100% Span)67500 IOPS
  • Random Write (100% Span)17500 IOPS
  • Latency – Read40 µs
  • Latency – Write42 µs

So, the ‘It Depends” answer is based on the manufacture & model of drive at specific code level, and (if it’s in an array) how the specific array vendor implemented it in their design.   If it’s an Integrated Cache Disk Array (ICDA), then pre-fetching (read ahead) and caching algorithms behave differently at various code levels for each vendor.  There are specific used defined configuration parameters that can negate the answer to the question below as well.   For instance, High & Low water marks, Dynamic Cache Partitioning, Workload QoS, etc.

In the case of an ICDA, what’s more important than read vs write or random vs sequential is whether or not the IO was a cache hit.  A cache hit for a single random read IO in an ICDA who’s LUN is on a 7.2K SATA drive will have better performance than a random read miss on that same array if the LUN were on a flash drive.

So, as we’ve seen throughout this series, there is more to the overall performance benchmark than any one variable.

Q. What observations do you have on rebuild time for Flash disk? On what magnitude is it faster than spinning disks considering a high-end hybrid or AFA storage system?

A. This question is dangerous, as it crosses into vendor specifics. Rebuild time depends on more than just the type of drive, the RAID type and configuration, drive utilization, array business—and many more factors come into play here. Generally, an SSD drive will rebuild faster than a similarly sized spinning drive.

Q. What about the fact that you get cache effects in the OS stack and also in the Flash (DRAM landing areas) that actually improves write latency versus reads? Isn’t it worth mentioning Writes can actually appear higher performant on Flash? Or am I missing something? (Probably the latter) 🙂

A. You are 100% correct! 🙂 But do consider the size of DRAM… how much data can it take? Does the entire “working area” fit there? If it does, viola! If it does not – welcome to the rest of the world!

Q. Can you send links to all the webcasts in this series?

A. Of course, please see below and happy viewing!

  1. Storage Performance Benchmarking: Introduction and Fundamentals
  2. Storage Performance Benchmarking: Part 2  – Solution under Test
  3. Storage Performance Benchmarking: Block Components  
  4. Storage Performance Benchmarking: File Components  
  5. Storage Performance Benchmarking: Workloads

Storage Performance Benchmarking: Workloads

The SNIA Ethernet Storage Forum is very pleased to announce that the hugely popular “Storage Performance Benchmarking” webcast series continues with a 5th installment! Join us on February 14th at 10:00 am PT for “Storage Performance Benchmarking: Workloads.”

Benchmarking storage performance is both an art and a science. In this 5th installment, our experts, Mark Rogov and Chris Conniff, take on optimizing performance for various workloads. Attendees will gain an understanding of workload profiles and their characteristics for common Independent Software Vendor (ISV) applications and learn how to identify application workloads based on I/O profiles to better understand the implications on storage architectures and design patterns. This webcast will cover:

  • An introduction to benchmarking storage performance of workloads
  • Workload characteristics
  • Common Workloads (OLTP, OLAP, VMware, etc.)
  • Graph fun!

Did you notice this webcast is on February 14th? We did that on purpose, because we know you’ll love it! So, register now and spend an hour of your Valentine’s Day with us. We hope to see you there.

And if you have not yet had a chance to watch any of our previous “Storage Performance Benchmarking” webcasts, they are all available on-demand.