Intro to Incast, Head of Line Blocking, and Congestion Management

For a long time, the architecture and best practices of storage networks have been relatively well-understood. Recently, however, advanced capabilities have been added to storage that could have broader impacts on networks than we think.

The three main storage network transports – Fibre Channel, Ethernet, and InfiniBand – all have mechanisms to handle increased traffic, but they are not all affected or implemented the same way. For instance, utilizing a protocol such as NVMe over Fabrics will offer very different methodologies for handling congestion avoidance, burst handling, and queue management when looking at one networking in comparison to another.

Unfortunately, many network administrators may not understand how different storage solutions place burdens upon their networks. As more storage traffic traverses the network, customers face the risk of congestion leading to higher-than-expected latencies and lower-than expected throughput.

That’s why the SNIA Networking Storage Forum (NSF) is hosting a live webcast on June 18, 2019, Introduction to Incast, Head of Line Blocking, and Congestion Management where our NSF experts will cover:

  • Typical storage traffic patterns
  • What is Incast, what is head of line blocking, what is congestion, what is a slow drain, and when do these become problems on a network?
  • How Ethernet, Fibre Channel, InfiniBand handle these effects
  • The proper role of buffers in handling storage network traffic
  • Potential new ways to handle increasing storage traffic loads on the network

Register today to save your spot for June 18th. As always, our experts will be available to answer your questions. We hope to see you there.

The Impact of New Network Speeds on Storage

In the last few years, Ethernet equipment vendors have announced big increases in line speeds, shipping 25, 50, and 100 Gigabits-per -second (Gb/s) speeds and announcing 200/400 Gb/s. At the same time Fibre Channel vendors have launched 32GFC, 64GFC and 128GFC technology while InfiniBand has reached 200Gb/s (called HDR) speed.

But who exactly is asking for these faster new networking speeds, and how will they use them? Are there servers, storage, and applications that can make good use of them? How are these new speeds achieved? Are new types of signaling, cables and transceivers required? How will changes in PCIe standards and bandwidth keep up? And do the faster speeds come with different distance limitations?

These are among the questions our panel of experts will answer at the next live SNIA Networking Storage Forum (NSF) webcast on May 21, 2019, “New Landscape of Network Speeds.” Join us to learn:

  • How these new speeds are achieved
  • Where they are likely to be deployed for storage
  • What infrastructure changes are needed to support them

Register today to save your spot. And don’t forget to bring your questions. Our experts will be available to answer them on the spot.

Everything You Wanted to Know about Memory

Many followers (dare we say fans?) of the SNIA Networking Storage Forum (NSF) are familiar with our popular webcast series “Everything You Wanted To Know About Storage But Were Too Proud To Ask.” If you’ve missed any of the nine episodes we’ve done to date, they are all available on-demand and provide a 101 lesson on a range of storage related topics like buffers, storage controllers, iSCSI and more.

Our next “Too Proud to Ask” webcast on May 16, 2019 will be “Everything You Wanted To Know About Storage But Were Too Proud To Ask – Part Taupe – The Memory Pod.” Traditionally, much of the IT infrastructure that we’ve built over the years can be divided fairly simply into storage (the place we save our persistent data), network (how we get access to the storage and get at our data) and compute (memory and CPU that crunches on the data). In fact, so successful has this model been that a trip to any cloud services provider allows you to order (and be billed for) exactly these three components.

The only purpose of storage is to persist the data between periods of processing it on a CPU. And the only purpose of memory is to provide a cache of fast accessible data to feed the huge appetite of compute. Currently, we build effective systems in a cost-optimal way by using appropriate quantities of expensive and fast memory (DRAM for instance) to cache our cheaper and slower storage. But fast memory has no persistence at all; it’s only storage that provides the application the guarantee that storing, modifying or deleting data does exactly that.

Memory and storage differ in other ways. For example, we load from memory to registers on the CPU, perform operations there, and then store the results back to memory by loading from and storing to byte addresses. This load/store technology is different from storage, where we tend to move data back and fore between memory and storage in large blocks, by using an API (application programming interface).

It’s clear the lines between memory and storage are blurring as new memory technologies are challenging the way we build and use storage to meet application demands. New memory technologies look like storage in that they’re persistent, if a lot faster than traditional disks or even Flash based SSDs, but we address them in bytes, as we do memory like DRAM, if more slowly. Persistent memory (PM) lies between storage and memory in latency, bandwidth and cost, while providing memory semantics and storage persistence. In this webcast, our SNIA experts will discuss:

  • Fundamental terminology relating to memory
  • Traditional uses of storage and memory as a cache
  • How can we build and use systems based on PM?
  • Persistent memory over a network
  • Do we need a new programming model to take advantage of PM?
  • Interesting use cases for systems equipped with PM
  • How we might take better advantage of this new technology

Register today for this live webcast on May 16th. Our experts will be available to answer the questions that you should not be too proud to ask!

And if you’re curious to know why each of the webcasts in this series is associated with a different color (rather than a number), check out this SNIA NSF blog that explains it all.

Scale-Out File Systems FAQ

On February 28th, the SNIA Networking Storage Forum (NSF) took at look at what’s happening in Scale-Out File Systems. We discussed general principles, design considerations, challenges, benchmarks and more. If you missed the live webcast, it’s now available on-demand. We did not have time to answer all the questions we received at the live event, so here are answers to them all.

Q. Can scale-out file systems do Erasure coding?

A. Indeed, Erasure coding is a common method to improve resilience.

Q. How does one address the problem of a specific disk going down? Where does scale-out architecture provide redundancy?

A. Disk failures typically are covered by RAID software. Some of scale-out software also use multiple replicators to mitigate the impact of disk failures.

Q. Are there use cases where a hybrid of these two styles is needed?

A. Yes, for example, in some environments, the foundation layer might be using the dedicated storage server to form the large storage pool, which is the 1st style, and then export LUNs or virtual disks to the compute nodes (either physical or virtual) to run the applications, which is the 2nd style.

Q. Which scale-out file systems present on windows, Linux platforms?

A.   Some of  the scale-out file systems provide  native client software across multiple  platforms. Another approach is to use Samba to build SMB  gateways to make the  scale-out file system  available to Windows computers.

Q. Is Amazon elastic file system (EFS) on AWS scale-out file systems?

A. Please see:

https://docs.aws.amazon.com/efs/latest/ug/performance.html

“Amazon EFS file systems are distributed across an unconstrained number of storage servers, enabling file systems to grow elastically to petabyte scale and allowing massively parallel access from Amazon EC2 instances to your data. The distributed design of Amazon EFS avoids the bottlenecks and constraints inherent to traditional file servers.”

Q. Where are the most cost effective price/performance uses of NVMe?  

A. NVMe can support very high IOPS and very high throughput as well. The best use case would be to couple NVMe with high performance storage software that would not limit the NVMe.

 

What Are the Networking Requirements for HCI?

Hyperconverged infrastructures (also known as “HCI”) are designed to be easy to set up and  manage. All  you need to do is add networking. In practice, the “add networking” part has been more difficult than most anticipated. That’s why the SNIA Networking Storage Forum (NSF) hosted a live webcast “The Networking Requirements for Hyperconverged Infrastructure” where we covered what HCI is, storage characteristics of HCI, and important networking considerations. If you missed it, it’s available on-demand.

We had some interesting questions during the live webcast and as we promised during the live presentation, here are answers from our expert presenters: Read More

Experts Answer Virtualization and Storage Networking Questions

The SNIA Networking Storage Forum (NSF) kicked off the New Year with a live webcast “Virtualization and Storage Networking Best Practices.” We guessed it would be popular and boy, were we right! Nearly 1,000 people registered to attend. If you missed out, it’s available on-demand. You can also download a copy of the webcast slides.

Our experts, Jason Massae from VMware and Cody Hosterman from Pure Storage, did a great job sharing insights and lessons learned on best practices on configuration, troubleshooting and optimization. Here’s the blog we promised at the live event with answers to all the questions we received.

Q. What is a fan-in ratio?

A. fan-in ratio (sometimes also called an “oversubscription ratio”) refers to the number of hosts links related to the links to a storage device. Using a very simple example can help understand the principle:

Say you have a Fibre Channel network (the actual speed or protocol does not matter for our purposes here). You have 60 hosts, each with a 4GFC link, going through a series of switches, connected to a storage device, just like in the diagram below:

This is a 10:1 oversubscription ratio; the “fan-in” part refers to the number of host bandwidth in comparison to the target bandwidth.

Block storage protocols like Fibre Channel, FCoE, and iSCSI have much lower fan-in ratios than file storage protocols such as NFS, and deterministic storage protocols (like FC) have lower than non-deterministic (like iSCSI). The true arbiter of what the appropriate fan-in ratio is determined by the application. Highly transactional applications, such as databases, often require very low ratios.

Q. If there’s a mismatch in the MTU between server and switch will the highest MTU between the two get negotiated or else will a mismatch persist?

A. No, the lowest value will be used, but there’s a caveat to this. The switch and the network in the path(s) can have MTU set higher than the hosts, but the hosts cannot have a higher MTU than the network. For example, if your hosts are set to 1500 and all the network switches in the path are set to 9k, then the hosts will communicate over 1500.

However, what can, and usually does, happen is someone sets the host(s) or target(s) to 9k but never changes the rest of the network. When this happens, you end up with unreliable or even loss of connectivity. Take a look at the graphic below:

A large ball can’t fit through a hole smaller than itself. Consequently, a 9k frame cannot pass through a 1500 port. Unless you and your network admin both understand and use jumbo frames, there’s no reason to implement in your environment.

Q. Can you implement port binding when using two NICs for all traffic including iSCSI?

A. Yes you can use two NICs for all traffic including iSCSI, many organizations use this configuration. The key to this is making sure you have enough bandwidth to support  all  the traffic/ IO that will use those NICs. You should, at the very least, use 10Gb NICs faster if possible.

Remember, now all your management, VM and storage traffic are using the same network devices. If you don’t plan accordingly, everything can be impacted in your virtual environment. There are some hypervisors capable of granular network controls to manage which type of traffic uses which NIC, certain failover details and allow setting QoS limits on the different traffic types. Subsequently, you can ensure storage traffic gets the required bandwidth or priority in a dual NIC configuration.

Q. I’ve seen HBA drivers that by default set their queue depth to 128 but the target port only handles 512. So two HBAs would saturate one target port which is undesirable. Why don’t the HBA drivers ask what the depth should be at installation?

A. There are a couple of possible reasons for this. One is that many do not know what it even means, and are likely to make a poor decision (higher is better, right?!). So vendors tend to set these things at defaults and let people change them if needed—and usually that means they have purpose to change them. Furthermore, every storage array handles these things differently, and that can make it more difficult to size these things. It is usually better to provide consistency—having things set uniformly makes it easier to support and will give more consistent expectations even across storage platforms.

Second, many environments are large—which means people usually are not clicking and type through installation. Things are templatized, or sysprepped, or automated, etc. During or after the deployment their automation tools can configure things uniformly in accordance with their needs.

In short, it is like most things: give defaults to keep one-off installations simple (and decrease the risks from people who may not know exactly what they are doing), complete the installations without having to research a ton of settings that may not ultimately matter, and yet still provide experienced/advanced users, or automaters, ways to make changes.

Q. A number of white papers show the storage uplinks on different subnets. Is there a reason to have each link on its own subnet/VLAN or can they share a common segment?  

A. One reason is to reduce the number of logical paths. Especially in iSCSI, the number of paths can easily exceed supported limits if every port can talk to every target. Using multiple subnets or VLANs can drop this in half—and all you really use is logical redundancy, which doesn’t really matter. Also, if everything is in the same subnet or VLAN and someone make some kind of catastrophic change to that subnet or VLAN (or some device in it causes other issues), it is less likely to affect both subnets/VLANs. This gives some management “oops” protection. One change will bring all storage connectivity down.

 

 

 

The Ins and Outs of a Scale-Out File System Architecture

To meet the increasingly higher demand on both capacity and performance in large cluster computing environments, the storage subsystem has evolved toward a modular and scalable design. The scale-out file system has emerged as one implementation of the trend, in addition to scale-out object and block storage solutions.

What are the key principles when architecting a scale-out file system? Find out on February 28th when the SNIA Networking Storage Forum (NSF) hosts The Scale-Out File System Architecture Overview, a live webcast where we will present an overview of scale-out file system architectures. This presentation will provide an introduction to scale-out-file systems and cover:

  • General principles when architecting a scale-out file system storage solution
  • Hardware and software design considerations for different workloads
  • Storage challenges when serving a large number of compute nodes, e.g. name space consistency, distributed locking, data replication, etc.
  • Use cases for scale-out file systems
  • Common benchmark and performance analysis approaches

Register today to save your spot. We hope you will join us.

Networking Questions for Ethernet Scale-Out Storage

Unlike traditional local or scale-up storage, scale-out storage imposes different and more intense workloads on the network. That’s why the SNIA Networking Storage Forum (NSF) hosted a live webcast “Networking Requirements for Ethernet Scale-Out Storage.” Our audience had some insightful questions. As promised, our experts are answering them in this blog.

Q. How does scale-out flash storage impact Ethernet networking requirements?

A.  Scale-out flash storage demands higher bandwidth and lower latency than scale-out storage using hard drives. As noted in the webcast, it’s more likely to run into problems with TCP Incast and congestion, especially with older or slower switches. For this reason it’s more likely than scale-out HDD storage to benefit from higher bandwidth networks and modern datacenter Ethernet solutions–such as RDMA, congestion management, and QoS features.

Q. What are your thoughts on NVMe-oF TCP/IP and availability?

A.  The NVMe over TCP specification was ratified in November 2018, so it is a new standard. Some vendors already offer this as a pre-standard implementation. We expect that several of the scale-out storage vendors who support block storage will support NVMe over TCP as a front-end (client connection) protocol in the near future. It’s also possible some vendors will use NVMe over TCP as a back-end (cluster) networking protocol.

Q. Which is better: RoCE or iWARP?

A.  SNIA is vendor-neutral and does not directly recommend one vendor or protocol over another. Both are RDMA protocols that run on Ethernet, are supported by multiple vendors, and can be used with Ethernet-based scale-out storage. You can learn more about this topic by viewing our recent Great Storage Debate webcast “RoCE vs. iWARP” and checking out the Q&A blog from that webcast.

Q. How would you compare use of TCP/IP and Ethernet RDMA networking for scale-out storage?

A.  Ethernet RDMA can improve the performance of Ethernet-based scale-out storage for the front-end (client) and/or back-end (cluster) networks. RDMA generally offers higher throughput, lower latency, and reduced CPU utilization when compared to using normal (non-RDMA) TCP/IP networking. This can lead to faster storage performance and leave more storage node CPU cycles available for running storage software. However, high-performance RDMA requires choosing network adapters that support RDMA offloads and in some cases requires modifications to the network switch configurations. Some other types of non-Ethernet storage networking also offer various levels of direct memory access or networking offloads that can provide high-performance networking for scale-out storage.

Q. How does RDMA networking enable latency reduction?

A. RDMA typically bypasses the kernel TCP/IP stack and offloads networking tasks from the CPU to the network adapter. In essence it reduces the total path length which consequently reduces the latency. Most RDMA NICs (rNICs) perform some level of networking acceleration in an ASIC or FPGA including retransmissions, reordering, TCP operations flow control, and congestion management.

Q. Do all scale-out storage solutions have a separate cluster network?

A.  Logically all scale-out storage systems have a cluster network. Sometimes it runs on a physically separate network and sometimes it runs on the same network as the front-end (client) traffic. Sometimes the client and cluster networks use different networking technologies.

 

 

 

 

Virtualization and Storage Networking Best Practices from the Experts

Ever make a mistake configuring a storage array or wonder if you’re maximizing the value of your virtualized environment? With all the different storage arrays and connectivity protocols available today, knowing best practices can help improve operational efficiency and ensure resilient operations. That’s why the SNIA Networking Storage Forum is kicking off 2019 with a live webcast “Virtualization and Storage Networking Best Practices.”

In this webcast, Jason Massae from VMware and Cody Hosterman from Pure Storage will share insights and lessons learned as reported by VMware’s storage global services by discussing:

  • Common mistakes when setting up storage arrays
  • Why iSCSI is the number one storage configuration problem
  • Configuring adapters for iSCSI or iSER
  • How to verify your PSP matches your array requirements
  • NFS best practices
  • How to maximize the value of your array and virtualization
  • Troubleshooting recommendations

Register today to join us on January 17th. Whether you’ve been configuring storage for VMs for years or just getting started, we think you will pick up some useful tips to optimize your storage networking infrastructure.

RDMA for Persistent Memory over Fabrics – FAQ

In our most recent SNIA Networking Storage Forum (NSF) webcast Extending RDMA for Persistent Memory over Fabrics, our expert speakers, Tony Hurson and Rob Davis outlined extensions to RDMA protocols that confirm persistence and additionally can order successive writes to different memories within the target system. Hundreds of people have seen the webcast and have given it a 4.8 rating on a scale of 1-5! If you missed, it you can watch it on-demand at your convenience. The webcast slides are also available for download.

We had several interesting questions during the live event. Here are answers from our presenters:

Q. For the RDMA Message Extensions, does the client have to qualify a WRITE completion with only Atomic Write Response and not with Commit Response?
A. If an Atomic Write must be confirmed persistent, it must be followed by an additional Commit Request. Built-in confirmation of persistence was dropped from the Atomic Request because it adds latency and is not needed for some application streams.

Q. Why do you need confirmation for writes? From my point of view, the only thing required is ordering.

A. Agreed, but only if the entire target system is non-volatile! Explicit confirmation of persistence is required to cover the “gap” between the Write completing in the network and the data reaching persistence at the target.

Q. Where are these messages being generated? Does NIC know when the data is flushed or committed?

A. They are generated by the application that has reserved the memory window on the remote node. It can write using RDMA writes to that window all it wants, but to guarantee persistence it must send a flush.

Q. How is RPM presented on the client host?

A. The application using it sees it as memory it can read and write.

Q. Does this RDMA commit response implicitly ACK any previous RDMA sends/writes to same or different MR?

A. Yes, the new Commit (and Verify and Atomic Write) Responses have the same acknowledgement coalescing properties as the existing Read Response. That is, a Commit Response is explicit (non-coalesced); but it coalesces/implies acknowledgement of prior Write and/or Send Requests.

Q. Does this one still have the current RMDA Write ACK?

A. See previous general answer. Yes. A Commit Response implicitly acknowledges prior Writes.

Q. With respect to the Race Hazard explained to show the need for explicit completion response, wouldn’t this be the case even with a non-volatile Memory, if the data were to be stored in non-volatile memory. Why is this completion status required only on the non-volatile case?

A. Most networked applications that write over the network to volatile memory do not require explicit confirmation at the writer endpoint that data has actually reached there. If so, additional handshake messages are usually exchanged between the endpoint applications. On the other hand, a writer to PERSISTENT memory across a network almost always needs assurance that data has reached persistence, thus the new extension.

Q. What if you are using multiple RNIC with multiple ports to multiple ports on a 100Gb fabric for server-to-server RDMA? How is order kept there…by CPU software or ‘NIC teaming plus’?

A. This would depend on the RNIC vendor and their implementation.

Q. What is the time frame for these new RDMA messages to be available in verbs API?

A. This depends on the IBTA standards approval process which is not completely predicable, roughly sometime the first half of 2019.

Q. Where could I find more details about the three new verbs (what are the arguments)?

A. Please poll/contact/Google the IBTA and IETF organizations towards the end of calendar year 2018, when first drafts of the extension documents are expected to be available.

Q. Do you see this technology used in a way similar to Hyperconverged systems now use storage or could you see this used as a large shared memory subsystem in the network?

A. High-speed persistent memory, in either NVDIMM or SSD form factor, has enormous potential in speeding up hyperconverged write replication. It will require however substantial re-write of such storage stacks, moving for example from traditional three-phase block storage protocols (command/data/response) to an RDMA write/confirm model. More generally, the RDMA extensions are useful for distributed shared PERSISTENT memory applications.

Q. What would be the most useful performance metrics to debug performance issues in such environments?

A. Within the RNIC, basic counts for the new message types would be a baseline. These plus total stall times encountered by the RNIC awaiting Commit Responses from the local CPU subsystem would be useful. Within the CPU platform basic counts of device write and read requests targeting persistent memory would be useful.

Q. Do all the RDMA NIC’s have to update their firmware to support this new VERB’s? What is the expected performance improvement with the new Commit message?

A. Both answers would depend on the RNIC vendor and their implementation.

Q. Will the three new verbs be implemented in the RNIC alone, or will they require changes in other places (processor, memory controllers, etc.)?

A. The new Commit request requires the CPU platform and its memory controllers to confirm that prior write data has reached persistence. The new Atomic Write and Verify messages however may be executed entirely within the RNIC.

Q. What about the future of NVMe over TCP – this would be much simpler for people to implement. Is this a good option?

A. Again this would depend on the NIC vendor and their implementation. Different vendors have implemented various tests for performance. It is recommended that readers do their own due diligence.