Experts Answer Virtualization and Storage Networking Questions

The SNIA Networking Storage Forum (NSF) kicked off the New Year with a live webcast “Virtualization and Storage Networking Best Practices.” We guessed it would be popular and boy, were we right! Nearly 1,000 people registered to attend. If you missed out, it’s available on-demand. You can also download a copy of the webcast slides.

Our experts, Jason Massae from VMware and Cody Hosterman from Pure Storage, did a great job sharing insights and lessons learned on best practices on configuration, troubleshooting and optimization. Here’s the blog we promised at the live event with answers to all the questions we received.

Q. What is a fan-in ratio?

A. fan-in ratio (sometimes also called an “oversubscription ratio”) refers to the number of hosts links related to the links to a storage device. Using a very simple example can help understand the principle:

Say you have a Fibre Channel network (the actual speed or protocol does not matter for our purposes here). You have 60 hosts, each with a 4GFC link, going through a series of switches, connected to a storage device, just like in the diagram below:

This is a 10:1 oversubscription ratio; the “fan-in” part refers to the number of host bandwidth in comparison to the target bandwidth.

Block storage protocols like Fibre Channel, FCoE, and iSCSI have much lower fan-in ratios than file storage protocols such as NFS, and deterministic storage protocols (like FC) have lower than non-deterministic (like iSCSI). The true arbiter of what the appropriate fan-in ratio is determined by the application. Highly transactional applications, such as databases, often require very low ratios.

Q. If there’s a mismatch in the MTU between server and switch will the highest MTU between the two get negotiated or else will a mismatch persist?

A. No, the lowest value will be used, but there’s a caveat to this. The switch and the network in the path(s) can have MTU set higher than the hosts, but the hosts cannot have a higher MTU than the network. For example, if your hosts are set to 1500 and all the network switches in the path are set to 9k, then the hosts will communicate over 1500.

However, what can, and usually does, happen is someone sets the host(s) or target(s) to 9k but never changes the rest of the network. When this happens, you end up with unreliable or even loss of connectivity. Take a look at the graphic below:

A large ball can’t fit through a hole smaller than itself. Consequently, a 9k frame cannot pass through a 1500 port. Unless you and your network admin both understand and use jumbo frames, there’s no reason to implement in your environment.

Q. Can you implement port binding when using two NICs for all traffic including iSCSI?

A. Yes you can use two NICs for all traffic including iSCSI, many organizations use this configuration. The key to this is making sure you have enough bandwidth to support  all  the traffic/ IO that will use those NICs. You should, at the very least, use 10Gb NICs faster if possible.

Remember, now all your management, VM and storage traffic are using the same network devices. If you don’t plan accordingly, everything can be impacted in your virtual environment. There are some hypervisors capable of granular network controls to manage which type of traffic uses which NIC, certain failover details and allow setting QoS limits on the different traffic types. Subsequently, you can ensure storage traffic gets the required bandwidth or priority in a dual NIC configuration.

Q. I’ve seen HBA drivers that by default set their queue depth to 128 but the target port only handles 512. So two HBAs would saturate one target port which is undesirable. Why don’t the HBA drivers ask what the depth should be at installation?

A. There are a couple of possible reasons for this. One is that many do not know what it even means, and are likely to make a poor decision (higher is better, right?!). So vendors tend to set these things at defaults and let people change them if needed—and usually that means they have purpose to change them. Furthermore, every storage array handles these things differently, and that can make it more difficult to size these things. It is usually better to provide consistency—having things set uniformly makes it easier to support and will give more consistent expectations even across storage platforms.

Second, many environments are large—which means people usually are not clicking and type through installation. Things are templatized, or sysprepped, or automated, etc. During or after the deployment their automation tools can configure things uniformly in accordance with their needs.

In short, it is like most things: give defaults to keep one-off installations simple (and decrease the risks from people who may not know exactly what they are doing), complete the installations without having to research a ton of settings that may not ultimately matter, and yet still provide experienced/advanced users, or automaters, ways to make changes.

Q. A number of white papers show the storage uplinks on different subnets. Is there a reason to have each link on its own subnet/VLAN or can they share a common segment?  

A. One reason is to reduce the number of logical paths. Especially in iSCSI, the number of paths can easily exceed supported limits if every port can talk to every target. Using multiple subnets or VLANs can drop this in half—and all you really use is logical redundancy, which doesn’t really matter. Also, if everything is in the same subnet or VLAN and someone make some kind of catastrophic change to that subnet or VLAN (or some device in it causes other issues), it is less likely to affect both subnets/VLANs. This gives some management “oops” protection. One change will bring all storage connectivity down.

 

 

 

The Ins and Outs of a Scale-Out File System Architecture

To meet the increasingly higher demand on both capacity and performance in large cluster computing environments, the storage subsystem has evolved toward a modular and scalable design. The scale-out file system has emerged as one implementation of the trend, in addition to scale-out object and block storage solutions.

What are the key principles when architecting a scale-out file system? Find out on February 28th when the SNIA Networking Storage Forum (NSF) hosts The Scale-Out File System Architecture Overview, a live webcast where we will present an overview of scale-out file system architectures. This presentation will provide an introduction to scale-out-file systems and cover:

  • General principles when architecting a scale-out file system storage solution
  • Hardware and software design considerations for different workloads
  • Storage challenges when serving a large number of compute nodes, e.g. name space consistency, distributed locking, data replication, etc.
  • Use cases for scale-out file systems
  • Common benchmark and performance analysis approaches

Register today to save your spot. We hope you will join us.

Networking for Hyperconvergence

“Why can’t I add a 33rd node?”

One of the great advantages of Hyperconverged infrastructures (also known as “HCI”) is that, relatively speaking, they are extremely easy to set up and manage. In many ways, they’re the “Happy Meals” of infrastructure, because you have compute and storage in the same box. All you need to do is add networking.

In practice, though, many consumers of HCI have found that the “add networking” part isn’t quite as much of a no-brainer as they thought it would be. Because HCI hides a great deal of the “back end” communication, it’s possible to severely underestimate or misunderstand the requirements necessary to run a seamless environment. At some point, “just add more nodes” becomes a more difficult proposition.

That’s why the SNIA Networking Storage Forum (NSF) is hosting a live webcast “Networking Requirements for Hyperconvergence” on February 5, 2019. At this webcast, we’re going to take a look behind the scenes, peek behind the GUI, so to speak. We’ll be talking about what goes on back there, and shine the light behind the bezels to see:

  • The impact of metadata on the network
  • What happens as we add additional nodes
  • How to right-size the network for growth
  • Networking best practices to make your HCI work better
  • And more…

Now, not all HCI environments are created equal, so we’ll say in advance that your mileage will vary. However, understanding some basic concepts of how storage networking impacts HCI performance may be particularly useful when planning your HCI environment, or contemplating whether or not it is appropriate for your situation in the first place.

Register here to save your spot for February 5th. Our experts will be on hand to answer your questions.

This webcast is the second installment of our Storage Networking series. Our first was “Networking Requirements for Ethernet Scale-Out Storage.” It’s available on-demand as are all our educational webcasts. I encourage you to peruse the more than 60 vendor-neutral presentations is the NSF webcast library at your convenience.

When NVMe™ over Fabrics Meets TCP

In the storage world, NVMe™ is arguably the hottest thing going right now. Go to any storage conference – either vendor-related or vendor-neutral, and you’ll see NVMe as the latest and greatest innovation. It stands to reason, then, that when you want to run NVMe over a  network, you must understand NVMe over Fabrics (NVMe-oF). Meanwhile, TCP is by far the most popular networking transport protocol both for storage and non-storage traffic.

TCP – the long-standing mainstay of networking – is the newest transport technology to be approved by the NVM Express ® organization, enabling NVMe/TCP. This can mean really good things for storage and storage networking – but what are the tradeoffs?

With any new technology, though, there can still be a bit of confusion.  No  technology is a panacea; and with any new development there will always be a need to gauge where it is best used (like a tool in a toolbox).

Learn more on January 22nd when the SNIA Networking Storage Forum hosts a live webcast, What NVMe™/TCP Means for Networked Storage. In this webcast, we’ve brought together the lead author of the NVMe/TCP specification, Sagi Grimberg, and J. Metz, member of the SNIA and NVMe Boards of Directors, to discuss:

  • What is NVMe/TCP
  • How NVMe/TCP works
  • What are the trade-offs?
  • What should network administrators know?
  • What kind of expectations are realistic?
  • What technologies can make NVMe/TCP work better?
  • And more…

Obviously, we can’t cover the entire world of NVMe and TCP networking in an hour, but we  can  start to raise the questions – and approach the answers – that must be addressed in order to make informed decisions. Speaking of questions, bring yours. Sagi and J. will be answering them on the 22nd. Register today to save your spot.

 

 

Networking Questions for Ethernet Scale-Out Storage

Unlike traditional local or scale-up storage, scale-out storage imposes different and more intense workloads on the network. That’s why the SNIA Networking Storage Forum (NSF) hosted a live webcast “Networking Requirements for Ethernet Scale-Out Storage.” Our audience had some insightful questions. As promised, our experts are answering them in this blog.

Q. How does scale-out flash storage impact Ethernet networking requirements?

A.  Scale-out flash storage demands higher bandwidth and lower latency than scale-out storage using hard drives. As noted in the webcast, it’s more likely to run into problems with TCP Incast and congestion, especially with older or slower switches. For this reason it’s more likely than scale-out HDD storage to benefit from higher bandwidth networks and modern datacenter Ethernet solutions–such as RDMA, congestion management, and QoS features.

Q. What are your thoughts on NVMe-oF TCP/IP and availability?

A.  The NVMe over TCP specification was ratified in November 2018, so it is a new standard. Some vendors already offer this as a pre-standard implementation. We expect that several of the scale-out storage vendors who support block storage will support NVMe over TCP as a front-end (client connection) protocol in the near future. It’s also possible some vendors will use NVMe over TCP as a back-end (cluster) networking protocol.

Q. Which is better: RoCE or iWARP?

A.  SNIA is vendor-neutral and does not directly recommend one vendor or protocol over another. Both are RDMA protocols that run on Ethernet, are supported by multiple vendors, and can be used with Ethernet-based scale-out storage. You can learn more about this topic by viewing our recent Great Storage Debate webcast “RoCE vs. iWARP” and checking out the Q&A blog from that webcast.

Q. How would you compare use of TCP/IP and Ethernet RDMA networking for scale-out storage?

A.  Ethernet RDMA can improve the performance of Ethernet-based scale-out storage for the front-end (client) and/or back-end (cluster) networks. RDMA generally offers higher throughput, lower latency, and reduced CPU utilization when compared to using normal (non-RDMA) TCP/IP networking. This can lead to faster storage performance and leave more storage node CPU cycles available for running storage software. However, high-performance RDMA requires choosing network adapters that support RDMA offloads and in some cases requires modifications to the network switch configurations. Some other types of non-Ethernet storage networking also offer various levels of direct memory access or networking offloads that can provide high-performance networking for scale-out storage.

Q. How does RDMA networking enable latency reduction?

A. RDMA typically bypasses the kernel TCP/IP stack and offloads networking tasks from the CPU to the network adapter. In essence it reduces the total path length which consequently reduces the latency. Most RDMA NICs (rNICs) perform some level of networking acceleration in an ASIC or FPGA including retransmissions, reordering, TCP operations flow control, and congestion management.

Q. Do all scale-out storage solutions have a separate cluster network?

A.  Logically all scale-out storage systems have a cluster network. Sometimes it runs on a physically separate network and sometimes it runs on the same network as the front-end (client) traffic. Sometimes the client and cluster networks use different networking technologies.

 

 

 

 

Virtualization and Storage Networking Best Practices from the Experts

Ever make a mistake configuring a storage array or wonder if you’re maximizing the value of your virtualized environment? With all the different storage arrays and connectivity protocols available today, knowing best practices can help improve operational efficiency and ensure resilient operations. That’s why the SNIA Networking Storage Forum is kicking off 2019 with a live webcast “Virtualization and Storage Networking Best Practices.”

In this webcast, Jason Massae from VMware and Cody Hosterman from Pure Storage will share insights and lessons learned as reported by VMware’s storage global services by discussing:

  • Common mistakes when setting up storage arrays
  • Why iSCSI is the number one storage configuration problem
  • Configuring adapters for iSCSI or iSER
  • How to verify your PSP matches your array requirements
  • NFS best practices
  • How to maximize the value of your array and virtualization
  • Troubleshooting recommendations

Register today to join us on January 17th. Whether you’ve been configuring storage for VMs for years or just getting started, we think you will pick up some useful tips to optimize your storage networking infrastructure.

RDMA for Persistent Memory over Fabrics – FAQ

In our most recent SNIA Networking Storage Forum (NSF) webcast Extending RDMA for Persistent Memory over Fabrics, our expert speakers, Tony Hurson and Rob Davis outlined extensions to RDMA protocols that confirm persistence and additionally can order successive writes to different memories within the target system. Hundreds of people have seen the webcast and have given it a 4.8 rating on a scale of 1-5! If you missed, it you can watch it on-demand at your convenience. The webcast slides are also available for download.

We had several interesting questions during the live event. Here are answers from our presenters:

Q. For the RDMA Message Extensions, does the client have to qualify a WRITE completion with only Atomic Write Response and not with Commit Response?
A. If an Atomic Write must be confirmed persistent, it must be followed by an additional Commit Request. Built-in confirmation of persistence was dropped from the Atomic Request because it adds latency and is not needed for some application streams.

Q. Why do you need confirmation for writes? From my point of view, the only thing required is ordering.

A. Agreed, but only if the entire target system is non-volatile! Explicit confirmation of persistence is required to cover the “gap” between the Write completing in the network and the data reaching persistence at the target.

Q. Where are these messages being generated? Does NIC know when the data is flushed or committed?

A. They are generated by the application that has reserved the memory window on the remote node. It can write using RDMA writes to that window all it wants, but to guarantee persistence it must send a flush.

Q. How is RPM presented on the client host?

A. The application using it sees it as memory it can read and write.

Q. Does this RDMA commit response implicitly ACK any previous RDMA sends/writes to same or different MR?

A. Yes, the new Commit (and Verify and Atomic Write) Responses have the same acknowledgement coalescing properties as the existing Read Response. That is, a Commit Response is explicit (non-coalesced); but it coalesces/implies acknowledgement of prior Write and/or Send Requests.

Q. Does this one still have the current RMDA Write ACK?

A. See previous general answer. Yes. A Commit Response implicitly acknowledges prior Writes.

Q. With respect to the Race Hazard explained to show the need for explicit completion response, wouldn’t this be the case even with a non-volatile Memory, if the data were to be stored in non-volatile memory. Why is this completion status required only on the non-volatile case?

A. Most networked applications that write over the network to volatile memory do not require explicit confirmation at the writer endpoint that data has actually reached there. If so, additional handshake messages are usually exchanged between the endpoint applications. On the other hand, a writer to PERSISTENT memory across a network almost always needs assurance that data has reached persistence, thus the new extension.

Q. What if you are using multiple RNIC with multiple ports to multiple ports on a 100Gb fabric for server-to-server RDMA? How is order kept there…by CPU software or ‘NIC teaming plus’?

A. This would depend on the RNIC vendor and their implementation.

Q. What is the time frame for these new RDMA messages to be available in verbs API?

A. This depends on the IBTA standards approval process which is not completely predicable, roughly sometime the first half of 2019.

Q. Where could I find more details about the three new verbs (what are the arguments)?

A. Please poll/contact/Google the IBTA and IETF organizations towards the end of calendar year 2018, when first drafts of the extension documents are expected to be available.

Q. Do you see this technology used in a way similar to Hyperconverged systems now use storage or could you see this used as a large shared memory subsystem in the network?

A. High-speed persistent memory, in either NVDIMM or SSD form factor, has enormous potential in speeding up hyperconverged write replication. It will require however substantial re-write of such storage stacks, moving for example from traditional three-phase block storage protocols (command/data/response) to an RDMA write/confirm model. More generally, the RDMA extensions are useful for distributed shared PERSISTENT memory applications.

Q. What would be the most useful performance metrics to debug performance issues in such environments?

A. Within the RNIC, basic counts for the new message types would be a baseline. These plus total stall times encountered by the RNIC awaiting Commit Responses from the local CPU subsystem would be useful. Within the CPU platform basic counts of device write and read requests targeting persistent memory would be useful.

Q. Do all the RDMA NIC’s have to update their firmware to support this new VERB’s? What is the expected performance improvement with the new Commit message?

A. Both answers would depend on the RNIC vendor and their implementation.

Q. Will the three new verbs be implemented in the RNIC alone, or will they require changes in other places (processor, memory controllers, etc.)?

A. The new Commit request requires the CPU platform and its memory controllers to confirm that prior write data has reached persistence. The new Atomic Write and Verify messages however may be executed entirely within the RNIC.

Q. What about the future of NVMe over TCP – this would be much simpler for people to implement. Is this a good option?

A. Again this would depend on the NIC vendor and their implementation. Different vendors have implemented various tests for performance. It is recommended that readers do their own due diligence.

 

How Scale-Out Storage Changes Networking Demands

Scale-out storage is increasingly popular for Cloud, High-Performance Computing, Machine Learning, and certain Enterprise applications. It offers the ability to grow both capacity and performance at the same time and to distribute I/O workloads across multiple machines.

But unlike traditional local or scale-up storage, scale-out storage imposes different and more intense workloads on the network. Clients often access multiple storage servers simultaneously; data typically replicates or migrates from one storage node to another; and metadata or management servers must stay in sync with each other as well as communicating with clients.

Due to these demands, traditional network architectures and speeds may not work well for scale-out storage, especially when it’s based on flash. That’s why the SNIA Networking Storage Forum (NSF) is hosting a live webcast “Networking Requirements for Scale-Out Storage” on November 14th. I hope you will join my NSF colleagues and me to learn about:

  • Scale-out storage solutions and what workloads they can address
  • How your network may need to evolve to support scale-out storage
  • Network considerations to ensure performance for demanding workloads
  • Key considerations for all flash scale-out storage solutions

Register today. Our NSF experts will be on hand to answer your questions.

Introducing the Networking Storage Forum

At SNIA, we are dedicated to staying on top of storage trends and technologies to fulfill our mission as a globally recognized and trusted authority for storage leadership, standards, and technology expertise. For the last several years, the Ethernet Storage Forum has been working hard to provide high quality educational and informational material related to all kinds of storage.

From our “Everything You Wanted To Know About Storage But Were Too Proud To Ask” series, to the absolutely phenomenal (and required viewing) “Storage Performance Benchmarking” series to the “Great Storage Debates” series, we’ve produced dozens of hours of material.

Technologies have evolved and we’ve come to a point where there’s a need to understand how these systems and architectures work – beyond just the type of wire that is used. Today, there are new systems that are bringing storage to completely new audiences. From scale-up to scale-out, from disaggregated to hyperconverged, RDMA, and NVMe-oF – there is more to storage networking than just your favorite transport.

For example, when we talk about NVMe™ over Fabrics, the protocol is broader than just one way of accomplishing what you need. When we talk about virtualized environments, we need to examine the nature of the relationship between hypervisors and all kinds of networks. When we look at “Storage as a Service,” we need to understand how we can create workable systems from all the tools at our disposal.

Bigger Than Our Britches

As I said, SNIA’s Ethernet Storage Forum has been working to bring these new technologies to the forefront, so that you can see (and understand) the bigger picture. To that end, we realized that we needed to rethink the way that our charter worked, to be even more inclusive of technologies that were relevant to storage and networking.

So…

Introducing the Networking Storage Forum. In this group we’re going to continue producing top-quality, vendor-neutral material related to storage networking solutions. We’ll be talking about:

  • Storage Protocols (iSCSI, FC, FCoE, NFS, SMB, NVMe-oF, etc.)
  • Architectures (Hyperconvergence, Virtualization, Storage as a Service, etc.)
  • Storage Best Practices
  • New and developing technologies

… and more!

Generally speaking, we’ll continue to do the same great work that we’ve been doing, but now our name more accurately reflects the breadth of work that we do.

We’re excited to launch this new chapter of the Forum. If you work for a vendor, are a systems integrator, university or someone who manages storage, we welcome you to join the NSF. We are an active group that honestly has a lot of fun. If you’re one of our loyal followers, we hope you will continue to keep track of what we’re doing. And if you’re new to this Forum, we encourage you to take advantage of the library of webcasts, white papers, and published articles that we have produced here. There’s a wealth of un-biased, educational information there, we don’t think you’ll find anywhere else!

If there’s something that you’d like to hear about – let us know! We are always looking to hear about headaches, concerns, and areas of confusion within the industry where we can shed some light. Stay current with all things NSF:

 

 

Oh What a Tangled Web We Weave: Extending RDMA for PM over Fabrics

For datacenter applications requiring low-latency access to persistent storage, byte-addressable persistent memory (PM) technologies like 3D XPoint and MRAM are attractive solutions. Network-based access to PM, labeled here Persistent Memory over Fabrics (PMoF), is driven by data scalability and/or availability requirements. Remote Direct Memory Access (RDMA) network protocols are a good match for PMoF, allowing direct RDMA data reads or writes from/to remote PM. However, the completion of an RDMA Write at the sending node offers no guarantee that data has reached persistence at the target.

Join the Networking Storage Forum (NSF) on October 25, 2018 for out next live webcast, Extending RDMA for Persistent Memory over Fabrics. In this webcast, we will outline extensions to RDMA protocols that confirm such persistence and additionally can order successive writes to different memories within the target system. Learn:

  • Why we can’t just treat PM just like traditional storage or volatile memory
  • What happens when you write to memory over RDMA
  • Which programming model and protocol changes are required for PMoF
  • How proposed RDMA extensions for PM would work

We believe this webcast will appeal to developers of low-latency and/or high-availability datacenter storage applications and be of interest to datacenter developers, administrators and users. I encourage you to register today. Our NSF experts will be on hand to answer you questions. We look forward to your joining us on October 25th.