Scale-Out File Systems FAQ

On February 28th, the SNIA Networking Storage Forum (NSF) took at look at what’s happening in Scale-Out File Systems. We discussed general principles, design considerations, challenges, benchmarks and more. If you missed the live webcast, it’s now available on-demand. We did not have time to answer all the questions we received at the live event, so here are answers to them all.

Q. Can scale-out file systems do Erasure coding?

A. Indeed, Erasure coding is a common method to improve resilience.

Q. How does one address the problem of a specific disk going down? Where does scale-out architecture provide redundancy?

A. Disk failures typically are covered by RAID software. Some of scale-out software also use multiple replicators to mitigate the impact of disk failures.

Q. Are there use cases where a hybrid of these two styles is needed?

A. Yes, for example, in some environments, the foundation layer might be using the dedicated storage server to form the large storage pool, which is the 1st style, and then export LUNs or virtual disks to the compute nodes (either physical or virtual) to run the applications, which is the 2nd style.

Q. Which scale-out file systems present on windows, Linux platforms?

A.   Some of  the scale-out file systems provide  native client software across multiple  platforms. Another approach is to use Samba to build SMB  gateways to make the  scale-out file system  available to Windows computers.

Q. Is Amazon elastic file system (EFS) on AWS scale-out file systems?

A. Please see:

https://docs.aws.amazon.com/efs/latest/ug/performance.html

“Amazon EFS file systems are distributed across an unconstrained number of storage servers, enabling file systems to grow elastically to petabyte scale and allowing massively parallel access from Amazon EC2 instances to your data. The distributed design of Amazon EFS avoids the bottlenecks and constraints inherent to traditional file servers.”

Q. Where are the most cost effective price/performance uses of NVMe?  

A. NVMe can support very high IOPS and very high throughput as well. The best use case would be to couple NVMe with high performance storage software that would not limit the NVMe.

 

The Ins and Outs of a Scale-Out File System Architecture

To meet the increasingly higher demand on both capacity and performance in large cluster computing environments, the storage subsystem has evolved toward a modular and scalable design. The scale-out file system has emerged as one implementation of the trend, in addition to scale-out object and block storage solutions.

What are the key principles when architecting a scale-out file system? Find out on February 28th when the SNIA Networking Storage Forum (NSF) hosts The Scale-Out File System Architecture Overview, a live webcast where we will present an overview of scale-out file system architectures. This presentation will provide an introduction to scale-out-file systems and cover:

  • General principles when architecting a scale-out file system storage solution
  • Hardware and software design considerations for different workloads
  • Storage challenges when serving a large number of compute nodes, e.g. name space consistency, distributed locking, data replication, etc.
  • Use cases for scale-out file systems
  • Common benchmark and performance analysis approaches

Register today to save your spot. We hope you will join us.

Networking Questions for Ethernet Scale-Out Storage

Unlike traditional local or scale-up storage, scale-out storage imposes different and more intense workloads on the network. That’s why the SNIA Networking Storage Forum (NSF) hosted a live webcast “Networking Requirements for Ethernet Scale-Out Storage.” Our audience had some insightful questions. As promised, our experts are answering them in this blog.

Q. How does scale-out flash storage impact Ethernet networking requirements?

A.  Scale-out flash storage demands higher bandwidth and lower latency than scale-out storage using hard drives. As noted in the webcast, it’s more likely to run into problems with TCP Incast and congestion, especially with older or slower switches. For this reason it’s more likely than scale-out HDD storage to benefit from higher bandwidth networks and modern datacenter Ethernet solutions–such as RDMA, congestion management, and QoS features.

Q. What are your thoughts on NVMe-oF TCP/IP and availability?

A.  The NVMe over TCP specification was ratified in November 2018, so it is a new standard. Some vendors already offer this as a pre-standard implementation. We expect that several of the scale-out storage vendors who support block storage will support NVMe over TCP as a front-end (client connection) protocol in the near future. It’s also possible some vendors will use NVMe over TCP as a back-end (cluster) networking protocol.

Q. Which is better: RoCE or iWARP?

A.  SNIA is vendor-neutral and does not directly recommend one vendor or protocol over another. Both are RDMA protocols that run on Ethernet, are supported by multiple vendors, and can be used with Ethernet-based scale-out storage. You can learn more about this topic by viewing our recent Great Storage Debate webcast “RoCE vs. iWARP” and checking out the Q&A blog from that webcast.

Q. How would you compare use of TCP/IP and Ethernet RDMA networking for scale-out storage?

A.  Ethernet RDMA can improve the performance of Ethernet-based scale-out storage for the front-end (client) and/or back-end (cluster) networks. RDMA generally offers higher throughput, lower latency, and reduced CPU utilization when compared to using normal (non-RDMA) TCP/IP networking. This can lead to faster storage performance and leave more storage node CPU cycles available for running storage software. However, high-performance RDMA requires choosing network adapters that support RDMA offloads and in some cases requires modifications to the network switch configurations. Some other types of non-Ethernet storage networking also offer various levels of direct memory access or networking offloads that can provide high-performance networking for scale-out storage.

Q. How does RDMA networking enable latency reduction?

A. RDMA typically bypasses the kernel TCP/IP stack and offloads networking tasks from the CPU to the network adapter. In essence it reduces the total path length which consequently reduces the latency. Most RDMA NICs (rNICs) perform some level of networking acceleration in an ASIC or FPGA including retransmissions, reordering, TCP operations flow control, and congestion management.

Q. Do all scale-out storage solutions have a separate cluster network?

A.  Logically all scale-out storage systems have a cluster network. Sometimes it runs on a physically separate network and sometimes it runs on the same network as the front-end (client) traffic. Sometimes the client and cluster networks use different networking technologies.

 

 

 

 

How Scale-Out Storage Changes Networking Demands

Scale-out storage is increasingly popular for Cloud, High-Performance Computing, Machine Learning, and certain Enterprise applications. It offers the ability to grow both capacity and performance at the same time and to distribute I/O workloads across multiple machines.

But unlike traditional local or scale-up storage, scale-out storage imposes different and more intense workloads on the network. Clients often access multiple storage servers simultaneously; data typically replicates or migrates from one storage node to another; and metadata or management servers must stay in sync with each other as well as communicating with clients.

Due to these demands, traditional network architectures and speeds may not work well for scale-out storage, especially when it’s based on flash. That’s why the SNIA Networking Storage Forum (NSF) is hosting a live webcast “Networking Requirements for Scale-Out Storage” on November 14th. I hope you will join my NSF colleagues and me to learn about:

  • Scale-out storage solutions and what workloads they can address
  • How your network may need to evolve to support scale-out storage
  • Network considerations to ensure performance for demanding workloads
  • Key considerations for all flash scale-out storage solutions

Register today. Our NSF experts will be on hand to answer your questions.

Congestion Control in New Storage Architectures Q&A

We had a great response to last week’s Webcast “Controlling Congestion in New Storage Architectures” where we introduced CONGA, a new congestion control mechanism that is the result of research at Stanford University. We had many good questions at the live event and have complied answers for all of them in this blog. If you think of additional questions, please feel free to comment here and we’ll get back to you as soon as possible.

Q. Isn’t the leaf/spine network just a Clos network?   Since the network has loops, isn’t there a deadlock hazard if pause frames are sent within the network?

A. CLOS/Spine-Leaf networks are based on routing, which has its own loop prevention (TTLs/RPF checks).

Q. Why isn’t the congestion metric subject to the same delays as the rest of the data traffic?    

A. It is, but since this is done in the data plane with 40/100g within a data center fabric it can be done in near real time and without the delay of sending it to a centralized control plane.

Q. Are packets dropped in certain cases?

A.  Yes, there can be certain reasons why a packet might be dropped.

Q. Why is there no TCP reset? Is it because the Ethernet layer does the flowlet retransmission before TCP has to do a resend?

A. There are many reasons for a TCP reset, CONGA does not prevent them, but it can help with how the application responds to a loss.   If there is a loss of the flowlet it is less detrimental to how the application performs because it will resend what it has lost versus the potential for  full TCP connection to be reset.

Q. Is CONGA on an RFC standard track?

A. CONGA is based on research done at Stanford. It is not currently an RFC.

The research information can be found here.

Q. How does ECN fit into CONGA?

A. ECN can be used in conjunction with CONGA, as long as the host/networking hardware supports it.