Questions on the 2017 Ethernet Roadmap for Networked Storage

Last month, experts from Dell EMC, Intel, Mellanox and Microsoft convened to take a look ahead at what’s in store for Ethernet Networked Storage this  year. It was a fascinating discussion of anticipated updates. If you missed the webcast, “2017 Ethernet Roadmap for Networked Storage,” it’s now available on-demand. We had a lot of great questions during the live event and we ran out of time to address them all, so here are answers from our speakers.

Q. What’s the future of twisted pair cable? What is the new speed being developed with twisted pair cable?

A. By twisted pair I assume you mean USTP CAT5,6,7 etc.  The problem going forward with high speed signaling is the USTP stands for Un-Shielded and the signal radiates off the wire very quickly.   At 25G and 50G this is a real problem and forces the line card end to have a big, power consuming and costly chip to dig the signal out of the noise. Anything can be done, but at what cost.  25G BASE-T is being developed but the reach is somewhere around 30 meters.  Cost, size, power consumption are all going up and reach going down – all opposite to the trends in modern high speed data centers.  BASE-T will always have a place for those applications that don’t need the faster rates.

Q. What do you think of RCx standards and cables?

A.  So far, Amphenol, JAE and Volex are the suppliers who are members of the MSA. Very few companies have announced or discussed RCx.   In addition to a smaller connector, not having an EEPROM eliminates steps in the cable assembly manufacture, hence helping with lowering the cost when compared to traditional DAC cabling. The biggest advantage of RCx is that it can help eliminate bulky breakout cables within a rack since a single RCx4 receptacle can accept a number of combinations of single lane, 2 lane or 4 lane cable with the same connector on the host. RCx ports can be connected to existing QSFP/SFP infrastructure with appropriate cabling.  It remains to be seen, however, if it becomes a standard and popular product or remain as a custom solution.

Q. How long does AOC normally reach, 3m or 30m?  

A. AOCs pick it up after DAC drops off about 3m.  Most popular reaches are 3,5,and 10m and volume drops rapidly after 15,20,30,50, and100. We are seeing Ethernet connected HDD’s at 2.5GbE x 2 ports, and Ceph touting this solution.   This seems to play well into the 25/50/100GbE standards with the massive parallelism possible.

Q. How do we scale PCIe lanes to support NVMe drives to scale, and to replace the capacity we see with storage arrays populated completely with HDDs?

A.  With the advent of PCIe Gen 4, the per-lane rate of PCIe is going from 8 GT/s to 16GT/s. Scaling of PCIe is already happening.

Q. How many NVMe drives does it take to saturate 100GbE?

A.  3 or 4 depending on individual drives.

Q. How about the reliability of Ethernet? A lot of people think Fibre Channel has better reliability than Ethernet.

A.  It’s true that Fibre Channel is a lossless protocol. Ethernet frames are sometimes dropped by the switch, however, network storage using TCP has built in error-correction facility. TCP was designed at a time when networks were less robust than today. Ethernet networks these days are far more reliable.

Q. Do the 2.5GbE and 5GbE refer to the client side Ethernet port or the server Ethernet port?

A.  It can exist on both the client side and the server side Ethernet port.

Q. Are there any 25GbE or 50GbE NICs available on the market?

A.  Yes, there are many that are on the market from a number of vendors, including Dell, Mellanox, Intel, and a number of others.

Q.  Commonly used Ethernet speeds are either 10GbE or 40GbE. Do the new 25GbE and 50GbE require new switches?

A. Yes, you need new switches to support 25GbE and 50GbE. This is, in part, because the SerDes rate per lane at 25 and 50GbE is 25Gb/s, which is not supported by the 10 and 40GbE switches with a maximum SerDes rate of 10Gb/s.

Q.  With a certain number of SerDes coming off the switch ASIC, which would you prefer to use 100G or 40G if assuming both are at the same cost?

A.  Certainly 100G. You get 2.5X the bandwidth for the same cost under the assumptions made in the question.

Q.  Are there any 100G/200G/400G switches and modulation available now?

A.  There are many 100G Ethernet switches available on the market today include Dell’s Z9100 and S6100, Mellanox’s SN2700, and a number of others. The 200G and 400G IEEE standards are not complete as of yet. I’m sure all switch vendors will come out with switches supporting those rates in the future.

Q. What does lambda mean?

A.  Lambda is the symbol for wavelength.

Q. Is the 50GbE standard ratified now?

A. IEEE 802.3 just recently started development of a 50GbE standard based upon a single-lane 50 Gb/s physical layer interface. That standard is probably about 2 years away from ratification. The 25G Ethernet Consortium has a ratified specification for 50GbE based upon a dual-lane 25 Gb/s physical layer interface.

Q. Are there any parallel options for using 2 or 4 lanes like in 128GFCp?

A.  Many Ethernet specifications are based upon parallel options. 10GBASE-T is based upon 4 twisted-pairs of copper cabling. 100GBASE-SR4 is based upon 4 lanes (8 fibers) of multimode fiber. Even the industry MSA for 100G over CWDM4 is based upon four wavelengths on a duplex single-mode fiber. In some instances, the parallel option is based upon the additional medium (extra wires or fibers) but with fiber optics, parallel can be created by using different wavelengths that don’t interfere with each other.

Update: If you missed the live event, it’s now available  on-demand. You can also  download the webcast slides.

 

 

Ethernet Networked Storage – FAQ

At our SNIA Ethernet Storage Forum (ESF) webcast “Re-Introduction to Ethernet Networked Storage,” we provided a solid foundation on Ethernet networked storage, the move to higher speeds, challenges, use cases and benefits. Here are answers to the questions we received during the live event.

Q.  Within the iWARP protocol there is a layer called MPA (Marker PDU Aligned Framing for TCP) inserted for storage applications. What is the point of this protocol?

A. MPA is an adaptation layer between the iWARP Direct Data Placement Protocol and TCP/IP. It provides framing and CRC protection for Protocol Data Units.   MPA enables packing of multiple small RDMA messages into a single Ethernet frame.   It also enables an iWARP NIC to place frames received out-of-order (instead of dropping them), which can be beneficial on best-effort networks. More detail can be found in IETF RFC 5044 and IETF RFC 5041.

Q. What is the API for RDMA network IPC?

The general API for RDMA is called verbs. The OpenFabrics Verbs Working Group oversees the development of verbs definition and functionality in the OpenFabrics Software (OFS) code. You can find the training content from OpenFabrics Alliance here. General information about RDMA for Ethernet (RoCE) is available at the InfiniBand Trade Association website. Information about Internet Wide Area RDMA Protocol (iWARP) can be found at IETF: RFC 5040, RFC 5041, RFC 5042, RFC 5043, RFC 5044.

Q. RDMA requires TCP/IP (iWARP), InfiniBand, or RoCE to operate on with respect to NVMe over Fabrics. Therefore, what are the advantages of disadvantages of iWARP vs. RoCE?

A. Both RoCE and iWARP support RDMA over Ethernet. iWARP uses TCP/IP while RoCE uses UDP/IP. Debating which one is better is beyond the scope of this webcast, but you can learn more by watching the SNIA ESF webcast, “How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics.”

Q. 100Gb Ethernet Optical Data Center solution?

A.  100Gb Ethernet optical interconnect products were first available around 2011 or 2012 in a 10x10Gb/s design (100GBASE-CR10 for copper, 100GBASE-SR10 for optical) which required thick cables and a CXP and a CFP MSA housing. These were generally used only for switch-to-switch links. Starting in late 2015, the more compact 4x25Gb/s design (using the QSFP28 form factor) became available in copper (DAC), optical cabling (AOC), and transceivers (100GBASE-SR4, 100GBASE-LR4, 100GBASE-PSM4, etc.). The optical transceivers allow 100GbE connectivity up to 100m, or 2km and 10km distances, depending on the type of transceiver and fiber used.

Q. Where is FCoE being used today?

A. FCoE is primarily used in blade server deployments where there could be contention for PCI slots and only one built-in NIC. These NICs typically support FCoE at 10Gb/s speeds, passing both FC and Ethernet traffic via connect to a Top-of-Rack FCoE switch which parses traffic to the respective fabrics (FC and Ethernet). However, it has not gained much acceptance outside of the blade server use case.

Q. Why did iSCSI start out mostly in lower-cost SAN markets?

A. When it first debuted, iSCSI packets were processed by software initiators which consumed CPU cycles and showed higher latency than Fibre Channel. Achieving high performance with iSCSI required expensive NICs with iSCSI hardware acceleration, and iSCSI networks were typically limited to 100Mb/s or 1Gb/s while Fibre Channel was running at 4Gb/s. Fibre Channel is also a lossless protocol, while TCP/IP is lossey, which caused concerns for storage administrators. Now however, iSCSI can run on 25, 40, 50 or 100Gb/s Ethernet with various types of TCP/IP acceleration or RDMA offloads available on the NICs.

Q. What are some of the differences between iSCSI and FCoE?

A. iSCSI runs SCSI protocol commands over TCP/IP (except iSER which is iSCSI over RDMA) while FCoE runs Fibre Channel protocol over Ethernet. iSCSI can run over layer 2 and 3 networks while FCoE is Layer 2 only. FCoE requires a lossless network, typically implemented using DCB (Data Center Bridging) Ethernet and specialized switches.

Q. You pointed out that at least twice that people incorrectly predicted the end of Fibre Channel, but it didn’t happen. What makes you say Fibre Channel is actually going to decline this time?

A. Several things are different this time. First, Ethernet is now much faster than Fibre Channel instead of the other way around. Second, Ethernet networks now support lossless and RDMA options that were not previously available. Third, several new solutions–like big data, hyper-converged infrastructure, object storage, most scale-out storage, and most clustered file systems–do not support Fibre Channel. Fourth, none of the hyper-scale cloud implementations use Fibre Channel and most private and public cloud architects do not want a separate Fibre Channel network–they want one converged network, which is usually Ethernet.

Q. Which storage protocols support RDMA over Ethernet?

A. The Ethernet RDMA options for storage protocols are iSER (iSCSI Extensions for RDMA), SMB Direct, NVMe over Fabrics, and NFS over RDMA. There are also storage solutions that use proprietary protocols supporting RDMA over Ethernet.

Update: If you missed the live event, it’s now available  on-demand. You can also  download the webcast slides.

 

 

 

 

 

 

 

 

 

 

 

2017 Ethernet Roadmap for Networked Storage

When SNIA’s Ethernet Storage Forum (ESF) last looked at the Ethernet Roadmap for Networked Storage in 2015, we anticipated a world of rapid change. The list of advances in 2016 is nothing short of amazing

  • New adapters, switches, and cables have been launched supporting 25, 50, and 100Gb Ethernet speeds including support from major server vendors and storage startups
  • Multiple vendors have added or updated support for RDMA over Ethernet
  • The growth of NVMe storage devices and release of the NVMe over Fabrics standard are driving demand for both faster speeds and lower latency in networking
  • The growth of cloud, virtualization, hyper-converged infrastructure, object storage, and containers are all increasing the popularity of Ethernet as a storage fabric

The world of Ethernet in 2017 promises more of the same. Now we revisit the topic with a look ahead at what’s in store for Ethernet in 2017.   Join us on December 1, 2016 for our live webcast, “2017 Ethernet Roadmap to Networked Storage.”

With all the incredible advances and learning vectors, SNIA ESF has assembled a great team of experts to help you keep up. Here are some of the things to keep track of in the upcoming year:

  • Learn what is driving the adoption of faster Ethernet speeds and new Ethernet storage models
  • Understand the different copper and optical cabling choices available at different speeds and distances
  • Debate how other connectivity options will compete against Ethernet for the new cloud and software-defined storage networks
  • And finally look ahead with us at what Ethernet is planning for new connectivity options and faster speeds such as 200 and 400 Gigabit Ethernet

The momentum is strong with Ethernet, and we’re here to help you stay informed of the lightning-fast changes. Come join us to look at the future of Ethernet for storage and join this SNIA ESF webcast on December 1st. Register here.

Update: If you missed the live event, it’s now available  on-demand. You can also  download the webcast slides.

It’s Time for a Re-Introduction to Ethernet Networked Storage

Ethernet technology had been a proven standard for over 30 years and there are many networked storage solutions based on Ethernet. While storage devices are evolving rapidly with new standards and specifications, Ethernet is moving towards higher speeds as well: 10Gbps, 25Gbps, 50Gbps and 100Gbps….making it time to re-introduce Ethernet Networked Storage.

That’s exactly what Rob Davis and I plan to do on August 4th in a live SNIA Ethernet Storage Forum Webcast, “Re-Introducing Ethernet Networked Storage.” We will start by providing a solid foundation on Ethernet networked storage and move to the latest advancements, challenges, use cases and benefits. You’ll hear:

  • The evolution of storage devices – spinning media to NVM
  • New standards: NVMe and NVMe over Fabric
  • A retrospect of traditional networked storage including SAN and NAS
  • How new storage devices and new standards would impact Ethernet networked storage
  • Ethernet based software-defined storage and the hyper-converged model
  • A look ahead at new Ethernet technologies optimized for networked storage in the future

I hope you will join us on August 4th at 10:00 a.m. PT. We’re confident you will learn some new things about Ethernet networked storage. Register today!

Update: If you missed the live event, it’s now available  on-demand. You can also  download the webcast slides.

Principles of Networked Solid State Storage – Q&A

At this month’s SNIA Ethernet Storage Forum Webcast, “Architectural Principles for Networked Solid State Storage Access,” Doug Voigt, Chair of the SNIA NVM Programming Technical Working Group, and a member of the SNIA Technical Council, outlined key architectural principles surrounding the application of  networked  solid state technologies. We had a flurry of questions near the end of the Webcast that we did not have enough time to answer. Here are Doug’s answers to all the questions we received during the event:

Q. Are there wait cycles in accessing persistent memory?

A. It depends entirely on which persistent memory (PM) technology is being accessed and how the memory interconnect is used.   Some technologies have write times that are quite different from read times.   When using tightly timed interconnects such as DDR with those technologies it may be difficult to avoid wait cycles.

Q. How do Pmalloc and malloc share the virtual address space of the application?

A. This is entirely up to the OS and other libraries operating within any constraints of the processor architecture-specific memory management units.   A good mental model would be fairly large regions of contiguous address space in both the physical and virtual domains, where each region will comprise a single type of memory. Capacity will be reserved for pmalloc and malloc in the appropriate regions.

Q. Always flush after doing your memory-mapped IO.   Is that simply good hygiene?

A. Not exactly. The term “Memory Mapped IO” is used to reference control plane (as opposed to data plane) access.   It is often reasonable to set up control plane memory as uncacheable. The need for strict order of access to physical control plane registers is so pervasive that caching is generally not useful. Uncacheable writes are always flushed by the processor, as opposed to the application.

Generally with memory mapped IO devices the data plane uses direct memory access (DMA).   With memory mapped files (as opposed to memory mapped IO) Load/Store (more commonly referred to as “Ld/St”), not DMA, is used in the data plane. Disabling caching in the data plane is generally a big performance sacrifice for small byte range access.

In the Ld/St datapath, strategically placed flushing is required to retain both performance and power failure recovery. The SNIA NVM Programming Model describes this type of functionality.

Q. Once NVDIMM support become pervasive with support from NVMe drives in the server box, should network storage be more focused on SAS Flash or just SAS HDDs?

A. Not necessarily.   NVMe over Fabric, Fibre Channel and iSCSI are also types of networked storage that will likely retain significant market share relative to SAS.

Q. Are the ‘Big Data’ Data Warehouse applications starting to use the persistence memory and domain technologies in their applications?

A. It is too early to see much of this yet. PM technologies might become a priority as a staging area for analytic applications with high ingest or checkpoint rates. NVDIMMs are likely to be too expensive to store anything “big” for quite a while.

Q. Also, is the persistence memory/domains being used in the Hyper-converged and Converged hardware infrastructures?

A. Persistent memory is quintessentially (Hyper-) converged.   It wouldn’t be unreasonable to expect some traction with hyper-converged solutions that experience high storage-performance demand.

Q. What distance would you associate with 10’s of microseconds?

A. In terms of transmission delay, 10’s of uS align with a campus or small city scale, but the distance itself is often not the primary factor.   Switching delays, transmission line properties and software overhead are generally bigger factors.

Q. So latency would be the binding factor for distances…not a question, an observation.

A. Yes, in effect, either through transmission or relay.   See above.

Q. Aren’t there multi-threaded SSDs?

A. Yes, but since the primary metric in this presentation is latency we ignore multi-threading.   It can enable more work to get done, but it generally increases latency rather than reducing it.

Q. Is Pmalloc universal usage?

A. The term is starting to be recognized among developers and has been used in research. Various similar names have been used in early research prototypessuch as pmalloc in Mnemosyne and nvmalloc in SCMFS.

Q. So how would PM help in a (stock broking) requirement, where we currently prophesize an RDMA or iWARP solution?

A. With PM the answer is always lower latency.   PM can be litegrated like memory or like flash. RDMA network paths for both of these options were discussed in the presentation. In either case, PM is low-latency enough that networking and software overheads will completely determine performance, even when using RDMA. The performance boost from PM is greatest when it is accessed locally.   If remote access is a requirement then the new work being done in the RDMA community should help.

Q. If data stored in memory requires to be copied to a different host, memory (for consistency) how does PM assist, or is there an extension to PM? Coherency between multiple hosts in a cluster, if you will?

A. PM technology does not help with this; the methods of managing consistency across hosts remain unchanged by PM.   All PM offers is low latency persistence.

Coordination across hosts or nodes in a cluster must use existing clustering techniques such as locking and quorums. In addition, the relative timescales of memory access and network communication suggest the application of asynchronous remote replication techniques used in today’s storage solutions.

Regarding coherency, PM brings nothing new to the known techniques for managing coherency.   Classical cluster architecture must be applied outside of symmetric multi-processing coherency domains. Within coherency domains, all of the logic is above the PM level in a processor side memory controller or a software emulation of the same algorithms.

Update: If you missed the live event, it’s now available  on-demand. You can also  download the webcast slides.

 

 

 

Questions Aplenty on NVMe over Fabrics

Our live SNIA-ESF Webcast, “Under the Hood with NVMe over Fabrics,” generated more questions than we anticipated, proving to us that this topic is worthy of future discussions. Here are answers to both the questions we took during the live event as well as those we didn’t have time for.

Q. So fabric is an alternative to PCIe, for those of us familiar with PCIe-attached NVMe devices, yes?

A. Yes, fabric is the term used in the specification that represents a variety of physical interconnects and transports for NVM Express.

Q. How are the namespaces shared in a fabric?

A. Namespaces are NVM subsystem resources and are accessible by all controllers in the NVM subsystem. Multi-host access may be coordinated using reservations.

 Q.  If there are multiple subsystems accessing same NVMe devices over fabric then how is namespace shared?

A. The mapping of fabric NVM subsystem resources (namespaces and controllers) to PCIe NVMe device subsystems is implementation specific. They may be mapped 1 to 1 or N to 1, depends on the functionality of the NVMe bridge.

Q. Are namespace reservations similar to SCSI reservations?

A. Yes

Q. Are there plans for defining bindings for Intel Omni Path fabric?

A. Intel Omni-Path is a good candidate fabric for NVMe over Fabrics.

Q. Is hybrid attachment allowed? Could a single namespace be attached to a fabric and PCIe (through two controllers) concurrently?

A. At this moment, such hybrid configuration is not permitted within the specification

Q. Is a NVM sub-system purpose built or commodity server hardware?

A. This is a difficult question to answer. At the time of this writing there are not enough “off-the-shelf” commodity components to be able to construct NVMe over Fabric subsystems.

Q. Does NVMEoF use the same NVMe PCIe controller register map?

A. A subset of the NVMe controller register mapping was retained for fabrics but renamed to “Properties” to avoid confusion.

Q. So does NVMe over Fabric act like an extension of the PCIe bus? Meaning that I see the same MMIO registers and queues remotely? Or is it a completely different protocol that is solely message based? Will current NVMe host drivers work on the fabric or does it really require a different driver stack?

A. Fabrics is not an extension of PCIe, it’s an extension of NVMe. It uses the same NVMe Submission and Completion Queue model and Descriptors as the PCIe NVMe. Most of the original NVMe host driver stack is retained and shared between PCIe and Fabrics, the bottom side was modified to allow for multiple transports.

Q. Does NVMe over Fabrics support immediate data for writes, or must write data always be fetched by the NVMe controller?

A. Yes, immediate data is termed “in-capsule” and is used to send the NVMe command data with the NVMe submission entry.

Q. As far as I know, Linux introduced a multi-queue model at the block layer recently. Is it the same thing you are mentioning?

A. No, but NVMe uses the Linux Block-MQ layer. NVMe Multi-Queue is used between the host and the NVMe controller for both PCIe and fabric based controllers.

Q.  Are there situations where you might want to have more than one queue pair per CPU? What are they?

A. Queue-Pairs are matched up by CPU cores, not CPUs, which allows the creation of multiple namespace entities per CPU. This, in turn, is very useful for virtualization and application separation.

Q. What are three mandatory commands? Do they refer to read/write/sync cache?

A. Actually, there are 13 required commands. Kevin Marks has a very good presentation from the Flash Memory Summit that provides a list of these commands within the broader NVMe context. You can download it here.  

Q. Please talk about queue depths? Arbitrary? Limited?

A. Controller defined maximum queue depths up to a maximum of 64K entries.

Q. Where will SQs and CQs be physically located? Are they on host memory or SSD memory?

A. For fabrics, the SQ is located on the controller side to avoid the inefficiency of having to pull SQE’s across a fabric. CQ’s reside on the host.

Q. How do you create ordering guarantee when that is needed for correctness?

A. For commands that require sequencing, there is a concept called “Fused Commands” which get sent as a single unit.

Q. In NVMeoF how are devices discovered?

A. NVMeoF devices are discoverable via a couple of different means, depending on whether you are using Fibre Channel (which has its own discovery and login process) or an iSCSI-like name server. Mike Shapiro goes over the discovery mechanism in considerable detail in this BrightTALK Webcast.
Q. I guess all new drivers will be required for NVMeoF?

A. Yes, new drivers are being written and will be required for NVMeoF.

Q. Why can’t the doorbell+ plus communication model apply to PCIe? I mean, why doesn’t PCIe use doorbell+?

A. NVMe 1.2 defines controller resident buffers that can be used for pushing SQ Entries from the host to the controller. Doorbells are still required for PCIe to inform the controller about the new SQ entries.

Q. If there are two hosts connected to the same subsystem then will NVMe controller have two queues :- one for each host

A. Yes

Q. So with your command and data description, does NVMe over Fabric require RDMA or does it have a “Data Ready” type message to tell the host when to send write data?

A. Data transfer operations are fabric dependent. RDMA uses RDMA_READ, another transport may use some form of Data Ready model.

Q. Can you quantify the protocol translation overhead? In reality, that does not look like that big from performance perspective.

A. Submission Queue entries are 64bytes and Completion Queue entries are 16bytes. These are sufficiently small for block storage traffic which typically is in 4K+ size requests.  

Q. Do Dual Port SSDs need to support two Admin Qs since they have two paths to the same host?

A. Dual-Port or multi-path capable NVM subsystems require using two NVMe controllers each with one AdminQ and one or more IO queues.  

Q. For a Dual Port SSD, does each port need to have its Submission Q on a different CPU core in the host? I assume the SQs for the two ports cannot be on the same CPU core.

A. The mapping of controller queues to host CPU cores is typically per controller. If the host was connected to two controllers, there would be two queues per core. One queue to controller 1 and one queue to controller 2 per host core.

Q.  As you mentioned currently there is an LBA addressing in standard. What will happen when Intel will go to market with new media (3D Point), which is announced to be byte addressable?

A. The NVMe NVM command set is block based and is independent of the type and access method of the NVM media used in a subsystem implementation.  

Q. Is there a real benefit of this architecture in a NAS environment?

A. There is a natural advantage to making any storage access more efficient. A network-attached system still requires block access at the lower levels, and NVMe (either local or over a Fabric) can improve NAS design and flexibility immensely. This is particularly true for pNFS and scale-out SMB paradigms.

Q. How do you handle authentication across many servers (hosts) on the fabric? How do you decide what host can access what part of each device? Does it have to be namespace specific?

A. The fabrics specification defines an Authentication model and also defines the naming format for NVM subsystems and hosts. A target implementation can choose to provision NVM subsystems to specific host based on the naming format.

Q. Having same structure at all layers means at the transport layer of flash appliance also we should maintain the submission and completions Queue model and these mapped to physical Queue of NVMe sub controller?

A. The NVMe Submission Queue and Completion Queue entries are common between fabrics and PCIe NVMe. This simplifies the steps required to bridge between NVMe fabrics and NVMe PCIe. An implementation may choose to map the fabrics SQ directly to a PCIe NVMe SSD SQ to provide a very efficient simple NVMe transport bridge

Q. With an RDMA based transport, how will each host discover the NVME controller(s) that it has been granted access to?

A. Please see the answer above.

Q. Traditionally SAS supports SAS expander for scaling purpose. How does NVMe over fabric solve the issue as there is no expander concept in NVMe world?

A. Recall that SAS expanders compensate for SCSI’s inherent lack of scalability. NVMe perpetuates the multi-queue model (which does not exist for SCSI) natively, so SAS expander-like pieces are not required for scale-out.

Update: If you missed the live event, it’s now available  on-demand. You can also  download the webcast slides.  

Update: Want to learn more about NVMe? Check out these SNIA ESF webcasts:

 

 

How Ethernet RDMA Protocols iWARP and RoCE Support NVMe over Fabrics

NVMe (Non-Volatile Memory Express) over Fabrics is of tremendous interest among storage vendors, flash manufacturers, and cloud and Web 2.0 customers. Because it offers efficient remote and shared access to a new generation of flash and other non-volatile memory storage, it requires fast, low latency networks, and the first version of the specification is expected to take advantage of RDMA (Remote Direct Memory Access) support in the transport protocol.

Many customers and vendors are now familiar with the advantages and concepts of NVMe over Fabrics but are not familiar with the specific protocols that support it. Join us on January 26th for this live Webcast that will explore and compare the Ethernet RDMA protocols and transports that support NVMe over Fabrics and the infrastructure needed to use them. You’ll hear:

  • Why NVMe Over Fabrics requires a low-latency network
  • How the NVMe protocol is mapped to the network transport
  • How RDMA-capable protocols work
  • Comparing available Ethernet RDMA transports: iWARP and RoCE
  • Infrastructure required to support RDMA over Ethernet
  • Congestion management methods

The event is live, so please bring your questions. We look forward to answering them.

Update: If you missed the live event, it’s now available  on-demand. You can also  download the webcast slides.

Under the Hood with NVMe over Fabrics

Non-Volatile Memory Express (NVMe) has piqued the interest of many people in the storage world. Using a robust, efficient, and highly flexible transportation protocol for SSDs, Flash, and future Non-Volatile Memory storage devices, the NVM Express group is working on extending these advantages over a networked Fabric.

Our first Webcast on The Performance Impact of NVMe over Fabrics was very well received. If you missed it, check-it out on-demand. On December 15th, Dave Minturn, Storage Architect at Intel, will join me for a deeper dive in a live Webcast, “Under the Hood with NVMe over Fabrics.” At this Webcast we’ll explain not only what NVMe over Fabrics is, but also specifically pay attention to how it works. We’ll be exploring:

  • Key terms and concepts
  • Differences between NVMe-based fabrics and SCSI-based fabrics
  • Practical examples of NVMe over Fabrics solutions
  • Important future considerations

Register now and join us as we discuss the next iteration of NVMe.  I hope to “see” you on the 15th when Dave and I will be anxious to answer your questions.

Update: If you missed the live event, it’s now available  on-demand. You can also  download the webcast slides.

Life of a Storage Packet (Walk): Q&A

We got a lot of great questions at our recent Ethernet Storage Forum webcast “The Life of a Storage Packet (Walk).” As promised, we’ve compiled all the questions with fairly detailed answers. We hope this blog helps to clear up any confusion or uncertainties. If you think of additional questions, please comment below and we’ll get back to you as soon as we can. Thanks to everyone who watched the live webcast. If you missed it, it’s now available on-demand.

Q. Does size of a block depend on OS?

A. Yes, some OS’ only support 512 byte blocks. Some OS’ provide a method to support both 512 byte blocks and 4096 (aka 4K) byte blocks; for example, via a qualifier in their format command. Some devices are built using 4K block sizes, but then emulate 512 byte blocks to the host (aka 512E devices). Many modern versions of OS’ automatically detect the block size of each device they discover and do the right thing based on what they discover. You have to check the documentation for your OS to know its capabilities.

Q. CIFS is not a file system.

A. Not in the same way that ext, ntfs, FAT, HFS, UFS, etc. are, no. However, in terms of certain functionalities that we discussed on the presentation – that is, the ability to manipulate files with operations such as read, write, create, delete, and rename – are all file system functionalities. The difference being, of course, that the files are not on the local computer and are actually on a remote computer.

For what it’s worth, the term ‘CIFS’ has been deprecated from usage, and SMB is the preferred term for precisely the reasons that it should not be confused with local OS systems like the ones mentioned above.

Q. Is it safe to say that a “block” at the file system level is equal to an IO?

A. Not specifically. The difference in the use of those terms, is that a “block” is a place that data is stored – it has an address and it contains data (512 byte of data or 4K bytes of data). An IO is an operation that requests access to a block. An IO may perform a read operation on a block, or it may perform a write operation on a block.

Remember, the term IOPS is I/O operations per second – so it is not about blocks or bytes or bits – it is operations. If the operations are performed on a 512 byte block, they produce a different number of MB/sec than if they operate on a 4K byte block.

So, let’s take an example. A 1MB/second bandwidth on a 4K block size device is the same speed as 1MB/second bandwidth on a 512 byte block size device (observe, however, that the 4K block size device will have only 1/8 the IOPS that the 512 byte device has – because it will take only 1/8 the operations to transfer the same number of Mega-bytes). However, 1M IOPS on a 4K block size device is much better than 1M IOPS on a 512 byte block size device (because the 4K block size device is moving 8X the amount of data than the 512 byte block size device in each operation.

There is an excellent explanation and walk-through of IOPS in our Storage Performance Benchmarking webinar.

Q. Don’t all those Inodes also live on the disk and so don’t the IOs to read those blocks also have to go to the SCSI controller?

A. Well-spotted!

The communication back and forth between the file system and the Inodes traverses the controller for all access to blocks on disk. This is one of the reasons why needing to access disk is considered “expensive.” When you add in a network to the mix, these kinds of situations need particular careful consideration.

Q. Shall we allocate blocks and inodes or is it an automatic process?

A. It’s an automated process, and there is no user intervention at all.

Q. Are Inodes created during OS installation?

A. Inodes are a particular block type, among some other block types (e.g., data blocks, boot blocks, superblocks, group descriptor blocks). These block types are combined into functional groups. These groups are OS-dependent. The block layout therefore, including Inodes, is created during OS installation. If the filesystem needs more Inodes after the OS installation, the OS dynamically adds them to the Inode pool.

Q. Which physical hardware does volume manager reside?

A. Volume managers are not hardware. They are software layers that create pseudo devices that are presented to layers of the OS above them (typically to the file system layer). The volume manager software fits into the OS to accept requests from the file system and pass those requests down to the device driver. In some systems they might be called “filter drivers”.

Q. In flash media, is there also an iSCSI controller that converts PCIe into iSCSI to interact with the flash?

A. I want to make sure that the answer to the question is clear.

On the host, we need to convert PCIe commands to SCSI, so we send them to an iSCSI controller/adapter to be sent across the Ethernet wire. . That adapter can be either software or hardware.

Flash drives are basic media, just like spinning disk drives. Flash will have its own controller, which can be SCSI. If you wish to access the drive over Ethernet using the iSCSI protocol, you will have an adapter on the flash drive (which can be either software or hardware) that will do the SCSI translation between the Flash media and the SCSI commands. This is often called the FTL – or the flash translation layer. Again, the SCSI commands are translated into an Ethernet-friendly packet to be sent along the wire.

There are other types of communication forms for working with Flash, too. The most recent is NVMe. You can see the SNIA webinar on The Performance Impact of NVMe and NVMe over Fabrics for more information.

Q. Why is there a SCSI language in between Storage and the hosts?

A. Before storage standards (like SCSI and ATA), you would purchase storage from Vendor X, and you would also buy a storage controller for that vendor X storage.   If you wanted Vendor Y storage, you could not use the vendor X controller, you had to purchase a new controller from vendor Y. Every vendor had their own language, and you had to purchase matching components.   Once you got locked into one vendor, you were stuck – at the hardware level.

Today with standards (such as SCSI), you can buy a SCSI device from any vendor and connect it to any storage controller that you buy from any vendor – and it just works.   That is the point of the SCSI standard. When the storage standardization efforts began, there were many competing ideas. It just so happened that SCSI won out, and so now it’s everywhere.

Q. If the storage side is flash, do we still need a SCSI controller between host and flash storage or is the controller different for a flash storage?

Yes… and no.

Yes, to use the SCSI part of the OS, the flash device must continue to speak the SCSI protocol and so a SCSI controller is needed. This method of communicating with flash devices enables all the existing S/W on the host OS to just continue working without even knowing it is flash.

No, flash memory chips can be connected to the system using non-SCSI methods. Most of those methods are generally special purpose applications and so the existing OS S/W simply cannot use that device (only the special purpose S/W designed for that device can use it). However, NVMe is a new protocol that is enabling more general use of the flash memory technology with the hopes to provide new capabilities that are beyond what SCSI provides. This is also discussed in our webcast on The Performance Impact of NVMe and NVMe over Fabrics.

Q. What’s the difference between partition, logical disk, volume, LUN, etc?

Excellent question!

Let’s work this one backwards – LUN – that is a SCSI term (actually an acronym) that refers to the Logical Unit Number – it is part of the address used to access a logical unit. Small SCSI devices (such as a single spindle disk drive) have only a single logical unit. Large storage arrays may contain 100s of logical units. Each of those logical units appears to the host OS as if it were a single spindle disk. So, the logical unit is the SCSI object that contains the blocks where the data is stored (where that object may be an individual piece of hardware, or a logical entity within a larger SCSI device). To access a SCSI logical unit, the OS must specify the address of the SCSI device, and then the LUN (logical unit number) for the logical unit within that SCSI device.

The other terms (partition, logical disk, and volume) are OS terms that have to do with virtualization and how the storage blocks are managed. When a SCSI (or other storage device) is formatted by the OS, it may be broken into multiple partitions. Each partition is then treated by the OS as if it were a unique device (a virtual device, or a logical disk). Each of those partitions may then be used independently.

For example, on a Unix system, the “a” partition may contain a file system that has all the files that are necessary to boot. The “b” partition may be setup without a file system and used as the swap or paging storage for use by the virtual memory subsystem. The “d” partition may then contain a file system that contains all the user’s data files. Each partition is unique storage space and may even use a different file system to organize the data located there.

Notice, that I skipped the “c” partition. That partition is often setup to access all the blocks of the physical device. So, on a 500GB disk, maybe “a” contains 10GB, “b” contains 90GB, and “d” contains 400GB; while “c” contains all 500GB. Partitions “a” and “d” may be backed up or restored independently, and partition “c” may be used to perform an image copy of the entire device.

Now, to the other terms:

Logical disk and volume are terms often related to volume managers.

  • Logical disk may be a term used to refer to a partition, but usually that is not the case. Logical disks are typically the devices created by the volume management layer when they combine individual devices into a single larger device (a.k.a. a logical disk).
  • Volume managers also may be used to divide up a large device into small chunks (just like partitions), and those smaller chunks are referred to as logical disks.
  • Volume is a more vague term that typically is used as another term for a logical disk. In some circles, the term Volume is used to refer to a RAID set in a SCSI storage controller (but this is a much less often used definition for Volume).

Q. Will the “complete” presentation somewhere we can go review?

Yes. Click here to access the on-demand webcast as well as a PDF of the webcast slides

Update: If you missed the live event, it’s now available  on-demand. You can also  download the webcast slides.

Next Webcast: The 2015 Ethernet Roadmap for Networked Storage

The ESF is excited to announce our next live Webcast, “The 2015 Ethernet Roadmap for Networked Storage.”

For over three decades, Ethernet has advanced on a simple “powers-of-ten” speed increases, and this model has served the industry well.   Ethernet is changing in big ways and the Ethernet Alliance has captured the latest changes in the 2015 Ethernet Roadmap.

On June 30th at 10:00 a.m. PT an expert panel comprised of Scott Kipp, President of the Ethernet Alliance, David Chalupsky, Chair IEEE P802.3bq/bz TFs and the Ethernet Alliance BASE-T Subcommittee and myself will present the Ethernet Alliance’s 2015 Ethernet Roadmap for the networking technology that underlies most of future network storage.

SNIA has focused on protocols and usage models and more or less just takes Ethernet for granted.   The biggest technology disruption in the storage space is the emergence into the mainstream of Non-Volatile Memory (NVM), FLASH in particular.   NVM increasingly moves system bottlenecks from the storage subsystem to the network.   Developments in NVM — most recently 3D FLASH — assure that the cost per GB will continue aggressive declines and demand for bandwidth will go up.   NVM will become more prevalent, making the roadmap for Ethernet increasingly more important to the storage networking community.

This will be a live and interactive session. I encourage you to register now and bring your questions for our experts. I hope to see you on June 30th.