New ESF Webcast: Benefits of RDMA in Accelerating Ethernet Storage Connectivity

We’re kicking off our 2015 ESF Webcasts on March 4th with what we believe is an intriguing topic – how RDMA technologies can accelerate Ethernet Storage. Remote Direct Memory Access (RDMA) has existed for many years as an interconnect technology, providing low latency and high bandwidth in computing clusters.  More recently, RDMA has gained traction as a method for accelerating storage connectivity and interconnectivity on Ethernet.  In this Webcast, experts from Emulex, Intel and Microsoft will discuss:

  • Storage protocols that take advantage of RDMA
  • Overview of iSER for block storage
  • Deep dive of SMB Direct for file storage.
  • Benefits of available RDMA technologies to accelerate your Ethernet storage connectivity, both iWARP and RoCE

Register now. This live Webcast will provide attendees with a vendor-neutral look at RDMA technologies and should prove to be an interactive and informative event. I hope you’ll join us!

Real-World FCoE Best Practices Q&A

At our recent live Webcast “Real-World FCoE Designs and Best Practices,” IT leaders from Thermo Fisher Scientific and Gannett Co. shared their experiences from their FCoE deployments – one single-hop, one multi-hop. It was a candid discussion on the lessons they learned. If you missed the Webcast, it’s now available on demand. We polled the audience to see what stage of FCoE deployment they’re in (see the poll results at the end of this blog). Just over half said they’re still in learning mode. To that end, here are answers to the questions we got during the Webcast. As you will see, many of these questions were directed to our guest end users regarding their experiences. I hope that it will help you in your journey. If you have additional questions, please ask them in the comments section in this blog and we’ll get back to you as soon as possible.

Q. Have any issues come up where the storage team needed to upgrade SAN switch firmware to solve a problem, but the network team objected to upgrading the FCFs?  This assumes a shared firmware release on both network and SAN switch products (i.e. Cisco NX-OS)

A. No we need to work together as a team so as long as it is planned out in advance this has not been an issue.

Q. Is there any overhead at the host CPU level when using FCOE/CNA vs. using FC/HBA? Has anyone done any benchmarking on this?

A. To the host OS it is the CNA that presents a HBA and 10G Ethernet adapter, so to the host OS there is not a difference from what is normally presented for Ethernet and FC adapters.  In a software FCoE implementation there might be, but you should check with the particular implementation from the OS vendors for this information.

Q. Are there any high-level performance considerations when compared to typical FC SAN? Any obvious impact to IO latency as hosts are moved to FCoE compared to FC?

A. There is a performance increase in comparison to 8GB Fibre channel since FCoE using Ethernet and 64/66b encoding vs. 8/10b encoding that native 8GB uses. On dedicated links it could be around 50% increase in performance from 10GB FCoE vs. 8GB FC.

 Q. Have you planned to use of 40G – FCoE in you edge core design?

A. We have purchased the hardware to go 40G if we choose to.

 Q. Was DCB used to isolate the network traffic with FC traffic at the CNA?

A. DCB is a set of technologies that includes DCBX, PFC, ETS that are used with FCoE.

 Q. Was FCoE implemented on existing hosts or just on new ones being added to the SAN?

A. Only on new hosts.

Q. Can you expand on Domain_ID sprawl ?

A. In FC or FCoE fabrics each storage vendor supports only a certain amount of switches per fabric. Each full FC or FCoE switch will consume a Domain ID, so it is important to consider how many switches or domain IDs are allowed in a supported fabric based on the storage vendor’s fabric recommendations. There are technologies such as NPIV and vendor specific technologies that can be helpful to limit domain ID sprawl in your fabrics.

Remember the poll I mentioned during the Webcast? Here are the results. Let us know where you are in your FCoE deployment plans.

Screen Shot 2014-12-19 at 9.15.15 AM

 

 

Take Our ESF Quick Poll

ESF has some exciting plans for 2015! We’re busy covering all things “Ethernet Storage” with topics on FCoE and iSCSI use caseCheckmarks, Cloud File Services, Object Storage, NVMe over Fabrics, SMB 3.0, NFS and more. We’re writing White Papers, hosting live expert Webcasts, publishing articles, and of course using this blog and Twitter to keep you updated on all that’s going on.

To help us in our mission to drive the broad adoption of Ethernet-connected storage networking technologies, we want to deliver content on the Ethernet Storage topics that matter most to you. Please take this quick poll – really it’s quick – only two questions – and help us shape the conversation for 2015. We look forward to your input and appreciate your support of SNIA-ESF. SNIA-ESF quick poll.

 

 

 

Webcast Preview: End Users Share their FCoE Stories

Fibre Channel over Ethernet (FCoE) has been growing in popularity year after year. From access layer, to multi-hop and beyond, FCoE has established itself as a true solution in the data center.

Are you interested in learning how customers are using FCoE? Join us on December 10th, at 3:00 pm ET, 1:00 pm PT for our live Webcast, “Real World FCoE Designs and Best Practices“. This live SNIA Webcast examines the most used FCoE designs and looks at how this is being used in REAL world customer implementations. You will hear from two IT leaders  who have implemented FCoE and why they did so. We will cover:

  • Real-world Use Cases and Customer Implementations of:
    • Single-Hop FCoE
    • Multi-Hop FCoE
    • Use of FCoE for Inter-Switch Links (ISLs)

This will be a vendor-neutral live presentation. Please join us on December 10th and bring your questions for our panel.

The Performance Impact of NVMe and NVMe over Fabrics – Q&A

More than 400 people have already seen our recent live ESF Webcast, “The Performance Impact of NVMe and NVMe over Fabrics.” If you missed it, it’s now available on-demand. It was a great session with a lot of questions from attendees. We did not have time to address them all – so here is a complete Q&A courtesy of our experts from Cisco, EMC and Intel. If you think of additional questions, please feel free to comment on this blog.

Q. Are you saying that just by changing the interface to NVMe for any SSD, one would greatly bump up the IOPS?

A. NVMe SSDs have higher IOPs than SAS or SATA SSDs due to several factors, including the low latency of PCIe and the efficiency of the NVMe protocol.

Q. How much of the speed of NVMe you have shown is due to simpler NVMe protocol vs. using Flash? i.e how would the SAS performance change when you are attaching SSD to SAS

A. The performance differences shown comparing NVMe to SAS to SATA were all using solid-state drives (all NAND Flash). Thus, the difference shown was due to the interface.

Q. Can you  comment on the test conditions these results were obtained under and what are the main reasons NVMe outperforms the others?

A. The most important reason NVMe outperforms other interfaces is that it was architected for NVM – rather than inheriting the legacy of HDDs. NVMe is built on the foundation of a very efficient multi-queue model and a simple hardware automatable command set that results in very low latency and high performance. Details for IOPs and bandwidth comparisons are shown in the footnotes of the corresponding foils. For the efficiency tests, the detailed setup information was inadvertantly removed from the backup. This will be corrected.

Q. What is the IOPS difference between NVMe and SAS at the same queue depth?

A. At a queue depth of 32, for the particular devices shown with 4K random reads is NVMe =~ 267K IOPs and SAS =~ 149K IOPs. SAS does not improve when the queue depth is increased to 128. NVMe performance increases to ~ 472K IOPs at a queue depth of 128.

Q. Why not use PCIe directly instead of the NVMe layer on PCIe?

A. PCI Express is used directly. NVM Express is the standard software interface for high performance PCI Express storage devices. PCI Express does not define register, DMA, command set, or feature set for PCIe storage devices. NVM Express replaces proprietary software interfaces and drivers used previously by PCIe SSDs in the market.

Q. Is the Working Group considering adding things like enclosure identification in the transport abstraction so the host/client can identify where the NVMe drives reside?

A. The NVM Express organization is developing a Management Interface specification set for release in Q1’2015 that will enable standardized enclosure management. The intent is that these features could be used regardless of fabric type (PCIe, RDMA, etc).

Q. Are there APIs in the software interface for device query information and device RAID configuration?

A. NVMe includes an Identify Controller and Identify Namespace command that provides information about the NVMe subsystem, controllers, and namespaces. It is possible to create a RAID controller that uses the NVMe interface if desired. Higher level software APIs are typically defined by the OSV.

Q. 1. Are NVMe drivers today multi-threaded? 2. If I were to buy a NVMe device today can you suggest some list of vendors whose solutions are used today in data centers (i.e production and not proof of concept or proto)?

A. The NVM Express drivers are designed for multi-threading – each I/O queue may be owned/controlled by one thread without synchronization with other driver threads. A list of devices that have passed NVMe interoperability and conformance testing are on the NVMe Integrator’s List.

Q. When do you think the market will consolidate for NVMe/PCIE based SSDs and End of SATA era ?

A. By 2018, IDC predicts that Enterprise SSDs by Interface will be PCIe=38%, SAS=28%, and SATA=34%. By 2018, Samsung predicts over 70% of client SSDs will be PCIe. Based on forecasts like this, we expect strong growth for NVMe as the standard PCIe SSD interface in both Enterprise and Client segments.

Q. why can’t it be like a graphic card which does memory transactions?

A. SSDs of today have much longer latency than memory – where a read from a typical NAND page takes > 50 microseconds. However, as next generation NVM comes to market over the next few years, there may be blurring of the lines between storage and memory, where next generation NVM may be used as very fast storage (like NVMe) or as memory as in NVDIMM type of usage models.

Q. It seems that most NVMe drive vendors supply proprietary drivers for their drives. What’s the value of NVMe over proprietary interfaces given this? Will we eventually converge on the open source driver?

A. As the NVMe ecosystem matures, we would expect most implementations would use inbox drivers that are present in many OSes, like Windows, Linux, and Solaris. However, in some Enterprise applications, a vendor may have a value added feature that could be delivered via their own software driver. OEMs and customers will decide whether to use inbox drivers or a vendor specific driver based on whether the value provided by the vendor is significant.

Q. To create an interconnect to a scale-out storage system with many NVMe drives does that mean you would you need an aggregated fabric link (with multiple RDMA links) to provide enough bandwidth for multiple NVMe drives?

A. Depends on the speed of the fabric links and the number of NVMe drives. Ideally, the target system would be configured such that the front-end fabric and back-end NVMe drives were bandwidth balanced. Scaling out multi-drive subsystems on a fabric may require the use of fat-tree switch topologies which may be constructed using some form of link aggregation. The performance of the PCIe NVMe drives is expected to put high bandwidth demands on the front-end network interconnect. Each NVMe SFF8639 2.5″ drive has a PCIe Gen3/x4 interconnect with the capability to product 3+ GB/s (24gbps) of sustained storage bandwidth. There are multiple production server systems with 4-8 NVMe sff8639 drive bays, which puts these platforms in the 200Gbps capability when used as NVMe over fabrics storage servers. The combination of PCIe NVMe drives and NVMe over fabric targets is going to have a significant impact on datacenter storage performance.

Q. In other forums we heard about NVMe extensions to deliver vendor specific value add features. Do we have any updates?

A. Each vendor is allowed to add their own vendor specific features and value. It would be best to discuss any vendor specific features with the appropriate vendor.

Q. Given that PCIe is not a scalable fabric at least from a storage perspective, do you see the need for SAS SSDs to increase or diminish over time? Or is your view that NVMe SSDs will populate the tier between DRAM and say, rotating media like SAS HDDs?

A. NVMe SSDs are the highest performance SSDs available today. If there is a box of NVMe SSDs, the most appropriate connection to that JBOD may may be Ethernet or another fabric that then fans out inside the JBOD to PCIe/NVMe SSDs.

Q. From a storage industry perspective, what deficiencies does NVMe have to displace SAS? Will that transition ever happen?

A. NVMe SSDs are seeing initial broad deployment primarily in server use cases that prize the high performance. Storage applications require a robust high availiability interface. NVMe has defined support for dual port, reservations, and other high availability features. NVMe will be used in storage applications as these high availability features mature in products.

Q. Will NVMe  over fabrics allow to dma read/write the NVMe  device directly (without going through system memory)?

A. The locality of the NVMe over Fabric buffers on the target-side are target implementation specific. That said, one could construct a target that used a 
pool of PCIe NVMe subsystem controller resident memory as the source and/or sink buffers of a fabric’s NIC’s NVMe data exchanges. This type of configuration will have the limitation of having to pre-determine fabric data to NVMe  device locality else the data could end up in the wrong drive’s controller memory.

Q. Intel True scale fabric technology was based on Fulcrum ASIC. Could you please provide an input how Intel Omni Scale differs from Intel True Sclae fabric?

A.  In the context of NVMe over Fabrics, Intel Omni-Path fabric is a possible future fabric candidate for an NVMe over Fabrics definition. Specifics on the fabric itself are outside of the scope of NVMe over Fabrics definition. For information on Omni-Path file, please refer to http://www.intel.com/content/www/us/en/omni-scale/intel-omni-scale-fabric-demo.html?wapkw=omni-scale.

Q. Can the host side NVMe client be in user mode since it is using RDMA?

A. It is possible since RDMA QP communications allow for both user and kernel mode access to the RDMA verbs. However, there are implications to consider. The NVMe host software currently resides in multiple operating systems as a kernel level block-storage driver. The goal is for NVMe over Fabrics to share common NVMe code between multiple fabric types in order to provide a consistent and sustainable core NVMe software. When NVMe over Fabrics is moved to user level, it essentially becomes a separate single-fabric software solution that must be maintained independently of the multi-fabric kernel NVMe software. There are some performance advantages of having a user-level interface, such as not having to go through the O/S system calls and the ability to poll the completions. These have to be weighed against the loss of kernel resident functionality, such as upper level kernel storage software, and the cost of sustaining the software.

Q. Which role will or could play the InfiniBand Protocol in the NVMe concept?

A. InfiniBandâ„¢ is one of the supported RDMA fabrics for NVMe over RDMA. NVMe over RDMA will support the family of RDMA fabrics through use of a common set of RDMA verbs. This will allow users to select the RDMA fabric type based on their fabric requirements and not be limited to any one RDMA fabric type for NVMe over RDMA.

Q. Where is  this experimental code for NVMeOF for Driver and FIO available?

A. FIO is a common Linux storage benchmarking tool and is available from multiple Internet sites. The driver used in the demo was developed specifically as a proof of concept and demonstration for the Intel IDF 2014. They were based on a pre-standardized implementation of the NVMe over RDMA wire protocol. The standard NVMe over RDMA wire protocol is currently being under definition in NVM Express, Inc.. Once the standard is complete, both Host and reference Target drivers for Linux will be developed.

Q. Was polling on the completion queue used on the target side in the prototype?

A. The target side POC implementation used a polling technique for both the NVMe over RDMA CQ and NVMe CQ. This was to minimize the latency by eliminating the interrupt latency on the target for both CQs. Depending on the O/S and both the RDMA and NVMe devices interrupt moderation settings, interrupt latency is typically around 2 microseconds. If polling is not the desired model, Intel processors enable another form of event signaling called Monitor/Mwait where latency is typically in the 500ns latency range.

Q. In the prototype over iWARP, did the remote device dma write/read the NVMe  device directly, or did it go through remote system memory?

A. In the PoC, all NVMe commands and command data went through the remote system memory. Only the NVMe commands were accessed by the CPU, the data was not touched.

Q. Are there any dependencies between NVMEoF using RDMA and iWARP? Can standard software RDMA in Linux  distributions  be used without need for iWARP support?

A. As mentioned, the NVMe over RDMA will be RDMA type agnostic and work with all RDMA providers that support a common set of RDMA verbs.

Q. PCIe doesn’t support multi-host access to devices. Does NVMe over fabric require movement away from PCIe?

A. The NVMe 1.1 specification specifically added features for multi-host support – allowing NVMe subsystems to have multiple NVMe controllers and multiple fabric ports. This model is supported in PCI Express by multi-function/ multi-port PCIe drives (typically referred to as dual-port). Dependng on the fabric type, NVMe over Fabrics will extend to configurations with many hosts sharing a single NVMe subsystem consisting of multiple NVMe controllers.

Q. In light of the fact that NVMe over Fabrics reintroduces more of the SCSI architecture, can you compare and contrast NVMe over Fabrics with ‘SCSI Express’ (SAM/SPC/SBC/SOP/PQI)?

A. NVMe over Fabrics is not a SCSI model, it’s extending the NVMe model onto other fabric types. The goal is to maintain the simplicity of the NVMe model, such as the small amount of NVMe command types, multi-queue interface model, and efficient NVM oriented host and controller implementations. We chose the RDMA fabric as the first fabric because it too was architected with a small number of operations, multi-queue interface model, and efficient low-latency operations.

Q. Is there an open source for NVMe over Fabrics, which was used for the IDF demo? If not, can that be made available to others to see how it was done?

A. Most of the techniques used in the PoC drivers will be implemented in future open-source Host and referent Target drivers. The PoC was both a learning and demonstration vehicle. Due to the PoC drivers using a pre-standards based NVMe over RDMA protocol, we feel it’s best not to propagate the implementation.

Q. What is the overhead of the protocol? Did you try putting NVMe in front just DRAM? I’d assume you’ll get much better results, and understand the limitations of the protocol much better.  In front of DRAM it won’t be NVM, but it will give good data regarding protocol latency.

A. The overhead of the protocol on the host-side matched the PCIe NVMe driver. On the target, the POC protocol efficiency was around 600ns of compute latency for a complete 4K I/O. For low-latency media, such as DRAM or next generation NVM, the reduced latency of a solution similar to the PoC will enable the effective use of the media’s low latency characteristics.

Q. Do you have FC performance comparison with NVMe?

A. We did implement an 100GBE/FCoE target with NVMe back-end storage for an Intel IDF 2012 demonstration. FCoE is a combination of two models, FCP and SCSI. Our experience with this target implementation showed that we were adding a significant amount of computational latency on both the host (initiator) and target FCoE/FC/SCSI storage stacks that reduced the performance and efficiency advantages gained in the back-end PCIe NVMe SSDs. A significant component of this computational latency was due to the multiple storage models and associated translations that occurred between the host application and back-end NVMe drives This experience led us down the path of enabling an end-to-end NVMe model through expanding the NVMe model onto a range of fabric types.

Update: Want to learn more about NVMe? Check out these SNIA ESF webcasts:

A Beginner’s Guide to NVMe

When I first started in storage technology (it doesn’t seem like that long ago, really!) the topic seemed like it was becoming rather stagnant. The only thing that seemed to be happening was that disks were getting bigger (more space) and the connections were getting faster (more speed).

More speed, more space; more space, more speed. Who doesn’t like that? After all, you can never have too much bandwidth, or too much disk space! Even so, it does get rather routine. It gets boring. It gets well, “what have you done for me lately?

Then came Flash.

People said that Flash memory was a game changer, and though I believed it, I just didn’t understand how, really. I mean, sure, it’s really, really fast storage drives, but isn’t that just the same story as before? This is coming, of course, from someone who was far more familiar with storage networks than storage devices.

Fortunately, I kept that question to myself (well, at least until I was asked to write this blog), thus saving myself from potential embarrassment.

There is no shortage of Flash vendors out there who (rightfully) would have jumped at the chance to set my misinformed self on the straight and narrow; they would have been correct, too. Flash isn’t just “cool,” it allows the coordination, access, and retrieval of data in ways that simply weren’t possible with more traditional media.

There are different ways to use Flash, of course, and different architectures abound in the marketplace – from fully “All Flash Arrays” (AFA), “Hybrid Arrays” (which are a combination of Flash and spinning disk), to more traditional systems that have simply replaced the spinning drives with Flash drives (without much change to the architecture).

Even through these architectures, though, Flash is still constrained by the very basic tenets of SCSI connectivity. While this type of connectivity works very well (and has a long and illustrious history), the characteristics of Flash memory allow for some capabilities that SCSI wasn’t built to do.

Enter NVMe.

What’s NVMe?

Glad you asked. NVMe stands for “Non-Volatile Memory Express.” If that doesn’t clear things up, let’s unpack this a bit.

Flash and Solid State Devices (SSDs) are a type of non-volatile memory (NVM). At the risk of oversimplifying, NVM is a type of memory that keeps its content when the power goes out. It also implies a way by which the data stored on the device is accessed. NVMe is a way that you can access that memory.

If it’s not quite clear yet, don’t worry. Let’s break it down with pretty pictures.

Starting Before the Beginning – SCSI

For this, you better brace yourself. This is going to get weird, but I promise that it will make sense when it’s all over.

Let’s imagine for the moment that you are responsible for programming a robot in a factory. The factory has a series of conveyor belts, each with things that you have to put and get with the robot.

Graphic 1

The robot is on a track, and can only move from sideways on the track. Whenever it needs to get a little box on the conveyor belt, it has to move from side to side until it gets to the correct belt, and then wait for the correct orange block (below) to arrive as the belt rotates.

Now, to make things just a little bit more complicated, each conveyor belt is a little slower than the previous one. For example, belt 1 is the fastest, belt 2 is a little slower, belt 3 even slower, and so on.

Graphic 2

In a nutshell, this is analogous to how spinning disk drives work. The robot arm – called a read/write head – moves across a spinning disk from sector to sector (analogous, imperfect as it may be, to our trusty conveyor belts) to pick up blocks of data.

(As an aside, this is the reason why there are differences in performance between long contiguous blocks to read and write – called sequential data – and randomly placing blocks down willy-nilly in various sectors/conveyor belts – called random read/writes.)

Now remember, our trusty robot needs to be programmed. That is, you – in the control room – need to send instructions to the robot so that it can get/put its data. The command set that is used to do this, in our analogy, is SCSI.

Characteristics of SCSI

SCSI is a tried-and-true command structure. It is, after all, the protocol for controlling most devices in the world. In fact, its ubiquity is so prevalent, most people nowadays think that it’s simply a given. It works on so many levels and layers as an upper layer protocol, with so many different devices, that it’s easily the default “go-to” application. It’s used with Fibre Channel, FCoE, iSCSI (obviously), InfiniBand – even the hard drives in your desktop and laptops.

It was also built for devices that relied heavily on the limitations of these conveyor belts. Rotational media has long been shown as superior than linear (i.e., tape) in terms of speed and performance, but the engineering required to make up for the changes in speeds from the inside of the drive (where the rotational speed is much slower) to the outside – similar to the difference between our conveyor belts “4” and “1” – results in some pretty fancy footwork on the storage side.

Because the robot arm must move back and forth, it’s okay that it can only handle one series of commands at a time. Even if you could tell it to get a block from conveyor belt 1 and 3 at the same time, it couldn’t do it. It would have to queue the commands and get to each in turn.

So, the fact that SCSI only sent commands one-at-a-time was okay, because the robot could only do that anyway.

Then came Flash, and suddenly the things seemed a bit… constricted.

The Flash Robot

Let’s continue with the analogy, shall we? We still have our mandate – control a robot to get and put blocks of information onto storage media. This time, it’s Flash.

First, let’s do away with the conveyor belts altogether. Instead, let’s lay out all the blocks (in a nice OCD-friendly way) on the media, so that the robot arm can see it all at once:

Graphic 3

From its omniscient view, the robot can “see” all the blocks at once. Nothing moves, of course, as this is solid-state. However, as it stands, this is how we currently use Flash with a robot arm that responds to SCSI commands. We still have to address our needs one-at-a-time.

Now, it’s important to note that because we’re not talking about spinning media, the robot arm can go really fast (of course, there’s no real moving read/write head in a Flash drive, but remember we’re looking at this from a SCSI standpoint). The long and the short of it is that even though we can see all of the data from our vantage point, we still only ask for information one at a time, and we still only have one queue for those requests.

Enter NVMe

This is where things get very interesting.

NVMe is the standardized interface for PCIe SSDs (it’s important to note, however, that NVMe is not PCIe!). It was architected from the ground-up, specifically for the characteristics of Flash and NVM technologies.

Most of the technical specifics are available at the NVM Express website, but here are a couple of the main highlights.

First, whereas SCSI has one queue for commands, NVMe is designed to have up to 64 thousand queues. Moreover, each of those queues in turn can have up to 64 thousand commands. Simultaneously. Concurrently. That is, at the same time.

That’s a lot of freakin’ commands goin’ on at once!

Let’s take a look at our programmable robot. To complete the analogy, instead of one arm, our intrepid robot has 64 thousand arms, each with the ability to handle 64 thousand commands.

Graphic 4

 

Second, NVMe streamlines those commands to only the basics that Flash technologies need: 13 to be exact.

Remember when I said that Flash has certain characteristics that allow for radically changing the way data centers store and retrieve data? This is why.

Flash is already fast. NVMe can make this even faster than we do so today. How fast? Very fast.

Graphic 5

Source: The Performance Impact of NVMe and NVMe over Fabrics  

This is just an example, of course, because enterprise data centers have more workloads than just those that run 4K random reads. Even so, it’s a pretty nifty example:

  • For 100% random reads, NVMe has 3x better IOPS than 12Gbps SAS
  • For 70% random reads, NVMe has 2x better IOPS than 12Gbps SAS
  • For 100% random writes, NVMe has 1.5x better IOPs than 12Gbps SAS

What about sequential data?

Graphic 6

Source:  The Performance Impact of NVMe and NVMe over Fabrics  

Again, this is just one scenario, but the results are still quite impressive. For one, NVMe delivers greater than 2.5Gbps read performance and ~2Gbps write performance:

  • 100% reads: NVMe has 2x performance of 12Gbps SAS
  • 100% writes: NVMe has 2.5x performance of 12Gbps SAS

Of course, there is more to data center life than IOPS! Those efficiencies of command structure that I mentioned above also reduce CPU cycles by half, as well as reduce latency by more than 200 microseconds than 12 Gbps SAS.

I know it sounds like I’m picking on poor 12 Gbps SAS, but at the moment it is the closest thing to the NVMe-type of architecture. The reason for this is because of NVMe’s relationship with PCIe.

Relationship with PCIe

If there’s one place where there is likely to be some confusion, it’s here. I have to confess, when I first started going deep into NVMe, I got somewhat confused, too. I understood what PCIe was, but I was having a much harder time figuring out where NVMe and PCIe intersected, because most of the time the conversations tend to blend the two technologies together in the discussion.

That’s when I got it: they don’t intersect.

When it comes to “hot data,” we in the industry have been seeing a progressive migration towards the CPU. Traditional hosts contain an I/O controller that sits in-between the CPU and the storage device. By using PCIe, however it is possible to eliminate that I/O controller from the data path, making the flows go very, very quick.

Because of the direct connection to the CPU, PCIe has some pretty nifty advantages, including (among others):

  • Lower latency
  • Scalable performance (1 GB/s, per lane, and PCI 3.0 x8 cards have 8 lanes – that’s what the “x8” means)
  • Increased I/O (up to 40 PCIe lanes per CPU socket)
  • Low power

This performance of PCIe, as shown above, is significant. Placing a SSD on that PCIe interface was, and is, inevitable. However, there needed to be a standard way to communicate with the SSDs through the PCIe interface, or else there would be a free-for-all for implementations. The interoperability matrices would be a nightmare!

Enter NVMe.

NVM Express is that standardized way of communicating with a NVM storage device, and is backed by an ever-expanding consortium of hardware and software vendors to make your life easier and more productive with Flash technologies.

Think of PCIe as the physical interface and NVMe as the protocol for managing NVM devices using that interface.

Not Just PCIe, Fabrics!

Just like with SCSI, an interest has emerged in moving the storage outside of the host chassis itself. It’s possible to do this with PCIe, but it requires extending the PCIe hardware interface outside as well, and as a result there has been some interest in using NVMe with other interface and storage networking technologies.

Work is progressing on not just point-to-point communication (PCIe, RDMA-friendly technologies such as RoCE and iWARP), but also Fabrics as well (InfiniBand-, Ethernet- and Fibre Channel-based, including FCoE).

By expanding these capabilities, it will be possible to attach hundreds, maybe thousands, of SSDs in a network – far more than can be accommodated via PCIe-based systems. There are some interesting aspects for NVMe using Fabrics that are best left for another blog post, but it was worth mentioning here as an interesting tidbit.

Bottom Line

NVM Express has been in development since 2007, believe it or not, and has just released version 1.2  as of this writing. Working prototypes of NVMe using PCIe- and Ethernet-based connectivity have been demonstrated for several years. It’s probably the most-developed standards work that few people have ever heard about!

Want to learn more? I encourage you to listen/watch the SNIA ESF Webcast “The Performance Impact of NVMe and NVMe over Fabrics”  which goes into much more technical detail, sponsored by the NVMe Promoter’s Board and starring some of the brainiacs behind the technical work. Oh, and I’m there too (as the moderator, of course).

Keep an eye open for more information about NVMe in the coming months. The progress made is remarkably rapid, and more companies are joining each year. It will be extremely interesting to see just how creative the data center architectures can get in the coming years, as Flash technology comes to realize its full potential.

Update: Want to learn more about NVMe? Check out these SNIA ESF webcasts:

 

Cloud File Services: SMB/CIFS and NFS…in the Cloud – Q&A

Cloud File Services: SMB/CIFS and NFS…in the Cloud – Q&A

At our recent live ESF Webcast, “Cloud File Services: SMB/CIFS and NFS…in the Cloud” we talked about evaporating your existing file server into the cloud. Over 300 people have viewed the Webcast. If you missed it, it’s now available on-demand. It was an interactive session with a lot of great questions from attendees. We did not have time to address them all – so here is a complete Q&A from the Webcast. If you think of additional questions, please feel free to comment on this blog.

Q. Can your Storage OS take advantage of born-in-the-cloud File Storage like Zadara Storage at AWS and Azure?

A. The concept presented is generic in nature.   Whichever storage OS the customer chooses to use in the cloud will have its own requirements on the underlying storage beneath it.   Most Storage OS’s used for Cloud File Services will likely use block or object backends rather than a file backend.

Q. Regarding Cloud File Services for “Client file services,” since the traditional file services require the client and server to be in a connected mode, and in the same network. And, they are tied to identities available in the network. How can the SMB/NFS protocols be used to serve data from the cloud to the clients that could be coming from different networks (4G/Corporate)? Isn’t REST the appropriate interface for that model?

A. The answer depends on the use case.   There are numerous examples of SMB over the WAN, for example, so it’s not far fetched to imagine someone using Cloud File Services as an alternative to a “Sync & Share” solution for client file services.   REST (or similar) may be appropriate for some, while file-based protocols will work better for others.   Cloud File Services provides the capability to extend into this environment where it couldn’t before.

Q. Is Manila like VMware VSAN or VASA?

A. Please take a look at the Manila project on OpenStack’s website https://wiki.openstack.org/wiki/Manila

Q. How do you take care of data security while moving data from on-premises to cloud (Storage OS)?

A. The answer depends on the Storage OS you are using for your Cloud File Services platform.   If your Storage OS supports encryption, for example, in its storage-to-storage in-flight data transport, then data security in-flight would be taken care of.   There are many facets to security that need to be thought through, including security at rest, some of which may depend on the environment (private/on-premises, service provider, hyperscalar) the Storage OS is sitting in.

Q. How do you get the data out of the cloud?   I think that’s been a traditional concern with cloud storage.

A. That’s the beauty of Cloud File Services!   With data movement and migration provided at the storage-level by the same Storage OS across all locations, you can simply move the data between on-premises and off-premises and expect similar behavior on both ends.   If you choose to put data into a native environment specific to a hyperscalar or service provider, you run the risk of lock-in.

Q. 1. How does one address issue of “chatty” applications over the cloud?   2. File services have “poor” performance for small files. How does one address that issue? Block & Objects do address that issue 3. Why not expose SMB, NFS, Object Interface on the Compute note?

A. 1. We should take this opportunity to make the applications less chatty! 🙂   One possible solution here is to operate the application and Storage OS in the same environment, in much the same way you would have on-premises.   If you choose a hyperscalar or service provider, for chatty use cases, it may be best to keep the application and storage pieces “closer” together.

2. Newer file protocols are getting much better at this.   SMB 3.02 for instance, was optimized for 8K transactions.   With a modern Storage OS, you will be able to take advantage of new developments.

3. That is precisely the idea: the Storage OS operating in the “compute nodes,” serving out their interfaces, while taking advantage of different backend offerings for cost and scalability.

Q. Most storage arrays NetApp, EMC etc., can provide 5 9s of resilience, Cloud VMs typically offer 3 9’s.   How do you get to 5 9’s with CFS?

A. Cloud File Services (CFS) as a platform can span across all of your environments, and as such, the availability guarantees will depend upon each environment in which CFS is operating.

Q. Why are we “adding” another layer? Why can we just use powerful “NAS” devices that can have different media like NVMe, Flash SSD or HDDs?

A. Traditional applications may not want to change, but this architecture should suit those well.   It’s worth examining that “cloud-ready” model.   Is the goal to be “cloud-ready,” or is the goal to support the scaling, failover, and on-demand-ness that the cloud has the ability to provide?   Shared nothing is a popular way of accomplishing some of this, but it may not be the only way.

The existing interfaces provided by hyperscalars do provide abstraction, but if you are building an application, you run a strong risk of lock-in on any particular abstraction.   What is your exit strategy then?   How do you move your data (and applications) out?

By leveraging a common Storage OS across your entire infrastructure (on-premises, service providers, and hyperscalars), you have a very simple exit strategy, and your exit and mobility strategy become very similar, if not the same, with the ability to scale or move across any environment you choose.

Q. How do you virtualize storage OS? What happens to native storage OS hardware/storage?

A. A Storage OS can be virtualized similar to a PC or traditional server OS.   Some pieces may have to be switched or removed, but it is still an operating system.

Q. Why is your Storage displayed as part of your Compute layer?

A. In the hyperscalar model, the Storage OS is sitting in the compute layer because it is, in effect, running as a virtual machine the same as any other.   It can then take advantage of different tiers of storage offered to it.

Q. My concern is that it would be slower as a VM than a storage controller.   There’s really no guarantee of storage performance in the cloud in fact most hyperscalers won’t give me a good SLA without boatloads of money.   How might you respond to this?

A. Of course with on-premises infrastructure, a company or service provider will have more of a guarantee in the sense that they control the hardware behind it.   However, as we’ve seen, SLA’s continue to improve over time, and costs continue to come down for the Public Cloud.

Q. Does FreeNAS qualify as a Storage OS?

A. I recommend checking with their team.

Q. Isn’t this similar to Hybrid cloud?

A. Cloud File Services (CFS) is one way of looking at Hybrid Cloud.   Savvy readers and listeners will pick up that having the same Storage OS everywhere doesn’t necessarily limit you to only File Services. iSCSI or RESTful interfaces could work exactly the same.

Q. What do you mean by Storage OS? Can you give some examples?

A. As I work for NetApp, one example is Data ONTAP.   EMC has several as well, such as one for the VNX platform.   Most major storage vendors will have their own OS.

Q. I think one of the key questions is the data access latency over WAN, how I can move my data to the cloud, how I can move back when needed – for example, when the service is terminated?

A. Latency is a common concern, and connectivity is always important.   Moving your data into and out of the cloud is the beauty of the Cloud File Services platform, as I mentioned in other answers.   If one of your environments goes down (for example, your on-premises datacenter) then you would feasibly be able to shift your workloads over to one of your other environments, similar to a DR situation.   That is one example of where storage replication and application awareness across sites is important.

Q. Running applications like Oracle, Exchange through SMB/NFS (NAS), don’t you think it will be slow compared to FC (block storage)?

A. Oracle has had great success running over NFS, and it is extremely popular.   While Exchange doesn’t currently support running directly over SMB at this time, it’s not ludicrous to think that it may happen at some point in the future, in the same way that SQL has.

Q. What about REST and S3 API or are they just for object storage?   What about CINDER?

A. The focus of this presentation was only File Services, but as I mentioned in another answer, if your Storage OS supports these services (like REST or S3), it’s feasible to imagine that you could span them in the same way that we discussed CFS.

Q. Why SAN based application moving to NAS?

A. This was discussed in one of the early slides in the presentation (slide 10, I believe).   Data mobility and granular management were discussed, as it’s easier to move, delete, and otherwise manage files than LUNs, an admin can operate at a more granular level, and it’s easier to operate and maintain.   No HBA’s, etc.   File protocols are generally considered “easier” to use.

 

 

New Webcast: Cloud File Services: SMB/CIFS and NFS…in the Cloud

Imagine evaporating your existing file server into the cloud with the same experience and level of control that you currently have on-premises. On October 1st, ESF will host a live Webcast that introduces the concept of Cloud File Services and discusses the pros and cons you need to consider.

There are numerous companies and startups offering “sync & share” solutions across multiple devices and locations, but for the enterprise, caveats are everywhere. Register now for this Webcast to learn:

  • Key drivers for file storage
  • Split administration with sync & share solutions and on-premises file services
  • Applications over File Services on-premises (SMB 3, NFS 4.1)
  • Moving to the cloud: your storage OS in a hyperscalar or service provider
  • Accommodating existing File Services workloads with Cloud File Services
  • Accommodating  cloud-hosted applications over Cloud File Services

This Webcast will be a vendor-neutral and informative discussion on this hot topic. And since it’s live, we encourage your to bring your questions for our experts. I hope you’ll register today and we’ll look forward to having you attend on October 1st.    

 

Upcoming Plugfests at SDC

This year’s SNIA Storage Developer Conference (SDC) will take place in Santa Clara, CA Sept. 15-18.   In addition to an exciting agenda with great speakers, there is an opportunity for vendors to participate in SNIA Plugfests. Two Plugfests that I think are worth noting are: SMB2/SMB3 and iSCSI.

These Plugfests enable a vendor to bring their implementations of SMB2/SMB3 and/or iSCSI to test, identify, and fix bugs in a collaborative setting with the goal of providing a forum in which companies can develop interoperable products. SNIA provides and supports networks and infrastructure for the Plugfest, creating a collaborative framework for testing. Plugfest participants work together to define the testing process, assuring that objectives are accomplished.

Still Time to Register

Great news! There is still time to register. Setup for the Plugfest begins on September 13, 2014 and testing begins on the September 14th.

Register here for the SMB2/SMB3 Plugfest

Register here for the iSCSI Plugfest

What to Expect at a Plugfest

Learn more about what takes place at the Plugfests by watching the  video interview of Jeremy Allison, Co-Creator of Samba, as he candidly talks about what to expect at an SDC Plugfest.

Learn more about the Plugfest registration process. If you have additional questions, please contact Arnold Jones (arnold@snia.org).

 

Expanding Your Data Center with FCoE – Q&A

At our recent live ESF Webcast, “Expert Insights: Expanding the Data Center with FCoE,” we examined the current state of FCoE and looked at how this protocol can expand the agility of the data center if you missed it, it’s now available  on-demand. We did not have time to address all the questions, so here are answers to them all. If you think of additional questions, please feel free to comment on this blog.

Q. You mentioned using 40 and 100G for inter-switch links.   Are there use cases for end point (FCoE target and initiator) 40 and 100G connectivity?

A. Today most end points are only supporting 10G, but we are starting to see 40G server offerings enter the market, and activity among the storage vendors designing these 40G products into their arrays.

Q. What about interoperability between FCoE switch vendors?

A. Each switch vendor has his own support matrix, and would need to be examined independently.

Q. Is FCoE supported on copper cable?

A. Yes, FCoE supports “Twin Ax” copper and is widely used for server to top of rack switch connections to seven meters.  In fact, Converged Network Adapters are now available that support 10GBASE-T copper cables with the familiar RJ-45 jack.   At least one major switch vendor has qualified FCoE running over 10GBASE-T to 30 meters.

Q. What distance does FCoE support?

A. Distance limits are dependent on the hardware in use and the buffering available for Priority Flow Control. The lengths can vary from 3m up to over 80km. Top of rack switches would fall into the 3m range while larger class switch/directors would support longer lengths.

Q. Can FCoE take part in management/orchestration by OpenStack Neutron?

A. As of this writing there are no OpenStack extensions in Neutron for FCoE-specific plugins.

Q. So how is this FC-BB-6 different than FIP snooping?

A. FIP Snooping is a part of FC-BB-5 (Appendix D), which allows switch devices to identify an FCoE Frame format and create a forwarding ACL to a known FCF. FC-BB-6 creates additional architectural elements for deployments, including a “switch-less” environment (VN2VN), and a distributed switch architecture with a controlling FCF. Each of these cases is independent from the other, and you would choose one instead of the others. You can learn more about VN2VN from our SNIA-ESF Webcast, “How VN2VN Will Help Accelerate Adoption of FCoE.”

Q. You mentioned DCB at the beginning of the presentation. Are there other purposes for DCB? Seems like a lot of change in the network to create a DCB environment for just FCoE. What are some of the other technologies that can take advantage of DCB?

A. First, DCB is becoming very ubiquitous. Unlike the early days of the standard, where only a few switches supported it, today most enterprise switches support DCB protocols. As far as other use cases for DCB, iSCSI benefits from DCB, since it eliminates dropped packets and the TCP/IP protocol’s backoff algorithm when packets are dropped, smoothing out response time for iSCSI traffic. There is a protocol known as RoCE or RDMA over Converged Ethernet. RoCE requires the lossless fabric DCB creates to achieve consistent low latency and high bandwidth.   This is basically the InfiniBand API running over Ethernet. Microsoft’s latest version of file serving protocol, SMB Direct, and the Hyper-V Live Migration can utilize RoCE, and there is an extension to iSCSI known as iSER, which replaces TCP/IP with RDMA for the iSCSI datamover; enabling all iSCSI reads and writes to be done as RDMA operations using RoCE.

Q. Great point about RoCE.   iSCSI RDMA (iSER) is required from DCB if the adapters support RoCE, right?

A. Agreed. Please see the answer above to the DCB question.

Q. Did that Boeing Aerospace diagram still have traditional FC links, and if yes, where?

A. There was no Fibre Channel storage attached in that environment. Having the green line in the ledger was simply to show that Fibre Channel would have it’s own color should there be any links.

Q. What is the price of a 10 Gbp CNA compare to a 10Gbps NIC ?

A. Price is dependent on vendor and economics. But, there are several approaches to delivering the value of FCoE which can influence pricing:

  • Purpose built silicon that offloads the FC and Ethernet protocol functions offer a number of advantages including high performance, low CPU overhead, advanced features, etc., though even this depends on the vendor’s implementation.    But, these added features come with the expectation of additional cost. But, the processing of the protocols has to be done somewhere, and if you need your server CPUs to process applications instead of network protocols, then the value is justified.
  • With the introduction of Open FCoE drivers with DCB supported NICs, new options are available for customers to deploy the value of FCoE at the host. Open FCoE offloads the FC processing onto the host CPU and standard 10GbE NICs with DCB support can be used to manage the Ethernet transport functions. Where you have excess CPU capacity on your server, you might be in a position to reduce costs and deploy a software driver with  a 10GbE or faster NIC enhanced with the limited set of hardware offloads necessary to achieve full performance with Open FCoE. However, Open FCoE isn’t available with every OS or every NIC, so you need to consider OS support and availability.
  • A third consideration is that most enterprise servers include some form of advanced 10GbE networking on the motherboard that either supports purpose built silicon or DCB enabled silicon. So, depending upon which server and OS you deploy, you may have several options via embedded silicon.