Block Storage - SNIA on Data, Networking & Storage

File, Block and Object Storage: Real-world Questions, Expert Answers

May 16, 2018May 16, 2018 John Kim Leave a comment

More than 1,200 people have already watched our Ethernet Storage Forum (ESF) Great Storage Debate webcast “File vs. Block vs. Object Storage.” If you haven’t seen it yet, it’s available on demand. This great debate generated many interesting questions. As promised, our experts have answered them all here.

Q. What about the encryption technologies on file storage? Do they exist, and how do they affect the performance compared to unencrypted storage?

A. Yes, encryption of file data at rest can be done by the storage software, operating system, or the drives themselves (self-encrypting drives). Encryption of file data on the wire can be done by the storage software, OS, or specialized network cards. These methods can usually also be applied to block and object storage. Encryption requires processing power so if it’s done by the main CPU it might affect performance. If encryption is offloaded to the HBA, drive, or SmartNIC then it might not affect performance.

Q. Regarding block size, I thought that block size settings were also used to tune and optimize file protocol transfer, for example in NFS, am I wrong?

A. That is correct, block size refers to the size of data in each I/O and can be applied to block, file and object storage, though it may not be used very often for object storage. NFS and SMB both let you specific block I/O size.

Q. What is the main difference between object and file? Is it true that File has a hierarchical structure, while object does not?

A. Yes that is one important difference. Another difference is the access method–folder/file/offset for files and key-value for objects. File storage also often allows access to specific data within a file and in many cases shared writes to the same file, while object storage typically offers only shared reads and most object storage systems do not allow direct updates to existing objects.

Q. What is the best way to backup a local Object store system?

A. Most object storage systems have built-in data protection using either replication or erasure coding which often replicates the data to one or more remote locations. If you deploy local object storage that does not include any remote replication or erasure coding protection, you should implement some other form of backup or replication, perhaps at the hardware or operating system level.

Q. I feel that this discussion conflates object storage with cloud storage features, and presumes certain cloud features (for example security) that are not universally available or really part of Object Storage. This is a very common problem with discussions of objects — they typically become descriptions of one vendor’s cloud features.

A. Cloud storage can be block, file, and/or object, though object storage is perhaps more popular in public and private cloud than it is in non-cloud environments. Security can be required and deployed in both enterprise and cloud storage environments, and for block, file and object storage. It was not the intention of this webinar to conflate cloud and object storage; we leave that to the SNIA Cloud Storage Initiative (CSI).

Q. How do open source block, file and object storage products play into the equation?

A. Open source software solutions are available for block, file and object storage. As is usually the case with other open-source, these solutions typically make storage (block, file or object) available at a lower acquisition cost than commercial storage software or appliances, but at the cost of higher complexity and higher integration/support effort by the end user. Thus customers who care most about simplicity and minimizing their integration/support work tend to buy commercial appliances or storage software, while large customers who have enough staff to do their own storage integration, testing and support may prefer open-source solutions so they don’t have to pay software license fees.

Q. How is data [0s and 1s in hard disk] converted to objects or vice versa?

A. In the beginning there were electrons, with conductors, insulators, and semi-conductors (we skipped the quantum physics level of explanation). Then there were chip companies, storage companies, and networking companies. Then The Storage Networking Industry Association (SNIA) came along… The short answer is some software (running in the storage server, storage device, or the cloud) organizes the 0s and 1s into objects stored in a file system or object store. The software makes these objects (full of 0s and 1s) available via a key-value systems and/or a RESTful API. You submit data (stream of 1s and 0s) and get a key-value in return. Or you submit a key-value and get the object (stream of 1s and 0s) in return.

Q. What is the difference (from an operating system perspective where the file/object resides) between a file in mounted NFS drive and object in, for example Google drive? Isn’t object storage (under the hood) just network file system with rest API access?

A. Correct–under the hood there are often similarities between file and object storage. Some object storage systems store the underlying data as file and some file storage systems store the underlying data as objects. However, customers and applications usually just care about the access method, performance, and reliability/availability, not the underlying storage method.

Q. I’ve heard that an Achilles’ Heel of Object is that if you lose the name/handle, then the object is essentially lost. If true, are there ways to mitigate this risk?

A. If you lose the name/handle or key-value, then you cannot access the object, but most solutions using object storage keep redundant copies of the name/handle to avoid this. In addition, many object storage systems also store metadata about each object and let you search the metadata, so if you lose the name/handle you can regain access to the object by searching the metadata.

Q. Why don’t you mention concepts like time to first byte for object storage performance?

A. Time to first byte is an important performance metric for some applications and that can be true for block, file, and object storage. When using object storage, an application that is streaming out the object (like online video streaming) or processing the object linearly from beginning to end might really care about time to first byte. But an application that needs to work on the entire object might care more about time to load/copy the entire object instead of time to first byte.

Q. Could you describe how storage supports data temperatures?

A. Data temperatures describe how often data is accessed, where “hot” data is accessed often, “warm” data occasionally, and “cold” data rarely. A storage system can tier data so the hottest data is on the fastest storage while the coldest data is on the least expensive (and presumably slowest) storage. This could mean using block storage for the hot data, file storage for the warm data, and object storage for the cold data, but that is just one option. For example, block storage could be for cold data while file storage is for hot data, or you could have three tiers of file storage.

Q. Fibre channel uses SCSI. Does NVMe over Fibre Channel use SCSI too? That would diminish NVMe performance greatly.

A. NVMe over Fabrics over Fibre Channel does not use the Fibre Channel Protocol (FCP) and does not use SCSI. It runs the NVMe protocol over a FC-NVMe transport on top of the physical Fibre Channel network. In fact none of the NVMe over Fabrics options use SCSI.

Q. I get confused when some one says block size for block storage, also block size for NFS storage and object storage as well. Does block size means different for different storage type?

A. In this case “block size” refers to the size of the data access and it can apply to block, file, or object storage. You can use 4KB “block size” to access file data in 4KB chunks, even though you’re accessing it through a folder/file/offset combination instead of a logical block address. Some implementations may limit which block sizes you can use. Object storage tends to use larger block sizes (128KB, 1MB, 4MB, etc.) than block storage, but this is not required.

Q. One could argue that file system is not really a good match for big data. Would you agree?

A. It depends on the type of big data and the access patterns. Big data that consists of large SQL databases might work better on block storage if low latency is the most important criteria. Big data that consists of very large video or image files might be easiest to manage and protect on object storage. And big data for Hadoop or some machine learning applications might work best on file storage.

Q. It is my understanding that the unit for both File Storage & Object storage is File – so what is the key/fundamental difference between the two?

A. The unit for file storage is a file (folder/file/offset or directory/file/offset) and the unit for object storage is an object (key-value or object name). They are similar but not identical. For example file storage usually allows shared reads and writes to the same file, while object storage usually allows shared reads but not shared writes to the object. In fact many object storage systems do not allow any writes or updates to the middle of an object–they either allow only appends to the end of the object or don’t allow any changes to an object at all once it has been created.

Q. Why is key value store more efficient and less costly for PCIe SSD? Can you please expand?

A. If the SSD supports key-value storage directly, then the applications or storage servers don’t have to perform the key-value translation. They simply submit the key value and then write or read the related data directly from the SSDs. This reduces the cost of the servers and software that would otherwise have to manage the key-value translations, and could also increase object storage performance. (Key-value storage is not inherently more efficient for PCIe SSDs than for other types of SSDs.)

Interested in more SNIA ESF Great Storage Debates? Check out:

If you have an idea for another storage debate, let us know by commenting on this blog. Happy debating!

File vs. Block vs. Object Storage – Are Worlds Colliding?

March 16, 2018March 16, 2018 John Kim Leave a comment

When it comes to storage, a byte is a byte is a byte, isn’t it?

One of the enduring truths about simplicity is that scale makes everything hard, and with that comes complexity. And when we’re not processing the data, how do we store it and access it?

The only way to manage large quantities of data is to make it addressable in larger pieces, above the byte level. For that, we’ve designed sets of data management protocols that help us do several things: address large lumps of data by some kind of name or handle, organize it for storage on external storage devices with different characteristics, and provide protocols that allow us to programmatically write, find, and read it.

On April 17^th, the SNIA Ethernet Storage Forum will host another of its “Great Debates” webcasts. This time, it’s “File vs. Block vs. Object Storage.” In this live webcast, our experts, Mark Carlson, Alex McDonald and Saqib Jang will compare three types of data organization: file, block and object storage, and the access methods that support them. Each has its own set of use cases, advantages and disadvantages. Each provides data management ranging from simple to sophisticated, and each makes different demands on storage devices and programming technologies.

Perhaps you’re comfortable with block and file, but are interested in investigating the more recent class of object storage and access. Perhaps you’re happy with your understanding of objects, but would really like to understand files a bit better. Or perhaps you want to understand how file, block and object are implemented on the underlying storage systems – and how one can be made to look like the other, depending on how the storage is accessed. Join us as we discuss and debate:

Storage devices
- How different types of storage drive different management & access solutions
- Which use cases tend to favor block, file or object
Block
- Where everything is in fixed-size chunks
- SCSI and SCSI-based protocols, and how FC and iSCSI fit in
Files
- When everything is a stream of bytes
- NFS and SMB
Objects
- When everything is a BLOB
- HTTP, key value and RESTful interfaces
Altogether…
- When files, blocks and objects collide, it will rock your world!

I will be moderating this “friendly debate” where there won’t be winners or losers, just more information on these three popular data storage technologies. We hope you will register today to come join the debate on April 17^th.

And if you missed our first hugely popular “Great Debate” – Fibre Channel vs. iSCSI, it’s now available on-demand.

An FAQ to Make Your Storage System Hum

May 23, 2017July 18, 2017 Fred Zhang Leave a comment

In our most recent “Everything You Wanted To Know About Storage But Were Too Proud To Ask” webcast series – Part Sepia – Getting from Here to There, we discussed terms and concepts that have a profound impact on storage design and performance. If you missed the live event, I encourage you to check it our on-demand. We had many great questions on encapsulation, tunneling, IOPS, latency, jitter and quality of service (QoS). As promised, our experts have gotten together to answer them all.

Q. Is there a way to measure jitter?

A. Jitter can be measured directly as a statistical function of the latency, typically as the Variance or Standard Deviation of the latency. For example a storage device might show an average latency of 5ms with a standard deviation of 1.5ms. This means roughly 95% of the transactions have a latency between 2ms and 8ms (average latency plus/minus two standard deviations), however many storage customers measure jitter indirectly by showing the 99.9%, 99.99%, or 99.999% latency. For example if my storage system has 99.99% latency of 8ms, it means 99.99% of transactions have latency <=8ms and 1/10,000 of transactions have latency >8ms. Percentile latency is an indirect measure of jitter but often easier to calculate or understand than the actual jitter.

Q. Can jitter be easily characterized for storage, media, and networks. How and what tools are available for doing this?

A. Jitter is usually easy to measure on a network using standard network monitoring and reporting tools. It may or may not be easy to measure on storage systems or storage media, depending on the tools available (either built-in to the storage OS or using an external management or monitoring tool). If you can record the latency of each transaction or packet, then it’s easy to calculate and show the jitter using standard statistical measures such as Variance or Standard Deviation of the latency. What most customers do is just measure the 99.9%, 99.99%, or 99.999% latency. This is an indirect measure of jitter but is often much easier to report and understand than the actual jitter.

Q. Generally IOPS numbers are published for a particular block size like 8k write/read size, but in reality, IO request per second could be of mixed sizes, what is your perspective on this?

A. Most IOPS benchmarks test only one I/O size at a time. Most individual real workloads (for example databases) also use only one I/O size. It is true that a storage controller or HDD/SSD might need to support multiple workloads simultaneously, each with a different I/O size. While it is possible to run benchmarks with a mix of different I/O sizes, it’s rarely done because then there are too many workload combinations to test and publish. Some storage devices do not perform well if they must handle both small random and large sequential workloads simultaneously, so a smart storage controller might assign different workload types to different disk groups.

Q. One often misconfigured parameter is queue depth. Can you talk about how this relates to IOPS, latency and jitter?

A. Queue depth indicates how many tasks or I/Os can be lined up for a particular resource, such as a storage controller, network interface, or CPU. Having a higher queue depth ensures the resource stays highly utilized because it always has a new task to do as soon as it finishes its current task(s). This can result in higher IOPS because the CPU is less likely to have idle time waiting for new tasks to be put into its queue. But it could also increase latency because longer queues mean each task spends more time waiting in a queue. It’s easy to misconfigure queue depth because you it needs to be deep enough to keep the resource (CPU/controller/interface) busy but not so deep that each transaction spends a long time in the queue.

Q. Can you please repeat all your examples of tunneling? GRE, MPLS, what others? How can it be IPv4 via IPv6?

A. VXLAN, LISP, GRE, MPLS, IPSEC. Any time you encapsulate and send one protocol over another and decapsulate at the other end to send the original frame that process is tunneling. In the case we showed of IPv6 over IPv4, you are taking an original IPv6 frame with its IPv6 header of source address to destination address all IPv6 and sending it over and IPv4 enabled network we are encapsulating the IPv6 frame with an IPv4 header and “tunneling” IPv6 over the IPv4 network.

Q. I think it’d be possible to configure QoS to a point that exceeds the system capacity. Are there any safeguards on avoiding this scenario?

A. Some types of QoS allow over-provisioning and others do not. For example a QoS that imposes only maximum limits (and no minimum guarantees) on workloads might not prevent many workloads from exceeding system capacity. If the QoS allows over-provisioning, then you should use system monitoring and alerts to warn you when system capacity has been exceeded, or when any workloads are not getting their minimum guaranteed performance.

Q. Is there any research being done on using storage analytics along with artificial intelligence (AI) to assist with QoS?

A. There are a number of storage analytics products, both third party and storage vendor specific that help with QoS. Whether any of these tools may be described as using AI is debatable, since we’re in the early days of using AI to do much in the storage arena. There are many QoS research projects, and no doubt they will eventually make their way into commercially available products if they prove useful.

Q. Are there any methods (measurements) to calculate IOPS/MBps in tier capable storage? Would it be wrong metric if we estimate based on medium level, example tier 2 (between 1 and 3)?

A. This question needs refinement, since tiering is sometimes a cache model rather than a data movement model. And knowing the answer may not actually help! Vendors do have tools (normally internal, since they are quite complex) that can help with the planning of tiered storage.

By now, we hope you’re not “too proud” to ask some of these storage networking questions. We’ve produced four other webcasts in this “Everything You Wanted To Know About Storage,” series to date. They are all available on-demand. And you can register here for our next one on July 6^th where we’ll bring in experts to discuss:

Storage APIs and POSIX
Block, File, and Object storage
Byte Addressable and Logical Block Addressing
Log Structures and Journaling Systems

The Ethernet Storage Forum team and I hope to see you there!

Update: If you missed the live event, it’s now available on-demand. You can also download the webcast slides.

Clustered File Systems: No Limits

October 7, 2016July 21, 2017 John Kim Leave a comment

Today’s storage world would appear to have been divided into three major and mutually exclusive categories: block, file and object storage. The marketing that shapes much of the user demand would appear to suggest that these are three quite distinct animals, and many systems are sold as exclusively either SAN for block, NAS for file or object. And object is often conflated with cloud, a consumption model that can in reality be block, file or object.

A fixed taxonomy that divides the storage world this way is very limiting, and can be confusing; for instance, when we talk about cloud. How should providers and users buy and consume their storage? Are there other classifications that might help in providing storage solutions to meet specific or more general application needs? What about customers who need file access performance beyond what one storage box can provide? Which options support those who want scale-out solution like object storage with file protocol semantics?

To clear up the confusion, the SNIA Ethernet Storage Forum is hosting a live Webcast, “Clustered File Systems: No Limits.” In this Webcast we will explore clustered storage solutions that not only provide multiple end users access to shared storage over a network, but allow the storage itself to be distributed and managed over multiple discrete storage systems. You’ll hear:

General principles and specific clustered and distributed systems and the facilities they provide built on the underlying storage
Better known file systems like NFS, IBM Spectrum Scale (GPFS) and Lustre, along with a few of the less well known
How object based systems like S3 have blurred the lines between them and traditional file based solutions

This Webcast should appeal to those interested in exploring some of the different ways of accessing & managing storage, and how that might affect how storage systems are provisioned and consumed. POSIX and other acronyms may be mentioned, but no rocket science beyond a general understanding of the principles of storage will be assumed. Contains no nuts and is suitable for vegans!

As always, our experts will be on hand to answer your questions on the spot. Register now for this October 25^th event.

Update: If you missed the live event, it’s now available on-demand. You can also download the webcast slides.

A Q&A on Storage Performance Benchmarking: Block Components

April 21, 2016July 20, 2017 David Fair Leave a comment

For the third time, our storage performance benchmarking experts, Ken Cantrell and Mark Rogov, have generated an abundance of interest (in the form of questions) on block storage performance. If you missed the Webcast, “Storage Performance Benchmarking: Block Components,” it’s available on demand. It was no small effort to answer all the great questions that we received. And for those of you who have been waiting, we apologize, but we think the detailed and thoughtful answers Mark and Ken have put together are well worth the wait.

Q1: Are these numbers applicable to the 90th percentile for any given storage array, please?

Mark: These numbers represent HDD/SSD performance numbers. They aren’t meant to represent any particular storage array vendor’s performance. See the end of our presentation (bottleneck analysis) as to why it is really really hard to answer your question.

Q2: How about NVDIMM-F or NVDIMM-P or NVDIMM-X claiming 3-4M IOPS type of Enterprise storage devices?

Ken: Yup. They’re fast.

There’s a great presentation by Jim Handy titled “Understanding the Intel/Micron 3D XPoint Memory” presented at SDC2015 that I’d recommend you take a look at to understand more about this kind of memory and its possible positioning.

Mark: Great question. I think the conclusion of our presentation answers it. Flash (and we use flash as a collective term, defining everything that is not spinning storage to be “flash”) is drastically faster than spinning drives. But even within Flash, there are plenty of new technologies which compete with each other and improve the overall performance landscape. So, within the scope of our presentation, even a simple good old SLC drive tops the capability of a SAS line. If we improve on one drive, by switching the technology to a faster/newer/better variant (e.g., NVDIMM-F), or by stacking the drives, the resulting set will much more likely expose the limitations of the “regular” storage array.

Q3: I’d like to know which tool you are using to measure IOPS if possible.

Ken: The SNIA Solid State Storage Initiative (SSSI) has developed substantial expertise in the area of SSD performance and behavior. The SSS Performance Test Specifications were developed by the SNIA SSS Technical Work Group (TWG) and define how to measure SSD performance in a manner that is accurate, repeatable and enables comparison between different manufacturers’ products. Learn more about the SSD Performance Project here.

All of the Flash and HDD numbers at the beginning of the presentation were taken directly from the Solid State Storage Performance Test Specification summary results (SSS PTS). The SSS PTS provides a comprehensive method for measuring flash performance in the most vendor neutral approach that I’ve seen.

The Flash and HDD numbers at the end of the presentation were 80% of the starting numbers – scaled down to make them slightly more like what we’ve seen in a greater number of environments (that aren’t pushing their drives as hard).

Q4: Throughputs with SSD is not as much as one can get from a spinning drive when one keeps cost/GB on the axis. Comments please.

Ken: Now we have 3 axes? I’m not even sure how to visualize what you’re asking, but I’m pretty sure I understand the intent … and this is a harder question than it would appear on the surface. Why?

First off, prices aren’t my thing – I tend to focus on the internals and let the sales guys talk prices. Additionally, vendors often engage in significant discounting or bundling that makes it difficult for the average person (i.e., me) to understand true costs.
The astounding random I/O performance of flash enables support for compression and deduplication without dramatically increasing client-perceived latency. There’s a reason you see so many vendors offering inline deduplication and inline compression now when they did not even five years ago – flash is the enabler that makes this happen. So what is the true comparison? Raw HDD vs Raw flash? Or Raw HDD vs flash plus the storage efficiency (SE) savings it enables? If flash with SE features (dedupe and compression), then what is the savings that you can/should expect for your dataset? 1.5x? 5x? 50x? Knowing this is a prerequisite to answering the question, and the answer will be dependent both on the vendor’s features and your own data set characteristics.
As we discussed in the first session, if your application/user base have some sort of minimum performance expectations, particularly around latency, then HDDs may simply not be able to provide you the performance you need. You DID mention throughput (IOPS?) explicitly and with IOPS, OPS, or data rates (MB/s), you can always match flash data rates with HDDs – it just might take a LOT more HDD drives than flash devices. Latency/response time is different though – depending on whether you are drive bound and what your I/O characteristics look like (read vs write, random vs sequential), you may simply be unable to ever hit your latency targets with HDD.
The world, it is a-changing. six years ago it was easy to say “SSD for performance sensitive niche applications!” and smile. Today, prices continue to drop, vendors are making new decisions around the use of consumer grade vs enterprise grade flash, and overall flash/SSD is moving much more mainstream. And … consider the new 16TB (yes 16 TERABYTE) SSD drives announced by Samsung. My personal view (and I’m explicitly disclaiming that I’m speaking on my behalf, and not NetApp’s – which honestly, you should assume for all my answers) is that these are going to change the landscape almost as dramatically as SSD itself has.
There are definitely vendors that believe in the cost benefits of HDD. We chose not to mention specific vendors in the webcast, but consider BackBlaze. In their blog, they are extremely open about how they have configured their data center – and they are an (all?) HDD shop. In fact, “by the end of 2015, the Backblaze datacenter had 56,224 spinning hard drives containing customer data.” Speaking of Backblaze, you might be interested in their assessment of the 16TB drive, for their shop.

You might also be interested in slide 21 of the following, which includes some price/performance numbers from EMC and Oracle.

Q5: Does NVMe drive technology move things to a higher level?

Ken: If you truly mean NAND-based flash accessed via NVMe instead of SAS/SATA, yes. Look at the perf results linked out of question 3. If you mean the use of next-generation non-volatile memory (NVM) instead of NAND-based flash, then yes. The following chart is contained in a lot of SNIA presentations; I It does a good job of pointing out just how much faster we can get.

I also strongly recommend a look through of Advances in Non-Volatile Storage Technologies by Tom Coughlin from Coughlin Associates. If you care about these topics, the SNIA Storage Developer Conference is a great opportunity to learn more.

Q6: Why NAND gates and not AND gates?

Mark: NAND and NOR gates are known as “universal gates”–they can be combined in various groups and combinations to do any basic operations, i.e., AND, NOT, OR, etc. So, flash manufacturers had to choose between NAND and NOR. And just like with any technology, the price drove the choice. NAND gates are simply cheaper and slower. NORs are faster and more expensive. Actually, there are some NOR products in the market.

Q7: Mark accidently said 15K was 15,000/sec when it’s 15,000/minute.

Ken: Thanks! (Shame on you Mark!)

Mark: Thank you… I can’t believe that I misspoke! I never do! Never! Ahh!!!

Mark’s Lawyer: On behalf of my client, I move to remove this question and the digital recording from Exhibit A to Exhibit B (aka “never again section”)

Q8: Do you guys have any data about how expensive an erase-modify-write operation is, compared with spinning disks in terms of performance?

Ken: This is what we were attempting to demonstrate in the first set of slides. The PTS (see question 3) forces flash devices into a steady state mode where they are continuously doing program-erase cycles. So the results shown there demonstrate the difference between HDD writes (seek, spin, write) and flash writes (erase and program).

Your question made me wonder though … so I also did a quick literature search. Interesting to see how rates have changed over time, and how they vary by device:

From M-Systems, in 2002: Erase cycle was 3ms

From Micron, in 2006: The erase time for a 128KB erase block was 500 µs

From AnandTech, in 2012: Erase time for SLC was 1.5-2ms, MLC was 3ms and TLC was ~4.5ms (huh? SLC vs MLC vs TLC?)

Q9: Why can’t the pointer be at the page level instead of a block level (say, metadata within a block)? I’m sure that there is a reason. What do we gain by treating an entire block as a monolithic?

Mark: This is an excellent question to ask Google. I think the reasons for selecting a NAND gate technology, and for bundling a bunch of NAND gates into groups and for creating blocks (in essence, super groups) is power. It takes less power to operate the drives with NAND gates and blocks.

Q10: I heard someone mention NOR gates, instead of NAND, are NOR gates persistent, over a power cycle?

Ken. Yes.

Mark: There are plenty of other “Logic gates” see this article on Wikipedia for more information.

Q11: So, there is no advantage in keeping IO sequentially in an SSD?

Ken: Technically, or practically? Technically speaking, I think it does matter. Micron documented this in 2006, noting that “Random access time on NOR Flash is specified at 0.075Î¼s; on NAND Flash, random access time for the first byte only is significantly slowerâ€”25Î¼s (see Table 2 on page 5). However, after initial access has been made, the remaining 2111 bytes are shifted out of NAND at a mere 0.025Î¼s per byte.” The raw numbers have changed over the years, but I don’t believe the principle has. Violin Memory stated in 2013 that, “The idea of sequential I/O doesn’t exist with flash memory, because there is no physical concept of blocks being adjacent or contiguous. Logically, two blocks may have consecutive block addresses, but this has no bearing on where the actual information is electronically stored. You might therefore say that all flash I/O is random, but in truth the principles of random I/O versus sequential I/O are disk concepts so they don’t really apply.”

Practically speaking, I agree. Sequential vs random I/O is irrelevant for flash. Given (a) average I/O sizes for workloads and (b) the incredible performance of flash devices compared to the needs of the vast majority of people using them, it doesn’t much matter if you can access subsequent bytes in a NAND-based flash device faster than you can access the first bytes. They are plenty fast enough.

Note that it is hard to find public info on this. Sequential I/O tends to use larger I/O sizes, and random I/O uses smaller I/O sizes. So finding apples-to-apples comparisons between sequential and random I/O is difficult.

Mark: Yes, the flash drive doesn’t care anymore. But the hosts and application still do. Where it matters is in the workloads. Ken and I are still planning to dedicate an entire hour talking about workloads, and Random vs. Sequential will surely be a large part of it. However, we will admit that in the future, when all storage will be flash (which is, of course, a pipe dream) it won’t matter anymore.

Q12: What is the acceptance level to Erasure Coding, and hence the change in the way Storage Performance testing will change?

Mark: As we said during the webcast, RAID is a special case of Erasure Coding. Therefore its acceptance rate is 100% J But on a more serious note, Erasure Coding is necessary for any scale out system: and every vendor uses their own N+M rules.

Q13: Is RAID-1 always half the write performance? If the writes go to both drives simultaneously, I could see write performance being less than 100% of what one drive can do, but not half.

Ken: This was asked in a dry run as well. You’ve hit on something that seems to be a sticking point for multiple people. Perhaps consider it this way. It looks mathy and complicated, but bear with me …

Consider two physical drives. Call them P1 and P2.

Let the write performance (in iops) of P1 be P1_w.

Let the write performance (in iops) of P2 be P2_w.

How fast can P1 write? P1_w.

How fast can P2 write? P2_w.

If you can write to both P1 and P2 at the same time, independently, and completely in parallel, how fast can you write in aggregate? P1_w + P2_w.

For the previous question, what if P1_w = P2_w?

Then P1_w + P2_w = P1_w + P1_w = (2)*P1_w.

Now …

Consider a RAID-1 pair comprised of the same P1 and P2. Call it R1.

Writes can be sent (in a good implementation) to both P1 and P2 at the same time.

But, before a write is considered complete, it must be acknowledged by BOTH P1 and P2.

If P1_w > P2_w, what is the best performance of R1? P2_w. P2 is slower, so we’ll always be waiting on it (assuming performance is consistent), so the best we can do is P2_w.

Same logic if P1_w < P2_w.

What if P1_w = P2_w? What is the best performance of R1? Same logic … but since they are the same speed, it is simply P1_w.

So …

In the non-RAID-1 case, our performance (assuming P1_w = P2_w) was 2 * P1_w.

In the RAID-1 case, our performance (assuming P1_w = P2_w) is P1_w.

50% reduction.

RAID-1 only achieves ½ of what the physical pair could.

Mark: What Ken said.

Q14: Is there any kind of “asynch” RAID1 so that I can keep the performance of the disks but keep the mirroring?

Ken: See the previous answer also.

For reads, certainly. For writes, not that I know of, although you can make it much less visible. For example, if you have a caching RAID controller/system, your writes will go to memory and then go to disk whenever the controller/system decides to flush it. Perhaps it is big enough that it turns random I/O into sequential I/O (and you’re on HDDs) and the perf improvement from doing sequential instead (instead of random) is enough you don’t notice the effect of RAID itself.

Mark: I think that in reality, the behavior of a particular implementation is always vendor-dependent. Generally speaking, RAID1 does allow the reading from both drives, but budgets or software bugs or just plain ignorance could result in an implementation where that is not true. Address vendor documentation to know for sure.

Q15: Why do you need to read old parity to recalculate and write a new one? Isn’t the parity only calculated based on the data being written?

Ken: See answer to question #14.

Mark: It is a math trick… reading the parity saves reading the rest of the blocks on the full stripe. With 3 drives the savings are non-obvious, but with 5 or 14 there are significant.

Q16: This calculation is correct for 3 disks, right? If there are more disks and partial write is for stripe on single drive then you need to read more to calculate parity

Ken: No. There are some great write-ups about how RAID-5 works. Instead of pasting those here, I strongly encourage you to visit http://rickardnobel.se/how-raid5-works/ AND http://rickardnobel.se/raid-5-write-penalty/ and then tweet Mark (@markrogov) or Ken (@kencantrelljr) with questions/follow-up.

(I have no connection to Rickard … I just think he’s done a great job in his write-up.)

Mark: Yes, Rickard’s write up is spot on. Our goal is to introduce a fairly complex subject in a deceivingly simple manner. There are many edge cases that we don’t address: partial write to sector, partial write to a block, partial write a stripe… all those have their own consequences, and storage vendors deal with those differently.

Q17: I am also interested in Data Recovery on NAND technology

Ken: Me too. It isn’t a topic we’re planning to cover though.

Q18: Does caching write data help when one uses SSD?

Ken: It can. Memory is still faster than flash. It depends entirely on how the memory is used. For example, with writes, if memory were used as a write-through cache (look it up if you need), it wouldn’t make things faster. If it were used as a write-back cache, it would. If it is used as a read cache, it will almost certainly make reads of data faster. But even there, life is never simple. Why? Because if you’re using memory to cache data, you’re not using it for something else … and it is possible that the memory could be better used for caching metadata, for example.

Mark: Here, I’d like to recall our good friend, Dr. J Metz, who created an excellent presentation on comparing computer caches to pizza delivery in “Life of a Storage Packet (Walk)” And in his example, caching will keep the pizza warmer. Even if a flash drive is used.

Q19: If the customer is interested in throughput in MB/s then they probably won’t do IOs with 4KB size…

Ken: Agreed. I’m fairly certain that you’re referring to adding MB/s numbers on slide 41. We had a discussion about doing that when putting the slides together. The transition between slide 40 and 42 changed the I/O size from 4KiB to 128KiB, changed from writes to reads, and changed from random I/O to sequential I/O. Adding the MB/s numbers to slide 40/41 was meant to ease the transition between slide 40 and 42. You’re absolutely right though … rarely does anyone want to talk data rates (MB/s) when using small I/O sizes.

Mark: Agreed. Although a true performance guru would recognize that these are the two sides of the same coin.

Update: This webcast is part of a series on storage performance benchmarking. Check out the others:

Storage Performance Benchmarking Q&A – Take 2

December 16, 2015July 20, 2017 J Metz Leave a comment

Our recent Ethernet Storage Forum Webcast, “Storage Performance Benchmarking: Part 2,” has already been viewed by more than 500 people. If you haven’t seen it yet, it’s now available on –demand. Our expert presenters, Ken Cantrell and Mark Rogov did a great job fielding questions during the live event, but of course there wasn’t time to get to them all. So, as promised here are their answers to all of them. If you have additional questions or thoughts, please comment on this blog and we’ll get back to you as soon as we can.

Q: “As an example, am I right to presume workloads are generated by VMs”

A: Ken: It is probably a good idea at this point to define a workload, since we continue to use the term. At a very high level, think of a workload as the mix of operations issued by an application related to the accessing of data. In our case, data stored or made available by a storage solution. With that in mind, absolutely, workloads can be generated by VMs. But they don’t have to be. In other words, “it depends.”

For example, consider these 3 cases:

1) If your SUT (solution under test) was just a simple laptop with no hypervisor and a traditional OS, then there would be no VM in the mix. Your workload would be generated by the application you were measuring (whether that was a simple file copy or something complex like a local database installation).

2) Your SUT is composed of a physical client (like the laptop above) attached to a machine with a hypervisor installed on it and a local guest OS installation that is capable of exporting NFS or SMB shares. The laptop sends I/O via Ethernet to the guest OS. In this example, there is a VM, but it is acting as the storage system, not the workload generator.

3) Now reverse the I/O of example 2. Have the laptop export an SMB share and have the guest OS issue I/Os to that share. Now you finally have VMs generating workloads.

A: Mark: If one examines the solution under test (SUT), and considers the general data flow, then the workload is generated by the clients/hosts layer. Yes, we indicated that the clients/hosts can be VMs, but they also could be physical systems, and, in the case of a SUT consisting of just one laptop, the workload is generated by the application.

Q: Are you going to get around to file performance benchmarking? This infrastructure stuff is not new to me. I have done block all my life, I am interested in stuff about file.

A: Ken: That’s the plan. We are still working out the exact sequence, timing and content for future presentations, but had a dedicated section on both block and file on the original roadmap. If you have specific topics within “file” that you’d like covered, respond in the comments. No promises to cover them, but knowing the desires of the audience is always a good thing.

Keep in mind the intention of the webcast series – lay a strong, but simple foundation, for storage performance fundamentals and then build on that foundation.

A: Mark: The main intent of the series is to lay down basic performance principles first, then build on them to go to more complex topics. Both Ken and I refer to ourselves as “File Heads” and we can’t wait to concentrate on just file, but it would only make sense given that the infrastructure foundations are firm and understood by our audience.

Q: Why doesn’t SPEC SFS have performance testing for such failure models?

A: Ken: Brighttalk provided this comment in isolation, so it isn’t entirely clear which failure models you’re asking for. I’m assuming you mean a failure like a drive or controller failure. With that assumption, that said, the SFS subcommittee welcomes publications that illustrate a failure condition. SPEC SFS 2014 provides an excellent opportunity for someone to publish once in a non-failure scenario and then again in a failure condition of some sort – as long as that failure condition doesn’t violate any of the run rules regarding stable storage and the failure condition doesn’t generate user visible errors.

Note that SPEC SFS 2014 doesn’t mandate any demonstration of failure scenarios. We’ve discussed this in the past, but it has never been a priority for those that participate in SPEC (which is open to all – see http://www.spec.org for instructions on how to join the SPEC Open Systems Group).

Q: Why is write cache turned off for enterprise drives?

A: Ken: I knew this question was coming. 🙂 This is related to “stable storage” – the guarantee from your storage provider that data they say is safely stored on disk is actually stored on disk. I should clarify that the comment is a little dated and refers to caches designed around volatile memory; this wouldn’t apply to a hybrid SSD/spinning media drive that used the SSD to cache/stage data, since SSDs are non-volatile.

Consider the failure scenario where the enterprise drive has write caching enabled and then experiences a power failure. In most every system sold, the storage controller treats drives pretty much as black boxes – they tell the drive to read or write data at a certain location and expect the drive to do as told. So, when the drive says “yup, I got that data, you’re good!” the storage solution trusts the drive and, when it doesn’t need it in memory any longer, throws it away (that data is safely stored on disk, so this is ok). If the drive chose to cache that information in volatile memory, and loses power, the information is gone.

Midrange and enterprise storage vendors often (I think I can say generally) provide some sort of battery backup in case of power failures. These battery units keep power to at least some of the drives – but remember that drives (especially spinning ones) suck down a lot of power, and often the implementation chooses to keep power only to certain drives that the storage controller uses to flush its own volatile memory structures to.

A quick Internet search shows some specific comments on this topic:

From Seagate (http://knowledge.seagate.com/articles/en_US/FAQ/187751en):

Windows 2000 Professional / Server, Windows XP Home / Professional, Windows Vista and Windows 7 have a nifty little feature called write caching buried within the depths of property tabs. Normally, this type of feature is used with SCSI drives in server applications to provide greater data integrity.

When drives employ write-back cache, any interruption of power to the drive or system may cause lost or corrupted data because the drive does not have time to write the cached data to the disk before the power is lost. However, when write cache is turned off, drive performance slows down.

From Microsoft (https://support.microsoft.com/en-us/kb/259716):

…In addition, enabling disk write caching may increase operating system performance. This article describes how to enable or disable disk write caching…

NOTE: Enabling write caching generates the following warning. This is normal:

By enabling write caching, file system corruption and/or data loss could occur if the machine experiences a power, device or system failure and cannot be shutdown properly.

Q: What’s the difference between CPU and ASIC? When to use which word?

A: Ken: Unfortunately, the SNIA dictionary doesn’t define either term. At the easiest level, both are acronyms. CPU = central processing unit, and ASIC = application specific integrated circuit. At the next level, think of a CPU as general purpose processing element and an ASIC as a custom designed microchip designed for a special application or purpose. Once created, ASICs are non-programmable – they do something very specific (and hopefully very well and very quickly/efficiently). A CPU can run your bitcoin mining program overnight, wake you to Spotify in the morning, let you use your favorite word processor in between games of Plants vs. Zombies, and still let you watch Hulu before you head off to bed.

An ASIP (Application Specific Instruction-Set Processor) bridges the gap between a general purpose processor (CPU) and the highly specific, targeted design of an ASIC. An ASIP will have a much reduced instruction set and a more targeted design towards a specific application (say, digital signal processing), but still allow the execution of a specific instruction set given to it.

Q: Can you mention tools to identify the bottlenecks?

A: Ken: We are trying very hard, particularly in the webcasts themselves, to stay vendor neutral. I don’t mind violating that though here in the Q&A a little bit.

From an open source standpoint, there are a number of tools. One of the more popular now is to use something like Graphana as a front-end to Graphite, and use that to monitor a set of open source (or privately designed) sensors, including sensors from OPM below, that you place throughout your environment.

Here are a few other open-source benchmarking and performance tools, and what aspect of performance to which they apply. Please note that this is not a comprehensive list, nor is this a recommendation for their use. We are providing the link as a convenience, not an endorsement. [http://www.opensourcetesting.org/performance.php]

Still free, but NetApp-centric, is OnCommand Performance Manager (OPM), specifically OPM v2.0 and later. In addition to providing performance metrics for your NetApp storage array, OPM offers up the concept of a “bully” and “victim” scenario – it specifically watches for components that are performing poorly (the “victims”) and helps identify which other components are causing that poor behavior (the “bullies”). My team helps develop OPM.

Not free, and not NetApp centric, but a NetApp product, is OnCommand Insight (OCI). This is a premier product for looking at the performance of the components across your datacenter.

A: Mark: I didn’t want to break the vouch to neutrality. 🙂 EMC has a number of tools as well, vRealize suite, ViPR SRM, Unisphere, plus platform specific tools… However, it has been my experience that the most important tool a performance expert has is still the critical mind. One observes the problem, and then walks the entire set of the SUT layers looking for incongruences. Too often, the perceived bottleneck is not the problem, but a manifestation of a problem somewhere else. For example, as we pointed in the “MiB/s section” of this webcast, the network layer was a bottleneck due to the badly configured OS multipath drivers. Deciphering cause from reason requires several things: a good understanding of the SUT and its layers; a critical mind to analyze problem conditions; and a large dose of curiosity. The latter one a personal trait that drives us to question “what if I change this to that?” Asking questions while troubleshooting is, IMHO, a cornerstone requirement and inherently, a very human trait. My personal view is that tools are just tools, and they require a human hand to operate and a human mind to analyze the results.

Q: Did All flash arrays almost eliminate bottleneck..,at least the Storage controller bottleneck can be eliminated if enterprise can afford all flash arrays?

A: Ken: Actually, almost exactly the opposite. Spinning drives are (now, at least) relatively slow. Over the past 10 years the drives have gotten much bigger, although HDD drive speeds haven’t really changed all that much,. Because of this, what I’ve observed is that the IOPS/GB ratio for HDD has, if anything, been getting worse* and the most common bottleneck for an HDD-based customer turns out to be the speed of their drives.

Now consider what happens when a customer moves to SSDs. The SSDs that are sold (and folks can afford) are generally much smaller than the HDDs they are used to, so customers buy as many of them as they can in order to meet their capacity requirements. And the SSDs are, one-for-one, much faster. So what happens? High drive counts + really fast drives = the drives aren’t your bottleneck anymore. Instead, the bottleneck shifts upstream … in a well architected solution, generally to the storage controller or clients.

*For those that know the terms, we could have a long discussion about working set sizes over the years, how fast data ages, tiered storage and such, and the effect that these have on observed iops/GB … but I think we could agree that since HDD speeds aren’t increasing, the iops/GB ratio isn’t generally getting better.

Q: Can I download this slides?

A: Ken: Absolutely. Here are the links to Part 1 of our series:

PPT and PDF: http://www.snia.org/forums/esf/knowledge/webcasts (look for “Storage Performance Benchmarking: Introduction and Fundamentals (July 2015)”)

Presentation Recording: https://www.brighttalk.com/webcast/663/164323

Q&A Blog: https://sniansfblog.org/?p=447

Here are the links to Part 2:

PPT and PDF: http://www.snia.org/forums/esf/knowledge/webcasts (look for “Storage Performance Benchmarking: Part 2 (October 2015)”)

Presentation Recording: https://www.brighttalk.com/webcast/663/164335

Q&A Blog: That’s what you’re reading now. 🙂

Q: Storage controller, is a compute node, right? And for hyper converged systems, storage controller and compute nodes are the same, right?

A: Mark: Most certainly, a Storage Controller can be a compute node, but in our webcast it is not. The term “compute node” is typically interpreted to be a part of the client/hosts layer. A compute node computes for the application, and such application is generating the workload (please see the question above about where workload is initiated).

A good example of the compute node would be a system that renders cartoons, or geodesic fields. As such compute node computes something (application does the work), and stores the results onto the storage controller.

However, in the case of hyper-converged infrastructure, the storage controller is often virtualized among the client hosts, making every compute node a part of a larger storage controller.

Q: Are the performance numbers that vendors publish typically front-end?

A: Mark: I don’t want to generalize published numbers as being one way or another. I recommend reading every publication for specific details. Vendors publish numbers to cover use cases, and each use case may come with its own set of expected measurement points and metrics. Ken and I talked about how metrics matter in the first Storage Performance Benchmarking webcast. 🙂

Q: “We did an R&D PoC using 32 flash 400GB elements attached on DIMM slots (not through SAS controller, not a direct PCIe attach) and seven 40Gbps cards. We were able to pump 5.5M 4KB-IOPS resulting in 30GB/s (240Gbps) of traffic on the front-end connect. When do you expect the front-end connect be the bottleneck for more standard environment?”

A: Ken: Woot! That sounds like a lot of fun. If you’re in the Raleigh, NC area and can talk about that not under NDA, we should have lunch. I’d like to hear more.

I have a suspicion that this answer won’t satisfy you, because it isn’t going to be as empirical as your example. The problem in answering with raw numbers is that there isn’t a standard configuration for a SUT. An enterprise class storage array with a mix of 40GbE and 32GB FC connections (with traffic over both) will look very different than someone using their old Windows XP box with a single 100Mbit to share out an SMB share, and both will look different than someone accessing their photos on their favorite cloud provider. So, I’ll answer the question by saying that I expect the front-end connect to be the bottleneck anytime the rest of the components in the SUT are capable of hitting your performance metrics (whether that be in terms of response time, IOPS, or data rates), and the front-end connect isn’t.

THAT said, you’d be astounded how often, even today, that MTU mismatches result in terrible front-end performance (and functionality).

Q: “Example of cache in front end connection?”

A: Ken: I’ll cheat and note that SUTs can be a lot more complicated than we showed. For example, our picture looked like this:

Consider a SUT then where you have:

At one level, “the internet” is just a big black box acting as our front-end connect. But if we zoom in on it, perhaps we find that somewhere along the line there’s a caching server. Then we have an easy answer to where you find cache in the front-end connect.

In the much simpler model that you’ll find in many enterprise data environments though, you’ll find that the front-end connect consists of some relatively short length cables and a set of switches – either SAN or NAS switches. And in those environments, you won’t find a lot of cache. You will find memory, but you’ll find it used for buffering more than for caching.

We tried to minimize this in the presentation since there’s not a universally agreed upon distinction between these two terms. I think of a buffer (in this context) primarily as memory set aside to hold data very briefly, after which it is consumed and removed from the buffer. I think of cache, on the other hand, as a storage medium that holds data specifically to speed up data access (storage or retrieval). Data held in a cache can be held for a very long time, and not all data in a cache may ever be consumed/used.

Q: Even SSDs suck at random write but they are good for random read, is there that much difference?

A: Ken: Yes. The data we pulled for drive speeds was real data. And keep in mind that “sucks” is pretty relative here. Enterprise SSDs still tend to be at least 4x faster than spinning media. And, very importantly, their performance is much more consistent and deterministic since seek time is irrelevant with an SSD. New NVRAM technologies, like 3D XPoint, promise to dramatically improve write performance.

DRAM is volatile though so replacing HDDs with that wouldn’t really work, right? But if capacity requirements are high, we cannot replace disks with the cache, right?

A: Mark: Cache should never replace capacity. Cache is temporary storage, and requires by design to move its data to a permanent storage location. The size of cache should be matched to the size of the data that application uses, so called “working area”. For example, if an application writes to a 4GB file (think VMware vmdk), then for best performance the entire 4GB should fit into cache. However, capacity requirements for a VMware datastore can be as high as several TB. If the application (ESX server) is running many VMs, perhaps only the performance few need to fit into cache, while all other VMs would use cache for sub-portions of their vmdks.

A: Ken: You’re right, not permanently. As Mark points out, it isn’t to permanently replace the slower storage with cache necessarily, just supplement it enough that the working set fits in.

Q: How can we make client do less IO? Will it make sense?

A: Mark: A client does less IO by using larger IO size, for example. A classic use case is read- and write-sizes within NFS protocol. It is possible to increase read- and write-size of the NFS protocol from the NFS client mount options side. By default, some Linux environments use 32KB for reads and writes. And reading a 1GB file takes 32768 32KB IOs. If the read size is increased to 1GB, then it takes only 1024 IOs – a 32x reduction!

A: Ken: Other options involve app re-writes (yes, sometimes these ARE possible) and OS upgrades. Perhaps in the “app-rewrite” category, or maybe a new category, I’ve also worked with developers to rewrite their DB queries to be much less disk intensive, for example.

Q: Can you elaborate on some of the client level cache types other than file system or OS?

A: Mark: Other than file system and OS? Hmm… Let’s see: PCI-based cards that cache block-device level cache, e.g. EMC VFCache, SanDisk Fusion-IO, NetApp Flash Cache. Native network protocols (CIFS and NFS) caches. Local database caching, e.g., SafePeak, TimesTen, Windows Azure Caching.

Q: Please add more sessions, which goes into every detail

A: Mark: Absolutely! We will! Promise! We’re shooting for Part 3 in Q1 2016.

Update: This webcast is part of a series on storage performance benchmarking. Check out the others:

Block Storage in OpenStack Q&A

June 10, 2015June 10, 2015 Walter Boring Leave a comment

The team at SNIA-ESF and I were very pleased with how many people attended our live “Block Storage in the Open Source Cloud called OpenStack.” If you missed it, please check it out on demand. We had several great questions during the live event. As promised here are answers to all of them. If you have additional questions, please feel free to comment on this blog.

Q. How is the support for OpenStack, if we hit a roadblock or need some features?

A. The OpenStack community has many avenues for contacting developers for support. The official place to report issues, file bugs or ask for new features is Launchpad. https://launchpad.net/openstack. It is the central place for all of the many OpenStack projects to file bugs or feature requests. This is also the location where every OpenStack project tracks its current release cycle and all of the features, called blueprints. Another good source of information are the public mailing lists. A good place to start for the mailing list is here, https://wiki.openstack.org/wiki/Mailing_Lists. Finally, developers are also on the public Internet Relay Chat channels associated with their projects. The developers are live and interactive, on each of the channels. You can find the information about the IRC system that OpenStack developers use here: https://wiki.openstack.org/wiki/IRC.

Q. Why was Python chosen as the programming language? Which version of Python is used as there are incompatibilities between versions?

A. The short answer here is that Python is a great language for rapid development and deployment that is mature and has a wide variety of publicly available libraries for doing work. The current released version of OpenStack uses Python 2.7. The OpenStack community is making efforts to ensure that we can eventually migrate to Python 3.x. New libraries that are being developed have to be Python 3.x compatible.

Q. Is it possible to replicate the backed up volumes at the OpenStack layer or do you defer to the back end array for data replication?

A. Currently, there is no built in support for volume replication in Cinder. The Cinder community is actively working on how to implement volume replication in the next release Liberty, which will ship in the fall of 2015. As with any major new feature in Cinder, the community has to design the new feature core such that it works with the 40+ vendor arrays, in such a way that it’s consistent. As the array support grows, the amount of up front design becomes more and more important and difficult at the same time. We have a specification that we are currently working on that will get us closer to implementing replication.

Q. Who, or what, creates the FC zones?

A. In Cinder, the block storage project, the component that creates and manages Fibre Channel zones is called the Fibre Channel Zone manager. A good document to read up on the zone manager is here: http://www.brocade.com/downloads/documents/at_a_glance/fc-zone-manager-ag.pdf. The official OpenStack documentation on the zone manager is here: http://docs.openstack.org/kilo/config-reference/content/section_fc-zoning.html. The zone manager is automatically called after Cinder Fibre Channel volume driver exports its volume from the array. The zone manager then adds the zones as requested by the driver to make the volume available to the virtual machine.

Q. Does the Cinder and Nova attachment process work over VLANs?

A. Yes. It’s entirely dependent on how the OpenStack admin deploys the Nova and Cinder services. As long as the Nova hosts can see the Cinder services and arrays behind the Cinder volume drivers, then it should just work.

Q. Is the FCZM a native component of the Cinder project? Or is it an add-on?

A. As I mentioned earlier, the Fibre Channel zone manager is part of the Cinder project. There has been some discussions, as part of the Cinder community, to possibly break out the zone manager into it’s own Python library, in which case it would be available to any Python project. Currently, it’s built into Cinder itself.

Q. Does Cinder involve itself in the I/O path as well or is it only the control path responsible for allocating storage?

A. Cinder is almost entirely control plane provisioning mechanism only. There are a few operations where the Cinder services actually does I/O. When a user wants to create an image from a volume, then Cinder attaches the volume to itself, and then copies the bytes from the volume into an image. Cinder also has a backup service that allows a user to backup a volume to an external service. In that case, the Cinder backup service directs copying the bytes into the configured backup storage. When Cinder attaches a volume to a Nova VM, or a bare metal node, Cinder is not involved in any I/O. Cinder’s job is to simply ensure that the volume is exported from the back-end array and make it available to Nova to see. After that, it’s entirely up to the transport protocol, iSCSI, FC, NFS, etc. to do the I/O for the volume.

Q. Is Nova aware of the LUN usage %?

A. Nova doesn’t track statistics against the volumes that it has attached to its virtual machines.

Q. Where do the vendor specific parts of Cinder fit in? Are there vendor specific “volume managers”?

A. The vendor specific components of Cinder exist in what are called Cinder volume drivers. Those drivers are really nothing more than a python module that conforms to a volume driver API that is defined by the Cinder volume manager. You can get an idea of what the features that the drivers can support on the Cinder Support Matrix here:

https://wiki.openstack.org/wiki/CinderSupportMatrix

Q. If Cinder is only for control plane, which project in OpenStack is for data path?

A. There isn’t a project in OpenStack that manages the data path for volumes.

Q. Is there a volume detachment process as well and when does that come into play?

A. My presentation primarily focused around one aspect of the interaction between Nova and Cinder, which was volume attachment. I briefly discussed the volume detachment process, but it is conducted in basically the same way. An end user asks Nova to detach the volume. Nova then removes the volume from the VM, then removes the SCSI device from the compute host itself, and then tells Cinder to terminate the connection from the array to the compute host.

Q. If a virtual machine is moved to a different physical machine, how’s that handled in Cinder?

A. This process in OpenStack is called live migration. Nova does all of the work of moving the VM’s data, from one host to another. One facet of that is migrating any Cinder volume that may be attached to the VM. Nova understands which volumes are attached to the VM and knows which one of those volume(s) are Cinder volumes. When the VM is migrated, Nova coordinates with Cinder to ensure that all volumes are attached to the destination host and VM, as well as ensures that the volumes are detached from the originating compute host.

Q. Why doesn’t Cinder use SNIA SMI-S API to manage/consume SAN, NAS or Switch fabric instead of each storage vendor building Cinder drivers? SMI already covers all scenarios for the Cinder scenarios for FC, iSCSI, SAS etc.

A. Cinder itself doesn’t really manage the storage array communication itself. It’s entirely up to the individual vendor drivers to decide how best to communicate with its storage array. The HP 3PAR volume driver uses REST to communicate with the array, as do several other vendor drivers in Cinder. Other drivers use ssh. There are no strict rules on how a Cinder volume driver can choose to communicate with its back-end. This allows vendors to make the best use of their array interfaces as they see fit.

Q. Are there Horizon extensions or extension points for showing what physical resources your storage is coming from? Or is that something a storage vendor would need to implement?

A. Horizon doesn’t really know much about where storage is coming from other than it’s a Cinder volume. Horizon uses the available Cinder APIs to talk to Cinder to do work and fetch information about Cinder’s resources. I know of a few vendors that are writing Horizon plugins that add extra capabilities to view more detailed information about their specific array. As of today though, there is no API in Cinder to describe the internals of a volume on the vendor’s array.