Object Storage: Got Questions? - SNIA on Data, Networking & Storage

Over 900 people (and counting) have watched our SNIA Networking Storage Forum (NSF) webcast, “Object Storage: Trends, Use Cases” where our expert panelist had a lively discussion on object storage characteristics, use cases and performance acceleration. If you have not seen this session yet, we encourage you to check it out on-demand. The conversation included several interesting questions related to object storage. As promised, here are answers to them:

Q: Today object storage allows many new capabilities but also new challenges, such as the need for geographic and local load balancers in a distributed scale out infrastructure that at the same time do not become the bottleneck of the object services at an unsustainable cost. Are there any solutions available today that have these features built in?

A: Some object storage solutions have features such as load balancing and geographic distribution built into the software, though often the storage administrator must manually configure parts of these features at the network and/or server level. Most object storage cloud (StaaS) implementations include a distributed, scale-out infrastructure (including load balancing) in their implementation.

Q: What’s the approximate current market share of block vs. file vs. object storage deployed today? Where do you see this going in the next 5 years?

A: You can analyze this based on spending or capacity, since object storage typically costs less per terabyte than block or file storage. Including all private and public cloud storage worldwide, object storage probably makes up between 20-30% of the spending and between 40-60% of all storage capacity. If we look only at enterprise (not cloud) storage, then object storage probably constitutes 10-15% of spending and 20-30% of capacity.

Q: There was a comment at the start of the discussion where object storage is less performant than block/file which was clarified as a myth? Can you share some performance numbers for a given size of data?

A: On average, existing object storage is less performant than existing block/file storage because it is usually deployed on top of slower storage media, slower servers, and slower networks. But there is no reason object storage needs to be any slower than block/file storage for throughput and large I/O sizes. If deployed using fast infrastructure, the fastest object storage solutions run just as fast—in throughput terms—as the fastest block or file storage. However, in many cases, object storage may not be appropriate for highly-transactional small I/O workloads, which typically run on top of block or file storage.

Q: Do I need to transform to key value or can I just query S3?

A: You don’t query S3. To retrieve an object via S3 is simply an HTTP GET request which can be done from a browser. Many types of object storage support the S3 API, either natively or through translation, but there may be some types that require switching your applications to support a different key value storage API.

Q: Where does NVMe KV Command Set (in NVMe 2.0) sit in the S3 Amazon stack? How does it change the API structure?

A. The NVMe Key Value Command Set does not sit at the same level as the S3 API. The S3 API sits above protocols like the NVMe KV Command Set. The SNIA Key Value API allows a library to be written to the NVMe KV Command Set specification which is part of NVMe 2.0. Amazon S3 today supports use of key value pairs but does not currently employ the SNIA Key Value Storage API.

Q: Aren’t analytics on Object Storage slow and difficult? Have there been any changes in this area that make analytics faster?

A: This is one of the myths about object storage that we wanted to debunk in this webcast. Analytics on object storage is only slow if the storage itself is slow. It’s difficult only if the analytics tools or query cannot query object storage. While it is true that most traditional object storage deployed in the past ran on slower storage media (and connected with slower networks), there are now fast object storage solutions that can perform just as well as block or file storage solutions. In fact, some object storage software/service options include analytics capabilities built into the storage servers, and computational storage can include analytics capabilities within the drives themselves.

Q: For Kubernetes, if the client is the app why is CSI required (COSI)?

A: CSI provides an interface between the containerized app and persistent storage outside of the Kubernetes orchestrator. It allows storage vendors to support containerized applications.

Q: Is the entire KV database from a given S3 bucket being downloaded to the local drive?

A: AWS S3 sync can be used to synchronize an entire bucket to a local directory, but there are multiple ways to move data to and from AWS S3 to your local directories or other instance types.

Q: Given the volume, sensitivity, and the hybrid nature of data generation, location, and access — does object storage include security/encryption/key management built into the solution deployments?

A: Some object storage products include encryption and key management. Others do encryption while integrating with an external key management solution. At a high level, any object storage solution should include support for encryption and other security features.

Q: Does object storage support compression and dedupe?

A: Most object storage solutions include the ability to support dedupe or single-instance storage (storing only one copy of identical objects if the same object is submitted multiple times). Some object storage solutions include support for compression performed within the storage service, but it’s more common for objects to be compressed by the application or client before being sent to the object storage system.

Q: Amazon’s S3 in-the-cloud storage means saving in data ingress-egress, but losing on the Amazon CPU to perform the analysis in Amazon’s cloud compute platform, doesn’t it? Not understanding how data remains “”local.”

A: If you’re comparing AWS S3 to on-premises local storage, whether it will be less expensive to run analytics using AWS or using your own on-prem servers depends on the scale, maturity, and efficiency of your in-house analytics. Typically, an IT department building a small or new analytics operation will find it less costly to use AWS cloud storage and cloud analytics. While a large IT organization running a scalable, mature and efficient analytics operation would find they can do so at a lower cost than outsourcing it to AWS. Whether on-prem or in the cloud, object storage solutions can typically scale out further in capacity, while supporting a customizable level of processing performance based on the user’s requirements.

Q: Cheap and deep describes Openstack Swift, which claims to be hardware agnostic (deploys on readily available commodity hardware) – then you have to add network bandwidth, CPU, SSD, etc, for what you want to do at speed that makes it cheaper in the long run to go for a purpose-built array and fabric. Why not stay client-server at the outset, with a fast array, fast processing and fast network?

A: For geographic location, data remains local if you store it in your local data center without replicating it to remote locations. For data analytics purposes, data is “local” if it’s stored in the same data center or on the same network segment as the analytics servers. When the data and the analytics servers are in different data centers and not connected by a high-bandwidth, low-latency network, then analytics performance may suffer. This is true for object or any other type of storage solution. If the data is stored in Amazon servers, there may be less control over where data remains.

Q: Does supporting the NVMe KV Command Set in NVMe SSD/HDDs improve the performance or latency when compared to standard NVM Command Set?

A: Using SSDs/HDDs which support the NVMe KV Command Set structure should improve performance and latency over using the standard NVM Command Set, if storing an object as a key value pair.

Q. Do SSDs need to support both Command sets or just one?

A. An SSD can support just NVMe Command Set, just the NVM Command Set or both. A namespace on an NVMe SSD is formatted for one or the other. To get the benefits of the NVMe KV Command Set, an SSD only needs to implement that command set.

Q. Are there any latest updates on the KV Command Set ecosystem in Linux?

A. The latest drivers for Linux are available on a public GitHub site at: https://github.com/OpenMPDK/KVSSD

Q: Computational storage with S3 SELECT: Usually, an object storage solution doesn’t write objects to a single disk, there is some kind of erasure coding for data protection and probably some file system as an abstraction layer which the disk may not be aware of. Also, the data is usually encrypted. How would S3 SELECT be able to parse the original object data on a single drive?

A: Yes, most object storage solutions use erasure coding or a simple mirroring mechanism to ensure each object is stored in redundant locations, and yes, erasure coding usually splits up each object across multiple drives. A storage-side query such as AWS S3 Select runs a query on or near the object storage servers and returns a subset of the data to the client or requestor instead of returning the entire object to the requestor for the query. In this type of query, the object storage servers can decrypt encryption before executing the local query, if the encryption was done on the object server side. (If the encryption was done by the client before being sent to the object storage, then the queries would not be able to run at or on the storage servers.) The storage servers would also be able to reassemble an erasure-coded object locally to the storage servers to run the query, or possibly distribute and run the query on the multiple erasure coding destinations for that object.

Interested in more information on object storage? Check out the SNIA Educational Library.

Leave a Reply