As more storage traffic traverses the network, the risk of congestion leading to higher-than-expected latencies and lower-than expected throughput has become common. That’s why the SNIA Networking Storage Forum (NSF) hosted a live webcast earlier this month, Introduction to Incast, Head of Line Blocking, and Congestion Management. In this webcast (which is now available on-demand), our SNIA experts discussed how Ethernet, Fibre Channel and InfiniBand each handles increased traffic.
The audience at the live event asked some great questions, as promised, here are answers to them all.
Q. How many IP switch vendors today support Data Center TCP (DCTCP)?
A. In order to maintain vendor neutrality, we won’t get into the details. But several IP switch vendors do support DCTCP. Note that many Ethernet switches support basic explicit congestion notification (ECN), but DCTCP requires a more detailed version of ECN marking on the switch and also requires that at least some of the endpoints (servers and storage) support DCTCP.
Q. One point I missed around ECN/DCTCP was that the configuration for DCTCP on the switches is virtually identical to what you need to set for DCQCN (RoCE) – but you’d still want two separate queues between DCTCP and RoCE since they don’t really play along well.
A. Yes, RoCE congestion control also takes advantage of ECN and has some similarities to DCTCP. Using different priorities for DCTCP. If using Priority Flow Control (PFC), where RoCE is being kept in a no-drop traffic class, you will want to ensure that RoCE storage traffic and TCP-based storage traffic are in separate priorities. If you are not using lossless transport for RoCE, however, using different priorities for DCTCP and RoCE traffic is recommended, but not required.
Q. Is over-subscription a case of the server and/or switch/endpoint being faster than the link?
A. Over-subscription is not usually caused when one server is faster than one link; in that case the fast server’s throughput is simply limited to the link speed (like a 32G FC HBA plugged into a 16G FC switch port). But over-subscription can be caused when multiple nodes write or read more data than one switch port, switch, or switch uplink can handle. For example if six 8G Fibre Channel nodes are connected to one 16G FC switch port, that port is 3X oversubscribed. Or if sixteen 16G FC servers connect to a switch and all of them simultaneously try to send or receive traffic to the rest of the network through two 64G FC switch uplinks, then those switch uplinks are 2X oversubscribed (16x16G is two times the bandwidth of 2x64G). Similar oversubscription scenarios can be created with Ethernet and InfiniBand. Oversubscription is not always bad especially if the “downstream” links are not all expected to be handling data at full line rate all of the time.
Q. Can’t the switch regulate the incoming flow?
A. Yes if you have flow control or a lossless network then each switch can pause incoming traffic on any port if its buffers are getting too full. However, if the switch pauses incoming traffic for too long in a lossless network, this can cause congestion to spread to nearby switches. In a lossy network, the switch could also selectively drop packets to signal congestion to the senders.
While the lossless mechanism allows the switch to regulate the incoming flow and deal with congestion, it does not avoid congestion. To avoid too much traffic being generated in the first place, the traffic sources (server or storage) need to throttle the transmission rate. The aggregate traffic generated across all sources to one destination needs to be lower than the link speed of the destination port to prevent oversubscription. The FC standards committee is working on exactly such a proposal. See answer to question below.
Q. Is the FC protocol considering a back-off mechanism like DCTCP?
A. The Fibre Channel standards organization T11 recently began investigating methods for providing notifications from the Fabric to the end devices to address issues associated with link integrity, congestion, and discarded frames. This effort began in December 2018 and is expected to be complete in 2019.
Q. Do long distance FC networks need to have giant buffers to handle all the data required to keep the link full for the time that it takes to release credit? If not, how is the long-distance capability supported at line speed, given the time delay to return credit?
A. As the link comes up, the transmitter is initialized credits equal to the number of buffers available in the receiver. This preloaded credits for the transmitter has to be sufficiently high to allow for the time it takes for credits to come back from receiver. Longer delay in credit return requires higher number of buffers/credits to maintain maximum performance on the link. In general, the credit delay increases with link distance because of the increased propagation delay for the frame from transmitter to receiver and for the credit from receiver to transmitter. So, yes you do need more buffers for longer distance. This is true with any lossless network – Fibre Channel, InfiniBand and lossless Ethernet.
Q. Shouldn’t storage systems have the same credit-based system to regulate the incoming flow to the switch from the storage systems?
A. Yes, in a credit-based lossless network (Fibre Channel or InfiniBand) every port, including the port on the storage system, is required to implement the credit-based system to maintain the lossless characteristics. This allows the switch to control how much traffic is sent by the storage system to switch.
Q. Is the credit issuance from the switch or from the tx device?
A. The credit mechanism works in both ways on a link, bidirectionally. So if a server is exchanging data with a switch, the switch uses credits to regulate traffic coming from the server and the server uses credits to regulate traffic coming from the switch. This mechanism is the same on every Fibre Channel link be it Server-to-Switch, Switch-to-Switch or Switch-to-Server.
Q. Can you comment on DCTCP (datacenter TCP), and the current work @IETF (L4S – low loss, low latency, scalable transport)?
A. There are several possible means by which congestion can be observed and quite a few ways of managing that congestion. ECN and DCTCP were selected for the simple reason that they are established technologies (even if not widely known), and have been completed. As the commenter notes, however, there are other means by which congestion is being handled. One of these is L4S, which is currently (as of this writing) a work in progress in the IETF. Learn more here.
Q. Virtual Lanes / Virtual Channel would be equivalent to Priority Flow control – the trick is, that in standard TCP/IP, no one really uses different queues/ PCP / QoS to really differentiate between flows of the same application but different sessions, only different applications (VoIP, Data, Storage, …)
A. This is not quite correct. PFC has to do with an application of flow control upon a priority; it’s not the same thing as a priority/virtual lane/virtual channel itself. The commenter is correct, however, that most people do not see a need for isolating out storage applications on their TCP priorities, but then they wonder why they’re not getting stellar performance.
Q. Can every ECN capable switch be configured to support DCTCP?
A. Switches are, by their nature, stateless. That means that there is no need for a switch to be ‘configured’ for DCTCP, regardless of whether or not ECN is being used. So, in the strictest sense, any switch that is capable of ECN is already “configured” for DCTCP.
Q. Is it true that admission control (FC buffer credit scheme) has the drawback of usually underutilization of the links…especially if your workload uses many small frames, rather than full-sized frames?
A. This is correct in certain circumstances. Early in the presentation we discussed how it’s important to plan for the application, not the protocol (see slide #9). As noted in the presentation, “the application is King.”