Nearly Networking: September 2008

Tuesday, September 30, 2008

A Comparison of Mechanisms for Improving TCP Performance over Wireless Links

"Wireless TCP" is probably a popular prelim question; it also reappears from time to time in conferences; I've thought about working on it myself. The basic problem is at least clearly stated: TCP believes that all packet losses are caused by congestion, but in wireless this is likely to be incorrect. Thus, it does not perform very well over wireless links. The result of this paper is "yes, but local repair fixes the problem."

The authors evaluate a number of different possible solutions; local retransmission and a series of end-to-end modifications. Intuitively, we should be predisposed to local repair for several reasons: since it is closest to the link, it should have the best information about link quality, and should be able to react the fastest to loss. Split connections is a rather heavy weight solution to a not-that-deep problem, and furthermore will use significantly more network resources. SACKs are a worthy modification to TCP, but are best thought of as solving the bursty loss problem, and helping throughput on high bandwidth-delay links.

What this problem essentially comes down to is that, if a link as a Packet Reception Rate of "p", then with "r" retransmissions, the probability of packet delivery when using link-level acknowledgments is 1-((1-p)^r). What this says is that we can make a poor link look like a good link by estimating the value of p on the fly, and tuning r so as to guarantee a certain level of reliability. Of course, this can be complicated by bursty losses, but the intuition is correct.

The results of the paper bare out this intuition; local repair works as well as anything, and end-to-end protocols are hurt by the fact that the controllers are far from where losses are taking place. Split also works almost as well as local, but there is no gain from just running the local protocol directly on the TCP packets.

The upshot of all of this is that when the original internet papers say that the maximum drop rate of a link should be 1% TCP and IP headers which are either destined for an endpoint or else apply to each hop (hop-by-hop options).

MACAW: A Media Access Protocol for Wireless Lans's

The authors present their design for a media access control protocol to a shared 256kbps channel. Their overriding design consideration is fairness, and not throughput which is a decisions which appears at numerous points in the paper. They begin with a typical RTS-CTS-DATA scheme very similar to CSMA/CA and then consider a number of modifications to it to address fairness; the sender can contend for the receiver using RRTS, the data can use link-level recovery with ACKs, or the sender can acknowledge the CTS message with a DS message before sending the payload.

The design of a workable MAC is somewhat complicated, and its design will effect power consumption, throughput, latency, and reliability. Of these axis, the authors did a reasonable job evaluating throughput; however the others are not well treated (although low latency can be a function of fair sharing between senders). My main objection to their MAC is the high protocol overhead in the RTS-CTS-DS-DATA-ACK exchange, which occurs even the absence of channel contention. What we have found in our own research is that the authors are exactly right about local repair; the lossy nature of these links makes it inadvisable to just send a packet and assume it got there. However, our lesson has been that the additional packets involved in more complex coordination schemes result in a noisier channel which can be an important effect with many senders. Furthermore, if the data packets are short, the data is effectively piggybacked on the RTS; if it is received at all then no further transmissions are necessary and power consumption is reduced because of fewer sends and receives.

Several things about this paper means that our design lessons are not directly applicable. For one, they use significantly different radio technology with a "near-field" PHY and short range. Second, their workload model is significantly different (they cite a data rate at one point of 64 packets per second). Finally, all communication is to central collection points. Given these parameters, I probably would have tried to reduce contention by designing a protocol for assigning different channels to nearby receivers to reduce cross-cell interference (this should solve the RRTS problem) and evaluated weather or not the full scheme they propose is necessary in practice.

This brings us to a final criticism of the paper, which is its evaluation methodology. MAC protocols have been shown to be very sensitive to workload and environmental factors; if a protocol is designed for a specific workload and environment, it would be significantly more convincing to have implementation results for that environment then packet-level simulation. Modeling RF communication is notoriously difficult and there is still active work in conferences improving the simulation in ns-2 and TOSSIM. The implementation results, even if simplistic, are an important sanity check.

Tuesday, September 23, 2008

Scaling Internet Routers Using Optics

McKeown again presents a router architecture, except this time using optics. He begins with the claim that the centralized crossbar and scheduler approach from his previous paper does not scale; this is believable because of the n^2 complexity of the matching problem necessary to efficiently switch packets, and also an argument that multistage systems lead to unpredictable latencies.

He presents his new queuing design, where multiple sets of input queues are connected to output queues cyclically; I think the goal there is to reduce the non-determinism so that the switch can run faster with more predictable latency. He also considers power to be the limiting factor of how much equipment can be stuffed into a rack.

I find it difficult to tell what the magnitude of this paper's contribution is; it seems like it could be either a seminal work laying out a new architecture, or a pie-in-the-sky paper which could never be implemented.

A Fast Switched Backplane for a Gigabit Switched Router

McKeown presents a series of design decisions to allow high throughput between different ports on a network switch. He points to trends in router design towards more hardware, more parallelism, and fewer shared buses to reduce contention, and so argues that a switched backplane will be best because it reduces contention. His design consists of a star topology where each line card is connected to a single switch. The switch can connect pairs of cards, and so the challenge is to quickly form a matching between enqueued messages and receive buffers which minimizes latency and is fair. The algorithm he proposes is called iSLIP, and consists of multiple rounds of proposals after which the switch can choose a conflict-free match.

He is especially interested in reducing block; HOL (head of line) blocking is particularly problematic. To solve that, he introduces the VOQ, or virtual output queue where each card has multiple output queues, one for each other line card. In this scheme, a packet can never become blocked behind a packed destined for a different port. He also notes that a combination of queuing discipline (in the sense of more VOQs + priorities) has the same effect as more speedup (running the switch at a multiple of the line card speed.

This paper outlines the full design of a small switch; I thought it was an interesting read. The techniques he mentioned are probably still valid today, since a 16-way 2.4 Ghz switch is still a reasonable piece of hardware. It was light on details, and parts were confusing; the frequent note that the Cisco 12000 uses some of the techniques didn't actually contribute to the paper, because he does not go into detail other then the general design of that system.

Wednesday, September 17, 2008

Supporting Real-Time Applications

This second Clark, Shenker, and Zhang paper presents a queueing algorithm and architecture for differentiated services. The conceit is separating traffic into two classes, one with real-time requirements and one with more flexible latency requirements. To do so, they allow the network to make service commitments, essentially providing admission control and resource reservations. They also propose a new queuing discipline, FIFO+ to address sharing between flows.

This work seems incomplete, at best, to me. The authors mention off-hand that a pricing model will be essential to provide proper incentives for the system to work correctly, yet there is no discussion of how this would fit into the rest of the scheme. Even Shenker admits in his later paper that this work is particularly unmotivated; one does not get a sense that the complexity introduced here is necessary. Furthermore, the evaluation is lacking, given the incomplete architecture presented. The largest contribution may be the paper's identification of "predicted service" applications; it seems that all internet applications may enter this class by attempting to compute the network capacity available to them on the fly.

Fundamental Design Issues for the Future Internet

This 1995 paper by Scott Shenker takes a step back from the networking research of the day and tries to predict what architectural changes will be necessary for the Internet to scale into the next decade. At that time, the internet had about 3 million hosts and tens of thousands of web sites; its transition from academic curio to a piece of public infrastructure was well underway. The challenge he focuses on is Quality of Service (QoS) guarantees, with the driving applications of streaming audio and video. The architectural question behind these demands is the questions of weather or not new service models are necessary for these applications to succeed. While IP provides best-effort delivery, real-time applications might benefit from bounded latency. The solutions he explores are admission control and fair queuing within the network core.

He considers the alternatives and problems with a differentiated services model: the largest problem is fairness, in the sense that end users would have no incentive to request anything less then the best service, which would defeat the point of the service model. The only solution he proposes is the only one that probably has any chance of working: charging different service classes different amounts for their utilization. However, this paper does not seem to seriously believe this will emerge as an evolution of the internet. Shenker also seriously considers maintaining the existing best-effort service model while over provisioning the network to allow latency and bandwidth sensitive applications to perform well; footnote 12 where he contends that if "bandwidth does become extremely inexpensive, some of our design decisions will be altered. It might be preferable to merely add bandwidth rather then extend the service model." is prescient.

With the benefit of hindsight, it appears that internet architects may have overestimated the cost and difficulty of modifying the IP model; even the roughly contemporaneous IPv6 transition has yet to have a major impact. The basic conservatism of network operators and the choice between new, untested technology that might improve utilization versus a lot more of the old technology (more pipes) has so far been answered in favor of the well-understood solution of adding more bandwidth; moreover, with good results.

To argue that the only way the internet has adapted to the demands of realtime traffic in the past ten years would be simplistic. The emergence of traffic-shaping middle boxes near endpoints to effectively prioritize traffic based on deep, stateful packet inspection is one development; another is the proliferation of content delivery overlay networks which geographically distribute latency-sensitive content to put it closer to consumers is another. Both of these have apparently been easier to do then modify the underlying network model.

Monday, September 15, 2008

XCP

XCP is designed to deal with links with very high bandwidth-delay products. The authors claim that TCP will inevitably become unstable when run over these links; their claim is based on the fact that once the feedback period is too long, TCP will exhibit oscillatory behavior as it increases its congestion window, only to be forced to dramatically decrease it once the link becomes congested; however, the feedback loop is so long that it will have already injected a good deal of data when it receives the control signal. They also argue that slow start is too slow for these links.

As a solution, the authors propose to separate utilization and fairness policy goals, and to provide them with different policies: MIMD and AIMD. Multiple-increase multiple-decrease allow the protocol to quickly grow the amount of bandwidth a connection uses, while the multiplicative decrease still allows quick response to congestion. Further, by adding the congestion window to the transmitted data, senders and receivers gain a better information about the congestion state of the network; effectively extended the one bit of feedback other algorithms use.

One great point of the paper that the authors emphasize is the protocol's TCP-friendliness. I think this is essential for modern transport protocols; no one would risk deploying a new protocol if they did not have at least a reasonably well-founded believe that it would not clobber existing streams. Much of the rest of the paper is the necessary boilerplate to show that the protocol achieves its design goals. However, it's not really clear that the need for this protocol in the wide area has emerged since it was presented in 2002. While there certainly high latency links, often the TCP sessions running over them are just one in a sea of many others. If this is uses anywhere, I would expect it to be in special computing facilities like data centers, supercomputers, and potentially between installations like the various national supercomputer centers.

Sunday, September 14, 2008

Random Early Detection

The basic observation this paper makes is that, since dropping packets is interpreted by many transport protocols to be a sign of congestion, dropping packets can be used to request that endpoints communicating through the router reduce their traffic. The goals of RED are twofold; one is to maintain short queues in the router so as to minimize latency, and the other is to maintain a fair allocation of bandwidth between competing flows. The protocol uses relatively simple estimators to approximate the running average of the queue depth and to drop packets at a rate which increases with the queue length. As the length of the queue increases, RED will drop proportionally more of the incoming request stream to quench more senders. Since the drops are approximately uniformly distributed in time, flows using more of the router's capacity should experience more dropped packets.

RED different from FQ in several important respects. Clearly, RED is stateless with respect to individual conversations, a significant advantage. RED also explicitly extends the congestion feedback loop to include the senders by making the assumption that they will respond to dropped or marked packets by reducing their rate. As the authors mention, they are considering networks where most of the traffic is TCP streams (or DECbit). However, unlike the authors of FQ, they do not evaluate the effect of "firehose" senders being present on the network; it appears that this scheme would fail in those cases because although packets are dropped randomly, the altruistic senders would reduce their rate in response to this feedback, which would leave the greedy ones a greater share of the capacity. It is a useful result that cooperating senders eliminates the need for per-connection state in such as simple way.

This paper definitely has the feel of the journal paper that it is; a conference never would have allowed the authors to go on for so long about relatively basic math. Pedagogically speaking, though, I thought it was pretty interesting to see the problem mapped out in such detail; although the idea is very simple the authors did a good job of examining some of the premises underlying their work; in particular, it seems like you could really get in a hole with parameter selection where you have only rules of thumb about the actual values. They attempt to quantify the tradeoffs of each setting; with any dynamic system, I suspect the devil is in the details and so it is good to see it worked out. It strikes me that these past few papers we have read on queuing discipline have been a bit more rigorous then your run-of-the-mill cs paper; for me, this is a reason to keep them.

Wednesday, September 10, 2008

Core-Stateless Fair Queueing (CSFQ)

If the biggest objection to FQ is the per-flow state, the authors show how that state can be pushed to the edges of the network to allow core devices to function on individual packets while maintaining the nice properties of fair queueing. The actual exposition of their algorithm only takes a couple of pages, and the rest of the paper is an evaluation of the architecture and performance. In CSFQ, all per-flow state is maintained and ingress routers, which tag packets with a rate representing the rate of the flow. The use an exponential filter to estimate the arrival rate of packets for a particular flow which keeps the flow state to a few state variables. Write the fair rate into an MPLS-style tag at the head of the packet. What seems like it could be a crucial part of the algorithm is rewriting the rate estimates in the core, by reestimating the "fair rate" downward when packets are dropped.

The authors' simulation seems to show CSFQ working about as well as DRR (Deficit Route Robin), which seems to be the benchmark here since it has per-flow queue dicipline. The advantage of CSFQ is the careful state management. The other algorithms really do not perform well at all, for the same reasons the authors argued in Analysis and Simulation...

The authors mentions MPLS as one example of existing labeling and classification at the edge; there's a more recent modification (MPLS-TE for Traffic Engineering) that provides some administrative control over routes. However, I think that modern routers do implement forms of fair queueing. It may be the case that the amount of state to implement fair queueing at every router is no longer prohibitive; also, it seems that some routers implement fair queueing between administratively-defined traffic classes, which both reduces the state since N is now the number of classes and not the number of conversations, and accomplishes part of the intent of FQ.

I'm not sure I would include both this paper and the original FQ one.

Fair Queueing

The main observation that this paper makes is that if you have a set of different workloads presented to a router, the traditional FIFO or FCFS with tail drop policy don't perform very will in the face of congestion (when the demand for link bandwidth is greater then supply). The authors propose a new queuing scheme, where packet sources receive their fair share of the available bandwidth, which is defined to be 1/N, with N concurrent flows. Furthermore, they want to provide lower-latency service to flows which use less of their fair share.

To achieve this, they consider a bitwise FCFS model where each flow submits one bit at a time for transmission. They notice that this model has the properties they want; for instance, if a set of large packets and one small packet are all submitted at about the same time, the small packet will finish before the large ones. Since routers can't actually act on bits, they develop a packetized queuing scheme which has the property that packets are transmitted in the same order that they would be using the bitwise algorithm. They evaluate their new algorithm with theoretical results and simulation with microbenchmarks on a number of network topologies. The motivation of the simulations are twofold; one, they wish to show that their basic derivation has good properties and is worth the cost over other existing queuing algorithms. They also mention that they have been unable to prove properties about a network of gateways all using fare queuing, and so want to validate that there isn't a breakdown when more then one router uses this algorithm.

The basic idea underlying this queuing algorithm, that participants who are not greedy and use less then their share of a resource should be rewarded with better service, seems to have some generality; it is also used in many modern process schedulers (like MLFQ) to improve response time for interactive processes; a very similar goal to what the are trying to accomplish.

This one can be a bit of a slog, but they explain what is going on well in the text and the math is mostly understandable. I never did get a good sense of what the difference between "preemptive" and "nonpreemptive" versions of the algorithm were; the math on page 5 seems to bound the average response time, and in the preemptive case it is less... A table of constants might have helped, since there are a whole bunch of them. Fair queuing as a paper has some value, although it seems a bit hard to understand out of context. A big weakness of this work is the per-flow state; although I would have liked it if the authors had explicitly stated what state is kept for each flow.

Monday, September 8, 2008

Increase and Decrease

It was the spring of hope, it was the winter of despair, and so it is with protocols; the trick is knowing when you're in spring and when you're in winter. This paper is by far the more interesting of the two, since I think it manages to take the control problem of endpoints communicating at rates subject to interference by other senders and collapse it to a linear model where they demonstrate stability under conditions not depending on the number of participants. One key idea is to reduce the feedback from the network to a single congestion signal; of course, in TCP, there isn't actually a congestion bit but rather it infers congestion from loss; in any case, the idea is the same. The authors demonstrate how multiplicative increase and additive decrease lead to streams which are "fair" and "efficient" for all participants. The work was very closely contemporaneous with Congestion Avoidance and Control, and so represents an effort underway at the time to more thoroughly understand the conditions for stable transport protocols.

Without much of a background in control theory, the authors provide enough information on linear controls to assert that the major benefit these controls will have over nonlinear (say, quadratic) control is that they will be less brittle because they will require less information about the state of the system to function; in particular, not knowing the number of participants in the control scheme seems like a key requirement for internet protocols where this number ("n") is going to vary by many orders of magnitude.

An interesting observation is that the authors seem to have been more concerned with the OSI suite then the internet protocols. The OSI protocols, although eventually finished and implemented, were generally regarded as failing because they were too complicated and were never available in partly-finished yet function forms like TCP was. By the time solid implementations were available, it was too late; this papers seems to be a symptom of the impulse to research and design every facet of the protocols; while this can probably be both a virtue and a vice, it certainly led to commercial failure...

The paper doesn't include any measurement studies of actual network connections, although it can serve to more fully explain figures 8 & 9 in Congestion Avoidance. I think this work is mostly valuable as an examination of the assumptions underlying the internet congestion control algorithms, and grounding them in a series of justified assumptions. They do note that they don't consider the asynchronous case where streams come and go rather then all start at once; this is actually an important divergence from the real world, although apparently the control scheme is stable in this regime as well.

Congestion Avoidance and Control

It makes sense to put this paper first, since it is good at framing the problems TCP was experiencing. The authors introduce a number of new algorithms for improving throughput and fairness; they are all classics now, including the RTT variance estimation and slow start. One of the design points the emphasize that has contributed to TCP's robustness is focusing on a wide spectrum of bandwidth-delay product regimes from slow packet-radio links to fast local wired networks. They develop a very simple variance estimator, implementable in just a few lines of c. The also outline the "slow start", which really starts exponentially fast. I actually find their diagram on page 4 pretty confusing; I think it's because they seem to be assuming that every packet will be acked; I don't think that's really the way TCP works though.

What's really interesting about this paper is its generality: any stream protocol will have to discover the available bandwidth and respond to latency changes due to causes beyond the endpoints' control. This leads to the maxim that "those who don't understand TCP are doomed to reinvent it," and based on my experience in sensor nets, we've imperfectly reinvented it at least several times. Although this is a bit of a digression, it's really interesting how TCP can take advantage of huge differences in buffering capacity (and thus window size) when devices with drastically different capabilities communicate.

The paper addresses a whole range of concerns, from the robustness of its estimators, to how easily they can be implemented with only integer arithmetic; I rather suspect that the SIGCOMM version didn't include some of the sample implementation. None the less, part of the beauty of this work is that they show you don't need complicated solutions to complicated problems, and to convince the community that although there were sophisticated ideas behind their fixes, there was no sophisticated implementation.

Tuesday, September 2, 2008

Internetworking Principles

End-to-end Arguments in System Design

Saltzer, Reed, and Clark work out the principle that the most meaningful functionality in networking ,and in many other systems, should be placed at the "ends," rather then distributed throughout the "middle" or "low-level" of the system. This has a number of very interesting correlaries which follow immediately from this argument; some of them are lain out in the paper.

One design decision which draws immediately from this argument are the semantics of IP packets: packets are delivered zero or more times, in any order. The decision to settle on these best-effort delivery semantics has allowed protocols on top to be very widely applicable. One reason this for this seems to be that these semantics reflect reality. This is true in the sense that it is "easy" to build a contention-base protocol like Ethernet or anything using wireless which follows these guidelines using nothing more complicated then backoff and retry, while achieving any of the other properties the authors mention results like FIFO delivery or duplicate suppression almost inevitably result in TCP. Furthermore, if those assumptions are baked in to the networking fabric, it becomes much more difficult to handle broadcast domains and multicast communication patterns in the network.

A great section of the paper is the part where they discuss the problem of "identifying the endpoints." One of their principle arguments against message acknowledgments is the fact that application layer acks are almost always what is desired, not message acknowledgments. An interesting way that the paper has aged is the emergence of middle boxes all over networks conduction application-layer stateful packet inspection. The way this splits out "end to end" to "end middle end" is interesting. I think that as far as the data path is concerned, the end to end principle still applies since what the middlebox is interposing is policy control and perhaps routing based on application criteria. In most cases, these boxes are still transparent with respect to the data the endpoints receive.

The Design Philosophy of the Darpa Internet Protocols

What's facinating about this paper is how little has changed on the broader internet. Better tools for distributed management and policy-based routing? Accountability? Security? All things which remain challenges in the internet, and things which don't seem to have migrated to the core of the internet but remain point solutions within AS's. Perhaps this is to be expected from the internet architecture: newcomers will want to do the minimum possible to participate. It is interesting that even though the design emphasizes robustness to failure, it doesn't mention security as we currently understand it, which is tolerance to malicious hosts within the network.

Internet communication must continue despite loss of networks or gateways.
The Internet must support multiple types of communications services.
The Internet architecture must accommodate a variety of networks.
The Internet architecture must permit distributed management of resources
The Internet architecture must be cost effective.
The Internet architecture must permit host attachment with a low level of effort.
The resources used in the internet architecture must be accountable.

Most of these, and in fact their ordering, still seems reasonable. I think you would have to amend 1 to read "Internet communication must continue despite loss of networks or gateways and the introduction of malicious hosts."

Nearly Networking