Friday, November 14, 2008

Development of the Domain Name System

This paper presents an essential internet architectural component: a naming service which translates between internet ascii names and other information; most commonly, the internet address of the server with a given name. One interesting design decision is that DNS is actually somewhat more general then I think we know from our daily usage, with a notion of a record types and supporting many different types of lookups. This functionality doesn't seem to be used all that much because DNS servers typically are run by professional hosting staff these day who often have somewhat ossified views on what is possible. There are some new record types which have emerged with some importance; MX records are often used to filter spam by allowing only a few mail servers per domain.

There have been some enhancements made to DNS since it was first developed, although the overall structure does not seem to have changed much. There are numerous security problems with the caching architecture of the DNS system as poisoning a cache near a large of users can easily redirect their requests to and adversary. There are a large number of RFCs defining DNSSec, a set of cryptographic extensions to DNS which signs all data to prevent a whole class of attacks which are currently possible if DNS servers are compromised. DynDNS is another set of extensions which allow a mobile host to update the A record for them on the fly, to allow them to migrate between subnets and allow lookups to resolve correctly. Of course, if this is being used, caching becomes much less effective because caching for more the a few minutes is very problematic. This (I believe) has received some real traction and is implemented in modern versions of BIND.

One thing which is interesting to me is the question of how far down into users were originally envisioned to have control. To put it differently, it seems desirable to give users control over a portion of a globally accessible namespace for storing their own data; I might want to have a DNS server for my sub domain which allows me to store my public key and other things. However, as DNS has evolved, this type of usage has not emerged. I wonder if there is room for another naming service a more "grass roots" oriented service model.

Thursday, November 13, 2008

End-to-End Internet Packet Dynamics

First of all, I really like this study; the writing is straightforward, almost matter-of-fact and the paper is very clear about what its goals are, what the the contributions are, and where the data or analysis is not complete. This is a "pure" measurement paper in that it treats the results as interesting in and of themselves and does not lower itself by implementing something based on the results. This isn't sarcasm; I think this is actually a good thing in a measurement paper.

The analysis starts with a good data set; Paxson collects traces from many 100kb TCP sessions between 39 measurement points, geographically distributed across North America and Europe. To me, the most interesting results was the internet prevalence of packet reordering and damage in the traffic he observed. It wasn't possible to verify the checksum on the whole packets, and there may have been too much traffic to log all the packets, but it seems like this could easily be solved either by computing the checksum on the fly as packets come in, or else actually storing the whole trace; it may not have been possible in 1995 but would be no problem today.

XTrace

I like xtrace; it provides some really good insights about what is actually happening during complex data center and internet transactions across many machines and services. What is critically important to its ability to do this is its acceptance of the need for a limited amount of application modification to enable the detailed tracing xtrace performs. Other studies which attempt to infer causality from snooped data traffic are significantly more limited in the conclusions they can draw because they cannot incorporate a notion of application-level semantics in their analyses. XTrace's notion of a persistent transaction ID allows the application being traced to define the operation which is meaningful. It also persists across layers, so that the result is a trace showing all operations initiated by a single transaction anywhere in the system. The authors present several compelling examples of where the tool was used to diagnose performance problems and bugs.

I think the main value of this work as a window into the operation of complicated distributed systems which are otherwise appallingly opaque. This system may be only the first step into greater visibility, because for a sufficiently large system, one needs to ensure that there is a mechanism for usefully interpreting the resulting causal graphs which may be very large.

Sorry

For all you avid readers, sorry about the delay. I'm about to introduce some message reordering by posting in reverse order the papers which I missed while off at SenSys.

Thursday, October 30, 2008

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications

In what has got to be the most-cited DHT paper, the authors develop a protocol providing the "put" and "lookup" primitives in the wide area and in a completely distributed manor. The basic structure is of a "ring" of keys, which is just to say that keys are chosen from Z/2^nZ. What is very nice about their design is that correctness depends only on a single invariant: the fact that the "next" pointer each chord node maintains is correct. This allows them to analyze some of the basic properities of the system without some additional complexity; for instance, the optimization for log-time lookups (the finger table) is not required for the lookups to work, as you can always reach any key in linear time by following the "next" pointers around the ring. They also work out in some detail the join, leave, repair, and stabilization protocols which allows the DHT ring to remain stable in the face of various types of event.

I don't think there's a question that DHTs are a nice idea with some good applications; the MIT group responsible for this paper went on to explore some more or less obvious uses of them such as Ivy, which is a DHT-based file system (actually, there are many ivy's.) In the first paper the authors contrast their flat key space and completely distributed algorithm with more "traditional" distributed system which function by imposing hierarchy on the collection of machines; they cite FastTrack and DNS. While there is a benefit to being distributed, it also seems to suffer from a lack of "authoritative answers" that DNS can provide; the system there is fundamentally hierarchical which makes it perhaps not the best example. The completely distributed nature also make caching more of an issue. For instance, in a hierarchical system with high fanout but relatively low tree depth, if the nodes closer to the leaves of the tree cache results they can deflect a good deal of traffic from traveling up the tree. This is the equivalent of your ISP's DNS server caching recent DNS queries. Since all traffic flows through that one server, its can have a high hit rate. In a distributed system like Chord, each request would require a lookup traversing mostly different sets of fingers before reaching the ultimate destination.

Looking up Data in P2P Systems

The authors of this (the same authors who came up with Chord, and perhaps also (Tapestry) present a bit of a "state of the field" in distributed hash tables. Although a short paper in CACM, they actually go into some detail about four different DHT algorithms: Chord, Tapestry, Pastry, and Kademlia. It's an interesting read to hear about the differences, such as the distance functions, where they identify the most important properties as symmetry and unidirectionality. They also look briefly at other lookup ideas, like tree routing and CAN (not really unrelated.)

They note that dealing with concurrent and asynchronous failures and joins is a challenge that is hard to address in all of these systems, although Chord presents some formal proofs of resilience under failures. I think that the most interesting piece of future work that they note is the "proximity routing", which looks like an attempt to choose "fingers" which are nearby so as to minimize lookup latency. This seems like only one way that you might incorporate topological information into the hash table's decision, but the issue of trying to generate an overlay which respects an underlying topology seems like a good research direction.

Tuesday, October 28, 2008

Resilient Overlay Networks

I think the central insight of this paper is that some things, like multi-path routing and optimizing for different application metrics are not well handled at the IP later. The reason for this is because the Internet is addressed in a flat address space, and so it's generally frowned upon to make assumptions about which traffic is important, or which destinations are being used. Therefore, in order to achieve the performance of the paper, a great deal of routing state would need to be exchanged by BGP, most of which would remain unused.

The authors' solution is to create an overlay network: essential, just another network layer with its own routing protocol which uses the routing provided by an underlying layer to create virtual links between the overlay routers. This design is reminiscent of a lot of other work, like Application Layer Multicast, and maybe even the Ken Birman's Isis (hi Ari...) In fact, routing is even easier with virtual links then physical links, because each router is directly connected to every other router through a virtual link (ie., IP hosts can talk to anyone). Therefore, flooding routing state is just sending N-1 internet packets to the other routers; we don't have to worry much about complicated flooding mechanisms.

One thing I really like about this design is that it optimizes both state and control traffic for endpoints which are in use, while ignoring the rest of them. On an Internet scale, and 50-node subnet is pretty insignificant, and optimizing the same metrics for every 50-node subset would probably be impossible. I do dislike the generally arrogant tone of this paper, but it's interesting that it works.

Their result that at most one intermediate node is sufficient to overcome faults is actually pretty obvious in hindsight; they even have a nice explanation of why in the paper that bears repeating. The underlying problem is that (1) BGP converges slowly since it's DV, and (2) multi homing is kind of broken. Essentially all they do with one backup intermediate router is to make the endpoint's AS's "virtually multihomed." From this angle, you really might get the impression that their whole argument really comes down to fixing BGP, which might not even be that hard; mainly a deployment issue. Nonetheless, an enjoyable read.