Thursday, November 13, 2008

XTrace

I like xtrace; it provides some really good insights about what is actually happening during complex data center and internet transactions across many machines and services. What is critically important to its ability to do this is its acceptance of the need for a limited amount of application modification to enable the detailed tracing xtrace performs. Other studies which attempt to infer causality from snooped data traffic are significantly more limited in the conclusions they can draw because they cannot incorporate a notion of application-level semantics in their analyses. XTrace's notion of a persistent transaction ID allows the application being traced to define the operation which is meaningful. It also persists across layers, so that the result is a trace showing all operations initiated by a single transaction anywhere in the system. The authors present several compelling examples of where the tool was used to diagnose performance problems and bugs.

I think the main value of this work as a window into the operation of complicated distributed systems which are otherwise appallingly opaque. This system may be only the first step into greater visibility, because for a sufficiently large system, one needs to ensure that there is a mechanism for usefully interpreting the resulting causal graphs which may be very large.

No comments: