tcptrace problems with AVL vs. hash

From: Russ Fink (russfink@hotmail.com)
Date: 08/12/04


From: "Russ Fink" <russfink@hotmail.com>
Subject: tcptrace problems with AVL vs. hash
Date: Thu, 12 Aug 2004 09:20:35 -0400
Message-ID: <BAY1-F8JkYvwOwTURGt000352e4@hotmail.com>

Hello,

I am a new user to tcptrace, version 6.6.1, and ran it successfully on
several smaller data sets. However, when I cranked this up on a huge data
file running it overnight, top(1) reported 99%CPU, over 250M of resident
memory, and apparently ceased making progress and was hung the next morning.

The data set was taken from MIT's Lincoln Labs set of network intrusion
data, one of the "clean" data sets available at:
http://www.ll.mit.edu/IST/ideval/data/data_index.html . The specific data
set was ./1999/testing/week5/thursday/outside.tcpdump.gz (uncompressed). It
has roughly 81,000 connections total by my best estimate.

On further investigation of the code, I ran across a comment in tcptrace.h
that a hash table approach for storing tcp connections was replaced in favor
of an AVL tree approach. The specific reason for doing so was that using a
hash "might lead to a worst case scenario when many connections hash to the
same hash table entry. In such a case, searching for the connections
degrades to searching a linked list with a worst case complexity of O(number
of connections in list)."

Sorry to say, this logic holds no water. Any reasonable hash function will
ensure that the number of collisions is minimal, giving an overall O(1)
performance. When maintaining tcp connections, the best hash approach is to
hash the address quad of the connection (source/dest IP/port). Balanced AVL
trees on the other hand, and any recursive algorithm, tend to suffer greatly
in terms of stack size and efficiency. There is much wasted time trying to
reorder the tree with each and every new connection that comes in. Although
it is O(ln(n)), this is greatly worse than O(1) on any large data set.

As further merit to the hash function, the libnids library
(http://libnids.sourceforge.net/) essentially solves the same problem of
maintaining tcp connections in a table, but implements a hash approach to
connection lookup and the average number of collisions turns out to be very
small.

As for the problem of getting hung overnight, I am not saying the AVL tree
is the cause of the problem. I have not investigated the cause at all,
beyond running top. What happened was that initial grepping of code turned
up this discussion of AVL trees, and a comment with some flawed logic, which
led to my post.

I am going to try an older version of tcptrace to see if it can successfully
run the 81,000 data set in a reasonable amount of time.

Other than this, the interface and the design of the program is well done.

Russ Fink

_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how to
get there! http://lifeevents.msn.com/category.aspx?cid=Retirement

----------------------------------------------------------------------------
To unsubscribe, send a message with body containing "unsubscribe tcptrace" to
majordomo@tcptrace.org.



This archive was generated by hypermail 2.1.7 : 08/12/04 EDT