tcptrace-bugs Re: patch included: big performance improvement for long-lived conns w/ loss PART 2

From: Mike Lowell (m.lowell@f5.com)
Date: 09/04/05


Message-ID: <431A9BFF.10706@f5.com>
Date: Sun, 04 Sep 2005 00:02:23 -0700
From: Mike Lowell <m.lowell@f5.com>
Subject: tcptrace-bugs Re: patch included: big performance improvement for long-lived conns w/ loss PART 2




After patching IsRTO(), I profiled tcptrace again on the same tcpdump file and found that addseg() was the next biggest offender, for similar reasons (walking a very long linked list starting from the head when the most likely match will be near the tail).

gprof output with only the IsRTO() patch applied:

slak2:~$ gprof tcptrace-6.6.7/tcptrace
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total          
 time   seconds   seconds    calls   s/call   s/call  name   
 65.98     17.40    17.40  2346707     0.00     0.00  addseg
 20.63     22.84     5.44  2338805     0.00     0.00  IsRTO
  3.79     23.84     1.00  2870065     0.00     0.00  dotrace
  1.97     24.36     0.52                             __fixunsdfdi
  1.59     24.78     0.42  2870065     0.00     0.00  FindTTP
  1.54     25.18     0.41  8612471     0.00     0.00  MemCpy
[ ... ]



Unfortunately, the loop in addseg() cannot be easily inverted like IsRTO() because it would break the logic to find out of order segments and run rtt_retans()...  The next-best thing is to keep the existing forward-walking logic, but instead of starting at the head of the list, initially "prime the pump" by walking the list backwards until we find the first segment we should be running through the main addseg() algorithm.

Patch included below.

In the interest of completeness, I created a standalone version of addseg() and it's supporting functions, and created a set of test segments that would exploit every corner case in the addseg() algorithm.  I then made a copy of addseg() called "new_addseg()" that included my changes.  The tests are both functional and performance;  After running through about 30 functionally-stressful segments (i.e. overlapping and not overlapping in every relevent way), I then try to populate the segment list of each function with 30000 segments and measure the performance.  In all cases I compare the resulting lists for equality (to make sure I didn't break any part of the algorithm).  Also, just to be extra careful I didn't change the algorithm, I captured the output of a stock tcptrace-6.6.7 vs. my patched version and diff'd the output from "-l" of a 10Mbyte file that I know exercises these functions.

The summary of the performance results are that the new_addseg() function is as fast or faster in my theoretical tests (with my standalone application), and the same was true when applied to the real/full tcptrace application.  In both cases (tcptrace or my standalone program), all the dataset was stored in-memory on a linux tmpfs filesystem (no swap is configured, this machine has 4GB of RAM), further, the program being tested (tcptrace and my standalone program) is modified to mlockall() all memory, and
sched_setscheduler() to set real-time priority to ensure no other factors may be affect performance (it also makes the test results very repeatable).

Here is some ugly output that is the results from my standalone program:

slak2:~/save/tcptrace-6.6.7# gcc -o standalone_test standalone_test.c && ./standalone_test
populating segment list with original addseg
populating segment list with new new_addseg
starting performance test
no overlap
addseg: 0.005144
new_addseg: 0.005141
all overlap of last test
addseg: 4.827112
new_addseg: 4.274208
last byte overlaps next segment by 1
addseg: 18.154892
new_addseg: 0.005650
make every 100th packet look like a rexmit from 2pkt ago
addseg: 0.005253
new_addseg: 0.005218


More importantly than these theoretical tests, on my real 302Mbyte tcpdump file (described below), it cuts the analysis time in half from 40.150298 to 20.342623 (this is measured by adding a gettimeofday at the start and end of tcptrace.c:main()).  Trying tcptrace on a few other random files (not as extreme like the 302Mbyte example file I've been using) showed that performance is essentially unchanged, or maybe slightly faster with new_addseg().

In summary, as with the last patch to IsRTO(), this patch shouldn't have any noticeable impact performance in the standard cases (maybe slightly positive), but it should have a dramatic impact on the corner-case where a long-lived connection has lots of retransmits.


Lastly, IsRTO is again the biggest offender in the corner-case I'm working on, and now there isn't much that could be done to optimize the performance of the existing algorithm..  Even carefully-aligned hand-optimized assembly version of IsRTO and the data is uses is unlikely to be any faster than what gcc generates, so really the only way to make it faster would be to change the algorithm ... perhaps use a hash table, as was done with the connection lookup code.  After the most recent changes I'm quite satisfied with the performance, so I guess reworking the segment handling code into a full hash table will have to wait. ;)

Thanks for the great software!



gprof output with both patches applied:


slak2:~$ gprof tcptrace-6.6.7/tcptrace
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total          
 time   seconds   seconds    calls   s/call   s/call  name   

 70.03      9.02     9.02  2345235     0.00     0.00  IsRTO
  6.75      9.89     0.87  2870065     0.00     0.00  dotrace
  4.81     10.51     0.62                             __fixunsdfdi
  4.58     11.10     0.59  8612447     0.00     0.00  MemCpy
  2.41     11.41     0.31 14343074     0.00     0.00  elapsed
  1.94     11.66     0.25  2870065     0.00     0.00  FindTTP
[ ... ]



############## PATCH START ##############
diff -ur tcptrace-6.6.7-orig/rexmit.c tcptrace-6.6.7/rexmit.c
--- tcptrace-6.6.7-orig/rexmit.c        2003-11-19 06:38:05.000000000 -0800
+++ tcptrace-6.6.7/rexmit.c     2005-09-03 23:30:25.000000000 -0700
@@ -173,13 +173,28 @@
     int rexlen = 0;
     Bool split = FALSE;
 
-    /* check each segment in the segment list */
-    pseg = pquad->seglist_head;
-
     /* (optimize expected case, it just goes at the end) */
     if (pquad->seglist_tail &&
        (thisseg_firstbyte > pquad->seglist_tail->seq_lastbyte))
-       pseg = NULL;
+       goto ADDSEG_END;
+
+       /* Optimize the most likely case of the segment being near the end *
+        * Can't just perform the entire function backwards because of the *
+        * rtt_retrans() and *pout_order detection logic, but instead we   *
+        * can start the normal forward-looking logic near the right place *
+        * by searching for a good starting place from the tail backwards  */
+
+       for (pseg = pquad->seglist_tail;
+               pseg != NULL &&
+               thisseg_firstbyte <= pseg->seq_firstbyte;
+               pseg = pseg->prev);
+
+       /* seglist is either empty or thisseg_ goes at the beginning *
+        * my profiling suggests it's not worthwhile to optimize for *
+        * this as a special-case                                    */
+       if (pseg == NULL)
+               pseg = pquad->seglist_head;
+
     for (; pseg != NULL; pseg = pseg->next) {
        if (thisseg_firstbyte > pseg->seq_lastbyte) {
            /* goes beyond this one */
@@ -239,6 +254,7 @@
            len = thisseg_lastbyte - thisseg_firstbyte + 1;
        }
     }
+ADDSEG_END:
 
 
     /* if we got to the end, then it doesn't go BEFORE anybody, */
@@ -780,7 +796,7 @@
   quadrant *pquad = whichquad(ptcb->ss,s);
   segment *pseg;
 
-  for (pseg = pquad->seglist_head; pseg != NULL; pseg = pseg->next) {
+  for (pseg = pquad->seglist_tail; pseg != NULL; pseg = pseg->prev) {
     if (s == (pseg->seq_lastbyte+1)) {
       if (pseg->acked < 4) return TRUE;
       else return FALSE;
############## PATCH END ##############


Mike Lowell

----- Original Message -----
From: Mike Lowell <m.lowell@f5.com>
To: tcptrace-bugs@tcptrace.org
Subject: patch included: big performance improvement for long-lived conns w/ loss


Hi,

While trying to analyze a 302Byte tcpdump file with tcptrace, I found that tcptrace was taking an unusually long time (about 7 hours).  After a closer inspection, I realized that the tcpdump file contained only a few connections, and that each connection was VERY long (hundreds of thousands of packets).  I profiled tcptrace, and to my surprise the problem was immediately evident:

                                    
slak2:~$ gprof tcptrace-6.6.7/tcptrace
Flat profile:
                                 
Each sample counts as 0.01 seconds.              
  %   cumulative   self              self     total     
 time   seconds   seconds    calls   s/call   s/call  name
 98.11      3.63     3.63    25945     0.00     0.00  IsRTO
  0.81      3.66     0.03    32000     0.00     0.00  dotrace  
  0.54      3.68     0.02        1     0.02     3.70  ProcessFile
  0.27      3.69     0.01    32000     0.00     0.00  FindTTP
  0.27      3.70     0.01    32000     0.00     0.00  callback
[ ... ]

 
As you can see, IsRTO() was murdering the performance.  For my initial analysis of the 302Mbyte file, retransmit information was not necessary so I commented out the code which calls IsRTO(), and realize a performance gain of several orders of magnitude -- my 7 hour analysis only took 54 seconds. ;)
 
The problem: IsRTO() is walking a linked list of every single packet in my obscenely long TCP connections after a single retransmit is found.  Given that the first retransmit occurs at about packet #100 (of about 200000), the processing required becomes essentially O(n^2).

The fix: modify IsRTO() to walk the list from the tail end backwards.  On long TCP connections it's substantially more likely the packet you're looking for is near the end of the segment list rather than beginning.  In fact, in all cases it should be as fast or faster to start from the tail end of the segment list.  My proposed change takes 1 minute 15 seconds to complete.

There are obviously more efficient solutions by changing the higher-level algorithms of tcptrace, but this patch is only 1 line (i.e. very low risk), and given the huge performance benefit for this fairly obscure case, it seems good enough for now.


The patch is attached, and is inline:

########## CUT HERE START ##########
diff -ur tcptrace-6.6.7-orig/rexmit.c tcptrace-6.6.7/rexmit.c
--- tcptrace-6.6.7-orig/rexmit.c        2003-11-19 06:38:05.000000000 -0800
+++ tcptrace-6.6.7/rexmit.c     2005-09-01 18:20:57.000000000 -0700
@@ -780,7 +780,7 @@
   quadrant *pquad = whichquad(ptcb->ss,s);
   segment *pseg;
 
-  for (pseg = pquad->seglist_head; pseg != NULL; pseg = pseg->next) {
+  for (pseg = pquad->seglist_tail; pseg != NULL; pseg = pseg->prev) {
     if (s == (pseg->seq_lastbyte+1)) {
       if (pseg->acked < 4) return TRUE;
       else return FALSE;
########## CUT HERE END ##########


If this patch is okay, can it be included in a later stable version of tcptrace?

Thanks!

diff -ur tcptrace-6.6.7-orig/rexmit.c tcptrace-6.6.7/rexmit.c --- tcptrace-6.6.7-orig/rexmit.c 2003-11-19 06:38:05.000000000 -0800 +++ tcptrace-6.6.7/rexmit.c 2005-09-01 18:20:57.000000000 -0700 @@ -780,7 +780,7 @@ quadrant *pquad = whichquad(ptcb->ss,s); segment *pseg; - for (pseg = pquad->seglist_head; pseg != NULL; pseg = pseg->next) { + for (pseg = pquad->seglist_tail; pseg != NULL; pseg = pseg->prev) { if (s == (pseg->seq_lastbyte+1)) { if (pseg->acked < 4) return TRUE; else return FALSE;

diff -ur tcptrace-6.6.7-orig/rexmit.c tcptrace-6.6.7/rexmit.c
--- tcptrace-6.6.7-orig/rexmit.c 2003-11-19 06:38:05.000000000 -0800
+++ tcptrace-6.6.7/rexmit.c 2005-09-03 23:30:25.000000000 -0700
@@ -173,13 +173,28 @@
     int rexlen = 0;
     Bool split = FALSE;
 
- /* check each segment in the segment list */
- pseg = pquad->seglist_head;
-
     /* (optimize expected case, it just goes at the end) */
     if (pquad->seglist_tail &&
         (thisseg_firstbyte > pquad->seglist_tail->seq_lastbyte))
- pseg = NULL;
+ goto ADDSEG_END;
+
+ /* Optimize the most likely case of the segment being near the end *
+ * Can't just perform the entire function backwards because of the *
+ * rtt_retrans() and *pout_order detection logic, but instead we *
+ * can start the normal forward-looking logic near the right place *
+ * by searching for a good starting place from the tail backwards */
+
+ for (pseg = pquad->seglist_tail;
+ pseg != NULL &&
+ thisseg_firstbyte <= pseg->seq_firstbyte;
+ pseg = pseg->prev);
+
+ /* seglist is either empty or thisseg_ goes at the beginning *
+ * my profiling suggests it's not worthwhile to optimize for *
+ * this as a special-case */
+ if (pseg == NULL)
+ pseg = pquad->seglist_head;
+
     for (; pseg != NULL; pseg = pseg->next) {
         if (thisseg_firstbyte > pseg->seq_lastbyte) {
             /* goes beyond this one */
@@ -239,6 +254,7 @@
             len = thisseg_lastbyte - thisseg_firstbyte + 1;
         }
     }
+ADDSEG_END:
 
 
     /* if we got to the end, then it doesn't go BEFORE anybody, */
@@ -780,7 +796,7 @@
   quadrant *pquad = whichquad(ptcb->ss,s);
   segment *pseg;
 
- for (pseg = pquad->seglist_head; pseg != NULL; pseg = pseg->next) {
+ for (pseg = pquad->seglist_tail; pseg != NULL; pseg = pseg->prev) {
     if (s == (pseg->seq_lastbyte+1)) {
       if (pseg->acked < 4) return TRUE;
       else return FALSE;



This archive was generated by hypermail 2.1.7 : 09/04/05 EDT