tcptrace-bugs Re: patch included: big performance improvement for long-lived conns w/ loss PART 2
Message-ID: <431A9BFF.10706@f5.com>
Date: Sun, 04 Sep 2005 00:02:23 -0700
From: Mike Lowell <m.lowell@f5.com>
Subject: tcptrace-bugs Re: patch included: big performance improvement for long-lived conns w/ loss PART 2
After patching IsRTO(), I profiled tcptrace again on the same tcpdump
file and found that addseg() was the next biggest offender, for similar
reasons (walking a very long linked list starting from the head when
the most likely match will be near the tail).
gprof output with only the IsRTO() patch applied:
slak2:~$
gprof tcptrace-6.6.7/tcptrace
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
65.98 17.40 17.40 2346707 0.00 0.00 addseg
20.63 22.84 5.44 2338805 0.00 0.00 IsRTO
3.79 23.84 1.00 2870065 0.00 0.00 dotrace
1.97 24.36 0.52 __fixunsdfdi
1.59 24.78 0.42 2870065 0.00 0.00 FindTTP
1.54 25.18 0.41 8612471 0.00 0.00 MemCpy
[ ... ]
Unfortunately, the loop in addseg() cannot be easily inverted like
IsRTO() because it would break the logic to find out of order segments
and run rtt_retans()... The next-best thing is to keep the existing
forward-walking logic, but instead of starting at the head of the list,
initially "prime the pump" by walking the list backwards until we find
the first segment we should be running through the main addseg()
algorithm.
Patch included below.
In the interest of completeness, I created a standalone version of
addseg() and it's supporting functions, and created a set of test
segments that would exploit every corner case in the addseg()
algorithm. I then made a copy of addseg() called "new_addseg()" that
included my changes. The tests are both functional and performance;
After running through about 30 functionally-stressful segments (i.e.
overlapping and not overlapping in every relevent way), I then try to
populate the segment list of each function with 30000 segments and
measure the performance. In all cases I compare the resulting lists
for equality (to make sure I didn't break any part of the algorithm).
Also, just to be extra careful I didn't change the algorithm, I
captured the output of a stock tcptrace-6.6.7 vs. my patched version
and diff'd the output from "-l" of a 10Mbyte file that I know exercises
these functions.
The summary of the performance results are that the new_addseg()
function is as fast or faster in my theoretical tests (with my
standalone application), and the same was true when applied to the
real/full tcptrace application. In both cases (tcptrace or my
standalone program), all the dataset was stored in-memory on a linux
tmpfs filesystem (no swap is configured, this machine has 4GB of RAM),
further, the program being tested (tcptrace and my standalone program)
is modified to mlockall() all memory, and sched_setscheduler() to set
real-time priority to ensure no other factors may be affect performance
(it also makes the test results very repeatable).
Here is some ugly output that is the results from my standalone program:
slak2:~/save/tcptrace-6.6.7# gcc -o standalone_test standalone_test.c
&& ./standalone_test
populating segment list with original addseg
populating segment list with new new_addseg
starting performance test
no overlap
addseg: 0.005144
new_addseg: 0.005141
all overlap of last test
addseg: 4.827112
new_addseg: 4.274208
last byte overlaps next segment by 1
addseg: 18.154892
new_addseg: 0.005650
make every 100th packet look like a rexmit from 2pkt ago
addseg: 0.005253
new_addseg: 0.005218
More
importantly than these theoretical tests, on my real 302Mbyte tcpdump
file (described below), it cuts the analysis time in half from 40.150298 to 20.342623 (this is
measured by adding a gettimeofday at the start and end of
tcptrace.c:main()). Trying tcptrace on a few other random files (not
as extreme like the 302Mbyte example file I've been using) showed that
performance is essentially unchanged, or maybe slightly faster with
new_addseg().
In summary, as with the last patch to IsRTO(), this patch shouldn't
have any noticeable impact performance in the standard cases (maybe
slightly positive), but it should have a dramatic impact on the
corner-case where a long-lived connection has lots of retransmits.
Lastly, IsRTO is again the biggest offender in the corner-case I'm
working on, and now there isn't much that could be done to optimize the
performance of the existing algorithm.. Even carefully-aligned
hand-optimized assembly version of IsRTO and the data is uses is
unlikely to be any faster than what gcc generates, so really the only
way to make it faster would be to change the algorithm ... perhaps use
a hash table, as was done with the connection lookup code. After the
most recent changes I'm quite satisfied with the performance, so I
guess reworking the segment handling code into a full hash table will
have to wait. ;)
Thanks for the great software!
gprof output with both patches applied:
slak2:~$
gprof tcptrace-6.6.7/tcptrace
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
70.03 9.02 9.02 2345235 0.00 0.00 IsRTO
6.75 9.89 0.87 2870065 0.00 0.00 dotrace
4.81 10.51 0.62 __fixunsdfdi
4.58 11.10 0.59 8612447 0.00 0.00 MemCpy
2.41 11.41 0.31 14343074 0.00 0.00 elapsed
1.94 11.66 0.25 2870065 0.00 0.00 FindTTP
[ ... ]
############## PATCH START ##############
diff -ur
tcptrace-6.6.7-orig/rexmit.c tcptrace-6.6.7/rexmit.c
--- tcptrace-6.6.7-orig/rexmit.c 2003-11-19 06:38:05.000000000
-0800
+++ tcptrace-6.6.7/rexmit.c 2005-09-03 23:30:25.000000000 -0700
@@ -173,13 +173,28 @@
int rexlen = 0;
Bool split = FALSE;
- /* check each segment in the segment list */
- pseg = pquad->seglist_head;
-
/* (optimize expected case, it just goes at the end) */
if (pquad->seglist_tail &&
(thisseg_firstbyte >
pquad->seglist_tail->seq_lastbyte))
- pseg = NULL;
+ goto ADDSEG_END;
+
+ /* Optimize the most likely case of the segment being near the
end *
+ * Can't just perform the entire function backwards because of
the *
+ * rtt_retrans() and *pout_order detection logic, but instead
we *
+ * can start the normal forward-looking logic near the right
place *
+ * by searching for a good starting place from the tail
backwards */
+
+ for (pseg = pquad->seglist_tail;
+ pseg != NULL &&
+ thisseg_firstbyte <= pseg->seq_firstbyte;
+ pseg = pseg->prev);
+
+ /* seglist is either empty or thisseg_ goes at the beginning *
+ * my profiling suggests it's not worthwhile to optimize for *
+ * this as a special-case */
+ if (pseg == NULL)
+ pseg = pquad->seglist_head;
+
for (; pseg != NULL; pseg = pseg->next) {
if (thisseg_firstbyte > pseg->seq_lastbyte) {
/* goes beyond this one */
@@ -239,6 +254,7 @@
len = thisseg_lastbyte - thisseg_firstbyte + 1;
}
}
+ADDSEG_END:
/* if we got to the end, then it doesn't go BEFORE anybody, */
@@ -780,7 +796,7 @@
quadrant *pquad = whichquad(ptcb->ss,s);
segment *pseg;
- for (pseg = pquad->seglist_head; pseg != NULL; pseg =
pseg->next) {
+ for (pseg = pquad->seglist_tail; pseg != NULL; pseg =
pseg->prev) {
if (s == (pseg->seq_lastbyte+1)) {
if (pseg->acked < 4) return TRUE;
else return FALSE;
############## PATCH END ##############
Mike
Lowell
----- Original Message -----
Subject: patch included: big performance improvement for
long-lived conns w/ loss
Hi,
While trying to analyze a 302Byte tcpdump file with tcptrace, I found
that tcptrace was taking an unusually long time (about 7 hours). After
a closer inspection, I realized that the tcpdump file contained only a
few connections, and that each connection was VERY long (hundreds of
thousands of packets). I profiled tcptrace, and to my surprise the
problem was immediately evident:
slak2:~$ gprof tcptrace-6.6.7/tcptrace
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
98.11 3.63 3.63 25945 0.00 0.00 IsRTO
0.81 3.66 0.03 32000 0.00 0.00 dotrace
0.54 3.68 0.02 1 0.02 3.70 ProcessFile
0.27 3.69 0.01 32000 0.00 0.00 FindTTP
0.27 3.70 0.01 32000 0.00 0.00 callback
[ ... ]
As you can see, IsRTO() was murdering the performance. For my initial
analysis of the 302Mbyte file, retransmit information was not necessary
so I commented out the code which calls IsRTO(), and realize a
performance gain of several orders of magnitude -- my 7 hour analysis
only took 54 seconds. ;)
The problem: IsRTO() is walking a linked list of every single packet in
my obscenely long TCP connections after a single retransmit is found.
Given that the first retransmit occurs at about packet #100 (of about
200000), the processing required becomes essentially O(n^2).
The fix: modify IsRTO() to walk the list from the tail end backwards.
On long TCP connections it's substantially more likely the packet
you're looking for is near the end of the segment list rather than
beginning. In fact, in all cases it should be as fast or faster to
start from the tail end of the segment list. My proposed change takes
1 minute 15 seconds to complete.
There are obviously more efficient solutions by changing the
higher-level algorithms of tcptrace, but this patch is only 1 line
(i.e. very low risk), and given the huge performance benefit for this
fairly obscure case, it seems good enough for now.
The patch is attached, and is inline:
########## CUT HERE START ##########
diff -ur tcptrace-6.6.7-orig/rexmit.c tcptrace-6.6.7/rexmit.c
--- tcptrace-6.6.7-orig/rexmit.c 2003-11-19 06:38:05.000000000
-0800
+++ tcptrace-6.6.7/rexmit.c 2005-09-01 18:20:57.000000000 -0700
@@ -780,7 +780,7 @@
quadrant *pquad = whichquad(ptcb->ss,s);
segment *pseg;
- for (pseg = pquad->seglist_head; pseg != NULL; pseg =
pseg->next) {
+ for (pseg = pquad->seglist_tail; pseg != NULL; pseg =
pseg->prev) {
if (s == (pseg->seq_lastbyte+1)) {
if (pseg->acked < 4) return TRUE;
else return FALSE;
########## CUT HERE END ##########
If this patch is okay, can it be included in a later stable version of
tcptrace?
Thanks!
diff -ur tcptrace-6.6.7-orig/rexmit.c tcptrace-6.6.7/rexmit.c
--- tcptrace-6.6.7-orig/rexmit.c 2003-11-19 06:38:05.000000000 -0800
+++ tcptrace-6.6.7/rexmit.c 2005-09-01 18:20:57.000000000 -0700
@@ -780,7 +780,7 @@
quadrant *pquad = whichquad(ptcb->ss,s);
segment *pseg;
- for (pseg = pquad->seglist_head; pseg != NULL; pseg = pseg->next) {
+ for (pseg = pquad->seglist_tail; pseg != NULL; pseg = pseg->prev) {
if (s == (pseg->seq_lastbyte+1)) {
if (pseg->acked < 4) return TRUE;
else return FALSE;
diff -ur tcptrace-6.6.7-orig/rexmit.c tcptrace-6.6.7/rexmit.c
--- tcptrace-6.6.7-orig/rexmit.c 2003-11-19 06:38:05.000000000 -0800
+++ tcptrace-6.6.7/rexmit.c 2005-09-03 23:30:25.000000000 -0700
@@ -173,13 +173,28 @@
int rexlen = 0;
Bool split = FALSE;
- /* check each segment in the segment list */
- pseg = pquad->seglist_head;
-
/* (optimize expected case, it just goes at the end) */
if (pquad->seglist_tail &&
(thisseg_firstbyte > pquad->seglist_tail->seq_lastbyte))
- pseg = NULL;
+ goto ADDSEG_END;
+
+ /* Optimize the most likely case of the segment being near the end *
+ * Can't just perform the entire function backwards because of the *
+ * rtt_retrans() and *pout_order detection logic, but instead we *
+ * can start the normal forward-looking logic near the right place *
+ * by searching for a good starting place from the tail backwards */
+
+ for (pseg = pquad->seglist_tail;
+ pseg != NULL &&
+ thisseg_firstbyte <= pseg->seq_firstbyte;
+ pseg = pseg->prev);
+
+ /* seglist is either empty or thisseg_ goes at the beginning *
+ * my profiling suggests it's not worthwhile to optimize for *
+ * this as a special-case */
+ if (pseg == NULL)
+ pseg = pquad->seglist_head;
+
for (; pseg != NULL; pseg = pseg->next) {
if (thisseg_firstbyte > pseg->seq_lastbyte) {
/* goes beyond this one */
@@ -239,6 +254,7 @@
len = thisseg_lastbyte - thisseg_firstbyte + 1;
}
}
+ADDSEG_END:
/* if we got to the end, then it doesn't go BEFORE anybody, */
@@ -780,7 +796,7 @@
quadrant *pquad = whichquad(ptcb->ss,s);
segment *pseg;
- for (pseg = pquad->seglist_head; pseg != NULL; pseg = pseg->next) {
+ for (pseg = pquad->seglist_tail; pseg != NULL; pseg = pseg->prev) {
if (s == (pseg->seq_lastbyte+1)) {
if (pseg->acked < 4) return TRUE;
else return FALSE;
This archive was generated by hypermail 2.1.7
: 09/04/05 EDT