[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new WRR sched, updated TBF and "real" ingres implementation

On Tue, 8 Aug 2000 devik@cdi.cz wrote:

> Hello Jamal, I merged both our "threads" into this one
> mail, it will be more readable I hope ;-)

thanks ;->

> Yep this is the point ;) Policer can do nothing with this. But SFQ
> queue can. When there are several paralel conversations, SFQ will
> try to queue each of them into independent queue. 

pending no hash collisions. I guess if you have filters which are
"narrow" SFQ will narrow it even further; so Ok -- lets assume this for
sake of discussion.

> yes it is. But better than nothing. I've hacked it to display actual
> and maximal number of conversations and in my gateway it is max 8.
> SFQ hash has 1024 buckets so here is low probability of colision
> (8/1024).
> Also perturb parameter helps even more.
> I think for BE traffic SFQ is quite good solution.

Yes, but delay is only one part of the whole overall TCP throughput.
Mainly because TCP is clocked by ACKs. And i am not sure how much SFQ
in your one hop can really add to the overall RTT. probably not that
[Recall Mathis et al equation: TCP b/width = C*MSS/(RTT*sqrt(p)) 
where p is the drop probability and C is a constant which changes under
different conditions]

> > I think if you want to fix this unfairness, you have to write a stateful
> > classifier (urgh!); packeteer for example play games with adjusting tcp
> > window sizes based on flow state they maintain such as RTT etc.
> > It is dirty but works.
> huh .. you are right it is dirty .. what is packeteer by the way ?

oh-oh ;-> i hope i dont give you ideas;-> but look at www.packeteer.com
of course these guys dont use CBQ (their competitor Xedia does so look
through the marketing propaganda carefully).

> > > It doesn't depend on HW rate - I tested is also on loopback and it
> > > works.
> > 
> > Loopback is a bad example. It is a high speed interface ;->
> > try to test across two machines with a null serial connection (something
> > like 28-56Kbps)
> I wrote "also". I tried it on my 56k link and loopback. The point
> is that on 56k original CBQ works well too. But on loopback (or
> another iff with variable rate - like wireless eth) it doesn't.
> TBF+CBQ works well on both.

Sorry, yes TBF as the estimator would work fine. The speed i described is
not a factor. 
Did you limit the peak rate as well on TBF? I think it would work nicely
if you did. Infact i think peak rate control would help a lot.

> > > Currently I'm thinking to change SFQ and add new parameter to it:
> > > maximal depth of each flow. It should have one important result:
> > > ability to handle several subflows along with low delay.
> > 
> > Read the comments Alexey makes about maximal 128 packet queues and fitting
> > in 4K pages.
> Yes I spent several days with his code. SFQ has not 128 pks
> queueS instead whole qdisc can queue up to 128 packets. Each
> packet is then linked into one of 1024 hash buckets.
> So that SFQ can have one queue with 128 packets or 128 active 
> queues (of 1024) with 1 packet each.
> When I limit each queue's length to say 2, I can attach them 
> to CBQ leaf and ensure TCP conversation fairness along with 
> relatively small delay. Note that you probably can't do fair 
> (between conversations) queuing using
> dropping only - you need some metric which conversation should drop now.
> SFQ uses as metric it's queue len - when conversation's queue exceed
> certain limit (which is not tested currently) it will drop subsequent
> packets into that class (queue).
> By this way you can slowdown all conversations equaly. Using policers
> only you can only decide whether drop (or mark) CURRENT packet only.
> With weird arrival pattern, each drop can affect the same conversation
> thus policing will be done by denying arbitrary flow ..
> Unfortunately on my gateway it is true. When I installed SFQ instead of
> FIFO our customers noticed better thoughtput without starvations.

very smart. Totaly stateless. So you are enforcing the BW*RTT product and
at the same time by keeping the queue small you actually enforce a certain
drop probability. Of course, TCP is smart and will adjust itself to fit
this so it would work. Good stuff.
I think this would work very well if you can enforce each flow into a
queue i.e no hash collisions. but as the number of flows grow, you'll run
into problems. In your case probably not a very big deal since you dont
have many customers. You should also probably hack the SFQ hash function
to reduce the chances of collions (TCP ports?)

> > This is nice to say but hard to do;-> How do you know what HT1 is?
> Why ? I don't need to know HT1. I know HT1+WT1. 
> And HT1+WT1-PS1/desired_rate = idle. And idle is what I need
> to know. Right ?

How do you know what HT1+WT1 is? I am curious.

> This is exactly what you see in CBQ code. It uses right edges, ok,
> but see snippet from Alexander's code:
> idle = PSCHED_TDIFF(q->now, cl->last);
> idle -= L2T(cl, len);
> avgidle += idle - (avgidle>>cl->ewma_log);
> q->now is right edge of just finished packet, cl->last is the
> same for prev packet. So that you can see that their diff is
> equal to HT1+WT1. 

I think this is where you flaw is. Are you expecting that the dequeue is a
good enough event to be able to figure out this? 

> Then classe's len/rate is substracted and
> average computed. On several next lines you can see:
> idle -= L2T(&q->link, len);
> idle += L2T(cl, len);
> PSCHED_TADD2(q->now, idle, cl->undertime);
> well, here device's HW rate is used (q->link) but only to
> correct cl->undertime ! 

The reason you need to correct is because you dont know when
the previous packet from your class has left the network.
The CBQ estimator "self adjusts" its estimate of the b/width this way.
i.e this is how it shapes (introduces a cbr like data stream).

> It's because undertime was computed as time when avgidle 
> will be zero - BUT avgidle is updated always at right edge
> so that we get right edge's undertime too. But we need
> to know when class BEGINS to be undertime. Thus we substract
> HT1 to get left edge's undertime.
> Is it clear ?

I am afraid not. If this was a token bucket I'll agree with you: you dont
need to know the h/ware rate there. Your job is burst control with long 
term averages approaching alloc rate. The CBQ estimator is trying to do
the same thing as TB as well as shape (by continously adjusting its
estimate of what the allocated b/width is). This is by design. In that
adjustment algorithm, it needs this hardware rate to add isochrinity
as shown in appendix 1 of the doc.

> So you can see that when we will use left edges we will
> not need to use phys bw because avgidle will be computed
> in the same way as now (and now we don't need it too) and
> undertime value will already result as left edge's time.
> IMHO using right edge is heritage from first Floyd's
> implementation and probably nobody tried to change it.
> But for linux it would be more suitable I think.
> Do you still see any inconsistency in my theory ?

Lets solve this problem in a simple way ;-> Tell me how you intend on
finding what HT1+WT1 is. Maybe there is something you are onto that i dont
see yet. Note that:
The only true event notifier is when a packet (of your class) has left the
NIC. Dequeue doesnt tell you that a packet has left, it just asks you for
a packet. It is up to you to say yay or nay. Trying to write callbacks to
do this kind of notification might not be as trivial or practical at all.
With token bucket the design is to do pre-estimation before-hand; build
your parameters (protect against bursts by having peak rates etc) and the
algorithm doesnt try to adjust any of its computation. So you dont need
the h/ware rate; perhaps, the h/ware rate is not as important as i make it
out to be (for example in high speed networks it is insignificant).
Just tell me how you will find out what HT1+WT1 is and we might be able to
settle this.

> > total_packet_transm_time has 2 time factors which are hidden: These are
> > the interpacket gap and the h/ware transmission time. In order to
> > shape/leak at a certain allocated rate you need to account for
> > both.
> yes as I said we use it both but always as sum which we know.
> It is the same fot TBF, it shapes very well and doesn't use
> both factors separately, only as sum.
> In fact you need to assure that over arbitrary time interval
> you transmit at most T*rate bits. But it doesn't depend if
> you transmit 1kb in 1ms and 99ms gap or transmit it in 50ms
> with 50ms gap.
> It is still rate 10kbit/sec. Right ?

The only thing token bucket guarantees you is that you bound the
burstiness (burst + t*R) where t is the measurement period and R is the
allocated rate. It _does not_ guarantee you nice isochronous traffic.

Maybe this is where our misunderstanding is: In order to do shaping (as in
nice interpacket gaps approaching a certain rate), you need to know the
'service' time.
Shaping is not exactly the strength of token bucket (unless you limit the
peak rates which guarantees you a short term burstiness bound or you throw
in a leaky bucket as a controller). CBQ is meant for shaping as well as
burst control (look at the max/minburst paramaters).
This is the difference between the two. 

> > > I have some results but not in good form. But is seems
> > > that each class which is underlimit delays packet at most
> > > 1ms (486DX2/66), bounded classes are throttled exactly
> > > to specified ratio and delay with 5kbps flow and 500B packets
> > > is 45ms avg (using DROP overlimit action).
> > 
> > What are the similar numbers with the current estimator? 1 ms for
> > underlimit sounds like too much.
> :-) this parameter is the same, it's due to my slow computer
> probably. I'm doing a lot of debug printk in code.

ah, that would explain it.

> > Good. Like i said in my earlier mail; just make sure there is a selectable
> > alterntive for the old estimator.
> I was thinking to implement is as another qdisc (say TCBQ). 
> It's because a lot of code was changed in fundamental way
> and currently tc tool will need a much less parameters to
> setup TCBQ. In fact, qdisc needs no parameter (really).
> The classes each needs only rate. Burst, offtime, weight,
> overlimit strategy (which is missing in current tc), 
> priority and lower priority are optional.
> bandwidth, avpkt, ewma, minburst and maxburst are not
> needed anymore.
> Because almost everyone is familiar with TBF parameters
> it should be much simpler to use it now.

Ok. I think if you use both peak and allocated rate, you should be fine.

> I'm occupied too, but it was always one of my dreams to be one
> of linux developers ;-)

Looking for a job? ;-> (actually i also have a daytime job if thats of any
consolation to you).

> So that is should be possible. Can you point me to some papers
> about W2FQ and HPFSC ?

http://www.cs.cmu.edu/People/hzhang/HFSC/main.html is agood start;

but i think an interesting one is W2FQ at:

J.C.R. Bennett and H. Zhang, ``WF2Q: Worst-case Fair Weighted Fair
Queueing'', INFOCOM'96, Mar, 1996.
which can be found at:

scroll down the list.