[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new WRR sched, updated TBF and "real" ingres implementation



Hello Jamal, I merged both our "threads" into this one
mail, it will be more readable I hope ;-)

> have to do that. The point is dev->qdisc/qdisc_sleeping is a very simple
> scheme which works well; it is cheap; and you really dont have to worry
> about any invalid state that qdisc_sleeping might be in.
> In the case of ingress if the device is down, the packet will be junked
> much lower down in the stack; so ingress doesnt have to worry about it.

yes that explains it well. It was what I wanted to know :-)

> > Different subflows within one class (subflows are different
> > in tcp port/ip addreses) doesn't share bw fairly.
> > It would work if I create one policer for each flow (ip/port pairs)
> > and assign each one the same rate.
> 
> You can do this.
> Did you try the example which uses policer "index"?
> You can share the same policer across multiple flows.
> Look at the "color aware" example Edge32-ca-u32 (also some description in
> the README in that directory)
>
> > But when one policer polices several flows then there is always
> > case where "faster" sender is able to use more bandwidth than
> > slower (more distant) hosts.
> 
> Yeah; but this has to do with the protocol in question (TCP) i.e nothing
> to do with the policer;

Yep this is the point ;) Policer can do nothing with this. But SFQ
queue can. When there are several paralel conversations, SFQ will
try to queue each of them into independent queue. When gateway is
congested (or output TBF or CBQ starts throtling) several queues
builds up. Because SFQ queues are served in round robin consequence
is that faster host will create longer queue and will be more delayed.
Only drawback is ... see paragraph about SFQ bellow.

> > Queuing can solve it because when you use SFQ it divides flows
> > fairly. But you can't use SFQ when dropping (instead of delaying)
> > because then no packets are queued and we can't choose which
> > packet to send.
> >
> 
> SFQ is still stochastic in its fairness (eg when hash collisions occur)

yes it is. But better than nothing. I've hacked it to display actual
and maximal number of conversations and in my gateway it is max 8.
SFQ hash has 1024 buckets so here is low probability of colision
(8/1024).
Also perturb parameter helps even more.
I think for BE traffic SFQ is quite good solution.

> I think if you want to fix this unfairness, you have to write a stateful
> classifier (urgh!); packeteer for example play games with adjusting tcp
> window sizes based on flow state they maintain such as RTT etc.
> It is dirty but works.

huh .. you are right it is dirty .. what is packeteer by the way ?

> > It doesn't depend on HW rate - I tested is also on loopback and it
> > works.
> 
> Loopback is a bad example. It is a high speed interface ;->
> try to test across two machines with a null serial connection (something
> like 28-56Kbps)

I wrote "also". I tried it on my 56k link and loopback. The point
is that on 56k original CBQ works well too. But on loopback (or
another iff with variable rate - like wireless eth) it doesn't.
TBF+CBQ works well on both.

> > Currently I'm thinking to change SFQ and add new parameter to it:
> > maximal depth of each flow. It should have one important result:
> > ability to handle several subflows along with low delay.
> 
> Read the comments Alexey makes about maximal 128 packet queues and fitting
> in 4K pages.

Yes I spent several days with his code. SFQ has not 128 pks
queueS instead whole qdisc can queue up to 128 packets. Each
packet is then linked into one of 1024 hash buckets.
So that SFQ can have one queue with 128 packets or 128 active 
queues (of 1024) with 1 packet each.
When I limit each queue's length to say 2, I can attach them 
to CBQ leaf and ensure TCP conversation fairness along with 
relatively small delay. Note that you probably can't do fair 
(between conversations) queuing using
dropping only - you need some metric which conversation should drop now.
SFQ uses as metric it's queue len - when conversation's queue exceed
certain limit (which is not tested currently) it will drop subsequent
packets into that class (queue).
By this way you can slowdown all conversations equaly. Using policers
only you can only decide whether drop (or mark) CURRENT packet only.
With weird arrival pattern, each drop can affect the same conversation
thus policing will be done by denying arbitrary flow ..
Unfortunately on my gateway it is true. When I installed SFQ instead of
FIFO our customers noticed better thoughtput without starvations.

> > When packet P1 of size PS1 is sent at time 0, it completes transmision
> > over the wire at time HT1. Then there will be (even zero) pause, say WT1.
> > Transmision rate of link is then P1/(HT1+WT1). Estimators in CBQ
> > measures just this rate.
> 
> agreed (btw, you mean PS1/(HT1+WT1)).

yes ;)

> 
> > When you go thru 100Mb eth or 56k modem there will differ only
> > HT1/WT1 ratio. But rate doesn't depend on the ratio, only on the
> > sum.
> > I can't believe that data rate should depend on HW bandwidth - it
> > would affect only upper limit of the rate, wouldn't it ?
> 
> This is nice to say but hard to do;-> How do you know what HT1 is?

Why ? I don't need to know HT1. I know HT1+WT1. 
And HT1+WT1-PS1/desired_rate = idle. And idle is what I need
to know. Right ?
This is exactly what you see in CBQ code. It uses right edges, ok,
but see snippet from Alexander's code:

idle = PSCHED_TDIFF(q->now, cl->last);
idle -= L2T(cl, len);
avgidle += idle - (avgidle>>cl->ewma_log);

q->now is right edge of just finished packet, cl->last is the
same for prev packet. So that you can see that their diff is
equal to HT1+WT1. Then classe's len/rate is substracted and
average computed. On several next lines you can see:

idle -= L2T(&q->link, len);
idle += L2T(cl, len);
PSCHED_TADD2(q->now, idle, cl->undertime);

well, here device's HW rate is used (q->link) but only to
correct cl->undertime ! 
It's because undertime was computed as time when avgidle 
will be zero - BUT avgidle is updated always at right edge
so that we get right edge's undertime too. But we need
to know when class BEGINS to be undertime. Thus we substract
HT1 to get left edge's undertime.
Is it clear ?
So you can see that when we will use left edges we will
not need to use phys bw because avgidle will be computed
in the same way as now (and now we don't need it too) and
undertime value will already result as left edge's time.
IMHO using right edge is heritage from first Floyd's
implementation and probably nobody tried to change it.
But for linux it would be more suitable I think.
Do you still see any inconsistency in my theory ?

> total_packet_transm_time has 2 time factors which are hidden: These are
> the interpacket gap and the h/ware transmission time. In order to
> shape/leak at a certain allocated rate you need to account for
> both.

yes as I said we use it both but always as sum which we know.
It is the same fot TBF, it shapes very well and doesn't use
both factors separately, only as sum.
In fact you need to assure that over arbitrary time interval
you transmit at most T*rate bits. But it doesn't depend if
you transmit 1kb in 1ms and 99ms gap or transmit it in 50ms
with 50ms gap.
It is still rate 10kbit/sec. Right ?

> > I have some results but not in good form. But is seems
> > that each class which is underlimit delays packet at most
> > 1ms (486DX2/66), bounded classes are throttled exactly
> > to specified ratio and delay with 5kbps flow and 500B packets
> > is 45ms avg (using DROP overlimit action).
> 
> What are the similar numbers with the current estimator? 1 ms for
> underlimit sounds like too much.

:-) this parameter is the same, it's due to my slow computer
probably. I'm doing a lot of debug printk in code.

> Good. Like i said in my earlier mail; just make sure there is a selectable
> alterntive for the old estimator.

I was thinking to implement is as another qdisc (say TCBQ). 
It's because a lot of code was changed in fundamental way
and currently tc tool will need a much less parameters to
setup TCBQ. In fact, qdisc needs no parameter (really).
The classes each needs only rate. Burst, offtime, weight,
overlimit strategy (which is missing in current tc), 
priority and lower priority are optional.
bandwidth, avpkt, ewma, minburst and maxburst are not
needed anymore.
Because almost everyone is familiar with TBF parameters
it should be much simpler to use it now.

> con in this case translated as "to trick or convince"; i.e can i convince
> you to write a new scheduler or something else ;-> You seem to be a very
> capable guy and we need the manpower if we can get it ;-> Instead of you
> trying to tune existing things, i think your skills could be more usefull
> to add new things (eg think of W2FQ for one or  HPFSC it would be useful
> to have them around. I have been meaning to implement one of them but too
> occupied at the moment).

I'm occupied too, but it was always one of my dreams to be one
of linux developers ;-)
So that is should be possible. Can you point me to some papers
about W2FQ and HPFSC ?

regards 
devik