net/sched: Add codel and fq_codel from Linux 3.5-rc1

svn path=/dists/sid/linux-2.6/; revision=19042
This commit is contained in:
Ben Hutchings 2012-05-28 01:50:18 +00:00
parent f27223075d
commit ec8c372709
13 changed files with 2457 additions and 0 deletions

1
debian/changelog vendored
View File

@ -4,6 +4,7 @@ linux-2.6 (3.2.18-2) UNRELEASED; urgency=low
* be2net: Backport most changes up to Linux 3.5-rc1, thanks to
Sarveshwar Bandi (Closes: #673391)
- Add support for Skyhawk cards
* net/sched: Add codel and fq_codel from Linux 3.5-rc1
-- Ben Hutchings <ben@decadent.org.uk> Sun, 27 May 2012 01:12:44 +0100

View File

@ -4554,6 +4554,8 @@ CONFIG_NET_SCH_DRR=m
CONFIG_NET_SCH_MQPRIO=m
CONFIG_NET_SCH_CHOKE=m
CONFIG_NET_SCH_QFQ=m
CONFIG_NET_SCH_CODEL=m
CONFIG_NET_SCH_FQ_CODEL=m
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS_BASIC=m
CONFIG_NET_CLS_TCINDEX=m

View File

@ -0,0 +1,771 @@
From: Eric Dumazet <edumazet@google.com>
Date: Thu, 10 May 2012 07:51:25 +0000
Subject: [PATCH 1/7] codel: Controlled Delay AQM
commit 76e3cc126bb223013a6b9a0e2a51238d1ef2e409 upstream.
An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson.
http://queue.acm.org/detail.cfm?id=2209336
This AQM main input is no longer queue size in bytes or packets, but the
delay packets stay in (FIFO) queue.
As we don't have infinite memory, we still can drop packets in enqueue()
in case of massive load, but mean of CoDel is to drop packets in
dequeue(), using a control law based on two simple parameters :
target : target sojourn time (default 5ms)
interval : width of moving time window (default 100ms)
Based on initial work from Dave Taht.
Refactored to help future codel inclusion as a plugin for other linux
qdisc (FQ_CODEL, ...), like RED.
include/net/codel.h contains codel algorithm as close as possible than
Kathleen reference.
net/sched/sch_codel.c contains the linux qdisc specific glue.
Separate structures permit a memory efficient implementation of fq_codel
(to be sent as a separate work) : Each flow has its own struct
codel_vars.
timestamps are taken at enqueue() time with 1024 ns precision, allowing
a range of 2199 seconds in queue, and 100Gb links support. iproute2 uses
usec as base unit.
Selected packets are dropped, unless ECN is enabled and packets can get
ECN mark instead.
Tested from 2Mb to 10Gb speeds with no particular problems, on ixgbe and
tg3 drivers (BQL enabled).
Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ]
[ interval TIME ] [ ecn ]
qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn
Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0)
rate 202365Kbit 16708pps backlog 113550b 75p requeues 0
count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us
maxpacket 1514 ecn_mark 84399 drop_overlimit 0
CoDel must be seen as a base module, and should be used keeping in mind
there is still a FIFO queue. So a typical setup will probably need a
hierarchy of several qdiscs and packet classifiers to be able to meet
whatever constraints a user might have.
One possible example would be to use fq_codel, which combines Fair
Queueing and CoDel, in replacement of sfq / sfq_red.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Dave Taht <dave.taht@bufferbloat.net>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Van Jacobson <van@pollere.net>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
include/linux/pkt_sched.h | 26 ++++
include/net/codel.h | 332 +++++++++++++++++++++++++++++++++++++++++++++
net/sched/Kconfig | 11 ++
net/sched/Makefile | 1 +
net/sched/sch_codel.c | 275 +++++++++++++++++++++++++++++++++++++
5 files changed, 645 insertions(+)
create mode 100644 include/net/codel.h
create mode 100644 net/sched/sch_codel.c
diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index ffe975c..cde56c2 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -655,4 +655,30 @@ struct tc_qfq_stats {
__u32 lmax;
};
+/* CODEL */
+
+enum {
+ TCA_CODEL_UNSPEC,
+ TCA_CODEL_TARGET,
+ TCA_CODEL_LIMIT,
+ TCA_CODEL_INTERVAL,
+ TCA_CODEL_ECN,
+ __TCA_CODEL_MAX
+};
+
+#define TCA_CODEL_MAX (__TCA_CODEL_MAX - 1)
+
+struct tc_codel_xstats {
+ __u32 maxpacket; /* largest packet we've seen so far */
+ __u32 count; /* how many drops we've done since the last time we
+ * entered dropping state
+ */
+ __u32 lastcount; /* count at entry to dropping state */
+ __u32 ldelay; /* in-queue delay seen by most recently dequeued packet */
+ __s32 drop_next; /* time to drop next packet */
+ __u32 drop_overlimit; /* number of time max qdisc packet limit was hit */
+ __u32 ecn_mark; /* number of packets we ECN marked instead of dropped */
+ __u32 dropping; /* are we in dropping state ? */
+};
+
#endif
diff --git a/include/net/codel.h b/include/net/codel.h
new file mode 100644
index 0000000..bce2cef
--- /dev/null
+++ b/include/net/codel.h
@@ -0,0 +1,332 @@
+#ifndef __NET_SCHED_CODEL_H
+#define __NET_SCHED_CODEL_H
+
+/*
+ * Codel - The Controlled-Delay Active Queue Management algorithm
+ *
+ * Copyright (C) 2011-2012 Kathleen Nichols <nichols@pollere.com>
+ * Copyright (C) 2011-2012 Van Jacobson <van@pollere.net>
+ * Copyright (C) 2012 Michael D. Taht <dave.taht@bufferbloat.net>
+ * Copyright (C) 2012 Eric Dumazet <edumazet@google.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions, and the following disclaimer,
+ * without modification.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. The names of the authors may not be used to endorse or promote products
+ * derived from this software without specific prior written permission.
+ *
+ * Alternatively, provided that this notice is retained in full, this
+ * software may be distributed under the terms of the GNU General
+ * Public License ("GPL") version 2, in which case the provisions of the
+ * GPL apply INSTEAD OF those given above.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
+ * DAMAGE.
+ *
+ */
+
+#include <linux/types.h>
+#include <linux/ktime.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+#include <net/inet_ecn.h>
+
+/* Controlling Queue Delay (CoDel) algorithm
+ * =========================================
+ * Source : Kathleen Nichols and Van Jacobson
+ * http://queue.acm.org/detail.cfm?id=2209336
+ *
+ * Implemented on linux by Dave Taht and Eric Dumazet
+ */
+
+
+/* CoDel uses a 1024 nsec clock, encoded in u32
+ * This gives a range of 2199 seconds, because of signed compares
+ */
+typedef u32 codel_time_t;
+typedef s32 codel_tdiff_t;
+#define CODEL_SHIFT 10
+#define MS2TIME(a) ((a * NSEC_PER_MSEC) >> CODEL_SHIFT)
+
+static inline codel_time_t codel_get_time(void)
+{
+ u64 ns = ktime_to_ns(ktime_get());
+
+ return ns >> CODEL_SHIFT;
+}
+
+#define codel_time_after(a, b) ((s32)(a) - (s32)(b) > 0)
+#define codel_time_after_eq(a, b) ((s32)(a) - (s32)(b) >= 0)
+#define codel_time_before(a, b) ((s32)(a) - (s32)(b) < 0)
+#define codel_time_before_eq(a, b) ((s32)(a) - (s32)(b) <= 0)
+
+/* Qdiscs using codel plugin must use codel_skb_cb in their own cb[] */
+struct codel_skb_cb {
+ codel_time_t enqueue_time;
+};
+
+static struct codel_skb_cb *get_codel_cb(const struct sk_buff *skb)
+{
+ qdisc_cb_private_validate(skb, sizeof(struct codel_skb_cb));
+ return (struct codel_skb_cb *)qdisc_skb_cb(skb)->data;
+}
+
+static codel_time_t codel_get_enqueue_time(const struct sk_buff *skb)
+{
+ return get_codel_cb(skb)->enqueue_time;
+}
+
+static void codel_set_enqueue_time(struct sk_buff *skb)
+{
+ get_codel_cb(skb)->enqueue_time = codel_get_time();
+}
+
+static inline u32 codel_time_to_us(codel_time_t val)
+{
+ u64 valns = ((u64)val << CODEL_SHIFT);
+
+ do_div(valns, NSEC_PER_USEC);
+ return (u32)valns;
+}
+
+/**
+ * struct codel_params - contains codel parameters
+ * @target: target queue size (in time units)
+ * @interval: width of moving time window
+ * @ecn: is Explicit Congestion Notification enabled
+ */
+struct codel_params {
+ codel_time_t target;
+ codel_time_t interval;
+ bool ecn;
+};
+
+/**
+ * struct codel_vars - contains codel variables
+ * @count: how many drops we've done since the last time we
+ * entered dropping state
+ * @lastcount: count at entry to dropping state
+ * @dropping: set to true if in dropping state
+ * @first_above_time: when we went (or will go) continuously above target
+ * for interval
+ * @drop_next: time to drop next packet, or when we dropped last
+ * @ldelay: sojourn time of last dequeued packet
+ */
+struct codel_vars {
+ u32 count;
+ u32 lastcount;
+ bool dropping;
+ codel_time_t first_above_time;
+ codel_time_t drop_next;
+ codel_time_t ldelay;
+};
+
+/**
+ * struct codel_stats - contains codel shared variables and stats
+ * @maxpacket: largest packet we've seen so far
+ * @drop_count: temp count of dropped packets in dequeue()
+ * ecn_mark: number of packets we ECN marked instead of dropping
+ */
+struct codel_stats {
+ u32 maxpacket;
+ u32 drop_count;
+ u32 ecn_mark;
+};
+
+static void codel_params_init(struct codel_params *params)
+{
+ params->interval = MS2TIME(100);
+ params->target = MS2TIME(5);
+ params->ecn = false;
+}
+
+static void codel_vars_init(struct codel_vars *vars)
+{
+ vars->drop_next = 0;
+ vars->first_above_time = 0;
+ vars->dropping = false; /* exit dropping state */
+ vars->count = 0;
+ vars->lastcount = 0;
+}
+
+static void codel_stats_init(struct codel_stats *stats)
+{
+ stats->maxpacket = 256;
+}
+
+/* return interval/sqrt(x) with good precision
+ * relies on int_sqrt(unsigned long x) kernel implementation
+ */
+static u32 codel_inv_sqrt(u32 _interval, u32 _x)
+{
+ u64 interval = _interval;
+ unsigned long x = _x;
+
+ /* Scale operands for max precision */
+
+#if BITS_PER_LONG == 64
+ x <<= 32; /* On 64bit arches, we can prescale x by 32bits */
+ interval <<= 16;
+#endif
+
+ while (x < (1UL << (BITS_PER_LONG - 2))) {
+ x <<= 2;
+ interval <<= 1;
+ }
+ do_div(interval, int_sqrt(x));
+ return (u32)interval;
+}
+
+static codel_time_t codel_control_law(codel_time_t t,
+ codel_time_t interval,
+ u32 count)
+{
+ return t + codel_inv_sqrt(interval, count);
+}
+
+
+static bool codel_should_drop(struct sk_buff *skb,
+ unsigned int *backlog,
+ struct codel_vars *vars,
+ struct codel_params *params,
+ struct codel_stats *stats,
+ codel_time_t now)
+{
+ bool ok_to_drop;
+
+ if (!skb) {
+ vars->first_above_time = 0;
+ return false;
+ }
+
+ vars->ldelay = now - codel_get_enqueue_time(skb);
+ *backlog -= qdisc_pkt_len(skb);
+
+ if (unlikely(qdisc_pkt_len(skb) > stats->maxpacket))
+ stats->maxpacket = qdisc_pkt_len(skb);
+
+ if (codel_time_before(vars->ldelay, params->target) ||
+ *backlog <= stats->maxpacket) {
+ /* went below - stay below for at least interval */
+ vars->first_above_time = 0;
+ return false;
+ }
+ ok_to_drop = false;
+ if (vars->first_above_time == 0) {
+ /* just went above from below. If we stay above
+ * for at least interval we'll say it's ok to drop
+ */
+ vars->first_above_time = now + params->interval;
+ } else if (codel_time_after(now, vars->first_above_time)) {
+ ok_to_drop = true;
+ }
+ return ok_to_drop;
+}
+
+typedef struct sk_buff * (*codel_skb_dequeue_t)(struct codel_vars *vars,
+ struct Qdisc *sch);
+
+static struct sk_buff *codel_dequeue(struct Qdisc *sch,
+ struct codel_params *params,
+ struct codel_vars *vars,
+ struct codel_stats *stats,
+ codel_skb_dequeue_t dequeue_func,
+ u32 *backlog)
+{
+ struct sk_buff *skb = dequeue_func(vars, sch);
+ codel_time_t now;
+ bool drop;
+
+ if (!skb) {
+ vars->dropping = false;
+ return skb;
+ }
+ now = codel_get_time();
+ drop = codel_should_drop(skb, backlog, vars, params, stats, now);
+ if (vars->dropping) {
+ if (!drop) {
+ /* sojourn time below target - leave dropping state */
+ vars->dropping = false;
+ } else if (codel_time_after_eq(now, vars->drop_next)) {
+ /* It's time for the next drop. Drop the current
+ * packet and dequeue the next. The dequeue might
+ * take us out of dropping state.
+ * If not, schedule the next drop.
+ * A large backlog might result in drop rates so high
+ * that the next drop should happen now,
+ * hence the while loop.
+ */
+ while (vars->dropping &&
+ codel_time_after_eq(now, vars->drop_next)) {
+ if (++vars->count == 0) /* avoid zero divides */
+ vars->count = ~0U;
+ if (params->ecn && INET_ECN_set_ce(skb)) {
+ stats->ecn_mark++;
+ vars->drop_next =
+ codel_control_law(vars->drop_next,
+ params->interval,
+ vars->count);
+ goto end;
+ }
+ qdisc_drop(skb, sch);
+ stats->drop_count++;
+ skb = dequeue_func(vars, sch);
+ if (!codel_should_drop(skb, backlog,
+ vars, params, stats, now)) {
+ /* leave dropping state */
+ vars->dropping = false;
+ } else {
+ /* and schedule the next drop */
+ vars->drop_next =
+ codel_control_law(vars->drop_next,
+ params->interval,
+ vars->count);
+ }
+ }
+ }
+ } else if (drop) {
+ if (params->ecn && INET_ECN_set_ce(skb)) {
+ stats->ecn_mark++;
+ } else {
+ qdisc_drop(skb, sch);
+ stats->drop_count++;
+
+ skb = dequeue_func(vars, sch);
+ drop = codel_should_drop(skb, backlog, vars, params,
+ stats, now);
+ }
+ vars->dropping = true;
+ /* if min went above target close to when we last went below it
+ * assume that the drop rate that controlled the queue on the
+ * last cycle is a good starting point to control it now.
+ */
+ if (codel_time_before(now - vars->drop_next,
+ 16 * params->interval)) {
+ vars->count = (vars->count - vars->lastcount) | 1;
+ } else {
+ vars->count = 1;
+ }
+ vars->lastcount = vars->count;
+ vars->drop_next = codel_control_law(now, params->interval,
+ vars->count);
+ }
+end:
+ return skb;
+}
+#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index 75b58f8..fadd252 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -250,6 +250,17 @@ config NET_SCH_QFQ
If unsure, say N.
+config NET_SCH_CODEL
+ tristate "Controlled Delay AQM (CODEL)"
+ help
+ Say Y here if you want to use the Controlled Delay (CODEL)
+ packet scheduling algorithm.
+
+ To compile this driver as a module, choose M here: the module
+ will be called sch_codel.
+
+ If unsure, say N.
+
config NET_SCH_INGRESS
tristate "Ingress Qdisc"
depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 8cdf4e2..30fab03 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -37,6 +37,7 @@ obj-$(CONFIG_NET_SCH_PLUG) += sch_plug.o
obj-$(CONFIG_NET_SCH_MQPRIO) += sch_mqprio.o
obj-$(CONFIG_NET_SCH_CHOKE) += sch_choke.o
obj-$(CONFIG_NET_SCH_QFQ) += sch_qfq.o
+obj-$(CONFIG_NET_SCH_CODEL) += sch_codel.o
obj-$(CONFIG_NET_CLS_U32) += cls_u32.o
obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o
diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c
new file mode 100644
index 0000000..b4a1a81
--- /dev/null
+++ b/net/sched/sch_codel.c
@@ -0,0 +1,275 @@
+/*
+ * Codel - The Controlled-Delay Active Queue Management algorithm
+ *
+ * Copyright (C) 2011-2012 Kathleen Nichols <nichols@pollere.com>
+ * Copyright (C) 2011-2012 Van Jacobson <van@pollere.net>
+ *
+ * Implemented on linux by :
+ * Copyright (C) 2012 Michael D. Taht <dave.taht@bufferbloat.net>
+ * Copyright (C) 2012 Eric Dumazet <edumazet@google.com>
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ * notice, this list of conditions, and the following disclaimer,
+ * without modification.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ * 3. The names of the authors may not be used to endorse or promote products
+ * derived from this software without specific prior written permission.
+ *
+ * Alternatively, provided that this notice is retained in full, this
+ * software may be distributed under the terms of the GNU General
+ * Public License ("GPL") version 2, in which case the provisions of the
+ * GPL apply INSTEAD OF those given above.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
+ * DAMAGE.
+ *
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/pkt_sched.h>
+#include <net/codel.h>
+
+
+#define DEFAULT_CODEL_LIMIT 1000
+
+struct codel_sched_data {
+ struct codel_params params;
+ struct codel_vars vars;
+ struct codel_stats stats;
+ u32 drop_overlimit;
+};
+
+/* This is the specific function called from codel_dequeue()
+ * to dequeue a packet from queue. Note: backlog is handled in
+ * codel, we dont need to reduce it here.
+ */
+static struct sk_buff *dequeue(struct codel_vars *vars, struct Qdisc *sch)
+{
+ struct sk_buff *skb = __skb_dequeue(&sch->q);
+
+ prefetch(&skb->end); /* we'll need skb_shinfo() */
+ return skb;
+}
+
+static struct sk_buff *codel_qdisc_dequeue(struct Qdisc *sch)
+{
+ struct codel_sched_data *q = qdisc_priv(sch);
+ struct sk_buff *skb;
+
+ skb = codel_dequeue(sch, &q->params, &q->vars, &q->stats,
+ dequeue, &sch->qstats.backlog);
+ /* We cant call qdisc_tree_decrease_qlen() if our qlen is 0,
+ * or HTB crashes. Defer it for next round.
+ */
+ if (q->stats.drop_count && sch->q.qlen) {
+ qdisc_tree_decrease_qlen(sch, q->stats.drop_count);
+ q->stats.drop_count = 0;
+ }
+ if (skb)
+ qdisc_bstats_update(sch, skb);
+ return skb;
+}
+
+static int codel_qdisc_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+ struct codel_sched_data *q;
+
+ if (likely(qdisc_qlen(sch) < sch->limit)) {
+ codel_set_enqueue_time(skb);
+ return qdisc_enqueue_tail(skb, sch);
+ }
+ q = qdisc_priv(sch);
+ q->drop_overlimit++;
+ return qdisc_drop(skb, sch);
+}
+
+static const struct nla_policy codel_policy[TCA_CODEL_MAX + 1] = {
+ [TCA_CODEL_TARGET] = { .type = NLA_U32 },
+ [TCA_CODEL_LIMIT] = { .type = NLA_U32 },
+ [TCA_CODEL_INTERVAL] = { .type = NLA_U32 },
+ [TCA_CODEL_ECN] = { .type = NLA_U32 },
+};
+
+static int codel_change(struct Qdisc *sch, struct nlattr *opt)
+{
+ struct codel_sched_data *q = qdisc_priv(sch);
+ struct nlattr *tb[TCA_CODEL_MAX + 1];
+ unsigned int qlen;
+ int err;
+
+ if (!opt)
+ return -EINVAL;
+
+ err = nla_parse_nested(tb, TCA_CODEL_MAX, opt, codel_policy);
+ if (err < 0)
+ return err;
+
+ sch_tree_lock(sch);
+
+ if (tb[TCA_CODEL_TARGET]) {
+ u32 target = nla_get_u32(tb[TCA_CODEL_TARGET]);
+
+ q->params.target = ((u64)target * NSEC_PER_USEC) >> CODEL_SHIFT;
+ }
+
+ if (tb[TCA_CODEL_INTERVAL]) {
+ u32 interval = nla_get_u32(tb[TCA_CODEL_INTERVAL]);
+
+ q->params.interval = ((u64)interval * NSEC_PER_USEC) >> CODEL_SHIFT;
+ }
+
+ if (tb[TCA_CODEL_LIMIT])
+ sch->limit = nla_get_u32(tb[TCA_CODEL_LIMIT]);
+
+ if (tb[TCA_CODEL_ECN])
+ q->params.ecn = !!nla_get_u32(tb[TCA_CODEL_ECN]);
+
+ qlen = sch->q.qlen;
+ while (sch->q.qlen > sch->limit) {
+ struct sk_buff *skb = __skb_dequeue(&sch->q);
+
+ sch->qstats.backlog -= qdisc_pkt_len(skb);
+ qdisc_drop(skb, sch);
+ }
+ qdisc_tree_decrease_qlen(sch, qlen - sch->q.qlen);
+
+ sch_tree_unlock(sch);
+ return 0;
+}
+
+static int codel_init(struct Qdisc *sch, struct nlattr *opt)
+{
+ struct codel_sched_data *q = qdisc_priv(sch);
+
+ sch->limit = DEFAULT_CODEL_LIMIT;
+
+ codel_params_init(&q->params);
+ codel_vars_init(&q->vars);
+ codel_stats_init(&q->stats);
+
+ if (opt) {
+ int err = codel_change(sch, opt);
+
+ if (err)
+ return err;
+ }
+
+ if (sch->limit >= 1)
+ sch->flags |= TCQ_F_CAN_BYPASS;
+ else
+ sch->flags &= ~TCQ_F_CAN_BYPASS;
+
+ return 0;
+}
+
+static int codel_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+ struct codel_sched_data *q = qdisc_priv(sch);
+ struct nlattr *opts;
+
+ opts = nla_nest_start(skb, TCA_OPTIONS);
+ if (opts == NULL)
+ goto nla_put_failure;
+
+ if (nla_put_u32(skb, TCA_CODEL_TARGET,
+ codel_time_to_us(q->params.target)) ||
+ nla_put_u32(skb, TCA_CODEL_LIMIT,
+ sch->limit) ||
+ nla_put_u32(skb, TCA_CODEL_INTERVAL,
+ codel_time_to_us(q->params.interval)) ||
+ nla_put_u32(skb, TCA_CODEL_ECN,
+ q->params.ecn))
+ goto nla_put_failure;
+
+ return nla_nest_end(skb, opts);
+
+nla_put_failure:
+ nla_nest_cancel(skb, opts);
+ return -1;
+}
+
+static int codel_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+ const struct codel_sched_data *q = qdisc_priv(sch);
+ struct tc_codel_xstats st = {
+ .maxpacket = q->stats.maxpacket,
+ .count = q->vars.count,
+ .lastcount = q->vars.lastcount,
+ .drop_overlimit = q->drop_overlimit,
+ .ldelay = codel_time_to_us(q->vars.ldelay),
+ .dropping = q->vars.dropping,
+ .ecn_mark = q->stats.ecn_mark,
+ };
+
+ if (q->vars.dropping) {
+ codel_tdiff_t delta = q->vars.drop_next - codel_get_time();
+
+ if (delta >= 0)
+ st.drop_next = codel_time_to_us(delta);
+ else
+ st.drop_next = -codel_time_to_us(-delta);
+ }
+
+ return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static void codel_reset(struct Qdisc *sch)
+{
+ struct codel_sched_data *q = qdisc_priv(sch);
+
+ qdisc_reset_queue(sch);
+ codel_vars_init(&q->vars);
+}
+
+static struct Qdisc_ops codel_qdisc_ops __read_mostly = {
+ .id = "codel",
+ .priv_size = sizeof(struct codel_sched_data),
+
+ .enqueue = codel_qdisc_enqueue,
+ .dequeue = codel_qdisc_dequeue,
+ .peek = qdisc_peek_dequeued,
+ .init = codel_init,
+ .reset = codel_reset,
+ .change = codel_change,
+ .dump = codel_dump,
+ .dump_stats = codel_dump_stats,
+ .owner = THIS_MODULE,
+};
+
+static int __init codel_module_init(void)
+{
+ return register_qdisc(&codel_qdisc_ops);
+}
+
+static void __exit codel_module_exit(void)
+{
+ unregister_qdisc(&codel_qdisc_ops);
+}
+
+module_init(codel_module_init)
+module_exit(codel_module_exit)
+
+MODULE_DESCRIPTION("Controlled Delay queue discipline");
+MODULE_AUTHOR("Dave Taht");
+MODULE_AUTHOR("Eric Dumazet");
+MODULE_LICENSE("Dual BSD/GPL");
--
1.7.10

View File

@ -0,0 +1,189 @@
From: Eric Dumazet <edumazet@google.com>
Date: Sat, 12 May 2012 03:32:13 +0000
Subject: [PATCH 2/7] codel: use Newton method instead of sqrt() and divides
commit 536edd67109df5e0cdb2c4ee759e9bade7976367 upstream.
As Van pointed out, interval/sqrt(count) can be implemented using
multiplies only.
http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Iterative_methods_for_reciprocal_square_roots
This patch implements the Newton method and reciprocal divide.
Total cost is 15 cycles instead of 120 on my Corei5 machine (64bit
kernel).
There is a small 'error' for count values < 5, but we don't really care.
I reuse a hole in struct codel_vars :
- pack the dropping boolean into one bit
- use 31bit to store the reciprocal value of sqrt(count).
Suggested-by: Van Jacobson <van@pollere.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Matt Mathis <mattmathis@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
include/net/codel.h | 68 ++++++++++++++++++++++++++++-----------------------
1 file changed, 37 insertions(+), 31 deletions(-)
diff --git a/include/net/codel.h b/include/net/codel.h
index bce2cef..bd8747c 100644
--- a/include/net/codel.h
+++ b/include/net/codel.h
@@ -46,6 +46,7 @@
#include <linux/skbuff.h>
#include <net/pkt_sched.h>
#include <net/inet_ecn.h>
+#include <linux/reciprocal_div.h>
/* Controlling Queue Delay (CoDel) algorithm
* =========================================
@@ -123,6 +124,7 @@ struct codel_params {
* entered dropping state
* @lastcount: count at entry to dropping state
* @dropping: set to true if in dropping state
+ * @rec_inv_sqrt: reciprocal value of sqrt(count) >> 1
* @first_above_time: when we went (or will go) continuously above target
* for interval
* @drop_next: time to drop next packet, or when we dropped last
@@ -131,7 +133,8 @@ struct codel_params {
struct codel_vars {
u32 count;
u32 lastcount;
- bool dropping;
+ bool dropping:1;
+ u32 rec_inv_sqrt:31;
codel_time_t first_above_time;
codel_time_t drop_next;
codel_time_t ldelay;
@@ -158,11 +161,7 @@ static void codel_params_init(struct codel_params *params)
static void codel_vars_init(struct codel_vars *vars)
{
- vars->drop_next = 0;
- vars->first_above_time = 0;
- vars->dropping = false; /* exit dropping state */
- vars->count = 0;
- vars->lastcount = 0;
+ memset(vars, 0, sizeof(*vars));
}
static void codel_stats_init(struct codel_stats *stats)
@@ -170,38 +169,37 @@ static void codel_stats_init(struct codel_stats *stats)
stats->maxpacket = 256;
}
-/* return interval/sqrt(x) with good precision
- * relies on int_sqrt(unsigned long x) kernel implementation
+/*
+ * http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Iterative_methods_for_reciprocal_square_roots
+ * new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2)
+ *
+ * Here, invsqrt is a fixed point number (< 1.0), 31bit mantissa)
*/
-static u32 codel_inv_sqrt(u32 _interval, u32 _x)
+static void codel_Newton_step(struct codel_vars *vars)
{
- u64 interval = _interval;
- unsigned long x = _x;
+ u32 invsqrt = vars->rec_inv_sqrt;
+ u32 invsqrt2 = ((u64)invsqrt * invsqrt) >> 31;
+ u64 val = (3LL << 31) - ((u64)vars->count * invsqrt2);
- /* Scale operands for max precision */
-
-#if BITS_PER_LONG == 64
- x <<= 32; /* On 64bit arches, we can prescale x by 32bits */
- interval <<= 16;
-#endif
+ val = (val * invsqrt) >> 32;
- while (x < (1UL << (BITS_PER_LONG - 2))) {
- x <<= 2;
- interval <<= 1;
- }
- do_div(interval, int_sqrt(x));
- return (u32)interval;
+ vars->rec_inv_sqrt = val;
}
+/*
+ * CoDel control_law is t + interval/sqrt(count)
+ * We maintain in rec_inv_sqrt the reciprocal value of sqrt(count) to avoid
+ * both sqrt() and divide operation.
+ */
static codel_time_t codel_control_law(codel_time_t t,
codel_time_t interval,
- u32 count)
+ u32 rec_inv_sqrt)
{
- return t + codel_inv_sqrt(interval, count);
+ return t + reciprocal_divide(interval, rec_inv_sqrt << 1);
}
-static bool codel_should_drop(struct sk_buff *skb,
+static bool codel_should_drop(const struct sk_buff *skb,
unsigned int *backlog,
struct codel_vars *vars,
struct codel_params *params,
@@ -274,14 +272,16 @@ static struct sk_buff *codel_dequeue(struct Qdisc *sch,
*/
while (vars->dropping &&
codel_time_after_eq(now, vars->drop_next)) {
- if (++vars->count == 0) /* avoid zero divides */
- vars->count = ~0U;
+ vars->count++; /* dont care of possible wrap
+ * since there is no more divide
+ */
+ codel_Newton_step(vars);
if (params->ecn && INET_ECN_set_ce(skb)) {
stats->ecn_mark++;
vars->drop_next =
codel_control_law(vars->drop_next,
params->interval,
- vars->count);
+ vars->rec_inv_sqrt);
goto end;
}
qdisc_drop(skb, sch);
@@ -296,7 +296,7 @@ static struct sk_buff *codel_dequeue(struct Qdisc *sch,
vars->drop_next =
codel_control_law(vars->drop_next,
params->interval,
- vars->count);
+ vars->rec_inv_sqrt);
}
}
}
@@ -319,12 +319,18 @@ static struct sk_buff *codel_dequeue(struct Qdisc *sch,
if (codel_time_before(now - vars->drop_next,
16 * params->interval)) {
vars->count = (vars->count - vars->lastcount) | 1;
+ /* we dont care if rec_inv_sqrt approximation
+ * is not very precise :
+ * Next Newton steps will correct it quadratically.
+ */
+ codel_Newton_step(vars);
} else {
vars->count = 1;
+ vars->rec_inv_sqrt = 0x7fffffff;
}
vars->lastcount = vars->count;
vars->drop_next = codel_control_law(now, params->interval,
- vars->count);
+ vars->rec_inv_sqrt);
}
end:
return skb;
--
1.7.10

View File

@ -0,0 +1,850 @@
From: Eric Dumazet <edumazet@google.com>
Date: Fri, 11 May 2012 09:30:50 +0000
Subject: [PATCH 3/7] fq_codel: Fair Queue Codel AQM
commit 4b549a2ef4bef9965d97cbd992ba67930cd3e0fe upstream.
Fair Queue Codel packet scheduler
Principles :
- Packets are classified (internal classifier or external) on flows.
- This is a Stochastic model (as we use a hash, several flows might
be hashed on same slot)
- Each flow has a CoDel managed queue.
- Flows are linked onto two (Round Robin) lists,
so that new flows have priority on old ones.
- For a given flow, packets are not reordered (CoDel uses a FIFO)
- head drops only.
- ECN capability is on by default.
- Very low memory footprint (64 bytes per flow)
tc qdisc ... fq_codel [ limit PACKETS ] [ flows number ]
[ target TIME ] [ interval TIME ] [ noecn ]
[ quantum BYTES ]
defaults : 1024 flows, 10240 packets limit, quantum : device MTU
target : 5ms (CoDel default)
interval : 100ms (CoDel default)
Impressive results on load :
class htb 1:1 root leaf 10: prio 0 quantum 1514 rate 200000Kbit ceil 200000Kbit burst 1475b/8 mpu 0b overhead 0b cburst 1475b/8 mpu 0b overhead 0b level 0
Sent 43304920109 bytes 33063109 pkt (dropped 0, overlimits 0 requeues 0)
rate 201691Kbit 28595pps backlog 0b 312p requeues 0
lended: 33063109 borrowed: 0 giants: 0
tokens: -912 ctokens: -912
class fq_codel 10:1735 parent 10:
(dropped 1292, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:4524 parent 10:
(dropped 1291, overlimits 0 requeues 0)
backlog 16654b 11p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:4e74 parent 10:
(dropped 1290, overlimits 0 requeues 0)
backlog 6056b 4p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 6.4ms dropping drop_next 92.0ms
class fq_codel 10:628a parent 10:
(dropped 1289, overlimits 0 requeues 0)
backlog 7570b 5p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 5.4ms dropping drop_next 90.9ms
class fq_codel 10:a4b3 parent 10:
(dropped 302, overlimits 0 requeues 0)
backlog 16654b 11p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:c3c2 parent 10:
(dropped 1284, overlimits 0 requeues 0)
backlog 13626b 9p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 5.9ms
class fq_codel 10:d331 parent 10:
(dropped 299, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.0ms
class fq_codel 10:d526 parent 10:
(dropped 12160, overlimits 0 requeues 0)
backlog 35870b 211p requeues 0
deficit 1508 count 12160 lastcount 1 ldelay 15.3ms dropping drop_next 247us
class fq_codel 10:e2c6 parent 10:
(dropped 1288, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
class fq_codel 10:eab5 parent 10:
(dropped 1285, overlimits 0 requeues 0)
backlog 16654b 11p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 5.9ms
class fq_codel 10:f220 parent 10:
(dropped 1289, overlimits 0 requeues 0)
backlog 15140b 10p requeues 0
deficit 1514 count 1 lastcount 1 ldelay 7.1ms
qdisc htb 1: root refcnt 6 r2q 10 default 1 direct_packets_stat 0 ver 3.17
Sent 43331086547 bytes 33092812 pkt (dropped 0, overlimits 66063544 requeues 71)
rate 201697Kbit 28602pps backlog 0b 260p requeues 71
qdisc fq_codel 10: parent 1:1 limit 10240p flows 65536 target 5.0ms interval 100.0ms ecn
Sent 43331086547 bytes 33092812 pkt (dropped 949359, overlimits 0 requeues 0)
rate 201697Kbit 28602pps backlog 189352b 260p requeues 0
maxpacket 1514 drop_overlimit 0 new_flow_count 5582 ecn_mark 125593
new_flows_len 0 old_flows_len 11
PING 172.30.42.18 (172.30.42.18) 56(84) bytes of data.
64 bytes from 172.30.42.18: icmp_req=1 ttl=64 time=0.227 ms
64 bytes from 172.30.42.18: icmp_req=2 ttl=64 time=0.165 ms
64 bytes from 172.30.42.18: icmp_req=3 ttl=64 time=0.166 ms
64 bytes from 172.30.42.18: icmp_req=4 ttl=64 time=0.151 ms
64 bytes from 172.30.42.18: icmp_req=5 ttl=64 time=0.164 ms
64 bytes from 172.30.42.18: icmp_req=6 ttl=64 time=0.172 ms
64 bytes from 172.30.42.18: icmp_req=7 ttl=64 time=0.175 ms
64 bytes from 172.30.42.18: icmp_req=8 ttl=64 time=0.183 ms
64 bytes from 172.30.42.18: icmp_req=9 ttl=64 time=0.158 ms
64 bytes from 172.30.42.18: icmp_req=10 ttl=64 time=0.200 ms
10 packets transmitted, 10 received, 0% packet loss, time 8999ms
rtt min/avg/max/mdev = 0.151/0.176/0.227/0.022 ms
Much better than SFQ because of priority given to new flows, and fast
path dirtying less cache lines.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
include/linux/pkt_sched.h | 54 ++++
net/sched/Kconfig | 11 +
net/sched/Makefile | 1 +
net/sched/sch_fq_codel.c | 624 +++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 690 insertions(+)
create mode 100644 net/sched/sch_fq_codel.c
diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index cde56c2..32aef0a 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -681,4 +681,58 @@ struct tc_codel_xstats {
__u32 dropping; /* are we in dropping state ? */
};
+/* FQ_CODEL */
+
+enum {
+ TCA_FQ_CODEL_UNSPEC,
+ TCA_FQ_CODEL_TARGET,
+ TCA_FQ_CODEL_LIMIT,
+ TCA_FQ_CODEL_INTERVAL,
+ TCA_FQ_CODEL_ECN,
+ TCA_FQ_CODEL_FLOWS,
+ TCA_FQ_CODEL_QUANTUM,
+ __TCA_FQ_CODEL_MAX
+};
+
+#define TCA_FQ_CODEL_MAX (__TCA_FQ_CODEL_MAX - 1)
+
+enum {
+ TCA_FQ_CODEL_XSTATS_QDISC,
+ TCA_FQ_CODEL_XSTATS_CLASS,
+};
+
+struct tc_fq_codel_qd_stats {
+ __u32 maxpacket; /* largest packet we've seen so far */
+ __u32 drop_overlimit; /* number of time max qdisc
+ * packet limit was hit
+ */
+ __u32 ecn_mark; /* number of packets we ECN marked
+ * instead of being dropped
+ */
+ __u32 new_flow_count; /* number of time packets
+ * created a 'new flow'
+ */
+ __u32 new_flows_len; /* count of flows in new list */
+ __u32 old_flows_len; /* count of flows in old list */
+};
+
+struct tc_fq_codel_cl_stats {
+ __s32 deficit;
+ __u32 ldelay; /* in-queue delay seen by most recently
+ * dequeued packet
+ */
+ __u32 count;
+ __u32 lastcount;
+ __u32 dropping;
+ __s32 drop_next;
+};
+
+struct tc_fq_codel_xstats {
+ __u32 type;
+ union {
+ struct tc_fq_codel_qd_stats qdisc_stats;
+ struct tc_fq_codel_cl_stats class_stats;
+ };
+};
+
#endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index fadd252..e7a8976 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -261,6 +261,17 @@ config NET_SCH_CODEL
If unsure, say N.
+config NET_SCH_FQ_CODEL
+ tristate "Fair Queue Controlled Delay AQM (FQ_CODEL)"
+ help
+ Say Y here if you want to use the FQ Controlled Delay (FQ_CODEL)
+ packet scheduling algorithm.
+
+ To compile this driver as a module, choose M here: the module
+ will be called sch_fq_codel.
+
+ If unsure, say N.
+
config NET_SCH_INGRESS
tristate "Ingress Qdisc"
depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 30fab03..5940a19 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -38,6 +38,7 @@ obj-$(CONFIG_NET_SCH_MQPRIO) += sch_mqprio.o
obj-$(CONFIG_NET_SCH_CHOKE) += sch_choke.o
obj-$(CONFIG_NET_SCH_QFQ) += sch_qfq.o
obj-$(CONFIG_NET_SCH_CODEL) += sch_codel.o
+obj-$(CONFIG_NET_SCH_FQ_CODEL) += sch_fq_codel.o
obj-$(CONFIG_NET_CLS_U32) += cls_u32.o
obj-$(CONFIG_NET_CLS_ROUTE4) += cls_route.o
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
new file mode 100644
index 0000000..a7b3754
--- /dev/null
+++ b/net/sched/sch_fq_codel.c
@@ -0,0 +1,624 @@
+/*
+ * Fair Queue CoDel discipline
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ *
+ * Copyright (C) 2012 Eric Dumazet <edumazet@google.com>
+ */
+
+#include <linux/module.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/jiffies.h>
+#include <linux/string.h>
+#include <linux/in.h>
+#include <linux/errno.h>
+#include <linux/init.h>
+#include <linux/skbuff.h>
+#include <linux/jhash.h>
+#include <linux/slab.h>
+#include <linux/vmalloc.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/flow_keys.h>
+#include <net/codel.h>
+
+/* Fair Queue CoDel.
+ *
+ * Principles :
+ * Packets are classified (internal classifier or external) on flows.
+ * This is a Stochastic model (as we use a hash, several flows
+ * might be hashed on same slot)
+ * Each flow has a CoDel managed queue.
+ * Flows are linked onto two (Round Robin) lists,
+ * so that new flows have priority on old ones.
+ *
+ * For a given flow, packets are not reordered (CoDel uses a FIFO)
+ * head drops only.
+ * ECN capability is on by default.
+ * Low memory footprint (64 bytes per flow)
+ */
+
+struct fq_codel_flow {
+ struct sk_buff *head;
+ struct sk_buff *tail;
+ struct list_head flowchain;
+ int deficit;
+ u32 dropped; /* number of drops (or ECN marks) on this flow */
+ struct codel_vars cvars;
+}; /* please try to keep this structure <= 64 bytes */
+
+struct fq_codel_sched_data {
+ struct tcf_proto *filter_list; /* optional external classifier */
+ struct fq_codel_flow *flows; /* Flows table [flows_cnt] */
+ u32 *backlogs; /* backlog table [flows_cnt] */
+ u32 flows_cnt; /* number of flows */
+ u32 perturbation; /* hash perturbation */
+ u32 quantum; /* psched_mtu(qdisc_dev(sch)); */
+ struct codel_params cparams;
+ struct codel_stats cstats;
+ u32 drop_overlimit;
+ u32 new_flow_count;
+
+ struct list_head new_flows; /* list of new flows */
+ struct list_head old_flows; /* list of old flows */
+};
+
+static unsigned int fq_codel_hash(const struct fq_codel_sched_data *q,
+ const struct sk_buff *skb)
+{
+ struct flow_keys keys;
+ unsigned int hash;
+
+ skb_flow_dissect(skb, &keys);
+ hash = jhash_3words((__force u32)keys.dst,
+ (__force u32)keys.src ^ keys.ip_proto,
+ (__force u32)keys.ports, q->perturbation);
+ return ((u64)hash * q->flows_cnt) >> 32;
+}
+
+static unsigned int fq_codel_classify(struct sk_buff *skb, struct Qdisc *sch,
+ int *qerr)
+{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
+ struct tcf_result res;
+ int result;
+
+ if (TC_H_MAJ(skb->priority) == sch->handle &&
+ TC_H_MIN(skb->priority) > 0 &&
+ TC_H_MIN(skb->priority) <= q->flows_cnt)
+ return TC_H_MIN(skb->priority);
+
+ if (!q->filter_list)
+ return fq_codel_hash(q, skb) + 1;
+
+ *qerr = NET_XMIT_SUCCESS | __NET_XMIT_BYPASS;
+ result = tc_classify(skb, q->filter_list, &res);
+ if (result >= 0) {
+#ifdef CONFIG_NET_CLS_ACT
+ switch (result) {
+ case TC_ACT_STOLEN:
+ case TC_ACT_QUEUED:
+ *qerr = NET_XMIT_SUCCESS | __NET_XMIT_STOLEN;
+ case TC_ACT_SHOT:
+ return 0;
+ }
+#endif
+ if (TC_H_MIN(res.classid) <= q->flows_cnt)
+ return TC_H_MIN(res.classid);
+ }
+ return 0;
+}
+
+/* helper functions : might be changed when/if skb use a standard list_head */
+
+/* remove one skb from head of slot queue */
+static inline struct sk_buff *dequeue_head(struct fq_codel_flow *flow)
+{
+ struct sk_buff *skb = flow->head;
+
+ flow->head = skb->next;
+ skb->next = NULL;
+ return skb;
+}
+
+/* add skb to flow queue (tail add) */
+static inline void flow_queue_add(struct fq_codel_flow *flow,
+ struct sk_buff *skb)
+{
+ if (flow->head == NULL)
+ flow->head = skb;
+ else
+ flow->tail->next = skb;
+ flow->tail = skb;
+ skb->next = NULL;
+}
+
+static unsigned int fq_codel_drop(struct Qdisc *sch)
+{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
+ struct sk_buff *skb;
+ unsigned int maxbacklog = 0, idx = 0, i, len;
+ struct fq_codel_flow *flow;
+
+ /* Queue is full! Find the fat flow and drop packet from it.
+ * This might sound expensive, but with 1024 flows, we scan
+ * 4KB of memory, and we dont need to handle a complex tree
+ * in fast path (packet queue/enqueue) with many cache misses.
+ */
+ for (i = 0; i < q->flows_cnt; i++) {
+ if (q->backlogs[i] > maxbacklog) {
+ maxbacklog = q->backlogs[i];
+ idx = i;
+ }
+ }
+ flow = &q->flows[idx];
+ skb = dequeue_head(flow);
+ len = qdisc_pkt_len(skb);
+ q->backlogs[idx] -= len;
+ kfree_skb(skb);
+ sch->q.qlen--;
+ sch->qstats.drops++;
+ sch->qstats.backlog -= len;
+ flow->dropped++;
+ return idx;
+}
+
+static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
+{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
+ unsigned int idx;
+ struct fq_codel_flow *flow;
+ int uninitialized_var(ret);
+
+ idx = fq_codel_classify(skb, sch, &ret);
+ if (idx == 0) {
+ if (ret & __NET_XMIT_BYPASS)
+ sch->qstats.drops++;
+ kfree_skb(skb);
+ return ret;
+ }
+ idx--;
+
+ codel_set_enqueue_time(skb);
+ flow = &q->flows[idx];
+ flow_queue_add(flow, skb);
+ q->backlogs[idx] += qdisc_pkt_len(skb);
+ sch->qstats.backlog += qdisc_pkt_len(skb);
+
+ if (list_empty(&flow->flowchain)) {
+ list_add_tail(&flow->flowchain, &q->new_flows);
+ codel_vars_init(&flow->cvars);
+ q->new_flow_count++;
+ flow->deficit = q->quantum;
+ flow->dropped = 0;
+ }
+ if (++sch->q.qlen < sch->limit)
+ return NET_XMIT_SUCCESS;
+
+ q->drop_overlimit++;
+ /* Return Congestion Notification only if we dropped a packet
+ * from this flow.
+ */
+ if (fq_codel_drop(sch) == idx)
+ return NET_XMIT_CN;
+
+ /* As we dropped a packet, better let upper stack know this */
+ qdisc_tree_decrease_qlen(sch, 1);
+ return NET_XMIT_SUCCESS;
+}
+
+/* This is the specific function called from codel_dequeue()
+ * to dequeue a packet from queue. Note: backlog is handled in
+ * codel, we dont need to reduce it here.
+ */
+static struct sk_buff *dequeue(struct codel_vars *vars, struct Qdisc *sch)
+{
+ struct fq_codel_flow *flow;
+ struct sk_buff *skb = NULL;
+
+ flow = container_of(vars, struct fq_codel_flow, cvars);
+ if (flow->head) {
+ skb = dequeue_head(flow);
+ sch->qstats.backlog -= qdisc_pkt_len(skb);
+ sch->q.qlen--;
+ }
+ return skb;
+}
+
+static struct sk_buff *fq_codel_dequeue(struct Qdisc *sch)
+{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
+ struct sk_buff *skb;
+ struct fq_codel_flow *flow;
+ struct list_head *head;
+ u32 prev_drop_count, prev_ecn_mark;
+
+begin:
+ head = &q->new_flows;
+ if (list_empty(head)) {
+ head = &q->old_flows;
+ if (list_empty(head))
+ return NULL;
+ }
+ flow = list_first_entry(head, struct fq_codel_flow, flowchain);
+
+ if (flow->deficit <= 0) {
+ flow->deficit += q->quantum;
+ list_move_tail(&flow->flowchain, &q->old_flows);
+ goto begin;
+ }
+
+ prev_drop_count = q->cstats.drop_count;
+ prev_ecn_mark = q->cstats.ecn_mark;
+
+ skb = codel_dequeue(sch, &q->cparams, &flow->cvars, &q->cstats,
+ dequeue, &q->backlogs[flow - q->flows]);
+
+ flow->dropped += q->cstats.drop_count - prev_drop_count;
+ flow->dropped += q->cstats.ecn_mark - prev_ecn_mark;
+
+ if (!skb) {
+ /* force a pass through old_flows to prevent starvation */
+ if ((head == &q->new_flows) && !list_empty(&q->old_flows))
+ list_move_tail(&flow->flowchain, &q->old_flows);
+ else
+ list_del_init(&flow->flowchain);
+ goto begin;
+ }
+ qdisc_bstats_update(sch, skb);
+ flow->deficit -= qdisc_pkt_len(skb);
+ /* We cant call qdisc_tree_decrease_qlen() if our qlen is 0,
+ * or HTB crashes. Defer it for next round.
+ */
+ if (q->cstats.drop_count && sch->q.qlen) {
+ qdisc_tree_decrease_qlen(sch, q->cstats.drop_count);
+ q->cstats.drop_count = 0;
+ }
+ return skb;
+}
+
+static void fq_codel_reset(struct Qdisc *sch)
+{
+ struct sk_buff *skb;
+
+ while ((skb = fq_codel_dequeue(sch)) != NULL)
+ kfree_skb(skb);
+}
+
+static const struct nla_policy fq_codel_policy[TCA_FQ_CODEL_MAX + 1] = {
+ [TCA_FQ_CODEL_TARGET] = { .type = NLA_U32 },
+ [TCA_FQ_CODEL_LIMIT] = { .type = NLA_U32 },
+ [TCA_FQ_CODEL_INTERVAL] = { .type = NLA_U32 },
+ [TCA_FQ_CODEL_ECN] = { .type = NLA_U32 },
+ [TCA_FQ_CODEL_FLOWS] = { .type = NLA_U32 },
+ [TCA_FQ_CODEL_QUANTUM] = { .type = NLA_U32 },
+};
+
+static int fq_codel_change(struct Qdisc *sch, struct nlattr *opt)
+{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
+ struct nlattr *tb[TCA_FQ_CODEL_MAX + 1];
+ int err;
+
+ if (!opt)
+ return -EINVAL;
+
+ err = nla_parse_nested(tb, TCA_FQ_CODEL_MAX, opt, fq_codel_policy);
+ if (err < 0)
+ return err;
+ if (tb[TCA_FQ_CODEL_FLOWS]) {
+ if (q->flows)
+ return -EINVAL;
+ q->flows_cnt = nla_get_u32(tb[TCA_FQ_CODEL_FLOWS]);
+ if (!q->flows_cnt ||
+ q->flows_cnt > 65536)
+ return -EINVAL;
+ }
+ sch_tree_lock(sch);
+
+ if (tb[TCA_FQ_CODEL_TARGET]) {
+ u64 target = nla_get_u32(tb[TCA_FQ_CODEL_TARGET]);
+
+ q->cparams.target = (target * NSEC_PER_USEC) >> CODEL_SHIFT;
+ }
+
+ if (tb[TCA_FQ_CODEL_INTERVAL]) {
+ u64 interval = nla_get_u32(tb[TCA_FQ_CODEL_INTERVAL]);
+
+ q->cparams.interval = (interval * NSEC_PER_USEC) >> CODEL_SHIFT;
+ }
+
+ if (tb[TCA_FQ_CODEL_LIMIT])
+ sch->limit = nla_get_u32(tb[TCA_FQ_CODEL_LIMIT]);
+
+ if (tb[TCA_FQ_CODEL_ECN])
+ q->cparams.ecn = !!nla_get_u32(tb[TCA_FQ_CODEL_ECN]);
+
+ if (tb[TCA_FQ_CODEL_QUANTUM])
+ q->quantum = max(256U, nla_get_u32(tb[TCA_FQ_CODEL_QUANTUM]));
+
+ while (sch->q.qlen > sch->limit) {
+ struct sk_buff *skb = fq_codel_dequeue(sch);
+
+ kfree_skb(skb);
+ q->cstats.drop_count++;
+ }
+ qdisc_tree_decrease_qlen(sch, q->cstats.drop_count);
+ q->cstats.drop_count = 0;
+
+ sch_tree_unlock(sch);
+ return 0;
+}
+
+static void *fq_codel_zalloc(size_t sz)
+{
+ void *ptr = kzalloc(sz, GFP_KERNEL | __GFP_NOWARN);
+
+ if (!ptr)
+ ptr = vzalloc(sz);
+ return ptr;
+}
+
+static void fq_codel_free(void *addr)
+{
+ if (addr) {
+ if (is_vmalloc_addr(addr))
+ vfree(addr);
+ else
+ kfree(addr);
+ }
+}
+
+static void fq_codel_destroy(struct Qdisc *sch)
+{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
+
+ tcf_destroy_chain(&q->filter_list);
+ fq_codel_free(q->backlogs);
+ fq_codel_free(q->flows);
+}
+
+static int fq_codel_init(struct Qdisc *sch, struct nlattr *opt)
+{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
+ int i;
+
+ sch->limit = 10*1024;
+ q->flows_cnt = 1024;
+ q->quantum = psched_mtu(qdisc_dev(sch));
+ q->perturbation = net_random();
+ INIT_LIST_HEAD(&q->new_flows);
+ INIT_LIST_HEAD(&q->old_flows);
+ codel_params_init(&q->cparams);
+ codel_stats_init(&q->cstats);
+ q->cparams.ecn = true;
+
+ if (opt) {
+ int err = fq_codel_change(sch, opt);
+ if (err)
+ return err;
+ }
+
+ if (!q->flows) {
+ q->flows = fq_codel_zalloc(q->flows_cnt *
+ sizeof(struct fq_codel_flow));
+ if (!q->flows)
+ return -ENOMEM;
+ q->backlogs = fq_codel_zalloc(q->flows_cnt * sizeof(u32));
+ if (!q->backlogs) {
+ fq_codel_free(q->flows);
+ return -ENOMEM;
+ }
+ for (i = 0; i < q->flows_cnt; i++) {
+ struct fq_codel_flow *flow = q->flows + i;
+
+ INIT_LIST_HEAD(&flow->flowchain);
+ }
+ }
+ if (sch->limit >= 1)
+ sch->flags |= TCQ_F_CAN_BYPASS;
+ else
+ sch->flags &= ~TCQ_F_CAN_BYPASS;
+ return 0;
+}
+
+static int fq_codel_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
+ struct nlattr *opts;
+
+ opts = nla_nest_start(skb, TCA_OPTIONS);
+ if (opts == NULL)
+ goto nla_put_failure;
+
+ if (nla_put_u32(skb, TCA_FQ_CODEL_TARGET,
+ codel_time_to_us(q->cparams.target)) ||
+ nla_put_u32(skb, TCA_FQ_CODEL_LIMIT,
+ sch->limit) ||
+ nla_put_u32(skb, TCA_FQ_CODEL_INTERVAL,
+ codel_time_to_us(q->cparams.interval)) ||
+ nla_put_u32(skb, TCA_FQ_CODEL_ECN,
+ q->cparams.ecn) ||
+ nla_put_u32(skb, TCA_FQ_CODEL_QUANTUM,
+ q->quantum) ||
+ nla_put_u32(skb, TCA_FQ_CODEL_FLOWS,
+ q->flows_cnt))
+ goto nla_put_failure;
+
+ nla_nest_end(skb, opts);
+ return skb->len;
+
+nla_put_failure:
+ return -1;
+}
+
+static int fq_codel_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
+{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
+ struct tc_fq_codel_xstats st = {
+ .type = TCA_FQ_CODEL_XSTATS_QDISC,
+ .qdisc_stats.maxpacket = q->cstats.maxpacket,
+ .qdisc_stats.drop_overlimit = q->drop_overlimit,
+ .qdisc_stats.ecn_mark = q->cstats.ecn_mark,
+ .qdisc_stats.new_flow_count = q->new_flow_count,
+ };
+ struct list_head *pos;
+
+ list_for_each(pos, &q->new_flows)
+ st.qdisc_stats.new_flows_len++;
+
+ list_for_each(pos, &q->old_flows)
+ st.qdisc_stats.old_flows_len++;
+
+ return gnet_stats_copy_app(d, &st, sizeof(st));
+}
+
+static struct Qdisc *fq_codel_leaf(struct Qdisc *sch, unsigned long arg)
+{
+ return NULL;
+}
+
+static unsigned long fq_codel_get(struct Qdisc *sch, u32 classid)
+{
+ return 0;
+}
+
+static unsigned long fq_codel_bind(struct Qdisc *sch, unsigned long parent,
+ u32 classid)
+{
+ /* we cannot bypass queue discipline anymore */
+ sch->flags &= ~TCQ_F_CAN_BYPASS;
+ return 0;
+}
+
+static void fq_codel_put(struct Qdisc *q, unsigned long cl)
+{
+}
+
+static struct tcf_proto **fq_codel_find_tcf(struct Qdisc *sch, unsigned long cl)
+{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
+
+ if (cl)
+ return NULL;
+ return &q->filter_list;
+}
+
+static int fq_codel_dump_class(struct Qdisc *sch, unsigned long cl,
+ struct sk_buff *skb, struct tcmsg *tcm)
+{
+ tcm->tcm_handle |= TC_H_MIN(cl);
+ return 0;
+}
+
+static int fq_codel_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+ struct gnet_dump *d)
+{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
+ u32 idx = cl - 1;
+ struct gnet_stats_queue qs = { 0 };
+ struct tc_fq_codel_xstats xstats;
+
+ if (idx < q->flows_cnt) {
+ const struct fq_codel_flow *flow = &q->flows[idx];
+ const struct sk_buff *skb = flow->head;
+
+ memset(&xstats, 0, sizeof(xstats));
+ xstats.type = TCA_FQ_CODEL_XSTATS_CLASS;
+ xstats.class_stats.deficit = flow->deficit;
+ xstats.class_stats.ldelay =
+ codel_time_to_us(flow->cvars.ldelay);
+ xstats.class_stats.count = flow->cvars.count;
+ xstats.class_stats.lastcount = flow->cvars.lastcount;
+ xstats.class_stats.dropping = flow->cvars.dropping;
+ if (flow->cvars.dropping) {
+ codel_tdiff_t delta = flow->cvars.drop_next -
+ codel_get_time();
+
+ xstats.class_stats.drop_next = (delta >= 0) ?
+ codel_time_to_us(delta) :
+ -codel_time_to_us(-delta);
+ }
+ while (skb) {
+ qs.qlen++;
+ skb = skb->next;
+ }
+ qs.backlog = q->backlogs[idx];
+ qs.drops = flow->dropped;
+ }
+ if (gnet_stats_copy_queue(d, &qs) < 0)
+ return -1;
+ if (idx < q->flows_cnt)
+ return gnet_stats_copy_app(d, &xstats, sizeof(xstats));
+ return 0;
+}
+
+static void fq_codel_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
+ unsigned int i;
+
+ if (arg->stop)
+ return;
+
+ for (i = 0; i < q->flows_cnt; i++) {
+ if (list_empty(&q->flows[i].flowchain) ||
+ arg->count < arg->skip) {
+ arg->count++;
+ continue;
+ }
+ if (arg->fn(sch, i + 1, arg) < 0) {
+ arg->stop = 1;
+ break;
+ }
+ arg->count++;
+ }
+}
+
+static const struct Qdisc_class_ops fq_codel_class_ops = {
+ .leaf = fq_codel_leaf,
+ .get = fq_codel_get,
+ .put = fq_codel_put,
+ .tcf_chain = fq_codel_find_tcf,
+ .bind_tcf = fq_codel_bind,
+ .unbind_tcf = fq_codel_put,
+ .dump = fq_codel_dump_class,
+ .dump_stats = fq_codel_dump_class_stats,
+ .walk = fq_codel_walk,
+};
+
+static struct Qdisc_ops fq_codel_qdisc_ops __read_mostly = {
+ .cl_ops = &fq_codel_class_ops,
+ .id = "fq_codel",
+ .priv_size = sizeof(struct fq_codel_sched_data),
+ .enqueue = fq_codel_enqueue,
+ .dequeue = fq_codel_dequeue,
+ .peek = qdisc_peek_dequeued,
+ .drop = fq_codel_drop,
+ .init = fq_codel_init,
+ .reset = fq_codel_reset,
+ .destroy = fq_codel_destroy,
+ .change = fq_codel_change,
+ .dump = fq_codel_dump,
+ .dump_stats = fq_codel_dump_stats,
+ .owner = THIS_MODULE,
+};
+
+static int __init fq_codel_module_init(void)
+{
+ return register_qdisc(&fq_codel_qdisc_ops);
+}
+
+static void __exit fq_codel_module_exit(void)
+{
+ unregister_qdisc(&fq_codel_qdisc_ops);
+}
+
+module_init(fq_codel_module_init)
+module_exit(fq_codel_module_exit)
+MODULE_AUTHOR("Eric Dumazet");
+MODULE_LICENSE("GPL");
--
1.7.10

View File

@ -0,0 +1,37 @@
From: Geert Uytterhoeven <geert@linux-m68k.org>
Date: Mon, 14 May 2012 09:47:05 +0000
Subject: [PATCH 4/7] net/codel: Add missing #include <linux/prefetch.h>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
commit ce5b4b977127ee20c3f9c3fd3637cd3796f649f5 upstream.
m68k allmodconfig:
net/sched/sch_codel.c: In function dequeue:
net/sched/sch_codel.c:70: error: implicit declaration of function prefetch
make[1]: *** [net/sched/sch_codel.o] Error 1
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
net/sched/sch_codel.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c
index b4a1a81..213ef60 100644
--- a/net/sched/sch_codel.c
+++ b/net/sched/sch_codel.c
@@ -46,6 +46,7 @@
#include <linux/kernel.h>
#include <linux/errno.h>
#include <linux/skbuff.h>
+#include <linux/prefetch.h>
#include <net/pkt_sched.h>
#include <net/codel.h>
--
1.7.10

View File

@ -0,0 +1,55 @@
From: Sasha Levin <levinsasha928@gmail.com>
Date: Mon, 14 May 2012 11:57:06 +0000
Subject: [PATCH 5/7] net: codel: fix build errors
commit 669d67bf777def468970f2dcba1537edf3b2d329 upstream.
Fix the following build error:
net/sched/sch_fq_codel.c: In function 'fq_codel_dump_stats':
net/sched/sch_fq_codel.c:464:3: error: unknown field 'qdisc_stats' specified in initializer
net/sched/sch_fq_codel.c:464:3: warning: missing braces around initializer
net/sched/sch_fq_codel.c:464:3: warning: (near initialization for 'st.<anonymous>')
net/sched/sch_fq_codel.c:465:3: error: unknown field 'qdisc_stats' specified in initializer
net/sched/sch_fq_codel.c:465:3: warning: excess elements in struct initializer
net/sched/sch_fq_codel.c:465:3: warning: (near initialization for 'st')
net/sched/sch_fq_codel.c:466:3: error: unknown field 'qdisc_stats' specified in initializer
net/sched/sch_fq_codel.c:466:3: warning: excess elements in struct initializer
net/sched/sch_fq_codel.c:466:3: warning: (near initialization for 'st')
net/sched/sch_fq_codel.c:467:3: error: unknown field 'qdisc_stats' specified in initializer
net/sched/sch_fq_codel.c:467:3: warning: excess elements in struct initializer
net/sched/sch_fq_codel.c:467:3: warning: (near initialization for 'st')
make[1]: *** [net/sched/sch_fq_codel.o] Error 1
Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
net/sched/sch_fq_codel.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index a7b3754..337ff20 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -461,13 +461,14 @@ static int fq_codel_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
struct fq_codel_sched_data *q = qdisc_priv(sch);
struct tc_fq_codel_xstats st = {
.type = TCA_FQ_CODEL_XSTATS_QDISC,
- .qdisc_stats.maxpacket = q->cstats.maxpacket,
- .qdisc_stats.drop_overlimit = q->drop_overlimit,
- .qdisc_stats.ecn_mark = q->cstats.ecn_mark,
- .qdisc_stats.new_flow_count = q->new_flow_count,
};
struct list_head *pos;
+ st.qdisc_stats.maxpacket = q->cstats.maxpacket;
+ st.qdisc_stats.drop_overlimit = q->drop_overlimit;
+ st.qdisc_stats.ecn_mark = q->cstats.ecn_mark;
+ st.qdisc_stats.new_flow_count = q->new_flow_count;
+
list_for_each(pos, &q->new_flows)
st.qdisc_stats.new_flows_len++;
--
1.7.10

View File

@ -0,0 +1,90 @@
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sat, 12 May 2012 21:23:23 +0000
Subject: [PATCH 6/7] codel: use u16 field instead of 31bits for rec_inv_sqrt
commit 6ff272c9ad65eda219cd975b9da2dbc31cc812ee upstream.
David pointed out gcc might generate poor code with 31bit fields.
Using u16 is more than enough and permits a better code output.
Also make the code intent more readable using constants, fixed point arithmetic
not being trivial for everybody.
Suggested-by: David Miller <davem@davemloft.net>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
include/net/codel.h | 25 +++++++++++++++----------
1 file changed, 15 insertions(+), 10 deletions(-)
diff --git a/include/net/codel.h b/include/net/codel.h
index bd8747c..7546517 100644
--- a/include/net/codel.h
+++ b/include/net/codel.h
@@ -133,13 +133,17 @@ struct codel_params {
struct codel_vars {
u32 count;
u32 lastcount;
- bool dropping:1;
- u32 rec_inv_sqrt:31;
+ bool dropping;
+ u16 rec_inv_sqrt;
codel_time_t first_above_time;
codel_time_t drop_next;
codel_time_t ldelay;
};
+#define REC_INV_SQRT_BITS (8 * sizeof(u16)) /* or sizeof_in_bits(rec_inv_sqrt) */
+/* needed shift to get a Q0.32 number from rec_inv_sqrt */
+#define REC_INV_SQRT_SHIFT (32 - REC_INV_SQRT_BITS)
+
/**
* struct codel_stats - contains codel shared variables and stats
* @maxpacket: largest packet we've seen so far
@@ -173,17 +177,18 @@ static void codel_stats_init(struct codel_stats *stats)
* http://en.wikipedia.org/wiki/Methods_of_computing_square_roots#Iterative_methods_for_reciprocal_square_roots
* new_invsqrt = (invsqrt / 2) * (3 - count * invsqrt^2)
*
- * Here, invsqrt is a fixed point number (< 1.0), 31bit mantissa)
+ * Here, invsqrt is a fixed point number (< 1.0), 32bit mantissa, aka Q0.32
*/
static void codel_Newton_step(struct codel_vars *vars)
{
- u32 invsqrt = vars->rec_inv_sqrt;
- u32 invsqrt2 = ((u64)invsqrt * invsqrt) >> 31;
- u64 val = (3LL << 31) - ((u64)vars->count * invsqrt2);
+ u32 invsqrt = ((u32)vars->rec_inv_sqrt) << REC_INV_SQRT_SHIFT;
+ u32 invsqrt2 = ((u64)invsqrt * invsqrt) >> 32;
+ u64 val = (3LL << 32) - ((u64)vars->count * invsqrt2);
- val = (val * invsqrt) >> 32;
+ val >>= 2; /* avoid overflow in following multiply */
+ val = (val * invsqrt) >> (32 - 2 + 1);
- vars->rec_inv_sqrt = val;
+ vars->rec_inv_sqrt = val >> REC_INV_SQRT_SHIFT;
}
/*
@@ -195,7 +200,7 @@ static codel_time_t codel_control_law(codel_time_t t,
codel_time_t interval,
u32 rec_inv_sqrt)
{
- return t + reciprocal_divide(interval, rec_inv_sqrt << 1);
+ return t + reciprocal_divide(interval, rec_inv_sqrt << REC_INV_SQRT_SHIFT);
}
@@ -326,7 +331,7 @@ static struct sk_buff *codel_dequeue(struct Qdisc *sch,
codel_Newton_step(vars);
} else {
vars->count = 1;
- vars->rec_inv_sqrt = 0x7fffffff;
+ vars->rec_inv_sqrt = ~0U >> REC_INV_SQRT_SHIFT;
}
vars->lastcount = vars->count;
vars->drop_next = codel_control_law(now, params->interval,
--
1.7.10

View File

@ -0,0 +1,140 @@
From: Eric Dumazet <edumazet@google.com>
Date: Wed, 16 May 2012 04:39:09 +0000
Subject: [PATCH 7/7] fq_codel: should use qdisc backlog as threshold
commit 865ec5523dadbedefbc5710a68969f686a28d928 upstream.
codel_should_drop() logic allows a packet being not dropped if queue
size is under max packet size.
In fq_codel, we have two possible backlogs : The qdisc global one, and
the flow local one.
The meaningful one for codel_should_drop() should be the global backlog,
not the per flow one, so that thin flows can have a non zero drop/mark
probability.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Dave Taht <dave.taht@bufferbloat.net>
Cc: Kathleen Nichols <nichols@pollere.com>
Cc: Van Jacobson <van@pollere.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
include/net/codel.h | 15 +++++++--------
net/sched/sch_codel.c | 4 ++--
net/sched/sch_fq_codel.c | 5 +++--
3 files changed, 12 insertions(+), 12 deletions(-)
diff --git a/include/net/codel.h b/include/net/codel.h
index 7546517..550debf 100644
--- a/include/net/codel.h
+++ b/include/net/codel.h
@@ -205,7 +205,7 @@ static codel_time_t codel_control_law(codel_time_t t,
static bool codel_should_drop(const struct sk_buff *skb,
- unsigned int *backlog,
+ struct Qdisc *sch,
struct codel_vars *vars,
struct codel_params *params,
struct codel_stats *stats,
@@ -219,13 +219,13 @@ static bool codel_should_drop(const struct sk_buff *skb,
}
vars->ldelay = now - codel_get_enqueue_time(skb);
- *backlog -= qdisc_pkt_len(skb);
+ sch->qstats.backlog -= qdisc_pkt_len(skb);
if (unlikely(qdisc_pkt_len(skb) > stats->maxpacket))
stats->maxpacket = qdisc_pkt_len(skb);
if (codel_time_before(vars->ldelay, params->target) ||
- *backlog <= stats->maxpacket) {
+ sch->qstats.backlog <= stats->maxpacket) {
/* went below - stay below for at least interval */
vars->first_above_time = 0;
return false;
@@ -249,8 +249,7 @@ static struct sk_buff *codel_dequeue(struct Qdisc *sch,
struct codel_params *params,
struct codel_vars *vars,
struct codel_stats *stats,
- codel_skb_dequeue_t dequeue_func,
- u32 *backlog)
+ codel_skb_dequeue_t dequeue_func)
{
struct sk_buff *skb = dequeue_func(vars, sch);
codel_time_t now;
@@ -261,7 +260,7 @@ static struct sk_buff *codel_dequeue(struct Qdisc *sch,
return skb;
}
now = codel_get_time();
- drop = codel_should_drop(skb, backlog, vars, params, stats, now);
+ drop = codel_should_drop(skb, sch, vars, params, stats, now);
if (vars->dropping) {
if (!drop) {
/* sojourn time below target - leave dropping state */
@@ -292,7 +291,7 @@ static struct sk_buff *codel_dequeue(struct Qdisc *sch,
qdisc_drop(skb, sch);
stats->drop_count++;
skb = dequeue_func(vars, sch);
- if (!codel_should_drop(skb, backlog,
+ if (!codel_should_drop(skb, sch,
vars, params, stats, now)) {
/* leave dropping state */
vars->dropping = false;
@@ -313,7 +312,7 @@ static struct sk_buff *codel_dequeue(struct Qdisc *sch,
stats->drop_count++;
skb = dequeue_func(vars, sch);
- drop = codel_should_drop(skb, backlog, vars, params,
+ drop = codel_should_drop(skb, sch, vars, params,
stats, now);
}
vars->dropping = true;
diff --git a/net/sched/sch_codel.c b/net/sched/sch_codel.c
index 213ef60..2f9ab17 100644
--- a/net/sched/sch_codel.c
+++ b/net/sched/sch_codel.c
@@ -77,8 +77,8 @@ static struct sk_buff *codel_qdisc_dequeue(struct Qdisc *sch)
struct codel_sched_data *q = qdisc_priv(sch);
struct sk_buff *skb;
- skb = codel_dequeue(sch, &q->params, &q->vars, &q->stats,
- dequeue, &sch->qstats.backlog);
+ skb = codel_dequeue(sch, &q->params, &q->vars, &q->stats, dequeue);
+
/* We cant call qdisc_tree_decrease_qlen() if our qlen is 0,
* or HTB crashes. Defer it for next round.
*/
diff --git a/net/sched/sch_fq_codel.c b/net/sched/sch_fq_codel.c
index 337ff20..9fc1c62 100644
--- a/net/sched/sch_fq_codel.c
+++ b/net/sched/sch_fq_codel.c
@@ -217,13 +217,14 @@ static int fq_codel_enqueue(struct sk_buff *skb, struct Qdisc *sch)
*/
static struct sk_buff *dequeue(struct codel_vars *vars, struct Qdisc *sch)
{
+ struct fq_codel_sched_data *q = qdisc_priv(sch);
struct fq_codel_flow *flow;
struct sk_buff *skb = NULL;
flow = container_of(vars, struct fq_codel_flow, cvars);
if (flow->head) {
skb = dequeue_head(flow);
- sch->qstats.backlog -= qdisc_pkt_len(skb);
+ q->backlogs[flow - q->flows] -= qdisc_pkt_len(skb);
sch->q.qlen--;
}
return skb;
@@ -256,7 +257,7 @@ begin:
prev_ecn_mark = q->cstats.ecn_mark;
skb = codel_dequeue(sch, &q->cparams, &flow->cvars, &q->cstats,
- dequeue, &q->backlogs[flow - q->flows]);
+ dequeue);
flow->dropped += q->cstats.drop_count - prev_drop_count;
flow->dropped += q->cstats.ecn_mark - prev_ecn_mark;
--
1.7.10

View File

@ -0,0 +1,90 @@
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 28 Nov 2011 20:30:35 +0000
Subject: [PATCH 2/3] flow_dissector: use a 64bit load/store
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
commit 4d77d2b567ec66a443792d99e96ac760991d80d0 upstream.
Le lundi 28 novembre 2011 à 19:06 -0500, David Miller a écrit :
> From: Dimitris Michailidis <dm@chelsio.com>
> Date: Mon, 28 Nov 2011 08:25:39 -0800
>
> >> +bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys
> >> *flow)
> >> +{
> >> + int poff, nhoff = skb_network_offset(skb);
> >> + u8 ip_proto;
> >> + u16 proto = skb->protocol;
> >
> > __be16 instead of u16 for proto?
>
> I'll take care of this when I apply these patches.
( CC trimmed )
Thanks David !
Here is a small patch to use one 64bit load/store on x86_64 instead of
two 32bit load/stores.
[PATCH net-next] flow_dissector: use a 64bit load/store
gcc compiler is smart enough to use a single load/store if we
memcpy(dptr, sptr, 8) on x86_64, regardless of
CONFIG_CC_OPTIMIZE_FOR_SIZE
In IP header, daddr immediately follows saddr, this wont change in the
future. We only need to make sure our flow_keys (src,dst) fields wont
break the rule.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/include/net/flow_keys.h b/include/net/flow_keys.h
index e4cb285..80461c1 100644
--- a/include/net/flow_keys.h
+++ b/include/net/flow_keys.h
@@ -2,6 +2,7 @@
#define _NET_FLOW_KEYS_H
struct flow_keys {
+ /* (src,dst) must be grouped, in the same way than in IP header */
__be32 src;
__be32 dst;
union {
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index f0516d9..0985b9b 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -8,6 +8,16 @@
#include <linux/ppp_defs.h>
#include <net/flow_keys.h>
+/* copy saddr & daddr, possibly using 64bit load/store
+ * Equivalent to : flow->src = iph->saddr;
+ * flow->dst = iph->daddr;
+ */
+static void iph_to_flow_copy_addrs(struct flow_keys *flow, const struct iphdr *iph)
+{
+ BUILD_BUG_ON(offsetof(typeof(*flow), dst) !=
+ offsetof(typeof(*flow), src) + sizeof(flow->src));
+ memcpy(&flow->src, &iph->saddr, sizeof(flow->src) + sizeof(flow->dst));
+}
bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow)
{
@@ -31,8 +41,7 @@ ip:
ip_proto = 0;
else
ip_proto = iph->protocol;
- flow->src = iph->saddr;
- flow->dst = iph->daddr;
+ iph_to_flow_copy_addrs(flow, iph);
nhoff += iph->ihl * 4;
break;
}
--
1.7.10

View File

@ -0,0 +1,25 @@
From: Jesper Dangaard Brouer <hawk@comx.dk>
Date: Tue, 24 Jan 2012 16:03:33 -0500
Subject: [PATCH 3/3] net: flow_dissector.c missing include linux/export.h
commit c452ed70771cea3af73d21a5914989137fbd28b8 upstream.
The file net/core/flow_dissector.c seems to be missing
including linux/export.h.
Signed-off-by: Jesper Dangaard Brouer <hawk@comx.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 0985b9b..a225089 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -1,4 +1,5 @@
#include <linux/skbuff.h>
+#include <linux/export.h>
#include <linux/ip.h>
#include <linux/ipv6.h>
#include <linux/if_vlan.h>
--
1.7.10

View File

@ -0,0 +1,195 @@
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 28 Nov 2011 05:22:18 +0000
Subject: [PATCH 1/3] net: introduce skb_flow_dissect()
commit 0744dd00c1b1be99a25b62b1b48df440e82e57e0 upstream.
We use at least two flow dissectors in network stack, with known
limitations and code duplication.
Introduce skb_flow_dissect() to factorize this, highly inspired from
existing dissector from __skb_get_rxhash()
Note : We extensively use skb_header_pointer(), this permits us to not
touch skb at all.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/include/net/flow_keys.h b/include/net/flow_keys.h
new file mode 100644
index 0000000..e4cb285
--- /dev/null
+++ b/include/net/flow_keys.h
@@ -0,0 +1,15 @@
+#ifndef _NET_FLOW_KEYS_H
+#define _NET_FLOW_KEYS_H
+
+struct flow_keys {
+ __be32 src;
+ __be32 dst;
+ union {
+ __be32 ports;
+ __be16 port16[2];
+ };
+ u8 ip_proto;
+};
+
+extern bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow);
+#endif
diff --git a/net/core/Makefile b/net/core/Makefile
index 3606d40..c4ecc86 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -3,7 +3,7 @@
#
obj-y := sock.o request_sock.o skbuff.o iovec.o datagram.o stream.o scm.o \
- gen_stats.o gen_estimator.o net_namespace.o secure_seq.o
+ gen_stats.o gen_estimator.o net_namespace.o secure_seq.o flow_dissector.o
obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
new file mode 100644
index 0000000..f0516d9
--- /dev/null
+++ b/net/core/flow_dissector.c
@@ -0,0 +1,134 @@
+#include <linux/skbuff.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/if_vlan.h>
+#include <net/ip.h>
+#include <linux/if_tunnel.h>
+#include <linux/if_pppox.h>
+#include <linux/ppp_defs.h>
+#include <net/flow_keys.h>
+
+
+bool skb_flow_dissect(const struct sk_buff *skb, struct flow_keys *flow)
+{
+ int poff, nhoff = skb_network_offset(skb);
+ u8 ip_proto;
+ __be16 proto = skb->protocol;
+
+ memset(flow, 0, sizeof(*flow));
+
+again:
+ switch (proto) {
+ case __constant_htons(ETH_P_IP): {
+ const struct iphdr *iph;
+ struct iphdr _iph;
+ip:
+ iph = skb_header_pointer(skb, nhoff, sizeof(_iph), &_iph);
+ if (!iph)
+ return false;
+
+ if (ip_is_fragment(iph))
+ ip_proto = 0;
+ else
+ ip_proto = iph->protocol;
+ flow->src = iph->saddr;
+ flow->dst = iph->daddr;
+ nhoff += iph->ihl * 4;
+ break;
+ }
+ case __constant_htons(ETH_P_IPV6): {
+ const struct ipv6hdr *iph;
+ struct ipv6hdr _iph;
+ipv6:
+ iph = skb_header_pointer(skb, nhoff, sizeof(_iph), &_iph);
+ if (!iph)
+ return false;
+
+ ip_proto = iph->nexthdr;
+ flow->src = iph->saddr.s6_addr32[3];
+ flow->dst = iph->daddr.s6_addr32[3];
+ nhoff += sizeof(struct ipv6hdr);
+ break;
+ }
+ case __constant_htons(ETH_P_8021Q): {
+ const struct vlan_hdr *vlan;
+ struct vlan_hdr _vlan;
+
+ vlan = skb_header_pointer(skb, nhoff, sizeof(_vlan), &_vlan);
+ if (!vlan)
+ return false;
+
+ proto = vlan->h_vlan_encapsulated_proto;
+ nhoff += sizeof(*vlan);
+ goto again;
+ }
+ case __constant_htons(ETH_P_PPP_SES): {
+ struct {
+ struct pppoe_hdr hdr;
+ __be16 proto;
+ } *hdr, _hdr;
+ hdr = skb_header_pointer(skb, nhoff, sizeof(_hdr), &_hdr);
+ if (!hdr)
+ return false;
+ proto = hdr->proto;
+ nhoff += PPPOE_SES_HLEN;
+ switch (proto) {
+ case __constant_htons(PPP_IP):
+ goto ip;
+ case __constant_htons(PPP_IPV6):
+ goto ipv6;
+ default:
+ return false;
+ }
+ }
+ default:
+ return false;
+ }
+
+ switch (ip_proto) {
+ case IPPROTO_GRE: {
+ struct gre_hdr {
+ __be16 flags;
+ __be16 proto;
+ } *hdr, _hdr;
+
+ hdr = skb_header_pointer(skb, nhoff, sizeof(_hdr), &_hdr);
+ if (!hdr)
+ return false;
+ /*
+ * Only look inside GRE if version zero and no
+ * routing
+ */
+ if (!(hdr->flags & (GRE_VERSION|GRE_ROUTING))) {
+ proto = hdr->proto;
+ nhoff += 4;
+ if (hdr->flags & GRE_CSUM)
+ nhoff += 4;
+ if (hdr->flags & GRE_KEY)
+ nhoff += 4;
+ if (hdr->flags & GRE_SEQ)
+ nhoff += 4;
+ goto again;
+ }
+ break;
+ }
+ case IPPROTO_IPIP:
+ goto again;
+ default:
+ break;
+ }
+
+ flow->ip_proto = ip_proto;
+ poff = proto_ports_offset(ip_proto);
+ if (poff >= 0) {
+ __be32 *ports, _ports;
+
+ nhoff += poff;
+ ports = skb_header_pointer(skb, nhoff, sizeof(_ports), &_ports);
+ if (ports)
+ flow->ports = *ports;
+ }
+
+ return true;
+}
+EXPORT_SYMBOL(skb_flow_dissect);
--
1.7.10

View File

@ -293,3 +293,15 @@
+ features/all/be2net/0056-be2net-Record-receive-queue-index-in-skb-to-aid-RPS.patch
+ features/all/be2net/0057-be2net-Fix-EEH-error-reset-before-a-flash-dump-compl.patch
+ features/all/be2net/0058-be2net-avoid-disabling-sriov-while-VFs-are-assigned.patch
# Add CoDel from 3.5, and prerequisites
+ features/all/net-introduce-skb_flow_dissect.patch
+ features/all/flow_dissector-use-a-64bit-load-store.patch
+ features/all/net-flow_dissector.c-missing-include-linux-export.h.patch
+ features/all/codel/0001-codel-Controlled-Delay-AQM.patch
+ features/all/codel/0002-codel-use-Newton-method-instead-of-sqrt-and-divides.patch
+ features/all/codel/0003-fq_codel-Fair-Queue-Codel-AQM.patch
+ features/all/codel/0004-net-codel-Add-missing-include-linux-prefetch.h.patch
+ features/all/codel/0005-net-codel-fix-build-errors.patch
+ features/all/codel/0006-codel-use-u16-field-instead-of-31bits-for-rec_inv_sq.patch
+ features/all/codel/0007-fq_codel-should-use-qdisc-backlog-as-threshold.patch