Hi All,
Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
Previous versions of the patches was posted here.
(V1) http://lkml.org/lkml/2009/3/11/486
(V2) http://lkml.org/lkml/2009/5/5/275
(V3) http://lkml.org/lkml/2009/5/26/472
(V4) http://lkml.org/lkml/2009/6/8/580
(V5) http://lkml.org/lkml/2009/6/19/279
This patchset is still work in progress but I want to keep on getting the
snapshot of my tree out at regular intervals to get the feedback hence V6.
Changes from V5
===============
- Broke down two of the biggest patches in to smaller patches. Now core of
bfq scheduler patches are separate patch and it should make review a bit
easier. I will try to break the patches down even more.
- Broke out bfq core scheduler changes from flat fair queuing code.
- Created separate patch for in class preemtion logic.
- Created separate patch to for core bfq hierarchical scheduler
changes.
- Created a separate patch for cgroup related bits.
- Introduced a new patch to wait for requests to complete from previous
queue before next queue is scheduled. It helps in achieving better
accounting of disk time used by writes and hence better isolation between
reads and buffered writes. This helps achieve fairness between sync queues
and buffered writes.
- Merged gui's patch for optimization during io group deletion.
- Merged gui's per device rule interface patch resulting from Paul Menage's
feedback.
- Merged gui's patch to read group data under rcu lock instead of taking
spin lock.
- Took care of some of the balbir's review comments on V5.
- Got rid of additional user defined data tyepes. "bfq-timestamp-t",
bfq-weight-t and bfq-service-t.
- Changed data type of "weight" to unsigned int.
- replaced *-extract() function names with *-remove().
- Renamed some of the bfq-* functions to io-* in comments.
- Misc code cleanups
- Moved io-get-io-group() and other common changes from patch
"implement per group bdi congestion interface" to upper patches.
- Made lots of functions static.
- Got rid of some forward declarations.
- Replaced rq-ioq() with req-ioq() and moved it to blkdev.h
- Some comment cleanups.
- Got rid of elv-ioq-set-slice-end()
- Got rid of redundant declaration of io-disconnect-groups().
- Got rid of io-group-ioq()
Limitations
===========
- This IO controller provides the bandwidth control at the IO scheduler
level (leaf node in stacked hiearchy of logical devices). So there can
be cases (depending on configuration) where application does not see
proportional BW division at higher logical level device.
LWN has written an article about the issue here.
http://lwn.net/Articles/332839/
How to solve the issue of fairness at higher level logical devices
==================================================================
Couple of suggestions have come forward.
- Implement IO control at IO scheduler layer and then with the help of
some daemon, adjust the weight on underlying devices dynamiclly, depending
on what kind of BW gurantees are to be achieved at higher level logical
block devices.
- Also implement a higher level IO controller along with IO scheduler
based controller and let user choose one depending on his needs.
A higher level controller does not know about the assumptions/policies
of unerldying IO scheduler, hence it has the potential to break down
the IO scheduler's policy with-in cgroup. A lower level controller
can work with IO scheduler much more closely and efficiently.
Other active IO controller developments
=======================================
IO throttling
for.
- Work on a better interface (possibly cgroup based) for configuring per
group request descriptor limits.
- Debug and fix some of the areas like page cache where higher weight cgroup
async writes are stuck behind lower weight cgroup async writes.
Testing
=======
I have been able to do some testing as follows. All my testing is with ext3
file system with a SATA drive which supports queue depth of 31.
Test1 (Fairness for synchronous reads)
======================================
- Two dd in two cgroups with cgrop weights 1000 and 500. Ran two "dd" in those
cgroups (With CFQ scheduler and /sys/block/<device>/queue/fairness = 1)
Higher weight dd finishes first and at that point of time my script takes
care of reading cgroup files io.disk-time and io.disk-sectors for both the
groups and display the results.
dd if=/mnt/$BLOCKDEV/zerofile1 of=/dev/null &
dd if=/mnt/$BLOCKDEV/zerofile2 of=/dev/null &
234179072 bytes (234 MB) copied, 3.9065 s, 59.9 MB/s
234179072 bytes (234 MB) copied, 5.19232 s, 45.1 MB/s
group1 time=8 16 2471 group1 sectors=8 16 457840
group2 time=8 16 1220 group2 sectors=8 16 225736
First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.
This patchset tries to provide fairness in terms of disk time received. group1
got almost double of group2 disk time (At the time of first dd finish). These
time and sectors statistics can be read using io.disk-time and io.disk-sector
files in cgroup. More about it in documentation file.
Test2 (Reader Vs Buffered Writes)
================================
Buffered writes can be problematic and can overwhelm readers, especially with
noop and deadline. IO controller can provide isolation between readers and
buffered (async) writers.
First I ran the test without io controller to see the severity of the issue.
Ran a hostile writer and then after 10 seconds started a reader and then
monitored the completion time of reader. Reader reads a 256 MB file. Tested
this with noop scheduler.
sample script
For more accurate accounting of disk time for buffered writes with queuing
hardware I had to set /sys/block/<disk>/queue/iosched/fairness to "2".
sample script
Note, reader finishes now much lesser time and both group1 and group2
got almost 3 seconds of disk time. Hence io-controller provides isolation
from buffered writes.
Test3 (AIO)
===========
AIO reads
echo $$ > /cgroup/bfqio/test2/tasks
fio $fio-args Note that disk time given to group test1 is almost double of group2 disk
time.
AIO writes
echo $$ > /cgroup/bfqio/test2/tasks
fio $fio-args Above shows that by the time first fio (higher weight), finished, group
test1 got 28085 ms of disk time and group test2 got 14652 ms of disk time.
similarly the statistics for number of sectors transferred are also shown.
Note that disk time given to group test1 is almost double of group2 disk
time.
Test4 (Writes with O-SYNC)
==========================
Created two groups with weight 1000 and 500 and launched two fio jobs doing
sync writes.
sample script
# some code to read group data upon completion of first fio job
in proportional manner.
For example, consider two dd threads reading /dev/zero as input file and doing
writes of huge files. Very soon we will cross vm-dirty-ratio and dd thread will
be forced to write out some pages to disk before more pages can be dirtied. But
not necessarily dirty pages of same thread are picked. It can very well pick
the inode of lesser priority dd thread and do some writeout. So effectively
higher weight dd is doing writeouts of lower weight dd pages and we don't see
service differentation.
IOW, the core problem with async write fairness is that higher weight thread
does not throw enought IO traffic at IO controller to keep the queue
continuously backlogged. In my testing, there are many .2 to .8 second
intervals where higher weight queue is empty and in that duration lower weight
queue get lots of job done giving the impression that there was no service
differentiation.
In summary, from IO controller point of view async writes support is there.
Because page cache has not been designed in such a manner that higher
prio/weight writer can do more write out as compared to lower prio/weight
writer, gettting service differentiation is hard and it is visible in some
cases and not visible in some cases.
Do we really care that much for fairness among two writer cgroups? One can
choose to do direct writes or sync writes if fairness for writes really
matters for him.
Following is the only case where it is hard to ensure fairness between cgroups.
- Buffered writes Vs Buffered Writes.
So to test async writes I created two partitions on a disk and created ext3
file systems on both the partitions. Also created two cgroups and generated
lots of write traffic in two cgroups (50 fio threads) and watched the disk
time statistics in respective cgroups at the interval of 2 seconds. Thanks to
ryo tsuruta for the test case.
*****************************************************************
sync
echo 3 > /proc/sys/vm/drop-caches
fio-args="***********************************************************************
And watched the disk time and sector statistics for the both the cgroups
every 2 seconds using a script. How is snippet from output.
test1 statistics: time=8 48 1315 sectors=8 48 55776 dq=8 48 1
test2 statistics: time=8 48 633 sectors=8 48 14720 dq=8 48 2
test1 statistics: time=8 48 5586 sectors=8 48 339064 dq=8 48 2
test2 statistics: time=8 48 2985 sectors=8 48 146656 dq=8 48 3
test1 statistics: time=8 48 9935 sectors=8 48 628728 dq=8 48 3
test2 statistics: time=8 48 5265 sectors=8 48 278688 dq=8 48 4
test1 statistics: time=8 48 14156 sectors=8 48 932488 dq=8 48 6
test2 statistics: time=8 48 7646 sectors=8 48 412704 dq=8 48 7
test1 statistics: time=8 48 18141 sectors=8 48 1231488 dq=8 48 10
test2 statistics: time=8 48 9820 sectors=8 48 548400 dq=8 48 8
test1 statistics: time=8 48 21953 sectors=8 48 1485632 dq=8 48 13
test2 statistics: time=8 48 12394 sectors=8 48 698288 dq=8 48 10
test1 statistics: time=8 48 25167 sectors=8 48 1705264 dq=8 48 13
test2 statistics: time=8 48 14042 sectors=8 48 817808 dq=8 48 10
First two fields in time and sectors statistics represent major and minor
number of the device. Third field represents disk time in milliseconds and
number of sectors transferred respectively.
So disk time consumed by group1 is almost double of group2 in this case.
Thanks
Vivek
PATCH 01/25 - io-controller: Documentation by Vivek Goyal on
2009-07-02T20:05:04+00:00
o Documentation for io-controller.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff - Deadline IO scheduler tunables
+io-controller.txt
+ - IO controller for provding hierarchical IO scheduling
ioprio.txt
- Block io priorities (in CFQ scheduler)
request.txt
diff + =============
+
+Overview
+========
+
+This patchset implements a proportional weight IO controller. That is one
+can create cgroups and assign prio/weights to those cgroups and task group
+will get access to disk proportionate to the weight of the group.
+
+These patches modify elevator layer and individual IO schedulers to do
+IO control hence this io controller works only on block devices which use
+one of the standard io schedulers can not be used with any xyz logical block
+device.
+
+The assumption/thought behind modifying IO scheduler is that resource control
+is needed only on leaf nodes where the actual contention for resources is
+present and not on intertermediate logical block devices.
+
+Consider following hypothetical scenario. Lets say there are three physical
+disks, namely sda, sdb and sdc. Two logical volumes (lv0 and lv1) have been
+created on top of these. Some part of sdb is in lv0 and some part is in lv1.
+
+ lv0 lv1
+ / /
+ sda sdb sdc
+
+Also consider following cgroup hierarchy
+
+ root
+ /
+ A B
+ / /
+ T1 T2 T3 T4
+
+A and B are two cgroups and T1, T2, T3 and T4 are tasks with-in those cgroups.
+Assuming T1, T2, T3 and T4 are doing IO on lv0 and lv1. These tasks should
+get their fair share of bandwidth on disks sda, sdb and sdc. There is no
+IO control on intermediate logical block nodes (lv0, lv1).
+
+So if tasks T1 and T2 are doing IO on lv0 and T3 and T4 are doing IO on lv1
+only, there will not be any contetion for resources between group A and B if
+IO is going to sda or sdc. But if actual IO gets translated to disk sdb, then
+IO scheduler associated with the sdb will distribute disk bandwidth to
+group A and B proportionate to their weight.
+
+CFQ already has the notion of fairness and it provides differential disk
+access based on priority and class of the task. Just that it is flat and
+with cgroup stuff, it needs to be made hierarchical to achive a good
+hierarchical control on IO.
+
+Rest of the IO schedulers (noop, deadline and AS) don't have any notion
+of fairness among various threads. They maintain only one queue where all
+the IO gets queued (internally this queue is split in read and write queue
+for deadline and AS). With this patchset, now we maintain one queue per
+cgropu per device and then try to do fair queuing among those queues.
+
+One of the concerns raised with modifying IO schedulers was that we don't
+want to replicate the code in all the IO schedulers. These patches share
+the fair queuing code which has been moved to a common layer (elevator
+layer). Hence we don't end up replicating code across IO schedulers. Following
+diagram depicts the concept.
+
+
PATCH 05/25 - io-controller: Charge for time slice based on average disk rate by Vivek Goyal on
2009-07-02T20:05:04+00:00
o There are situations where a queue gets expired very soon and it looks
as if time slice used by that queue is zero. For example, If an async
queue dispatches a bunch of requests and queue is expired before first
request completes. Another example is where a queue is expired as soon
as first request completes and queue has no more requests (sync queues
on SSD).
o Currently we just charge 25% of slice length in such cases. This patch tries
to improve on that approximation by keeping a track of average disk rate
and charging for time by nr-sectors/disk-rate.
o This is still experimental, not very sure if it gives measurable improvement
or not. May be a better scheme is to use something more granular than jiffies
for time keeping for io queues.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff
+/* Maximum Window length for updating average disk rate */
+static int elv-rate-sampling-window = HZ / 10;
+
#define ELV-SLICE-SCALE (5)
#define ELV-HW-QUEUE-MIN (5)
@@ -941,6 +944,47 @@ static void elv-ioq-update-io-thinktime(struct io-queue *ioq)
ioq->ttime-mean = (ioq->ttime-total + 128) / ioq->ttime-samples;
}
+static void elv-update-io-rate(struct elv-fq-data *efqd, struct request *rq)
+{
+ long elapsed = jiffies - efqd->rate-sampling-start;
+ unsigned long total;
+
+ /* sampling window is off */
+ if (!efqd->rate-sampling-start)
+ return;
+
+ efqd->rate-sectors-current += blk-rq-sectors(rq);
+
+ if (efqd->rq-in-driver && (elapsed < elv-rate-sampling-window))
+ return;
+
+ efqd->rate-sectors = (7*efqd->rate-sectors +
+ 256*efqd->rate-sectors-current) / 8;
+
+ if (!elapsed) {
+ /*
+ * updating rate before a jiffy could complete. Could be a
+ * problem with fast queuing/non-queuing hardware. Should we
+ * look at higher resolution time source?
+ *
+ * In case of non-queuing hardware we will probably not try to
+ * dispatch from multiple queues and will be able to account
+ * for disk time used and will not need this approximation
+ * anyway?
+ */
+ elapsed = 1;
+ }
+
+ efqd->rate-time = (7*efqd->rate-time + 256*elapsed) / 8;
+ total = efqd->rate-sectors + (efqd->rate-time/2);
+ efqd->mean-rate = total/efqd->rate-time;
+
+ elv-log(efqd, "mean-rate=%d, t=%d s=%d", efqd->mean-rate,
+ elapsed, efqd->rate-sectors-current);
+ efqd->rate-sampling-start = 0;
+ efqd->rate-sectors-current = 0;
+}
+
/*
* Disable idle window if the process thinks too long.
* This idle flag can also be updated by io scheduler.
@@ -1231,6 +1275,34 @@ static void elv-del-ioq-busy(struct elevator-queue *e, struct io-queue *ioq,
}
/*
+ * Calculate the effective disk time used by the queue based on how many
+ * sectors queue has dispatched and what is the average disk rate
+ * Returns disk time in ms.
+ */
+static inline unsigned long elv-disk-time-used(struct request-queue *q,
+ struct io-queue *ioq)
+{
+ struct elv-fq-data *efqd = &q->elevator->efqd;
+ struct io-entity *entity = &ioq->entity;
+ unsigned long jiffies-used = 0;
+
+ if (!efqd->mean-rate)
+ return entity->budget/4;
+
+ /* Charge the queue based on average disk rate */
+ jiffies-used = ioq->nr-sectors/efqd->mean-rate;
+
+ if (!jiffies-used)
+ jiffies-used = 1;
+
+ elv-log-ioq(efqd, ioq, "disk time=%ldms sect=%lu rate=%ld",
+ jiffies-to-msecs(jiffies-used),
+ ioq->nr-sectors, efqd->mean-rate);
+
+ return jiffies-used;
+}
+
+/*
* Do the accounting. Determine how much service (in terms of time slices)
* current queue used and adjust the start, finish time of queue and vtime
* of the tree accordingly.
@@ -1248,8 +1320,10 @@ static void elv-del-ioq-busy(struct elevator-queue *e, struct io-queue *ioq,
* from next queue.
*
* Not sure how to determine the time consumed by queue in such scenarios.
- * Currently as a crude approximation, we are charging 25% of time slice
- * for such cases. A better mechanism is needed for accurate accounting.
+ * Currently as a crude approximation, try to keep track of average disk rate
+ * and charge the queue based on number of sectors transferred. If suffcient
+ * disk rate data is not available then we are charging 25% of time slice
+ * for such cases. A better mechanism, is needed for accurate accounting.
*/
void - * to do but at the same time not very sure what's the next best
- * thing to do.
+ * Charge the queue based on average disk rate or the 25% slice if
+ * mean rate is 0. This is not the best thing to do but at the same
+ * time not very sure what's the next best thing to do.
*
* This arises from that fact that we don't have the notion of
* one queue being operational at one time. io scheduler can dispatch
@@ -1282,7 +1356,7 @@ void if (time-after(ioq->slice-end, jiffies)) {
slice-unused = ioq->slice-end - jiffies;
@@ -1292,7 +1366,7 @@ void slice-used = entity->budget - slice-unused;
} else {
@@ -1310,6 +1384,8 @@ void elv-del-ioq-busy(q->elevator, ioq, 1);
else
@@ -1671,6 +1747,7 @@ void elv-fq-dispatched-request(struct elevator-queue *e, struct request *rq)
BUG-ON(!ioq);
elv-ioq-request-dispatched(ioq);
+ ioq->nr-sectors += blk-rq-sectors(rq);
elv-ioq-request-removed(e, rq);
elv-clear-ioq-must-dispatch(ioq);
}
@@ -1683,6 +1760,10 @@ void elv-fq-activate-rq(struct request-queue *q, struct request *rq)
return;
efqd->rq-in-driver++;
+
+ if (!efqd->rate-sampling-start)
+ efqd->rate-sampling-start = jiffies;
+
elv-log-ioq(efqd, rq->ioq, "activate rq, drv=%d",
efqd->rq-in-driver);
}
@@ -1746,6 +1827,8 @@ void elv-ioq-completed-request(struct request-queue *q, struct request *rq)
efqd->rq-in-driver
diff
+ /* Number of sectors dispatched in current dispatch round */
+ unsigned long nr-sectors;
+
/* Keep a track of think time of processes in this queue */
unsigned long last-end-request;
unsigned long ttime-total;
@@ -228,6 +231,14 @@ struct elv-fq-data {
/* Base slice length for sync and async queues */
unsigned int elv-slice[2];
+
+ /* Fields for keeping track of average disk rate */
+ unsigned long rate-sectors; /* number of sectors finished */
+ unsigned long rate-time; /* jiffies elapsed */
+ unsigned long mean-rate; /* sectors per jiffy */
+ unsigned long long rate-sampling-start; /*sampling window start jifies*/
+ /* number of sectors finished io during current sampling window */
+ unsigned long rate-sectors-current;
};
/* Logging facilities. */
PATCH 14/25 - io-controller: Separate out queue and data by Vivek Goyal on
2009-07-02T20:05:05+00:00
o So far noop, deadline and AS had one common structure called *-data which
contained both the queue information where requests are queued and also
common data used for scheduling. This patch breaks down this common
structure in two parts, *-queue and *-data. This is along the lines of
cfq where all the reuquests are queued in queue and common data and tunables
are part of data.
o It does not change the functionality but this re-organization helps once
noop, deadline and AS are changed to use hierarchical fair queuing.
o looks like queue-empty function is not required and we can check for
q->nr-sorted in elevator layer to see if ioscheduler queues are empty or
not.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff
-struct as-data {
- /*
- * run time data
- */
-
- struct request-queue *q; /* the "owner" queue */
-
+struct as-queue {
/*
* requests (as-rq s) are present on both sort-list and fifo-list
*/
@@ -90,6 +84,14 @@ struct as-data {
struct list-head fifo-list[2];
struct request *next-rq[2]; /* next in sort order */
+ unsigned long last-check-fifo[2];
+ int write-batch-count; /* max # of reqs in a write batch */
+ int current-write-count; /* how many requests left this batch */
+ int write-batch-idled; /* has the write batch gone idle? */
+};
+
+struct as-data {
+ struct request-queue *q; /* the "owner" queue */
sector-t last-sector[2]; /* last SYNC & ASYNC sectors */
unsigned long exit-prob; /* probability a task will exit while
@@ -103,21 +105,17 @@ struct as-data {
sector-t new-seek-mean;
unsigned long current-batch-expires;
- unsigned long last-check-fifo[2];
int changed-batch; /* 1: waiting for old batch to end */
int new-batch; /* 1: waiting on first read complete */
- int batch-data-dir; /* current batch SYNC / ASYNC */
- int write-batch-count; /* max # of reqs in a write batch */
- int current-write-count; /* how many requests left this batch */
- int write-batch-idled; /* has the write batch gone idle? */
enum anticipation-status antic-status;
unsigned long antic-start; /* jiffies: when it started */
struct timer-list antic-timer; /* anticipatory scheduling timer */
- struct work-struct antic-work; /* Deferred unplugging */
+ struct work-struct antic-work; /* Deferred unplugging */
struct io-context *io-context; /* Identify the expected process */
int ioc-finished; /* IO associated with io-context is finished */
int nr-dispatched;
+ int batch-data-dir; /* current batch SYNC / ASYNC */
/*
* settings that change how the i/o scheduler behaves
@@ -258,13 +256,14 @@ static void as-put-io-context(struct request *rq)
/*
* rb tree support functions
*/
-#define RQ-RB-ROOT(ad, rq) (&(ad)->sort-list[rq-is-sync((rq))])
+#define RQ-RB-ROOT(asq, rq) (&(asq)->sort-list[rq-is-sync((rq))])
static void as-add-rq-rb(struct as-data *ad, struct request *rq)
{
struct request *alias;
+ struct as-queue *asq = elv-get-sched-queue(ad->q, rq);
- while ((unlikely(alias = elv-rb-add(RQ-RB-ROOT(ad, rq), rq)))) {
+ while ((unlikely(alias = elv-rb-add(RQ-RB-ROOT(asq, rq), rq)))) {
as-move-to-dispatch(ad, alias);
as-antic-stop(ad);
}
@@ -272,7 +271,9 @@ static void as-add-rq-rb(struct as-data *ad, struct request *rq)
static inline void as-del-rq-rb(struct as-data *ad, struct request *rq)
{
- elv-rb-del(RQ-RB-ROOT(ad, rq), rq);
+ struct as-queue *asq = elv-get-sched-queue(ad->q, rq);
+
+ elv-rb-del(RQ-RB-ROOT(asq, rq), rq);
}
/*
@@ -366,7 +367,7 @@ as-choose-req(struct as-data *ad, struct request *rq1, struct request *rq2)
* what request to process next. Anticipation works on top of this.
*/
static struct request *
-as-find-next-rq(struct as-data *ad, struct request *last)
+as-find-next-rq(struct as-data *ad, struct as-queue *asq, struct request *last)
{
struct rb-node *rbnext = rb-next(&last->rb-node);
struct rb-node *rbprev = rb-prev(&last->rb-node);
@@ -382,7 +383,7 @@ as-find-next-rq(struct as-data *ad, struct request *last)
else {
const int data-dir = rq-is-sync(last);
- rbnext = rb-first(&ad->sort-list[data-dir]);
+ rbnext = rb-first(&asq->sort-list[data-dir]);
if (rbnext && rbnext != &last->rb-node)
next = rb-entry-rq(rbnext);
}
@@ -789,9 +790,10 @@ static int as-can-anticipate(struct as-data *ad, struct request *rq)
static void as-update-rq(struct as-data *ad, struct request *rq)
{
const int data-dir = rq-is-sync(rq);
+ struct as-queue *asq = elv-get-sched-queue(ad->q, rq);
/* keep the next-rq cache up to date */
- ad->next-rq[data-dir] = as-choose-req(ad, rq, ad->next-rq[data-dir]);
+ asq->next-rq[data-dir] = as-choose-req(ad, rq, asq->next-rq[data-dir]);
/*
* have we been anticipating this request?
@@ -812,25 +814,26 @@ static void update-write-batch(struct as-data *ad)
{
unsigned long batch = ad->batch-expire[BLK-RW-ASYNC];
long write-time;
+ struct as-queue *asq = elv-get-sched-queue(ad->q, NULL);
write-time = (jiffies - ad->current-batch-expires) + batch;
if (write-time < 0)
write-time = 0;
- if (write-time > batch && !ad->write-batch-idled) {
+ if (write-time > batch && !asq->write-batch-idled) {
if (write-time > batch * 3)
- ad->write-batch-count /= 2;
+ asq->write-batch-count /= 2;
else
- ad->write-batch-count else
- ad->write-batch-count++;
+ asq->write-batch-count++;
}
- if (ad->write-batch-count < 1)
- ad->write-batch-count = 1;
+ if (asq->write-batch-count < 1)
+ asq->write-batch-count = 1;
}
/*
@@ -901,6 +904,7 @@ static void as-remove-queued-request(struct request-queue *q,
const int data-dir = rq-is-sync(rq);
struct as-data *ad = q->elevator->elevator-data;
struct io-context *ioc;
+ struct as-queue *asq = elv-get-sched-queue(q, rq);
WARN-ON(RQ-STATE(rq) != AS-RQ-QUEUED);
@@ -914,8 +918,8 @@ static void as-remove-queued-request(struct request-queue *q,
* Update the "next-rq" cache if we are about to remove its
* entry
*/
- if (ad->next-rq[data-dir] == rq)
- ad->next-rq[data-dir] = as-find-next-rq(ad, rq);
+ if (asq->next-rq[data-dir] == rq)
+ asq->next-rq[data-dir] = as-find-next-rq(ad, asq, rq);
rq-fifo-clear(rq);
as-del-rq-rb(ad, rq);
@@ -929,23 +933,23 @@ static void as-remove-queued-request(struct request-queue *q,
*
* See as-antic-expired comment.
*/
-static int as-fifo-expired(struct as-data *ad, int adir)
+static int as-fifo-expired(struct as-data *ad, struct as-queue *asq, int adir)
{
struct request *rq;
long delta-jif;
- delta-jif = jiffies - ad->last-check-fifo[adir];
+ delta-jif = jiffies - asq->last-check-fifo[adir];
if (unlikely(delta-jif < 0))
delta-jif = -delta-jif;
if (delta-jif < ad->fifo-expire[adir])
return 0;
- ad->last-check-fifo[adir] = jiffies;
+ asq->last-check-fifo[adir] = jiffies;
- if (list-empty(&ad->fifo-list[adir]))
+ if (list-empty(&asq->fifo-list[adir]))
return 0;
- rq = rq-entry-fifo(ad->fifo-list[adir].next);
+ rq = rq-entry-fifo(asq->fifo-list[adir].next);
return time-after(jiffies, rq-fifo-time(rq));
}
@@ -954,7 +958,7 @@ static int as-fifo-expired(struct as-data *ad, int adir)
* as-batch-expired returns true if the current batch has expired. A batch
* is a set of reads or a set of writes.
*/
-static inline int as-batch-expired(struct as-data *ad)
+static inline int as-batch-expired(struct as-data *ad, struct as-queue *asq)
{
if (ad->changed-batch || ad->new-batch)
return 0;
@@ -964,7 +968,7 @@ static inline int as-batch-expired(struct as-data *ad)
return time-after(jiffies, ad->current-batch-expires);
return time-after(jiffies, ad->current-batch-expires)
- || ad->current-write-count == 0;
+ || asq->current-write-count == 0;
}
/*
@@ -973,6 +977,7 @@ static inline int as-batch-expired(struct as-data *ad)
static void as-move-to-dispatch(struct as-data *ad, struct request *rq)
{
const int data-dir = rq-is-sync(rq);
+ struct as-queue *asq = elv-get-sched-queue(ad->q, rq);
BUG-ON(RB-EMPTY-NODE(&rq->rb-node));
@@ -995,12 +1000,12 @@ static void as-move-to-dispatch(struct as-data *ad, struct request *rq)
ad->io-context = NULL;
}
- if (ad->current-write-count != 0)
- ad->current-write-count+ asq->next-rq[data-dir] = as-find-next-rq(ad, asq, rq);
/*
* take it off the sort and fifo list, add to dispatch queue
@@ -1024,9 +1029,16 @@ static void as-move-to-dispatch(struct as-data *ad, struct request *rq)
static int as-dispatch-request(struct request-queue *q, int force)
{
struct as-data *ad = q->elevator->elevator-data;
- const int reads = !list-empty(&ad->fifo-list[BLK-RW-SYNC]);
- const int writes = !list-empty(&ad->fifo-list[BLK-RW-ASYNC]);
struct request *rq;
+ struct as-queue *asq = elv-select-sched-queue(q, force);
+ int reads, writes;
+
+ if (!asq)
+ return 0;
+
+ reads = !list-empty(&asq->fifo-list[BLK-RW-SYNC]);
+ writes = !list-empty(&asq->fifo-list[BLK-RW-ASYNC]);
+
if (unlikely(force)) {
/*
@@ -1042,25 +1054,25 @@ static int as-dispatch-request(struct request-queue *q, int force)
ad->changed-batch = 0;
ad->new-batch = 0;
- while (ad->next-rq[BLK-RW-SYNC]) {
- as-move-to-dispatch(ad, ad->next-rq[BLK-RW-SYNC]);
+ while (asq->next-rq[BLK-RW-SYNC]) {
+ as-move-to-dispatch(ad, asq->next-rq[BLK-RW-SYNC]);
dispatched++;
}
- ad->last-check-fifo[BLK-RW-SYNC] = jiffies;
+ asq->last-check-fifo[BLK-RW-SYNC] = jiffies;
- while (ad->next-rq[BLK-RW-ASYNC]) {
- as-move-to-dispatch(ad, ad->next-rq[BLK-RW-ASYNC]);
+ while (asq->next-rq[BLK-RW-ASYNC]) {
+ as-move-to-dispatch(ad, asq->next-rq[BLK-RW-ASYNC]);
dispatched++;
}
- ad->last-check-fifo[BLK-RW-ASYNC] = jiffies;
+ asq->last-check-fifo[BLK-RW-ASYNC] = jiffies;
return dispatched;
}
/* Signal that the write batch was uncontended, so we can't time it */
if (ad->batch-data-dir == BLK-RW-ASYNC && !reads) {
- if (ad->current-write-count == 0 || !writes)
- ad->write-batch-idled = 1;
+ if (asq->current-write-count == 0 || !writes)
+ asq->write-batch-idled = 1;
}
if (!(reads || writes)
@@ -1069,14 +1081,14 @@ static int as-dispatch-request(struct request-queue *q, int force)
|| ad->changed-batch)
return 0;
- if (!(reads && writes && as-batch-expired(ad))) {
+ if (!(reads && writes && as-batch-expired(ad, asq))) {
/*
* batch is still running or no reads or no writes
*/
- rq = ad->next-rq[ad->batch-data-dir];
+ rq = asq->next-rq[ad->batch-data-dir];
if (ad->batch-data-dir == BLK-RW-SYNC && ad->antic-expire) {
- if (as-fifo-expired(ad, BLK-RW-SYNC))
+ if (as-fifo-expired(ad, asq, BLK-RW-SYNC))
goto fifo-expired;
if (as-can-anticipate(ad, rq)) {
@@ -1100,7 +1112,7 @@ static int as-dispatch-request(struct request-queue *q, int force)
*/
if (reads) {
- BUG-ON(RB-EMPTY-ROOT(&ad->sort-list[BLK-RW-SYNC]));
+ BUG-ON(RB-EMPTY-ROOT(&asq->sort-list[BLK-RW-SYNC]));
if (writes && ad->batch-data-dir == BLK-RW-SYNC)
/*
@@ -1113,8 +1125,8 @@ static int as-dispatch-request(struct request-queue *q, int force)
ad->changed-batch = 1;
}
ad->batch-data-dir = BLK-RW-SYNC;
- rq = rq-entry-fifo(ad->fifo-list[BLK-RW-SYNC].next);
- ad->last-check-fifo[ad->batch-data-dir] = jiffies;
+ rq = rq-entry-fifo(asq->fifo-list[BLK-RW-SYNC].next);
+ asq->last-check-fifo[ad->batch-data-dir] = jiffies;
goto dispatch-request;
}
@@ -1124,7 +1136,7 @@ static int as-dispatch-request(struct request-queue *q, int force)
if (writes) {
dispatch-writes:
- BUG-ON(RB-EMPTY-ROOT(&ad->sort-list[BLK-RW-ASYNC]));
+ BUG-ON(RB-EMPTY-ROOT(&asq->sort-list[BLK-RW-ASYNC]));
if (ad->batch-data-dir == BLK-RW-SYNC) {
ad->changed-batch = 1;
@@ -1137,10 +1149,10 @@ dispatch-writes:
ad->new-batch = 0;
}
ad->batch-data-dir = BLK-RW-ASYNC;
- ad->current-write-count = ad->write-batch-count;
- ad->write-batch-idled = 0;
- rq = rq-entry-fifo(ad->fifo-list[BLK-RW-ASYNC].next);
- ad->last-check-fifo[BLK-RW-ASYNC] = jiffies;
+ asq->current-write-count = asq->write-batch-count;
+ asq->write-batch-idled = 0;
+ rq = rq-entry-fifo(asq->fifo-list[BLK-RW-ASYNC].next);
+ asq->last-check-fifo[BLK-RW-ASYNC] = jiffies;
goto dispatch-request;
}
@@ -1152,9 +1164,9 @@ dispatch-request:
* If a request has expired, service it.
*/
- if (as-fifo-expired(ad, ad->batch-data-dir)) {
+ if (as-fifo-expired(ad, asq, ad->batch-data-dir)) {
fifo-expired:
- rq = rq-entry-fifo(ad->fifo-list[ad->batch-data-dir].next);
+ rq = rq-entry-fifo(asq->fifo-list[ad->batch-data-dir].next);
}
if (ad->changed-batch) {
@@ -1187,6 +1199,7 @@ static void as-add-request(struct request-queue *q, struct request *rq)
{
struct as-data *ad = q->elevator->elevator-data;
int data-dir;
+ struct as-queue *asq = elv-get-sched-queue(q, rq);
RQ-SET-STATE(rq, AS-RQ-NEW);
@@ -1205,7 +1218,7 @@ static void as-add-request(struct request-queue *q, struct request *rq)
* set expire time and add to fifo list
*/
rq-set-fifo-time(rq, jiffies + ad->fifo-expire[data-dir]);
- list-add-tail(&rq->queuelist, &ad->fifo-list[data-dir]);
+ list-add-tail(&rq->queuelist, &asq->fifo-list[data-dir]);
as-update-rq(ad, rq); /* keep state machine up to date */
RQ-SET-STATE(rq, AS-RQ-QUEUED);
@@ -1227,31 +1240,20 @@ static void as-deactivate-request(struct request-queue *q, struct request *rq)
atomic-inc(&RQ-IOC(rq)->aic->nr-dispatched);
}
-/*
- * as-queue-empty tells us if there are requests left in the device. It may
- * not be the case that a driver can get the next request even if the queue
- * is not empty - it is used in the block layer to check for plugging and
- * merging opportunities
- */
-static int as-queue-empty(struct request-queue *q)
-{
- struct as-data *ad = q->elevator->elevator-data;
-
- return list-empty(&ad->fifo-list[BLK-RW-ASYNC])
- && list-empty(&ad->fifo-list[BLK-RW-SYNC]);
-}
-
static int
as-merge(struct request-queue *q, struct request **req, struct bio *bio)
{
- struct as-data *ad = q->elevator->elevator-data;
sector-t rb-key = bio->bi-sector + bio-sectors(bio);
struct request * * check for front merge
*/
- }
+/* Called with queue lock held */
+static void *as-alloc-as-queue(struct request-queue *q,
+ struct elevator-queue *eq, gfp-t gfp-mask)
+{
+ struct as-queue *asq;
+ struct as-data *ad = eq->elevator-data;
+
+ asq = kmalloc-node(sizeof(*asq), gfp-mask | + asq->sort-list[BLK-RW-ASYNC] = RB-ROOT;
+ if (ad)
+ asq->write-batch-count = ad->batch-expire[BLK-RW-ASYNC] / 10;
+ else
+ asq->write-batch-count = default-write-batch-expire / 10;
+
+ if (asq->write-batch-count < 2)
+ asq->write-batch-count = 2;
+out:
+ return asq;
+}
+
+static void as-free-as-queue(struct elevator-queue *e, void *sched-queue)
+{
+ struct as-queue *asq = sched-queue;
+
+ BUG-ON(!list-empty(&asq->fifo-list[BLK-RW-SYNC]));
+ BUG-ON(!list-empty(&asq->fifo-list[BLK-RW-ASYNC]));
+ kfree(asq);
+}
+
static void as-exit-queue(struct elevator-queue *e)
{
struct as-data *ad = e->elevator-data;
@@ -1341,9 +1378,6 @@ static void as-exit-queue(struct elevator-queue *e)
del-timer-sync(&ad->antic-timer);
cancel-work-sync(&ad->antic-work);
- BUG-ON(!list-empty(&ad->fifo-list[BLK-RW-SYNC]));
- BUG-ON(!list-empty(&ad->fifo-list[BLK-RW-ASYNC]));
-
put-io-context(ad->io-context);
kfree(ad);
}
@@ -1367,10 +1401,6 @@ static void *as-init-queue(struct request-queue *q)
init-timer(&ad->antic-timer);
INIT-WORK(&ad->antic-work, as-work-handler);
- INIT-LIST-HEAD(&ad->fifo-list[BLK-RW-SYNC]);
- INIT-LIST-HEAD(&ad->fifo-list[BLK-RW-ASYNC]);
- ad->sort-list[BLK-RW-SYNC] = RB-ROOT;
- ad->sort-list[BLK-RW-ASYNC] = RB-ROOT;
ad->fifo-expire[BLK-RW-SYNC] = default-read-expire;
ad->fifo-expire[BLK-RW-ASYNC] = default-write-expire;
ad->antic-expire = default-antic-expire;
@@ -1378,9 +1408,6 @@ static void *as-init-queue(struct request-queue *q)
ad->batch-expire[BLK-RW-ASYNC] = default-write-batch-expire;
ad->current-batch-expires = jiffies + ad->batch-expire[BLK-RW-SYNC];
- ad->write-batch-count = ad->batch-expire[BLK-RW-ASYNC] / 10;
- if (ad->write-batch-count < 2)
- ad->write-batch-count = 2;
return ad;
}
@@ -1478,7 +1505,6 @@ static struct elevator-type iosched-as = {
.elevator-add-req-fn = as-add-request,
.elevator-activate-req-fn = as-activate-request,
.elevator-deactivate-req-fn = as-deactivate-request,
- .elevator-queue-empty-fn = as-queue-empty,
.elevator-completed-req-fn = as-completed-request,
.elevator-former-req-fn = elv-rb-former-request,
.elevator-latter-req-fn = elv-rb-latter-request,
@@ -1486,6 +1512,8 @@ static struct elevator-type iosched-as = {
.elevator-init-fn = as-init-queue,
.elevator-exit-fn = as-exit-queue,
.trim = as-trim,
+ .elevator-alloc-sched-queue-fn = as-alloc-as-queue,
+ .elevator-free-sched-queue-fn = as-free-as-queue,
},
.elevator-attrs = as-attrs,
diff
-struct deadline-data {
- /*
- * run time data
- */
-
+struct deadline-queue {
/*
* requests (deadline-rq s) are present on both sort-list and fifo-list
*/
- struct rb-root sort-list[2];
+ struct rb-root sort-list[2];
struct list-head fifo-list[2];
-
/*
* next in sort order. read, write or both are NULL
*/
struct request *next-rq[2];
unsigned int batching; /* number of sequential requests made */
- sector-t last-sector; /* head position */
unsigned int starved; /* times reads have starved writes */
+};
+struct deadline-data {
+ struct request-queue *q;
+ sector-t last-sector; /* head position */
/*
* settings that change how the i/o scheduler behaves
*/
@@ -56,7 +54,9 @@ static void deadline-move-request(struct deadline-data *, struct request *);
static inline struct rb-root *
deadline-rb-root(struct deadline-data *dd, struct request *rq)
{
- return &dd->sort-list[rq-data-dir(rq)];
+ struct deadline-queue *dq = elv-get-sched-queue(dd->q, rq);
+
+ return &dq->sort-list[rq-data-dir(rq)];
}
/*
@@ -87,9 +87,10 @@ static inline void
deadline-del-rq-rb(struct deadline-data *dd, struct request *rq)
{
const int data-dir = rq-data-dir(rq);
+ struct deadline-queue *dq = elv-get-sched-queue(dd->q, rq);
- if (dd->next-rq[data-dir] == rq)
- dd->next-rq[data-dir] = deadline-latter-request(rq);
+ if (dq->next-rq[data-dir] == rq)
+ dq->next-rq[data-dir] = deadline-latter-request(rq);
elv-rb-del(deadline-rb-root(dd, rq), rq);
}
@@ -102,6 +103,7 @@ deadline-add-request(struct request-queue *q, struct request *rq)
{
struct deadline-data *dd = q->elevator->elevator-data;
const int data-dir = rq-data-dir(rq);
+ struct deadline-queue *dq = elv-get-sched-queue(q, rq);
deadline-add-rq-rb(dd, rq);
@@ -109,7 +111,7 @@ deadline-add-request(struct request-queue *q, struct request *rq)
* set expire time and add to fifo list
*/
rq-set-fifo-time(rq, jiffies + dd->fifo-expire[data-dir]);
- list-add-tail(&rq->queuelist, &dd->fifo-list[data-dir]);
+ list-add-tail(&rq->queuelist, &dq->fifo-list[data-dir]);
}
/*
@@ -129,6 +131,11 @@ deadline-merge(struct request-queue *q, struct request **req, struct bio *bio)
struct deadline-data *dd = q->elevator->elevator-data;
struct request *
/*
* check for front merge
@@ -136,7 +143,7 @@ deadline-merge(struct request-queue *q, struct request **req, struct bio *bio)
if (dd->front-merges) {
sector-t sector = bio->bi-sector + bio-sectors(bio);
- {
const int data-dir = rq-data-dir(rq);
+ struct deadline-queue *dq = elv-get-sched-queue(dd->q, rq);
- dd->next-rq[READ] = NULL;
- dd->next-rq[WRITE] = NULL;
- dd->next-rq[data-dir] = deadline-latter-request(rq);
+ dq->next-rq[READ] = NULL;
+ dq->next-rq[WRITE] = NULL;
+ dq->next-rq[data-dir] = deadline-latter-request(rq);
dd->last-sector = rq-end-sector(rq);
@@ -225,9 +233,9 @@ deadline-move-request(struct deadline-data *dd, struct request *rq)
* deadline-check-fifo returns 0 if there are no expired requests on the fifo,
* 1 otherwise. Requires !list-empty(&dd->fifo-list[data-dir])
*/
-static inline int deadline-check-fifo(struct deadline-data *dd, int ddir)
+static inline int deadline-check-fifo(struct deadline-queue *dq, int ddir)
{
- struct request *rq = rq-entry-fifo(dd->fifo-list[ddir].next);
+ struct request *rq = rq-entry-fifo(dq->fifo-list[ddir].next);
/*
* rq is expired!
@@ -245,20 +253,26 @@ static inline int deadline-check-fifo(struct deadline-data *dd, int ddir)
static int deadline-dispatch-requests(struct request-queue *q, int force)
{
struct deadline-data *dd = q->elevator->elevator-data;
- const int reads = !list-empty(&dd->fifo-list[READ]);
- const int writes = !list-empty(&dd->fifo-list[WRITE]);
+ struct deadline-queue *dq = elv-select-sched-queue(q, force);
+ int reads, writes;
struct request *rq;
int data-dir;
+ if (!dq)
+ return 0;
+
+ reads = !list-empty(&dq->fifo-list[READ]);
+ writes = !list-empty(&dq->fifo-list[WRITE]);
+
/*
* batches are currently reads XOR writes
*/
- if (dd->next-rq[WRITE])
- rq = dd->next-rq[WRITE];
+ if (dq->next-rq[WRITE])
+ rq = dq->next-rq[WRITE];
else
- rq = dd->next-rq[READ];
+ rq = dq->next-rq[READ];
- if (rq && dd->batching < dd->fifo-batch)
+ if (rq && dq->batching < dd->fifo-batch)
/* we have a next request are still entitled to batch */
goto dispatch-request;
@@ -268,9 +282,9 @@ static int deadline-dispatch-requests(struct request-queue *q, int force)
*/
if (reads) {
- BUG-ON(RB-EMPTY-ROOT(&dd->sort-list[READ]));
+ BUG-ON(RB-EMPTY-ROOT(&dq->sort-list[READ]));
- if (writes && (dd->starved++ >= dd->writes-starved))
+ if (writes && (dq->starved++ >= dd->writes-starved))
goto dispatch-writes;
data-dir = READ;
@@ -284,9 +298,9 @@ static int deadline-dispatch-requests(struct request-queue *q, int force)
if (writes) {
dispatch-writes:
- BUG-ON(RB-EMPTY-ROOT(&dd->sort-list[WRITE]));
+ BUG-ON(RB-EMPTY-ROOT(&dq->sort-list[WRITE]));
- dd->starved = 0;
+ dq->starved = 0;
data-dir = WRITE;
@@ -299,48 +313,62 @@ dispatch-find-request:
/*
* we are not running a batch, find best request for selected data-dir
*/
- if (deadline-check-fifo(dd, data-dir) || !dd->next-rq[data-dir]) {
+ if (deadline-check-fifo(dq, data-dir) || !dq->next-rq[data-dir]) {
/*
* A deadline has expired, the last request was in the other
* direction, or we have run out of higher-sectored requests.
* Start again from the request with the earliest expiry time.
*/
- rq = rq-entry-fifo(dd->fifo-list[data-dir].next);
+ rq = rq-entry-fifo(dq->fifo-list[data-dir].next);
} else {
/*
* The last req was the same dir and we have a next request in
* sort order. No expired requests so continue on from here.
*/
- rq = dd->next-rq[data-dir];
+ rq = dq->next-rq[data-dir];
}
- dd->batching = 0;
+ dq->batching = 0;
dispatch-request:
/*
* rq is the selected appropriate request.
*/
- dd->batching++;
+ dq->batching++;
deadline-move-request(dd, rq);
return 1;
}
-static int deadline-queue-empty(struct request-queue *q)
+static void *deadline-alloc-deadline-queue(struct request-queue *q,
+ struct elevator-queue *eq, gfp-t gfp-mask)
{
- struct deadline-data *dd = q->elevator->elevator-data;
+ struct deadline-queue *dq;
- return list-empty(&dd->fifo-list[WRITE])
- && list-empty(&dd->fifo-list[READ]);
+ dq = kmalloc-node(sizeof(*dq), gfp-mask | + dq->sort-list[WRITE] = RB-ROOT;
+out:
+ return dq;
+}
+
+static void deadline-free-deadline-queue(struct elevator-queue *e,
+ void *sched-queue)
+{
+ struct deadline-queue *dq = sched-queue;
+
+ kfree(dq);
}
static void deadline-exit-queue(struct elevator-queue *e)
{
struct deadline-data *dd = e->elevator-data;
- BUG-ON(!list-empty(&dd->fifo-list[READ]));
- BUG-ON(!list-empty(&dd->fifo-list[WRITE]));
-
kfree(dd);
}
@@ -355,10 +383,7 @@ static void *deadline-init-queue(struct request-queue *q)
if (!dd)
return NULL;
- INIT-LIST-HEAD(&dd->fifo-list[READ]);
- INIT-LIST-HEAD(&dd->fifo-list[WRITE]);
- dd->sort-list[READ] = RB-ROOT;
- dd->sort-list[WRITE] = RB-ROOT;
+ dd->q = q;
dd->fifo-expire[READ] = read-expire;
dd->fifo-expire[WRITE] = write-expire;
dd->writes-starved = writes-starved;
@@ -445,13 +470,13 @@ static struct elevator-type iosched-deadline = {
.elevator-merge-req-fn = deadline-merged-requests,
.elevator-dispatch-fn = deadline-dispatch-requests,
.elevator-add-req-fn = deadline-add-request,
- .elevator-queue-empty-fn = deadline-queue-empty,
.elevator-former-req-fn = elv-rb-former-request,
.elevator-latter-req-fn = elv-rb-latter-request,
.elevator-init-fn = deadline-init-queue,
.elevator-exit-fn = deadline-exit-queue,
+ .elevator-alloc-sched-queue-fn = deadline-alloc-deadline-queue,
+ .elevator-free-sched-queue-fn = deadline-free-deadline-queue,
},
-
.elevator-attrs = deadline-attrs,
.elevator-name = "deadline",
.elevator-owner = THIS-MODULE,
diff
-static void *elevator-init-queue(struct request-queue *q,
- struct elevator-queue *eq)
+static void *elevator-init-data(struct request-queue *q,
+ struct elevator-queue *eq)
{
- return eq->ops->elevator-init-fn(q);
+ void *data = NULL;
+
+ if (eq->ops->elevator-init-fn) {
+ data = eq->ops->elevator-init-fn(q);
+ if (data)
+ return data;
+ else
+ return ERR-PTR(-ENOMEM);
+ }
+
+ /* IO scheduler does not instanciate data (noop), it is not an error */
+ return NULL;
+}
+
+static void elevator-free-sched-queue(struct elevator-queue *eq,
+ void *sched-queue)
+{
+ /* Not all io schedulers (cfq) strore sched-queue */
+ if (!sched-queue)
+ return;
+ eq->ops->elevator-free-sched-queue-fn(eq, sched-queue);
+}
+
+static void *elevator-alloc-sched-queue(struct request-queue *q,
+ struct elevator-queue *eq)
+{
+ void *sched-queue = NULL;
+
+ if (eq->ops->elevator-alloc-sched-queue-fn) {
+ sched-queue = eq->ops->elevator-alloc-sched-queue-fn(q, eq,
+ GFP-KERNEL);
+ if (!sched-queue)
+ return ERR-PTR(-ENOMEM);
+
+ }
+
+ return sched-queue;
}
static void elevator-attach(struct request-queue *q, struct elevator-queue *eq,
- void *data)
+ void *data, void *sched-queue)
{
q->elevator = eq;
eq->elevator-data = data;
+ eq->sched-queue = sched-queue;
}
static char chosen-elevator[16];
@@ -255,7 +292,7 @@ int elevator-init(struct request-queue *q, char *name)
struct elevator-type *e = NULL;
struct elevator-queue *eq;
int ret = 0;
- void *data;
+ void *data = NULL, *sched-queue = NULL;
INIT-LIST-HEAD(&q->queue-head);
q->last-merge = NULL;
@@ -289,13 +326,21 @@ int elevator-init(struct request-queue *q, char *name)
if (!eq)
return -ENOMEM;
- data = elevator-init-queue(q, eq);
- if (!data) {
+ data = elevator-init-data(q, eq);
+
+ if (IS-ERR(data)) {
+ kobject-put(&eq->kobj);
+ return -ENOMEM;
+ }
+
+ sched-queue = elevator-alloc-sched-queue(q, eq);
+
+ if (IS-ERR(sched-queue)) {
kobject-put(&eq->kobj);
return -ENOMEM;
}
- elevator-attach(q, eq, data);
+ elevator-attach(q, eq, data, sched-queue);
return ret;
}
EXPORT-SYMBOL(elevator-init);
@@ -303,6 +348,7 @@ EXPORT-SYMBOL(elevator-init);
void elevator-exit(struct elevator-queue *e)
{
mutex-lock(&e->sysfs-lock);
+ elevator-free-sched-queue(e, e->sched-queue);
elv-exit-fq-data(e);
if (e->ops->elevator-exit-fn)
e->ops->elevator-exit-fn(e);
@@ -992,7 +1038,7 @@ EXPORT-SYMBOL-GPL(elv-unregister);
static int elevator-switch(struct request-queue *q, struct elevator-type *new-e)
{
struct elevator-queue *old-elevator, *e;
- void *data;
+ void *data = NULL, *sched-queue = NULL;
/*
* Allocate new elevator
@@ -1001,10 +1047,18 @@ static int elevator-switch(struct request-queue *q, struct elevator-type *new-e)
if (!e)
return 0;
- data = elevator-init-queue(q, e);
- if (!data) {
+ data = elevator-init-data(q, e);
+
+ if (IS-ERR(data)) {
kobject-put(&e->kobj);
- return 0;
+ return -ENOMEM;
+ }
+
+ sched-queue = elevator-alloc-sched-queue(q, e);
+
+ if (IS-ERR(sched-queue)) {
+ kobject-put(&e->kobj);
+ return -ENOMEM;
}
/*
@@ -1021,7 +1075,7 @@ static int elevator-switch(struct request-queue *q, struct elevator-type *new-e)
/*
* attach and start new elevator
*/
- elevator-attach(q, e, data);
+ elevator-attach(q, e, data, sched-queue);
spin-unlock-irq(q->queue-lock);
@@ -1136,16 +1190,43 @@ struct request *elv-rb-latter-request(struct request-queue *q,
}
EXPORT-SYMBOL(elv-rb-latter-request);
-/* Get the io scheduler queue pointer. For cfq, it is stored in rq->ioq*/
+/* Get the io scheduler queue pointer. */
void *elv-get-sched-queue(struct request-queue *q, struct request *rq)
{
- return ioq-sched-queue(req-ioq(rq));
+ /*
+ * io scheduler is not using fair queuing. Return sched-queue
+ * pointer stored in elevator-queue. It will be null if io
+ * scheduler never stored anything there to begin with (cfq)
+ */
+ if (!elv-iosched-fair-queuing-enabled(q->elevator))
+ return q->elevator->sched-queue;
+
+ /*
+ * IO schedueler is using fair queuing infrasture. If io scheduler
+ * has passed a non null rq, retrieve sched-queue pointer from
+ * there. */
+ if (rq)
+ return ioq-sched-queue(req-ioq(rq));
+
+ return NULL;
}
EXPORT-SYMBOL(elv-get-sched-queue);
/* Select an ioscheduler queue to dispatch request from. */
void *elv-select-sched-queue(struct request-queue *q, int force)
{
+ if (!elv-iosched-fair-queuing-enabled(q->elevator))
+ return q->elevator->sched-queue;
+
return ioq-sched-queue(elv-fq-select-ioq(q, force));
}
EXPORT-SYMBOL(elv-select-sched-queue);
+
+/*
+ * Get the io scheduler queue pointer for current task.
+ */
+void *elv-get-sched-queue-current(struct request-queue *q)
+{
+ return q->elevator->sched-queue;
+}
+EXPORT-SYMBOL(elv-get-sched-queue-current);
diff
-struct noop-data {
+struct noop-queue {
struct list-head queue;
};
@@ -19,11 +19,14 @@ static void noop-merged-requests(struct request-queue *q, struct request *rq,
static int noop-dispatch(struct request-queue *q, int force)
{
- struct noop-data *nd = q->elevator->elevator-data;
+ struct noop-queue *nq = elv-select-sched-queue(q, force);
- if (!list-empty(&nd->queue)) {
+ if (!nq)
+ return 0;
+
+ if (!list-empty(&nq->queue)) {
struct request *rq;
- rq = list-entry(nd->queue.next, struct request, queuelist);
+ rq = list-entry(nq->queue.next, struct request, queuelist);
list-del-init(&rq->queuelist);
elv-dispatch-sort(q, rq);
return 1;
@@ -33,24 +36,17 @@ static int noop-dispatch(struct request-queue *q, int force)
static void noop-add-request(struct request-queue *q, struct request *rq)
{
- struct noop-data *nd = q->elevator->elevator-data;
+ struct noop-queue *nq = elv-get-sched-queue(q, rq);
- list-add-tail(&rq->queuelist, &nd->queue);
-}
-
-static int noop-queue-empty(struct request-queue *q)
-{
- struct noop-data *nd = q->elevator->elevator-data;
-
- return list-empty(&nd->queue);
+ list-add-tail(&rq->queuelist, &nq->queue);
}
static struct request *
noop-former-request(struct request-queue *q, struct request *rq)
{
- struct noop-data *nd = q->elevator->elevator-data;
+ struct noop-queue *nq = elv-get-sched-queue(q, rq);
- if (rq->queuelist.prev == &nd->queue)
+ if (rq->queuelist.prev == &nq->queue)
return NULL;
return list-entry(rq->queuelist.prev, struct request, queuelist);
}
@@ -58,30 +54,32 @@ noop-former-request(struct request-queue *q, struct request *rq)
static struct request *
noop-latter-request(struct request-queue *q, struct request *rq)
{
- struct noop-data *nd = q->elevator->elevator-data;
+ struct noop-queue *nq = elv-get-sched-queue(q, rq);
- if (rq->queuelist.next == &nd->queue)
+ if (rq->queuelist.next == &nq->queue)
return NULL;
return list-entry(rq->queuelist.next, struct request, queuelist);
}
-static void *noop-init-queue(struct request-queue *q)
+static void *noop-alloc-noop-queue(struct request-queue *q,
+ struct elevator-queue *eq, gfp-t gfp-mask)
{
- struct noop-data *nd;
+ struct noop-queue *nq;
- nd = kmalloc-node(sizeof(*nd), GFP-KERNEL, q->node);
- if (!nd)
- return NULL;
- INIT-LIST-HEAD(&nd->queue);
- return nd;
+ nq = kmalloc-node(sizeof(*nq), gfp-mask | }
-static void noop-exit-queue(struct elevator-queue *e)
+static void noop-free-noop-queue(struct elevator-queue *e, void *sched-queue)
{
- struct noop-data *nd = e->elevator-data;
+ struct noop-queue *nq = sched-queue;
- BUG-ON(!list-empty(&nd->queue));
- kfree(nd);
+ kfree(nq);
}
static struct elevator-type elevator-noop = {
@@ -89,11 +87,10 @@ static struct elevator-type elevator-noop = {
.elevator-merge-req-fn = noop-merged-requests,
.elevator-dispatch-fn = noop-dispatch,
.elevator-add-req-fn = noop-add-request,
- .elevator-queue-empty-fn = noop-queue-empty,
.elevator-former-req-fn = noop-former-request,
.elevator-latter-req-fn = noop-latter-request,
- .elevator-init-fn = noop-init-queue,
- .elevator-exit-fn = noop-exit-queue,
+ .elevator-alloc-sched-queue-fn = noop-alloc-noop-queue,
+ .elevator-free-sched-queue-fn = noop-free-noop-queue,
},
.elevator-name = "noop",
.elevator-owner = THIS-MODULE,
diff typedef void (elevator-exit-fn) (struct elevator-queue *);
-#ifdef CONFIG-ELV-FAIR-QUEUING
+typedef void* (elevator-alloc-sched-queue-fn) (struct request-queue *q, struct elevator-queue *eq, gfp-t);
typedef void (elevator-free-sched-queue-fn) (struct elevator-queue*, void *);
+#ifdef CONFIG-ELV-FAIR-QUEUING
typedef void (elevator-active-ioq-set-fn) (struct request-queue*, void *, int);
typedef void (elevator-active-ioq-reset-fn) (struct request-queue *, void*);
typedef void (elevator-arm-slice-timer-fn) (struct request-queue*, void*);
@@ -70,8 +71,9 @@ struct elevator-ops
elevator-exit-fn *elevator-exit-fn;
void (*trim)(struct io-context *);
-#ifdef CONFIG-ELV-FAIR-QUEUING
+ elevator-alloc-sched-queue-fn *elevator-alloc-sched-queue-fn;
elevator-free-sched-queue-fn *elevator-free-sched-queue-fn;
+#ifdef CONFIG-ELV-FAIR-QUEUING
elevator-active-ioq-set-fn *elevator-active-ioq-set-fn;
elevator-active-ioq-reset-fn *elevator-active-ioq-reset-fn;
@@ -112,6 +114,7 @@ struct elevator-queue
{
struct elevator-ops *ops;
void *elevator-data;
+ void *sched-queue;
struct kobject kobj;
struct elevator-type *elevator-type;
struct mutex sysfs-lock;
@@ -258,5 +261,6 @@ static inline int elv-iosched-fair-queuing-enabled(struct elevator-queue *e)
#endif /* ELV-IOSCHED-FAIR-QUEUING */
extern void *elv-get-sched-queue(struct request-queue *q, struct request *rq);
extern void *elv-select-sched-queue(struct request-queue *q, int force);
+extern void *elv-get-sched-queue-current(struct request-queue *q);
#endif /* CONFIG-BLOCK */
#endif
PATCH 23/25 - io-controller: Support per cgroup per device weights and io class by Vivek Goyal on
2009-07-02T20:05:37+00:00
This patch enables per-cgroup per-device weight and ioprio-class handling.
A new cgroup interface "policy" is introduced. You can make use of this
file to configure weight and ioprio-class for each device in a given cgroup.
The original "weight" and "ioprio-class" files are still available. If you
don't do special configuration for a particular device, "weight" and
"ioprio-class" are used as default values in this device.
You can use the following format to play with the new interface.
#echo dev-major:dev-minor weight ioprio-class > /patch/to/cgroup/policy
weight=0 means removing the policy for device.
Examples:
Configure weight=300 ioprio-class=2 on /dev/hdb (8:16) in this cgroup
# echo "8:16 300 2" > io.policy
# cat io.policy
dev weight class
8:16 300 2
Configure weight=500 ioprio-class=1 on /dev/hda (8:0) in this cgroup
# echo "8:0 500 1" > io.policy
# cat io.policy
dev weight class
8:0 500 1
8:16 300 2
Remove the policy for /dev/hda in this cgroup
# echo 8:0 0 1 > io.policy
# cat io.policy
dev weight class
8:16 300 2
Changelog (v1 -> v2)
- Rename some structures
- Use spin-lock-irqsave() and spin-lock-irqrestore() version to prevent
from enabling the interrupts unconditionally.
- Fix policy setup bug when switching to another io scheduler.
- If a policy is available for a specific device, don't update weight and
io class when writing "weight" and "iprio-class".
- Fix a bug when parsing policy string.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff #include <linux/biotrack.h>
+#include <linux/genhd.h>
/* Values taken from cfq */
const int elv-slice-sync = HZ / 10;
@@ -1053,12 +1054,31 @@ static void bfq-init-entity(struct io-entity *entity, struct io-group *iog)
entity->sched-data = &iog->sched-data;
}
-static void io-group-init-entity(struct io-cgroup *iocg, struct io-group *iog)
+static struct io-policy-node *policy-search-node(const struct io-cgroup *iocg,
+ dev-t dev);
+
+static void io-group-init-entity(struct io-cgroup *iocg, struct io-group *iog,
+ dev-t dev)
{
struct io-entity *entity = &iog->entity;
+ struct io-policy-node *pn;
+ unsigned long flags;
+
+ spin-lock-irqsave(&iocg->lock, flags);
+ pn = policy-search-node(iocg, dev);
+ if (pn) {
+ entity->weight = pn->weight;
+ entity->new-weight = pn->weight;
+ entity->ioprio-class = pn->ioprio-class;
+ entity->new-ioprio-class = pn->ioprio-class;
+ } else {
+ entity->weight = iocg->weight;
+ entity->new-weight = iocg->weight;
+ entity->ioprio-class = iocg->ioprio-class;
+ entity->new-ioprio-class = iocg->ioprio-class;
+ }
+ spin-unlock-irqrestore(&iocg->lock, flags);
- entity->weight = entity->new-weight = iocg->weight;
- entity->ioprio-class = entity->new-ioprio-class = iocg->ioprio-class;
entity->ioprio-changed = 1;
entity->my-sched-data = &iog->sched-data;
}
@@ -1174,6 +1194,227 @@ io-cgroup-lookup-group(struct io-cgroup *iocg, void *key)
return NULL;
}
+static int io-cgroup-policy-read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq-file *m)
+{
+ struct io-cgroup *iocg;
+ struct io-policy-node *pn;
+
+ iocg = cgroup-to-io-cgroup(cgrp);
+
+ if (list-empty(&iocg->policy-list))
+ goto out;
+
+ seq-printf(m, "dev weight class
");
+
+ spin-lock-irq(&iocg->lock);
+ list-for-each-entry(pn, &iocg->policy-list, node) {
+ seq-printf(m, "%u:%u %u %hu
", MAJOR(pn->dev),
+ MINOR(pn->dev), pn->weight, pn->ioprio-class);
+ }
+ spin-unlock-irq(&iocg->lock);
+out:
+ return 0;
+}
+
+static inline void policy-insert-node(struct io-cgroup *iocg,
+ struct io-policy-node *pn)
+{
+ list-add(&pn->node, &iocg->policy-list);
+}
+
+/* Must be called with iocg->lock held */
+static inline void policy-delete-node(struct io-policy-node *pn)
+{
+ list-del(&pn->node);
+}
+
+/* Must be called with iocg->lock held */
+static struct io-policy-node *policy-search-node(const struct io-cgroup *iocg,
+ dev-t dev)
+{
+ struct io-policy-node *pn;
+
+ if (list-empty(&iocg->policy-list))
+ return NULL;
+
+ list-for-each-entry(pn, &iocg->policy-list, node) {
+ if (pn->dev == dev)
+ return pn;
+ }
+
+ return NULL;
+}
+
+static int check-dev-num(dev-t dev)
+{
+ int part = 0;
+ struct gendisk *disk;
+
+ disk = get-gendisk(dev, &part);
+ if (!disk || part)
+ return -ENODEV;
+
+ return 0;
+}
+
+static int policy-parse-and-set(char *buf, struct io-policy-node *newpn)
+{
+ char *s[4], *p, *major-s = NULL, *minor-s = NULL;
+ int ret;
+ unsigned long major, minor, temp;
+ int i = 0;
+ dev-t dev;
+
+ memset(s, 0, sizeof(s));
+ while ((p = strsep(&buf, " ")) != NULL) {
+ if (!*p)
+ continue;
+ s[i++] = p;
+
+ /* Prevent from inputing too many things */
+ if (i == 4)
+ break;
+ }
+
+ if (i != 3)
+ return -EINVAL;
+
+ p = strsep(&s[0], ":");
+ if (p != NULL)
+ major-s = p;
+ else
+ return -EINVAL;
+
+ minor-s = s[0];
+ if (!minor-s)
+ return -EINVAL;
+
+ ret = strict-strtoul(major-s, 10, &major);
+ if (ret)
+ return -EINVAL;
+
+ ret = strict-strtoul(minor-s, 10, &minor);
+ if (ret)
+ return -EINVAL;
+
+ dev = MKDEV(major, minor);
+
+ ret = check-dev-num(dev);
+ if (ret)
+ return ret;
+
+ newpn->dev = dev;
+
+ if (s[1] == NULL)
+ return -EINVAL;
+
+ ret = strict-strtoul(s[1], 10, &temp);
+ if (ret || temp > WEIGHT-MAX)
+ return -EINVAL;
+
+ newpn->weight = temp;
+
+ if (s[2] == NULL)
+ return -EINVAL;
+
+ ret = strict-strtoul(s[2], 10, &temp);
+ if (ret || temp < IOPRIO-CLASS-RT || temp > IOPRIO-CLASS-IDLE)
+ return -EINVAL;
+ newpn->ioprio-class = temp;
+
+ return 0;
+}
+
+static int io-cgroup-policy-write(struct cgroup *cgrp, struct cftype *cft,
+ const char *buffer)
+{
+ struct io-cgroup *iocg;
+ struct io-policy-node *newpn, *pn;
+ char *buf;
+ int ret = 0;
+ int keep-newpn = 0;
+ struct hlist-node *n;
+ struct io-group *iog;
+
+ buf = kstrdup(buffer, GFP-KERNEL);
+ if (!buf)
+ return -ENOMEM;
+
+ newpn = kzalloc(sizeof(*newpn), GFP-KERNEL);
+ if (!newpn) {
+ ret = -ENOMEM;
+ goto free-buf;
+ }
+
+ ret = policy-parse-and-set(buf, newpn);
+ if (ret)
+ goto free-newpn;
+
+ if (!cgroup-lock-live-group(cgrp)) {
+ ret = -ENODEV;
+ goto free-newpn;
+ }
+
+ iocg = cgroup-to-io-cgroup(cgrp);
+ spin-lock-irq(&iocg->lock);
+
+ pn = policy-search-node(iocg, newpn->dev);
+ if (!pn) {
+ if (newpn->weight != 0) {
+ policy-insert-node(iocg, newpn);
+ keep-newpn = 1;
+ }
+ goto update-io-group;
+ }
+
+ if (newpn->weight == 0) {
+ /* weight == 0 means deleteing a policy */
+ policy-delete-node(pn);
+ goto update-io-group;
+ }
+
+ pn->weight = newpn->weight;
+ pn->ioprio-class = newpn->ioprio-class;
+
+update-io-group:
+ hlist-for-each-entry(iog, n, &iocg->group-data, group-node) {
+ if (iog->dev == newpn->dev) {
+ if (newpn->weight) {
+ iog->entity.new-weight = newpn->weight;
+ iog->entity.new-ioprio-class =
+ newpn->ioprio-class;
+ /*
+ * iog weight and ioprio-class updating
+ * actually happens if ioprio-changed is set.
+ * So ensure ioprio-changed is not set until
+ * new weight and new ioprio-class are updated.
+ */
+ smp-wmb();
+ iog->entity.ioprio-changed = 1;
+ } else {
+ iog->entity.new-weight = iocg->weight;
+ iog->entity.new-ioprio-class =
+ iocg->ioprio-class;
+
+ /* The same as above */
+ smp-wmb();
+ iog->entity.ioprio-changed = 1;
+ }
+ }
+ }
+ spin-unlock-irq(&iocg->lock);
+
+ cgroup-unlock();
+
+free-newpn:
+ if (!keep-newpn)
+ kfree(newpn);
+free-buf:
+ kfree(buf);
+ return ret;
+}
+
#define SHOW-FUNCTION(+ struct io-policy-node *pn;
if (val < (+ if (pn)
+ continue;
iog->entity.new-##+ .name = "policy",
+ .read-seq-string = io-cgroup-policy-read,
+ .write-string = io-cgroup-policy-write,
+ .max-write-len = 256,
+ },
+ {
.name = "weight",
.read-u64 = io-cgroup-weight-read,
.write-u64 = io-cgroup-weight-write,
@@ -1336,6 +1587,7 @@ static struct cgroup-subsys-state *iocg-create(struct cgroup-subsys *subsys,
INIT-HLIST-HEAD(&iocg->group-data);
iocg->weight = IO-DEFAULT-GRP-WEIGHT;
iocg->ioprio-class = IO-DEFAULT-GRP-CLASS;
+ INIT-LIST-HEAD(&iocg->policy-list);
return &iocg->css;
}
@@ -1438,7 +1690,7 @@ io-group-chain-alloc(struct request-queue *q, void *key, struct cgroup *cgroup)
sscanf(dev-name(bdi->dev), "%u:%u", &major, &minor);
iog->dev = MKDEV(major, minor);
- io-group-init-entity(iocg, iog);
+ io-group-init-entity(iocg, iog, iog->dev);
iog->my-entity = &iog->entity;
atomic-set(&iog->ref, 0);
@@ -1904,6 +2156,7 @@ static void iocg-destroy(struct cgroup-subsys *subsys, struct cgroup *cgroup)
struct io-group *iog;
struct elv-fq-data *efqd;
unsigned long uninitialized-var(flags);
+ struct io-policy-node *pn, *pntmp;
/*
* io groups are linked in two lists. One list is maintained
@@ -1943,6 +2196,11 @@ remove-entry:
goto remove-entry;
done:
+ list-for-each-entry-safe(pn, pntmp, &iocg->policy-list, node) {
+ policy-delete-node(pn);
+ kfree(pn);
+ }
+
free-css-id(&io-subsys, &iocg->css);
rcu-read-unlock();
BUG-ON(!hlist-empty(&iocg->group-data));
diff
+struct io-policy-node {
+ struct list-head node;
+ dev-t dev;
+ unsigned int weight;
+ unsigned short ioprio-class;
+};
+
/**
* struct io-cgroup - io cgroup data structure.
* @css: subsystem state for io in the containing cgroup.
@@ -284,6 +291,9 @@ struct io-cgroup {
unsigned int weight;
unsigned short ioprio-class;
+ /* list of io-policy-node */
+ struct list-head policy-list;
+
spinlock-t lock;
struct hlist-head group-data;
};
PATCH 03/25 - io-controller: bfq support of in-class preemption by Vivek Goyal on
2009-07-02T20:05:54+00:00
o Generally preemption is associated with cross class where if an request
from RT class is pending it will preempt the ongoing BE or IDLE class
request.
o CFQ also does in-class preemtions like a sync request queue preempting the
async request queue. In that case it looks like preempting queue gains
share and it is not fair.
o Implement the similar functionality in bfq so that we can retain the
existing CFQ behavior.
o This patch creates a bypass path so that a queue can be put at the
front of the service tree (add-front, similar to CFQ), so that it will
be selected next to run. That's a different thing that in the process
this queue gains share.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff */
-static void
st = + */
+ if (add-front) {
+ struct io-entity *next-entity;
+
+ /*
+ * Determine the entity which will be dispatched next
+ * Use sd->next-active once hierarchical patch is applied
+ */
+ next-entity = bfq-lookup-next-entity(sd, 0);
+
+ if (next-entity && next-entity != entity) {
+ struct io-service-tree *new-st;
+ u64 delta;
+
+ new-st = io-entity-service-tree(next-entity);
+
+ /*
+ * At this point, both entities should belong to
+ * same service tree as cross service tree preemption
+ * is automatically taken care by algorithm
+ */
+ BUG-ON(new-st != st);
+ entity->finish = next-entity->finish - 1;
+ delta = bfq-delta(entity->budget, entity->weight);
+ entity->start = entity->finish - delta;
+ if (bfq-gt(entity->start, st->vtime))
+ entity->start = st->vtime;
+ }
+ } else {
+ bfq-calc-finish(entity, entity->budget);
+ }
bfq-active-insert(st, entity);
}
@@ -635,9 +670,9 @@ static void - the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
PATCH 22/25 - io-controller: Per io group bdi congestion interface by Vivek Goyal on
2009-07-02T20:06:19+00:00
o So far there used to be only one pair or queue of request descriptors
(one for sync and one for async) per device and number of requests allocated
used to decide whether associated bdi is congested or not.
Now with per io group request descriptor infrastructure, there is a pair
of request descriptor queue per io group per device. So it might happen
that overall request queue is not congested but a particular io group
bio belongs to is congested.
Or, it could be otherwise that group is not congested but overall queue
is congested. This can happen if user has not properly set the request
descriptors limits for queue and groups.
(q->nr-requests < nr-groups * q->nr-group-requests)
Hence there is a need for new interface which can query deivce congestion
status per group. This group is determined by the "struct page" IO will be
done for. If page is null, then group is determined from the current task
context.
o This patch introduces new set of function bdi-*-congested-group(), which
take "struct page" as addition argument. These functions will call the
block layer and in trun elevator to find out if the io group the page will
go into is congested or not.
o Currently I have introduced the core functions and migrated most of the users.
But there might be still some left. This is an ongoing TODO item.
o There are some io-get-io-group() related changes which should be pushed into
higher patches. Still testing this patch. Will push these changes up in next
posting.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff
+#ifdef CONFIG-GROUP-IOSCHED
+int blk-queue-io-group-congested(struct backing-dev-info *bdi, int bdi-bits,
+ struct page *page)
+{
+ int ret = 0;
+ struct request-queue *q = bdi->unplug-io-data;
+
+ if (!q && !q->elevator)
+ return bdi-congested(bdi, bdi-bits);
+
+ /* Do we need to hold queue lock? */
+ if (bdi-bits & (1 << BDI-sync-congested))
+ ret |= elv-io-group-congested(q, page, 1);
+
+ if (bdi-bits & (1 << BDI-async-congested))
+ ret |= elv-io-group-congested(q, page, 0);
+
+ return ret;
+}
+#endif
+
/**
* blk-get-backing-dev-info - get the address of a queue's backing-dev-info
* @bdev: device
diff
+/* Set io group congestion on and off thresholds */
+void elv-io-group-congestion-threshold(struct request-queue *q,
+ struct io-group *iog)
+{
+ int nr;
+
+ nr = q->nr-group-requests - (q->nr-group-requests / 8) + 1;
+ if (nr > q->nr-group-requests)
+ nr = q->nr-group-requests;
+ iog->nr-congestion-on = nr;
+
+ nr = q->nr-group-requests - (q->nr-group-requests / 8)
+ - (q->nr-group-requests / 16) - 1;
+ if (nr < 1)
+ nr = 1;
+ iog->nr-congestion-off = nr;
+}
+
+static inline int elv-is-iog-congested(struct request-queue *q,
+ struct io-group *iog, int sync)
+{
+ if (iog->rl.count[sync] >= iog->nr-congestion-on)
+ return 1;
+ return 0;
+}
+
+/* Determine if io group page maps to is congested or not */
+int elv-io-group-congested(struct request-queue *q, struct page *page, int sync)
+{
+ struct io-group *iog;
+ int ret = 0;
+
+ rcu-read-lock();
+
+ iog = io-get-io-group(q, page, 0);
+
+ if (!iog) {
+ /*
+ * Either cgroup got deleted or this is first request in the
+ * group and associated io group object has not been created
+ * yet. Map it to root group.
+ *
+ * TODO: Fix the case of group not created yet.
+ */
+ iog = q->elevator->efqd.root-group;
+ }
+
+ ret = elv-is-iog-congested(q, iog, sync);
+ rcu-read-unlock();
+ return ret;
+}
+
/*
* Search the io-group for efqd into the hash table (by now only a list)
* of bgrp. Must be called under rcu-read-lock().
@@ -1401,6 +1453,7 @@ io-group-chain-alloc(struct request-queue *q, void *key, struct cgroup *cgroup)
elv-get-iog(iog);
blk-init-request-list(&iog->rl);
+ elv-io-group-congestion-threshold(q, iog);
if (leaf == NULL) {
leaf = iog;
@@ -1680,6 +1733,7 @@ static struct io-group *io-alloc-root-group(struct request-queue *q,
iog->sched-data.service-tree[i] = IO-SERVICE-TREE-INIT;
blk-init-request-list(&iog->rl);
+ elv-io-group-congestion-threshold(q, iog);
iocg = &io-root-cgroup;
spin-lock-irq(&iocg->lock);
@@ -1688,6 +1742,10 @@ static struct io-group *io-alloc-root-group(struct request-queue *q,
iog->iocg-id = css-id(&iocg->css);
spin-unlock-irq(&iocg->lock);
+#ifdef CONFIG-DEBUG-GROUP-IOSCHED
+ io-group-path(iog, iog->path, sizeof(iog->path));
+#endif
+
return iog;
}
diff
+ /* io group congestion on and off threshold for request descriptors */
+ unsigned int nr-congestion-on;
+ unsigned int nr-congestion-off;
+
/* request list associated with the group */
struct request-list rl;
};
@@ -531,6 +535,8 @@ extern struct io-queue *elv-lookup-ioq-bio(struct request-queue *q,
struct bio *bio);
extern struct request-list *io-group-get-request-list(struct request-queue *q,
struct bio *bio);
+extern int elv-io-group-congested(struct request-queue *q, struct page *page,
+ int sync);
/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
static inline void io-group-set-ioq(struct io-group *iog, struct io-queue *ioq)
diff
-int dm-table-any-congested(struct dm-table *t, int bdi-bits)
+int dm-table-any-congested(struct dm-table *t, int bdi-bits, struct page *page,
+ int group)
{
struct dm-dev-internal *dd;
struct list-head *devices = dm-table-get-devices(t);
@@ -1185,9 +1186,11 @@ int dm-table-any-congested(struct dm-table *t, int bdi-bits)
struct request-queue *q = bdev-get-queue(dd->dm-dev.bdev);
char b[BDEVNAME-SIZE];
- if (likely(q))
- r |= bdi-congested(&q->backing-dev-info, bdi-bits);
- else
+ if (likely(q)) {
+ struct backing-dev-info *bdi = &q->backing-dev-info;
+ r |= group ? bdi-congested-group(bdi, bdi-bits, page)
+ : bdi-congested(bdi, bdi-bits);
+ } else
DMWARN-LIMIT("%s: any-congested: nonexistent device %s",
dm-device-name(t->md),
bdevname(dd->dm-dev.bdev, b));
diff
-static int dm-any-congested(void *congested-data, int bdi-bits)
+static int dm-any-congested(void *congested-data, int bdi-bits,
+ struct page *page, int group)
{
int r = bdi-bits;
struct mapped-device *md = congested-data;
@@ -1625,8 +1626,8 @@ static int dm-any-congested(void *congested-data, int bdi-bits)
r = md->queue->backing-dev-info.state &
bdi-bits;
else
- r = dm-table-any-congested(map, bdi-bits);
-
+ r = dm-table-any-congested(map, bdi-bits, page,
+ group);
dm-table-put(map);
}
}
diff int dm-table-resume-targets(struct dm-table *t);
-int dm-table-any-congested(struct dm-table *t, int bdi-bits);
+int dm-table-any-congested(struct dm-table *t, int bdi-bits, struct page *page,
+ int group);
int dm-table-any-busy-target(struct dm-table *t);
int dm-table-set-type(struct dm-table *t);
unsigned dm-table-get-type(struct dm-table *t);
diff
-static int linear-congested(void *data, int bits)
+static int linear-congested(void *data, int bits, struct page *page, int group)
{
mddev-t *mddev = data;
linear-conf-t *conf;
@@ -113,7 +113,10 @@ static int linear-congested(void *data, int bits)
for (i = 0; i < mddev->raid-disks && !ret ; i++) {
struct request-queue *q = bdev-get-queue(conf->disks[i].rdev->bdev);
- ret |= bdi-congested(&q->backing-dev-info, bits);
+ struct backing-dev-info *bdi = &q->backing-dev-info;
+
+ ret |= group ? bdi-congested-group(bdi, bits, page) :
+ bdi-congested(bdi, bits);
}
rcu-read-unlock();
diff
-static int multipath-congested(void *data, int bits)
+static int multipath-congested(void *data, int bits, struct page *page,
+ int group)
{
mddev-t *mddev = data;
multipath-conf-t *conf = mddev->private;
@@ -203,8 +204,10 @@ static int multipath-congested(void *data, int bits)
mdk-rdev-t *rdev = rcu-dereference(conf->multipaths[i].rdev);
if (rdev && !test-bit(Faulty, &rdev->flags)) {
struct request-queue *q = bdev-get-queue(rdev->bdev);
+ struct backing-dev-info *bdi = &q->backing-dev-info;
- ret |= bdi-congested(&q->backing-dev-info, bits);
+ ret |= group ? bdi-congested-group(bdi, bits, page)
+ : bdi-congested(bdi, bits);
/* Just like multipath-map, we just check the
* first available device
*/
diff
-static int raid0-congested(void *data, int bits)
+static int raid0-congested(void *data, int bits, struct page *page, int group)
{
mddev-t *mddev = data;
raid0-conf-t *conf = mddev->private;
@@ -46,8 +46,10 @@ static int raid0-congested(void *data, int bits)
for (i = 0; i < mddev->raid-disks && !ret ; i++) {
struct request-queue *q = bdev-get-queue(devlist[i]->bdev);
+ struct backing-dev-info *bdi = &q->backing-dev-info;
- ret |= bdi-congested(&q->backing-dev-info, bits);
+ ret |= group ? bdi-congested-group(bdi, bits, page)
+ : bdi-congested(bdi, bits);
}
return ret;
}
diff
-static int raid1-congested(void *data, int bits)
+static int raid1-congested(void *data, int bits, struct page *page, int group)
{
mddev-t *mddev = data;
conf-t *conf = mddev->private;
@@ -581,14 +581,17 @@ static int raid1-congested(void *data, int bits)
mdk-rdev-t *rdev = rcu-dereference(conf->mirrors[i].rdev);
if (rdev && !test-bit(Faulty, &rdev->flags)) {
struct request-queue *q = bdev-get-queue(rdev->bdev);
+ struct backing-dev-info *bdi = &q->backing-dev-info;
/* Note the '|| 1' - when read-balance prefers
* non-congested targets, it can be removed
*/
if ((bits & (1<<BDI-async-congested)) || 1)
- ret |= bdi-congested(&q->backing-dev-info, bits);
+ ret |= group ? bdi-congested-group(bdi, bits,
+ page) : bdi-congested(bdi, bits);
else
- ret &= bdi-congested(&q->backing-dev-info, bits);
+ ret &= group ? bdi-congested-group(bdi, bits,
+ page) : bdi-congested(bdi, bits);
}
}
rcu-read-unlock();
diff
-static int raid10-congested(void *data, int bits)
+static int raid10-congested(void *data, int bits, struct page *page, int group)
{
mddev-t *mddev = data;
conf-t *conf = mddev->private;
@@ -636,8 +636,10 @@ static int raid10-congested(void *data, int bits)
mdk-rdev-t *rdev = rcu-dereference(conf->mirrors[i].rdev);
if (rdev && !test-bit(Faulty, &rdev->flags)) {
struct request-queue *q = bdev-get-queue(rdev->bdev);
+ struct backing-dev-info *bdi = &q->backing-dev-info;
- ret |= bdi-congested(&q->backing-dev-info, bits);
+ ret |= group ? bdi-congested-group(bdi, bits, page)
+ : bdi-congested(bdi, bits);
}
}
rcu-read-unlock();
diff
-static int raid5-congested(void *data, int bits)
+static int raid5-congested(void *data, int bits, struct page *page, int group)
{
mddev-t *mddev = data;
raid5-conf-t *conf = mddev->private;
diff wbc->nr-to-write -= ret;
- if (wbc->nonblocking && bdi-write-congested(bdi))
+ if (wbc->nonblocking && bdi-or-group-write-congested(bdi, page))
wbc->encountered-congestion = 1;
-leave(" = 0");
@@ -491,6 +491,12 @@ static int afs-writepages-region(struct address-space *mapping,
return 0;
}
+ if (wbc->nonblocking && bdi-write-congested-group(bdi, page)) {
+ wbc->encountered-congestion = 1;
+ page-cache-release(page);
+ break;
+ }
+
/* at this point we hold neither mapping->tree-lock nor lock on
* the page itself: the page may be truncated or invalidated
* (changing page->mapping to NULL), or even swizzled back from
diff
-static int btrfs-congested-fn(void *congested-data, int bdi-bits)
+static int btrfs-congested-fn(void *congested-data, int bdi-bits,
+ struct page *page, int group)
{
struct btrfs-fs-info *info = (struct btrfs-fs-info *)congested-data;
int ret = 0;
@@ -1260,7 +1261,8 @@ static int btrfs-congested-fn(void *congested-data, int bdi-bits)
if (!device->bdev)
continue;
bdi = blk-get-backing-dev-info(device->bdev);
- if (bdi && bdi-congested(bdi, bdi-bits)) {
+ if (bdi && (group ? bdi-congested-group(bdi, bdi-bits, page) :
+ bdi-congested(bdi, bdi-bits))) {
ret = 1;
break;
}
diff scanned = 1;
+
+ /*
+ * If the io group page will go into is congested, bail out.
+ */
+ if (wbc->nonblocking
+ && bdi-write-congested-group(bdi, pvec.pages[0])) {
+ wbc->encountered-congestion = 1;
+ done = 1;
+ pagevec-release(&pvec);
+ break;
+ }
+
for (i = 0; i < nr-pages; i++) {
struct page *page = pvec.pages[i];
diff int force-reg = 0;
+ struct page *page;
bdi = blk-get-backing-dev-info(device->bdev);
fs-info = device->dev-root->fs-info;
@@ -276,8 +277,11 @@ loop-lock:
* is now congested. Back off and let other work structs
* run instead
*/
- if (pending && bdi-write-congested(bdi) && batch-run > 32 &&
- fs-info->fs-devices->open-devices > 1) {
+ if (pending)
+ page = bio-iovec-idx(pending, 0)->bv-page;
+
+ if (pending && bdi-or-group-write-congested(bdi, page) &&
+ num-run > 32 && fs-info->fs-devices->open-devices > 1) {
struct io-context *ioc;
ioc = current->io-context;
diff
+ /*
+ * If the io group page will go into is congested, bail out.
+ */
+ if (wbc->nonblocking &&
+ bdi-write-congested-group(bdi, pvec.pages[0])) {
+ wbc->encountered-congestion = 1;
+ done = 1;
+ pagevec-release(&pvec);
+ break;
+ }
+
for (i = 0; i < nr-pages; i++) {
page = pvec.pages[i];
/*
diff bdi = inode->i-mapping->backing-dev-info;
- if (bdi-read-congested(bdi))
+ if (bdi-or-group-read-congested(bdi, NULL))
return;
if (bdi-write-congested(bdi))
return;
diff scanned = 1;
+
+ /*
+ * If io group page belongs to is congested. bail out.
+ */
+ if (wbc->nonblocking
+ && bdi-write-congested-group(bdi, pvec.pages[0])) {
+ wbc->encountered-congestion = 1;
+ done = 1;
+ pagevec-release(&pvec);
+ break;
+ }
+
ret = gfs2-write-jdata-pagevec(mapping, wbc, &pvec, nr-pages, end);
if (ret)
done = 1;
diff int err;
+ struct page *page = bio-iovec-idx(bio, 0)->bv-page;
- if (wi->nbio > 0 && bdi-write-congested(wi->bdi)) {
+ if (wi->nbio > 0 && bdi-or-group-write-congested(wi->bdi, page)) {
wait-for-completion(&wi->bio-event);
wi->nbio
bdi = inode->i-mapping->backing-dev-info;
wbc->nr-to-writeindex 1418b91..e95c97e 100644
+ if (bdi-or-group-read-congested(bdi, NULL))
return;
flags |= (XBF-TRYLOCK|XBF-ASYNC|XBF-READ-AHEAD);
diff
-typedef int (congested-fn)(void *, int);
+typedef int (congested-fn)(void *, int, struct page *, int);
enum bdi-stat-item {
BDI-RECLAIMABLE,
@@ -209,7 +209,7 @@ int writeback-in-progress(struct backing-dev-info *bdi);
static inline int bdi-congested(struct backing-dev-info *bdi, int bdi-bits)
{
if (bdi->congested-fn)
- return bdi->congested-fn(bdi->congested-data, bdi-bits);
+ return bdi->congested-fn(bdi->congested-data, bdi-bits, NULL, 0);
return (bdi->state & bdi-bits);
}
@@ -229,6 +229,63 @@ static inline int bdi-rw-congested(struct backing-dev-info *bdi)
(1 << BDI-async-congested));
}
+#ifdef CONFIG-GROUP-IOSCHED
+extern int bdi-congested-group(struct backing-dev-info *bdi, int bdi-bits,
+ struct page *page);
+
+extern int bdi-read-congested-group(struct backing-dev-info *bdi,
+ struct page *page);
+
+extern int bdi-or-group-read-congested(struct backing-dev-info *bdi,
+ struct page *page);
+
+extern int bdi-write-congested-group(struct backing-dev-info *bdi,
+ struct page *page);
+
+extern int bdi-or-group-write-congested(struct backing-dev-info *bdi,
+ struct page *page);
+
+extern int bdi-rw-congested-group(struct backing-dev-info *bdi,
+ struct page *page);
+#else /* CONFIG-GROUP-IOSCHED */
+static inline int bdi-congested-group(struct backing-dev-info *bdi,
+ int bdi-bits, struct page *page)
+{
+ return bdi-congested(bdi, bdi-bits);
+}
+
+static inline int bdi-read-congested-group(struct backing-dev-info *bdi,
+ struct page *page)
+{
+ return bdi-read-congested(bdi);
+}
+
+static inline int bdi-or-group-read-congested(struct backing-dev-info *bdi,
+ struct page *page)
+{
+ return bdi-read-congested(bdi);
+}
+
+static inline int bdi-write-congested-group(struct backing-dev-info *bdi,
+ struct page *page)
+{
+ return bdi-write-congested(bdi);
+}
+
+static inline int bdi-or-group-write-congested(struct backing-dev-info *bdi,
+ struct page *page)
+{
+ return bdi-write-congested(bdi);
+}
+
+static inline int bdi-rw-congested-group(struct backing-dev-info *bdi,
+ struct page *page)
+{
+ return bdi-rw-congested(bdi);
+}
+
+#endif /* CONFIG-GROUP-IOSCHED */
+
void clear-bdi-congested(struct backing-dev-info *bdi, int rw);
void set-bdi-congested(struct backing-dev-info *bdi, int rw);
long congestion-wait(int rw, long timeout);
diff
+#ifdef CONFIG-GROUP-IOSCHED
+extern int blk-queue-io-group-congested(struct backing-dev-info *bdi,
+ int bdi-bits, struct page *page);
+#endif
+
extern void blk-start-queue(struct request-queue *q);
extern void blk-stop-queue(struct request-queue *q);
extern void blk-sync-queue(struct request-queue *q);
diff #include <linux/device.h>
+#include "../block/elevator-fq.h"
void default-unplug-io-fn(struct backing-dev-info *bdi, struct page *page)
{
@@ -328,3 +329,64 @@ long congestion-wait(int rw, long timeout)
}
EXPORT-SYMBOL(congestion-wait);
+/*
+ * With group IO scheduling, there are request descriptors per io group per
+ * queue. So generic notion of whether queue is congested or not is not
+ * very accurate. Queue might not be congested but the io group in which
+ * request will go might actually be congested.
+ *
+ * Hence to get the correct idea about congestion level, one should query
+ * the io group congestion status on the queue. Pass in the page information
+ * which can be used to determine the io group of the page and congestion
+ * status can be determined accordingly.
+ *
+ * If page info is not passed, io group is determined from the current task
+ * context.
+ */
+#ifdef CONFIG-GROUP-IOSCHED
+int bdi-congested-group(struct backing-dev-info *bdi, int bdi-bits,
+ struct page *page)
+{
+ if (bdi->congested-fn)
+ return bdi->congested-fn(bdi->congested-data, bdi-bits, page, 1);
+
+ return blk-queue-io-group-congested(bdi, bdi-bits, page);
+}
+EXPORT-SYMBOL(bdi-congested-group);
+
+int bdi-read-congested-group(struct backing-dev-info *bdi, struct page *page)
+{
+ return bdi-congested-group(bdi, 1 << BDI-sync-congested, page);
+}
+EXPORT-SYMBOL(bdi-read-congested-group);
+
+/* Checks if either bdi or associated group is read congested */
+int bdi-or-group-read-congested(struct backing-dev-info *bdi,
+ struct page *page)
+{
+ return bdi-read-congested(bdi) || bdi-read-congested-group(bdi, page);
+}
+EXPORT-SYMBOL(bdi-or-group-read-congested);
+
+int bdi-write-congested-group(struct backing-dev-info *bdi, struct page *page)
+{
+ return bdi-congested-group(bdi, 1 << BDI-async-congested, page);
+}
+EXPORT-SYMBOL(bdi-write-congested-group);
+
+/* Checks if either bdi or associated group is write congested */
+int bdi-or-group-write-congested(struct backing-dev-info *bdi,
+ struct page *page)
+{
+ return bdi-write-congested(bdi) || bdi-write-congested-group(bdi, page);
+}
+EXPORT-SYMBOL(bdi-or-group-write-congested);
+
+int bdi-rw-congested-group(struct backing-dev-info *bdi, struct page *page)
+{
+ return bdi-congested-group(bdi, (1 << BDI-sync-congested) |
+ (1 << BDI-async-congested), page);
+}
+EXPORT-SYMBOL(bdi-rw-congested-group);
+
+#endif /* CONFIG-GROUP-IOSCHED */
diff
+ /*
+ * If the io group page will go into is congested, bail out.
+ */
+ if (wbc->nonblocking
+ && bdi-write-congested-group(bdi, pvec.pages[0])) {
+ wbc->encountered-congestion = 1;
+ done = 1;
+ pagevec-release(&pvec);
+ break;
+ }
+
for (i = 0; i < nr-pages; i++) {
struct page *page = pvec.pages[i];
diff */
- if (bdi-read-congested(mapping->backing-dev-info))
+ if (bdi-or-group-read-congested(mapping->backing-dev-info, NULL))
return;
/* do read-ahead */
PATCH 10/25 - io-controller: cfq changes to use hierarchical fair queuing code in elevaotor layer by Vivek Goyal on
2009-07-02T20:06:19+00:00
Make cfq hierarhical.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff
+config IOSCHED-CFQ-HIER
+ bool "CFQ Hierarchical Scheduling support"
+ depends on IOSCHED-CFQ && CGROUPS
+ select GROUP-IOSCHED
+ default n
+
PATCH 16/25 - io-controller: noop changes for hierarchical fair queuing by Vivek Goyal on
2009-07-02T20:06:51+00:00
This patch changes noop to use queue scheduling code from elevator layer.
One can go back to old noop by deselecting CONFIG-IOSCHED-NOOP-HIER.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff
+config IOSCHED-NOOP-HIER
+ bool "Noop Hierarchical Scheduling support"
+ depends on IOSCHED-NOOP && CGROUPS
+ select ELV-FAIR-QUEUING
+ select GROUP-IOSCHED
+ default n
+
PATCH 21/25 - io-controller: Per cgroup request descriptor support by Vivek Goyal on
2009-07-02T20:06:51+00:00
o Currently a request queue has got fixed number of request descriptors for
sync and async requests. Once the request descriptors are consumed, new
processes are put to sleep and they effectively become serialized. Because
sync and async queues are separate, async requests don't impact sync ones
but if one is looking for fairness between async requests, that is not
achievable if request queue descriptors become bottleneck.
o Make request descriptor's per io group so that if there is lots of IO
going on in one cgroup, it does not impact the IO of other group.
o This is just one relatively simple way of doing things. This patch will
probably change after the feedback. Folks have raised concerns that in
hierchical setup, child's request descriptors should be capped by parent's
request descriptors. May be we need to have per cgroup per device files
in cgroups where one can specify the upper limit of request descriptors
and whenever a cgroup is created one needs to assign request descritor
limit making sure total sum of child's request descriptor is not more than
of parent.
I guess something like memory controller. Anyway, that would be the next
step. For the time being, we have implemented something simpler as follows.
o This patch implements the per cgroup request descriptors. request pool per
queue is still common but every group will have its own wait list and its
own count of request descriptors allocated to that group for sync and async
queues. So effectively request-list becomes per io group property and not a
global request queue feature.
o Currently one can define q->nr-requests to limit request descriptors
allocated for the queue. Now there is another tunable q->nr-group-requests
which controls the requests descriptr limit per group. q->nr-requests
supercedes q->nr-group-requests to make sure if there are lots of groups
present, we don't end up allocating too many request descriptors on the
queue.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff
-static int blk-init-free-list(struct request-queue *q)
+void blk-init-request-list(struct request-list *rl)
{
- struct request-list *rl = &q->rq;
rl->count[BLK-RW-SYNC] = rl->count[BLK-RW-ASYNC] = 0;
- rl->starved[BLK-RW-SYNC] = rl->starved[BLK-RW-ASYNC] = 0;
- rl->elvpriv = 0;
init-waitqueue-head(&rl->wait[BLK-RW-SYNC]);
init-waitqueue-head(&rl->wait[BLK-RW-ASYNC]);
+}
- rl->rq-pool = mempool-create-node(BLKDEV-MIN-RQ, mempool-alloc-slab,
- mempool-free-slab, request-cachep, q->node);
+static int blk-init-free-list(struct request-queue *q)
+{
+ /*
+ * Initialize the queue request list in case there are non-hiearchical
+ * io schedulers not making use of fair queuing infrastructure.
+ *
+ * For ioschedulers making use of fair queuing infrastructure, request
+ * list is inside the associated group and when that group is
+ * instanciated, it takes care of initializing the request list also.
+ */
+ blk-init-request-list(&q->rq);
+ q->rq-data.rq-pool = mempool-create-node(BLKDEV-MIN-RQ,
+ mempool-alloc-slab, mempool-free-slab,
+ request-cachep, q->node);
- if (!rl->rq-pool)
+ if (!q->rq-data.rq-pool)
return -ENOMEM;
return 0;
@@ -575,6 +585,9 @@ blk-init-queue-node(request-fn-proc *rfn, spinlock-t *lock, int node-id)
return NULL;
}
+ /* init starved waiter wait queue */
+ init-waitqueue-head(&q->rq-data.starved-wait);
+
/*
* if caller didn't supply a lock, they get per-queue locking with
* our embedded lock
@@ -624,14 +637,14 @@ static inline void blk-free-request(struct request-queue *q, struct request *rq)
{
if (rq->cmd-flags & REQ-ELVPRIV)
elv-put-request(q, rq);
- mempool-free(rq, q->rq.rq-pool);
+ mempool-free(rq, q->rq-data.rq-pool);
}
static struct request *
blk-alloc-request(struct request-queue *q, struct bio *bio, int flags, int priv,
gfp-t gfp-mask)
{
- struct request *rq = mempool-alloc(q->rq.rq-pool, gfp-mask);
+ struct request *rq = mempool-alloc(q->rq-data.rq-pool, gfp-mask);
if (!rq)
return NULL;
@@ -642,7 +655,7 @@ blk-alloc-request(struct request-queue *q, struct bio *bio, int flags, int priv,
if (priv) {
if (unlikely(elv-set-request(q, rq, bio, gfp-mask))) {
- mempool-free(rq, q->rq.rq-pool);
+ mempool-free(rq, q->rq-data.rq-pool);
return NULL;
}
rq->cmd-flags |= REQ-ELVPRIV;
@@ -685,18 +698,18 @@ static void ioc-set-batching(struct request-queue *q, struct io-context *ioc)
ioc->last-waited = jiffies;
}
-static void + if (q->rq-data.count[sync] < queue-congestion-off-threshold(q))
blk-clear-queue-congested(q, sync);
- if (rl->count[sync] + 1 <= q->nr-requests) {
+ if (q->rq-data.count[sync] + 1 <= q->nr-requests)
+ blk-clear-queue-full(q, sync);
+
+ if (rl->count[sync] + 1 <= q->nr-group-requests) {
if (waitqueue-active(&rl->wait[sync]))
wake-up(&rl->wait[sync]);
-
- blk-clear-queue-full(q, sync);
}
}
@@ -704,63 +717,133 @@ static void +static void freed-request(struct request-queue *q, int sync, int priv,
+ struct request-list *rl)
+{
+ /* There is a window during request allocation where request is
+ * mapped to one group but by the time a queue for the group is
+ * allocated, it is possible that original cgroup/io group has been
+ * deleted and now io queue is allocated in a different group (root)
+ * altogether.
+ *
+ * One solution to the problem is that rq should take io group
+ * reference. But it looks too much to do that to solve this issue.
+ * The only side affect to the hard to hit issue seems to be that
+ * we will try to decrement the rl->count for a request list which
+ * did not allocate that request. Chcek for rl->count going less than
+ * zero and do not decrement it if that's the case.
+ */
+
+ if (priv && rl->count[sync] > 0)
+ rl->count[sync]- rl->elvpriv- + q->rq-data.starved+ * any request descriptor but we deny request allocation due to gloabl
+ * limits. In that case one should sleep on global list as on this request
+ * list no wakeup will take place.
+ *
+ * Also sets the request list starved flag if there are no requests pending
+ * in the direction of rq.
+ *
+ * Return 1 + * in same io group, then set the starved flag of
+ * the group request list. Otherwise, we need to
+ * make this process sleep in global starved list
+ * to make sure it will not sleep indefinitely.
+ */
+ if (rl->count[is-sync ^ 1] != 0) {
+ rl->starved[is-sync] = 1;
+ return 1;
+ } else
+ return 0;
+ }
+
+ return 1;
}
/*
* Get a free request, queue-lock must be held.
- * Returns NULL on failure, with queue-lock held.
+ * Returns NULL on failure, with queue-lock held. Also sets the "reason" field
+ * in case of failure. This reason field helps caller decide to whether sleep
+ * on per group list or global per queue list.
+ * reason = 0 sleep on per group list
+ * reason = 1 sleep on global list
+ *
* Returns !NULL on success, with queue-lock *not held*.
*/
static struct request *get-request(struct request-queue *q, int rw-flags,
- struct bio *bio, gfp-t gfp-mask)
+ struct bio *bio, gfp-t gfp-mask,
+ struct request-list *rl, int *reason)
{
struct request *rq = NULL;
- struct request-list *rl = &q->rq;
struct io-context *ioc = NULL;
const bool is-sync = rw-is-sync(rw-flags) != 0;
int may-queue, priv;
+ int sleep-on-global = 0;
may-queue = elv-may-queue(q, rw-flags);
if (may-queue == ELV-MQUEUE-NO)
goto rq-starved;
- if (rl->count[is-sync]+1 >= queue-congestion-on-threshold(q)) {
- if (rl->count[is-sync]+1 >= q->nr-requests) {
- ioc = current-io-context(GFP-ATOMIC, q->node);
- /*
- * The queue will fill after this allocation, so set
- * it as full, and mark this process as "batching".
- * This process will be allowed to complete a batch of
- * requests, others will be blocked.
- */
- if (!blk-queue-full(q, is-sync)) {
- ioc-set-batching(q, ioc);
- blk-set-queue-full(q, is-sync);
- } else {
- if (may-queue != ELV-MQUEUE-MUST
- && !ioc-batching(q, ioc)) {
- /*
- * The queue is full and the allocating
- * process is not a "batcher", and not
- * exempted by the IO scheduler
- */
- goto out;
- }
+ if (q->rq-data.count[is-sync]+1 >= queue-congestion-on-threshold(q))
+ blk-set-queue-congested(q, is-sync);
+
+ /*
+ * Looks like there is no user of queue full now.
+ * Keeping it for time being.
+ */
+ if (q->rq-data.count[is-sync]+1 >= q->nr-requests)
+ blk-set-queue-full(q, is-sync);
+
+ if (rl->count[is-sync]+1 >= q->nr-group-requests) {
+ ioc = current-io-context(GFP-ATOMIC, q->node);
+ /*
+ * The queue request descriptor group will fill after this
+ * allocation, so set
+ * it as full, and mark this process as "batching".
+ * This process will be allowed to complete a batch of
+ * requests, others will be blocked.
+ */
+ if (rl->count[is-sync] <= q->nr-group-requests)
+ ioc-set-batching(q, ioc);
+ else {
+ if (may-queue != ELV-MQUEUE-MUST
+ && !ioc-batching(q, ioc)) {
+ /*
+ * The queue is full and the allocating
+ * process is not a "batcher", and not
+ * exempted by the IO scheduler
+ */
+ goto out;
}
}
- blk-set-queue-congested(q, is-sync);
}
/*
@@ -768,21 +851,60 @@ static struct request *get-request(struct request-queue *q, int rw-flags,
* limit of requests, otherwise we could have thousands of requests
* allocated with any setting of ->nr-requests
*/
- if (rl->count[is-sync] >= (3 * q->nr-requests / 2))
+
+ if (q->rq-data.count[is-sync] >= (3 * q->nr-requests / 2)) {
+ /*
+ * Queue is too full for allocation. On which request queue
+ * the task should sleep? Generally it should sleep on its
+ * request list but if elevator switch is happening, in that
+ * window, request descriptors are allocated from global
+ * pool and are not accounted against any particular request
+ * list as group is going away.
+ *
+ * So it might happen that request list does not have any
+ * requests allocated at all and if process sleeps on per
+ * group request list, it will not be woken up. In such case,
+ * make it sleep on global starved list.
+ */
+ if (test-bit(QUEUE-FLAG-ELVSWITCH, &q->queue-flags)
+ || !can-sleep-on-request-list(rl, is-sync))
+ sleep-on-global = 1;
+ goto out;
+ }
+
+ /*
+ * Allocation of request is allowed from queue perspective. Now check
+ * from per group request list
+ */
+
+ if (rl->count[is-sync] >= (3 * q->nr-group-requests / 2))
goto out;
- rl->count[is-sync]++;
rl->starved[is-sync] = 0;
+ q->rq-data.count[is-sync]++;
+
priv = !test-bit(QUEUE-FLAG-ELVSWITCH, &q->queue-flags);
- if (priv)
- rl->elvpriv++;
+ if (priv) {
+ q->rq-data.elvpriv++;
+ /*
+ * Account the request to request list only if request is
+ * going to elevator. During elevator switch, there will
+ * be small window where group is going away and new group
+ * will not be allocated till elevator switch is complete.
+ * So till then instead of slowing down the application,
+ * we will continue to allocate request from total common
+ * pool instead of per group limit
+ */
+ rl->count[is-sync]++;
+ }
if (blk-queue-io-stat(q))
rw-flags |= REQ-IO-STAT;
spin-unlock-irq(q->queue-lock);
rq = blk-alloc-request(q, bio, rw-flags, priv, gfp-mask);
+
if (unlikely(!rq)) {
/*
* Allocation failed presumably due to memory. Undo anything
@@ -792,7 +914,7 @@ static struct request *get-request(struct request-queue *q, int rw-flags,
* wait queue, but this is pretty rare.
*/
spin-lock-irq(q->queue-lock);
- freed-request(q, is-sync, priv);
+ freed-request(q, is-sync, priv, rl);
/*
* in the very unlikely event that allocation failed and no
@@ -802,9 +924,8 @@ static struct request *get-request(struct request-queue *q, int rw-flags,
* rq mempool into READ and WRITE
*/
rq-starved:
- if (unlikely(rl->count[is-sync] == 0))
- rl->starved[is-sync] = 1;
-
+ if (!can-sleep-on-request-list(rl, is-sync))
+ sleep-on-global = 1;
goto out;
}
@@ -819,6 +940,8 @@ rq-starved:
trace-block-getrq(q, bio, rw-flags & 1);
out:
+ if (reason && sleep-on-global)
+ *reason = 1;
return rq;
}
@@ -832,16 +955,44 @@ static struct request *get-request-wait(struct request-queue *q, int rw-flags,
struct bio *bio)
{
const bool is-sync = rw-is-sync(rw-flags) != 0;
+ int sleep-on-global = 0;
struct request *rq;
+ struct request-list *rl = blk-get-request-list(q, bio);
+ struct io-group *iog = NULL;
- rq = get-request(q, rw-flags, bio, GFP-NOIO);
+ rq = get-request(q, rw-flags, bio, GFP-NOIO, rl, &sleep-on-global);
while (!rq) {
DEFINE-WAIT(wait);
struct io-context *ioc;
- struct request-list *rl = &q->rq;
- prepare-to-wait-exclusive(&rl->wait[is-sync], &wait,
- TASK-UNINTERRUPTIBLE);
+ if (sleep-on-global) {
+ /*
+ * Task failed allocation and needs to wait and
+ * try again. There are no requests pending from
+ * the io group hence need to sleep on global
+ * wait queue. Most likely the allocation failed
+ * because of memory issues.
+ */
+
+ q->rq-data.starved++;
+ prepare-to-wait-exclusive(&q->rq-data.starved-wait,
+ &wait, TASK-UNINTERRUPTIBLE);
+ } else {
+ /*
+ * We are about to sleep on a request list and we
+ * drop queue lock. After waking up, we will do
+ * finish-wait() on request list and in the mean
+ * time group might be gone. Take a reference to
+ * the group now.
+ */
+ prepare-to-wait-exclusive(&rl->wait[is-sync], &wait,
+ TASK-UNINTERRUPTIBLE);
+#ifdef CONFIG-GROUP-IOSCHED
+ iog = rl-iog(rl);
+ if (iog)
+ elv-get-iog(iog);
+#endif
+ }
trace-block-sleeprq(q, bio, rw-flags & 1);
@@ -859,9 +1010,30 @@ static struct request *get-request-wait(struct request-queue *q, int rw-flags,
ioc-set-batching(q, ioc);
spin-lock-irq(q->queue-lock);
- finish-wait(&rl->wait[is-sync], &wait);
- rq = get-request(q, rw-flags, bio, GFP-NOIO);
+ if (sleep-on-global) {
+ finish-wait(&q->rq-data.starved-wait, &wait);
+ sleep-on-global = 0;
+ } else {
+ finish-wait(&rl->wait[is-sync], &wait);
+#ifdef CONFIG-GROUP-IOSCHED
+ /*
+ * We had taken a reference to the rl/iog.
+ * Put that now
+ */
+ iog = rl-iog(rl);
+ if (iog)
+ elv-put-iog(iog);
+#endif
+ }
+
+ /*
+ * After the sleep check the rl again in case cgrop bio
+ * belonged to is gone and it is mapped to root group now
+ */
+ rl = blk-get-request-list(q, bio);
+ rq = get-request(q, rw-flags, bio, GFP-NOIO, rl,
+ &sleep-on-global);
};
return rq;
@@ -870,14 +1042,16 @@ static struct request *get-request-wait(struct request-queue *q, int rw-flags,
struct request *blk-get-request(struct request-queue *q, int rw, gfp-t gfp-mask)
{
struct request *rq;
+ struct request-list *rl;
BUG-ON(rw != READ && rw != WRITE);
spin-lock-irq(q->queue-lock);
+ rl = blk-get-request-list(q, NULL);
if (gfp-mask & }
@@ -1094,12 +1268,13 @@ void BUG-ON(!hlist-unhashed(&req->hash));
blk-free-request(q, req);
- freed-request(q, is-sync, priv);
+ freed-request(q, is-sync, priv, rl);
}
}
EXPORT-SYMBOL-GPL( */
q->nr-requests = BLKDEV-MAX-RQ;
+ q->nr-group-requests = BLKDEV-MAX-GROUP-RQ;
q->make-request-fn = mfn;
blk-queue-dma-alignment(q, 511);
diff {
- struct request-list *rl = &q->rq;
+ struct request-list *rl;
unsigned long nr;
int ret = queue-var-store(&nr, page, count);
if (nr < BLKDEV-MIN-RQ)
nr = BLKDEV-MIN-RQ;
spin-lock-irq(q->queue-lock);
+ rl = blk-get-request-list(q, NULL);
q->nr-requests = nr;
blk-queue-congestion-threshold(q);
- if (rl->count[BLK-RW-SYNC] >= queue-congestion-on-threshold(q))
+ if (q->rq-data.count[BLK-RW-SYNC] >= queue-congestion-on-threshold(q))
blk-set-queue-congested(q, BLK-RW-SYNC);
- else if (rl->count[BLK-RW-SYNC] < queue-congestion-off-threshold(q))
+ else if (q->rq-data.count[BLK-RW-SYNC] <
+ queue-congestion-off-threshold(q))
blk-clear-queue-congested(q, BLK-RW-SYNC);
- if (rl->count[BLK-RW-ASYNC] >= queue-congestion-on-threshold(q))
+ if (q->rq-data.count[BLK-RW-ASYNC] >= queue-congestion-on-threshold(q))
blk-set-queue-congested(q, BLK-RW-ASYNC);
- else if (rl->count[BLK-RW-ASYNC] < queue-congestion-off-threshold(q))
+ else if (q->rq-data.count[BLK-RW-ASYNC] <
+ queue-congestion-off-threshold(q))
blk-clear-queue-congested(q, BLK-RW-ASYNC);
- if (rl->count[BLK-RW-SYNC] >= q->nr-requests) {
+ if (q->rq-data.count[BLK-RW-SYNC] >= q->nr-requests) {
blk-set-queue-full(q, BLK-RW-SYNC);
- } else if (rl->count[BLK-RW-SYNC]+1 <= q->nr-requests) {
+ } else if (q->rq-data.count[BLK-RW-SYNC]+1 <= q->nr-requests) {
blk-clear-queue-full(q, BLK-RW-SYNC);
wake-up(&rl->wait[BLK-RW-SYNC]);
}
- if (rl->count[BLK-RW-ASYNC] >= q->nr-requests) {
+ if (q->rq-data.count[BLK-RW-ASYNC] >= q->nr-requests) {
blk-set-queue-full(q, BLK-RW-ASYNC);
- } else if (rl->count[BLK-RW-ASYNC]+1 <= q->nr-requests) {
+ } else if (q->rq-data.count[BLK-RW-ASYNC]+1 <= q->nr-requests) {
blk-clear-queue-full(q, BLK-RW-ASYNC);
wake-up(&rl->wait[BLK-RW-ASYNC]);
}
spin-unlock-irq(q->queue-lock);
return ret;
}
+#ifdef CONFIG-GROUP-IOSCHED
+static ssize-t queue-group-requests-show(struct request-queue *q, char *page)
+{
+ return queue-var-show(q->nr-group-requests, (page));
+}
+
+static ssize-t
+queue-group-requests-store(struct request-queue *q, const char *page,
+ size-t count)
+{
+ unsigned long nr;
+ int ret = queue-var-store(&nr, page, count);
+ if (nr < BLKDEV-MIN-RQ)
+ nr = BLKDEV-MIN-RQ;
+
+ spin-lock-irq(q->queue-lock);
+ q->nr-group-requests = nr;
+ spin-unlock-irq(q->queue-lock);
+ return ret;
+}
+#endif
static ssize-t queue-ra-show(struct request-queue *q, char *page)
{
@@ -239,6 +263,14 @@ static struct queue-sysfs-entry queue-requests-entry = {
.store = queue-requests-store,
};
+#ifdef CONFIG-GROUP-IOSCHED
+static struct queue-sysfs-entry queue-group-requests-entry = {
+ .attr = {.name = "nr-group-requests", .mode = S-IRUGO | S-IWUSR },
+ .show = queue-group-requests-show,
+ .store = queue-group-requests-store,
+};
+#endif
+
static struct queue-sysfs-entry queue-ra-entry = {
.attr = {.name = "read-ahead-kb", .mode = S-IRUGO | S-IWUSR },
.show = queue-ra-show,
@@ -313,6 +345,9 @@ static struct queue-sysfs-entry queue-iostats-entry = {
static struct attribute *default-attrs[] = {
&queue-requests-entry.attr,
+#ifdef CONFIG-GROUP-IOSCHED
+ &queue-group-requests-entry.attr,
+#endif
&queue-ra-entry.attr,
&queue-max-hw-sectors-entry.attr,
&queue-max-sectors-entry.attr,
@@ -392,12 +427,11 @@ static void blk-release-queue(struct kobject *kobj)
{
struct request-queue *q =
container-of(kobj, struct request-queue, kobj);
- struct request-list *rl = &q->rq;
blk-sync-queue(q);
- if (rl->rq-pool)
- mempool-destroy(rl->rq-pool);
+ if (q->rq-data.rq-pool)
+ mempool-destroy(q->rq-data.rq-pool);
if (q->queue-tags)
}
+struct request-list *io-group-get-request-list(struct request-queue *q,
+ struct bio *bio)
+{
+ struct io-group *iog;
+
+ iog = io-get-io-group-bio(q, bio, 1);
+ BUG-ON(!iog);
+ return &iog->rl;
+}
+
/*
* Search the io-group for efqd into the hash table (by now only a list)
* of bgrp. Must be called under rcu-read-lock().
@@ -1390,6 +1400,8 @@ io-group-chain-alloc(struct request-queue *q, void *key, struct cgroup *cgroup)
*/
elv-get-iog(iog);
+ blk-init-request-list(&iog->rl);
+
if (leaf == NULL) {
leaf = iog;
prev = leaf;
@@ -1667,6 +1679,8 @@ static struct io-group *io-alloc-root-group(struct request-queue *q,
for (i = 0; i < IO-IOPRIO-CLASSES; i++)
iog->sched-data.service-tree[i] = IO-SERVICE-TREE-INIT;
+ blk-init-request-list(&iog->rl);
+
iocg = &io-root-cgroup;
spin-lock-irq(&iocg->lock);
rcu-assign-pointer(iog->key, key);
diff struct io-queue *ioq;
+
+ /* request list associated with the group */
+ struct request-list rl;
};
/**
@@ -526,6 +529,8 @@ extern void elv-fq-unset-request-ioq(struct request-queue *q,
struct request *rq);
extern struct io-queue *elv-lookup-ioq-bio(struct request-queue *q,
struct bio *bio);
+extern struct request-list *io-group-get-request-list(struct request-queue *q,
+ struct bio *bio);
/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
static inline void io-group-set-ioq(struct io-group *iog, struct io-queue *ioq)
diff elv-drain-elevator(q);
- while (q->rq.elvpriv) {
+ while (q->rq-data.elvpriv) {
- int nrq = q->rq.count[BLK-RW-SYNC] + q->rq.count[BLK-RW-ASYNC]
- - queue-in-flight(q);
+ int nrq = q->rq-data.count[BLK-RW-SYNC] +
+ q->rq-data.count[BLK-RW-ASYNC] -
+ queue-in-flight(q);
if (nrq >= q->unplug-thresh)
#define BLKDEV-MIN-RQ 4
+
+#ifdef CONFIG-GROUP-IOSCHED
+#define BLKDEV-MAX-RQ 512 /* Default maximum for queue */
+#define BLKDEV-MAX-GROUP-RQ 128 /* Default maximum per group*/
+#else
#define BLKDEV-MAX-RQ 128 /* Default maximum */
+/*
+ * This is eqivalent to case of only one group present (root group). Let
+ * it consume all the request descriptors available on the queue .
+ */
+#define BLKDEV-MAX-GROUP-RQ BLKDEV-MAX-RQ /* Default maximum */
+#endif
struct request;
typedef void (rq-end-io-fn)(struct request *, int);
struct request-list {
/*
- * count[], starved[], and wait[] are indexed by
+ * count[], starved and wait[] are indexed by
* BLK-RW-SYNC/BLK-RW-ASYNC
*/
int count[2];
int starved[2];
+ wait-queue-head-t wait[2];
+};
+
+/*
+ * This data structures keeps track of mempool of requests for the queue
+ * and some overall statistics.
+ */
+struct request-data {
+ /*
+ * Per queue request descriptor count. This is in addition to per
+ * cgroup count
+ */
+ int count[2];
int elvpriv;
mempool-t *rq-pool;
- wait-queue-head-t wait[2];
+ int starved;
+ /*
+ * Global list for starved tasks. A task will be queued here if
+ * it could not allocate request descriptor and the associated
+ * group request list does not have any requests pending.
+ */
+ wait-queue-head-t starved-wait;
};
/*
@@ -355,6 +385,9 @@ struct request-queue
*/
struct request-list rq;
+ /* Contains request pool and other data like starved data */
+ struct request-data rq-data;
+
request-fn-proc *request-fn;
make-request-fn *make-request-fn;
prep-rq-fn *prep-rq-fn;
@@ -416,6 +449,8 @@ struct request-queue
* queue settings
*/
unsigned long nr-requests; /* Max # of requests */
+ /* Max # of per io group requests */
+ unsigned long nr-group-requests;
unsigned int nr-congestion-on;
unsigned int nr-congestion-off;
unsigned int nr-batching;
@@ -795,6 +830,54 @@ extern int scsi-cmd-ioctl(struct request-queue *, struct gendisk *, fmode-t,
extern int sg-scsi-ioctl(struct request-queue *, struct gendisk *, fmode-t,
struct scsi-ioctl-command +#ifdef CONFIG-GROUP-IOSCHED
+ if (!elv-iosched-fair-queuing-enabled(q->elevator))
+ return &q->rq;
+
+ return io-group-get-request-list(q, bio);
+#else
+ return &q->rq;
+#endif
+}
+
+static inline struct request-list *rq-rl(struct request-queue *q,
+ struct request *rq)
+{
+#ifdef CONFIG-GROUP-IOSCHED
+ struct io-group *iog;
+ int priv = rq->cmd-flags & REQ-ELVPRIV;
+
+ if (!elv-iosched-fair-queuing-enabled(q->elevator))
+ return &q->rq;
+
+ BUG-ON(priv && !rq->ioq);
+
+ if (priv)
+ iog = ioq-to-io-group(rq->ioq);
+ else
+ iog = q->elevator->efqd.root-group;
+
+ BUG-ON(!iog);
+ return &iog->rl;
+#else
+ return &q->rq;
+#endif
+}
+
+static inline struct io-group *rl-iog(struct request-list *rl)
+{
+#ifdef CONFIG-GROUP-IOSCHED
+ return container-of(rl, struct io-group, rl);
+#else
+ return NULL;
+#endif
+}
+
/*
* A queue has just exitted congestion. Note this in the global counter of
* congested queues, and wake up anyone who was waiting for requests to be
PATCH 18/25 - io-controller: anticipatory changes for hierarchical fair queuing by Vivek Goyal on
2009-07-02T20:07:22+00:00
This patch changes anticipatory scheduler to use queue scheduling code from
elevator layer. One can go back to old as by deselecting
CONFIG-IOSCHED-AS-HIER. Even with CONFIG-IOSCHED-AS-HIER=y, with-out any
other cgroup created, AS behavior should remain the same as old.
o AS is a single queue ioschduler, that means there is one AS queue per group.
o common layer code select the queue to dispatch from based on fairness, and
then AS code selects the request with-in group.
o AS runs reads and writes batches with-in group. So common layer runs timed
group queues and with-in group time, AS runs timed batches of reads and
writes.
o Note: Previously AS write batch length was adjusted synamically whenever
a W->R batch data direction took place and when first request from the
read batch completed.
Now write batch updation takes place when last request from the write
batch has finished during W->R transition.
o AS runs its own anticipation logic to anticipate on reads. common layer also
does the anticipation on the group if think time of the group is with-in
slice-idle.
o Introduced few debugging messages in AS.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff
+config IOSCHED-AS-HIER
+ bool "Anticipatory Hierarchical Scheduling support"
+ depends on IOSCHED-AS && CGROUPS
+ select ELV-FAIR-QUEUING
+ select GROUP-IOSCHED
+ default n
+
PATCH 19/25 - blkio_cgroup patches from Ryo to track async bios. by Vivek Goyal on
2009-07-02T20:07:22+00:00
o blkio-cgroup patches from Ryo to track async bios.
o Fernando is also working on another IO tracking mechanism. We are not
particular about any IO tracking mechanism. This patchset can make use
of any mechanism which makes it to upstream. For the time being making
use of Ryo's posting.
Based on 2.6.30-rc3-git3
Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp>
Signed-off-by: Ryo Tsuruta <ryov@valinux.co.jp>
-diff
+void init-io-context(struct io-context *ioc)
+{
+ atomic-long-set(&ioc->refcount, 1);
+ atomic-set(&ioc->nr-tasks, 1);
+ spin-lock-init(&ioc->lock);
+ ioc->ioprio-changed = 0;
+ ioc->ioprio = 0;
+#ifdef CONFIG-GROUP-IOSCHED
+ ioc->cgroup-changed = 0;
+#endif
+ ioc->last-waited = jiffies; /* doesn't matter... */
+ ioc->nr-batch-requests = 0; /* because this is 0 */
+ ioc->aic = NULL;
+ INIT-RADIX-TREE(&ioc->radix-root, GFP-ATOMIC | struct io-context *ret;
ret = kmem-cache-alloc-node(iocontext-cachep, gfp-flags, node);
- if (ret) {
- atomic-long-set(&ret->refcount, 1);
- atomic-set(&ret->nr-tasks, 1);
- spin-lock-init(&ret->lock);
- ret->ioprio-changed = 0;
- ret->ioprio = 0;
-#ifdef CONFIG-GROUP-IOSCHED
- ret->cgroup-changed = 0;
-#endif
- ret->last-waited = jiffies; /* doesn't matter... */
- ret->nr-batch-requests = 0; /* because this is 0 */
- ret->aic = NULL;
- INIT-RADIX-TREE(&ret->radix-root, GFP-ATOMIC | return ret;
}
diff #include <linux/bio.h>
+#include <linux/biotrack.h>
#include <linux/notifier.h>
#include <linux/cpu.h>
#include <linux/bitops.h>
@@ -668,6 +669,7 @@ static void }
diff #include <linux/buffer-head.h>
+#include <linux/biotrack.h>
#include <linux/rwsem.h>
#include <linux/uio.h>
#include <asm/atomic.h>
@@ -797,6 +798,7 @@ static int do-direct-IO(struct dio *dio)
ret = PTR-ERR(page);
goto out;
}
+ blkio-cgroup-reset-owner(page, current->mm);
while (block-in-page < blocks-per-page) {
unsigned offset-in-page = block-in-page << blkbits;
diff +#include <linux/mm.h>
+#include <linux/page-cgroup.h>
+
+#ifndef -LINUX-BIOTRACK-H
+#define -LINUX-BIOTRACK-H
+
+#ifdef CONFIG-CGROUP-BLKIO
+
+struct io-context;
+struct block-device;
+
+struct blkio-cgroup {
+ struct cgroup-subsys-state css;
+ struct io-context *io-context; /* default io-context */
+/* struct radix-tree-root io-context-root; per device io-context */
+};
+
+/**
+ * + lock-page-cgroup(pc);
+ page-cgroup-set-id(pc, 0);
+ unlock-page-cgroup(pc);
+}
+
+/**
+ * blkio-cgroup-disabled - check whether blkio-cgroup is disabled
+ *
+ * Returns true if disabled, false if not.
+ */
+static inline bool blkio-cgroup-disabled(void)
+{
+ if (blkio-cgroup-subsys.disabled)
+ return true;
+ return false;
+}
+
+extern void blkio-cgroup-set-owner(struct page *page, struct mm-struct *mm);
+extern void blkio-cgroup-reset-owner(struct page *page, struct mm-struct *mm);
+extern void blkio-cgroup-reset-owner-pagedirty(struct page *page,
+ struct mm-struct *mm);
+extern void blkio-cgroup-copy-owner(struct page *page, struct page *opage);
+
+extern struct io-context *get-blkio-cgroup-iocontext(struct bio *bio);
+extern unsigned long get-blkio-cgroup-id(struct bio *bio);
+extern unsigned long get-blkio-cgroup-id-page(struct page *page);
+extern struct cgroup *blkio-cgroup-lookup(int id);
+
+#else /* CONFIG-CGROUP-BIO */
+
+struct blkio-cgroup;
+
+static inline void +}
+
+static inline void blkio-cgroup-set-owner(struct page *page, struct mm-struct *mm)
+{
+}
+
+static inline void blkio-cgroup-reset-owner(struct page *page,
+ struct mm-struct *mm)
+{
+}
+
+static inline void blkio-cgroup-reset-owner-pagedirty(struct page *page,
+ struct mm-struct *mm)
+{
+}
+
+static inline void blkio-cgroup-copy-owner(struct page *page, struct page *opage)
+{
+}
+
+static inline struct io-context *get-blkio-cgroup-iocontext(struct bio *bio)
+{
+ return NULL;
+}
+
+static inline unsigned long get-blkio-cgroup-id(struct bio *bio)
+{
+ return 0;
+}
+
+static inline unsigned long get-blkio-cgroup-id-page(struct page *page)
+{
+ return 0;
+}
+
+#endif /* CONFIG-CGROUP-BLKIO */
+
+#endif /* -LINUX-BIOTRACK-H */
diff
+#ifdef CONFIG-CGROUP-BLKIO
+SUBSYS(blkio-cgroup)
+#endif
+
+/* */
+
#ifdef CONFIG-CGROUP-DEVICE
SUBSYS(devices)
#endif
diff struct io-context *alloc-io-context(gfp-t gfp-flags, int node);
+void init-io-context(struct io-context *ioc);
void copy-io-context(struct io-context **pdst, struct io-context **psrc);
#else
static inline void exit-io-context(void)
diff
+extern void struct mem-cgroup;
+static inline void diff struct page *node-mem-map;
-#ifdef CONFIG-CGROUP-MEM-RES-CTLR
+#ifdef CONFIG-CGROUP-PAGE
struct page-cgroup *node-page-cgroup;
#endif
#endif
@@ -956,7 +956,7 @@ struct mem-section {
/* See declaration of similar field in struct zone */
unsigned long *pageblock-flags;
-#ifdef CONFIG-CGROUP-MEM-RES-CTLR
+#ifdef CONFIG-CGROUP-PAGE
/*
* If !SPARSEMEM, pgdat doesn't have page-cgroup pointer. We use
* section. (see memcontrol.h/page-cgroup.h about this.)
diff
-#ifdef CONFIG-CGROUP-MEM-RES-CTLR
+#ifdef CONFIG-CGROUP-PAGE
#include <linux/bit-spinlock.h>
/*
* Page Cgroup can be considered as an extended mem-map.
@@ -12,9 +12,11 @@
*/
struct page-cgroup {
unsigned long flags;
- struct mem-cgroup *mem-cgroup;
struct page *page;
+#ifdef CONFIG-CGROUP-MEM-RES-CTLR
+ struct mem-cgroup *mem-cgroup;
struct list-head lru; /* per cgroup LRU list */
+#endif
};
void struct page-cgroup;
static inline void +/*
+ * use lower 16 bits for flags and reserve the rest for the page tracking id
+ */
+#define PCG-TRACKING-ID-SHIFT (16)
+#define PCG-TRACKING-ID-BITS
+ (8 * sizeof(unsigned long) - PCG-TRACKING-ID-SHIFT)
+
+/* NOTE: must be called with page-cgroup() held */
+static inline unsigned long page-cgroup-get-id(struct page-cgroup *pc)
+{
+ return pc->flags >> PCG-TRACKING-ID-SHIFT;
+}
+
+/* NOTE: must be called with page-cgroup() held */
+static inline void page-cgroup-set-id(struct page-cgroup *pc, unsigned long id)
+{
+ WARN-ON(id >= (1UL << PCG-TRACKING-ID-BITS));
+ pc->flags &= (1UL << PCG-TRACKING-ID-SHIFT) - 1;
+ pc->flags |= (unsigned long)(id << PCG-TRACKING-ID-SHIFT);
+}
+#endif
#endif
diff
+config CGROUP-BLKIO
+ bool "Block I/O cgroup subsystem"
+ depends on CGROUPS && BLOCK
+ select MM-OWNER
+ help
+ Provides a Resource Controller which enables to track the onwner
+ of every Block I/O requests.
+ The information this subsystem provides can be used from any
+ kind of module such as dm-ioband device mapper modules or
+ the cfq-scheduler.
+
+config CGROUP-PAGE
+ def-bool y
+ depends on CGROUP-MEM-RES-CTLR || CGROUP-BLKIO
+
config MM-OWNER
bool
diff obj-$(CONFIG-QUICKLIST) += quicklist.o
-obj-$(CONFIG-CGROUP-MEM-RES-CTLR) += memcontrol.o page-cgroup.o
+obj-$(CONFIG-CGROUP-MEM-RES-CTLR) += memcontrol.o
+obj-$(CONFIG-CGROUP-PAGE) += page-cgroup.o
+obj-$(CONFIG-CGROUP-BLKIO) += biotrack.o
obj-$(CONFIG-DEBUG-KMEMLEAK) += kmemleak.o
obj-$(CONFIG-DEBUG-KMEMLEAK-TEST) += kmemleak-test.o
diff + *
+ * Copyright (C) VA Linux Systems Japan, 2008-2009
+ * Developed by Hirokazu Takahashi <taka@valinux.co.jp>
+ *
+ * Copyright (C) 2008 Andrea Righi <righi.andrea@gmail.com>
+ * Use part of page-cgroup->flags to store blkio-cgroup ID.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ */
+
+#include <linux/module.h>
+#include <linux/smp.h>
+#include <linux/bit-spinlock.h>
+#include <linux/blkdev.h>
+#include <linux/biotrack.h>
+#include <linux/mm-inline.h>
+
+/*
+ * The block I/O tracking mechanism is implemented on the cgroup memory
+ * controller framework. It helps to find the the owner of an I/O request
+ * because every I/O request has a target page and the owner of the page
+ * can be easily determined on the framework.
+ */
+
+/* Return the blkio-cgroup that associates with a cgroup. */
+static inline struct blkio-cgroup *cgroup-blkio(struct cgroup *cgrp)
+{
+ return container-of(cgroup-subsys-state(cgrp, blkio-cgroup-subsys-id),
+ struct blkio-cgroup, css);
+}
+
+/* Return the blkio-cgroup that associates with a process. */
+static inline struct blkio-cgroup *blkio-cgroup-from-task(struct task-struct *p)
+{
+ return container-of(task-subsys-state(p, blkio-cgroup-subsys-id),
+ struct blkio-cgroup, css);
+}
+
+static struct io-context default-blkio-io-context;
+static struct blkio-cgroup default-blkio-cgroup = {
+ .io-context = &default-blkio-io-context,
+};
+
+/**
+ * blkio-cgroup-set-owner() - set the owner ID of a page.
+ * @page: the page we want to tag
+ * @mm: the mm-struct of a page owner
+ *
+ * Make a given page have the blkio-cgroup ID of the owner of this page.
+ */
+void blkio-cgroup-set-owner(struct page *page, struct mm-struct *mm)
+{
+ struct blkio-cgroup *biog;
+ struct page-cgroup *pc;
+ unsigned long id;
+
+ if (blkio-cgroup-disabled())
+ return;
+ pc = lookup-page-cgroup(page);
+ if (unlikely(!pc))
+ return;
+
+ lock-page-cgroup(pc);
+ page-cgroup-set-id(pc, 0); /* 0: default blkio-cgroup id */
+ unlock-page-cgroup(pc);
+ if (!mm)
+ return;
+
+ rcu-read-lock();
+ biog = blkio-cgroup-from-task(rcu-dereference(mm->owner));
+ if (unlikely(!biog)) {
+ rcu-read-unlock();
+ return;
+ }
+ /*
+ * css-get(&bio->css) isn't called to increment the reference
+ * count of this blkio-cgroup "biog" so the css-id might turn
+ * invalid even if this page is still active.
+ * This approach is chosen to minimize the overhead.
+ */
+ id = css-id(&biog->css);
+ rcu-read-unlock();
+ lock-page-cgroup(pc);
+ page-cgroup-set-id(pc, id);
+ unlock-page-cgroup(pc);
+}
+
+/**
+ * blkio-cgroup-reset-owner() - reset the owner ID of a page
+ * @page: the page we want to tag
+ * @mm: the mm-struct of a page owner
+ *
+ * Change the owner of a given page if necessary.
+ */
+void blkio-cgroup-reset-owner(struct page *page, struct mm-struct *mm)
+{
+ blkio-cgroup-set-owner(page, mm);
+}
+
+/**
+ * blkio-cgroup-reset-owner-pagedirty() - reset the owner ID of a pagecache page
+ * @page: the page we want to tag
+ * @mm: the mm-struct of a page owner
+ *
+ * Change the owner of a given page if the page is in the pagecache.
+ */
+void blkio-cgroup-reset-owner-pagedirty(struct page *page, struct mm-struct *mm)
+{
+ if (!page-is-file-cache(page))
+ return;
+ if (current->flags & PF-MEMALLOC)
+ return;
+
+ blkio-cgroup-reset-owner(page, mm);
+}
+
+/**
+ * blkio-cgroup-copy-owner() - copy the owner ID of a page into another page
+ * @npage: the page where we want to copy the owner
+ * @opage: the page from which we want to copy the ID
+ *
+ * Copy the owner ID of @opage into @npage.
+ */
+void blkio-cgroup-copy-owner(struct page *npage, struct page *opage)
+{
+ struct page-cgroup *npc, *opc;
+ unsigned long id;
+
+ if (blkio-cgroup-disabled())
+ return;
+ npc = lookup-page-cgroup(npage);
+ if (unlikely(!npc))
+ return;
+ opc = lookup-page-cgroup(opage);
+ if (unlikely(!opc))
+ return;
+
+ lock-page-cgroup(opc);
+ lock-page-cgroup(npc);
+ id = page-cgroup-get-id(opc);
+ page-cgroup-set-id(npc, id);
+ unlock-page-cgroup(npc);
+ unlock-page-cgroup(opc);
+}
+
+/* Create a new blkio-cgroup. */
+static struct cgroup-subsys-state *
+blkio-cgroup-create(struct cgroup-subsys *ss, struct cgroup *cgrp)
+{
+ struct blkio-cgroup *biog;
+ struct io-context *ioc;
+
+ if (!cgrp->parent) {
+ biog = &default-blkio-cgroup;
+ init-io-context(biog->io-context);
+ /* Increment the referrence count not to be released ever. */
+ atomic-long-inc(&biog->io-context->refcount);
+ return &biog->css;
+ }
+
+ biog = kzalloc(sizeof(*biog), GFP-KERNEL);
+ if (!biog)
+ return ERR-PTR(-ENOMEM);
+ ioc = alloc-io-context(GFP-KERNEL, -1);
+ if (!ioc) {
+ kfree(biog);
+ return ERR-PTR(-ENOMEM);
+ }
+ biog->io-context = ioc;
+ return &biog->css;
+}
+
+/* Delete the blkio-cgroup. */
+static void blkio-cgroup-destroy(struct cgroup-subsys *ss, struct cgroup *cgrp)
+{
+ struct blkio-cgroup *biog = cgroup-blkio(cgrp);
+
+ put-io-context(biog->io-context);
+ free-css-id(&blkio-cgroup-subsys, &biog->css);
+ kfree(biog);
+}
+
+/**
+ * get-blkio-cgroup-id() - determine the blkio-cgroup ID
+ * @bio: the &struct bio which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given bio. A return value zero
+ * means that the page associated with the bio belongs to default-blkio-cgroup.
+ */
+unsigned long get-blkio-cgroup-id(struct bio *bio)
+{
+ struct page-cgroup *pc;
+ struct page *page = bio-iovec-idx(bio, 0)->bv-page;
+ unsigned long id = 0;
+
+ pc = lookup-page-cgroup(page);
+ if (pc) {
+ lock-page-cgroup(pc);
+ id = page-cgroup-get-id(pc);
+ unlock-page-cgroup(pc);
+ }
+ return id;
+}
+
+/**
+ * get-blkio-cgroup-id-page() - determine the blkio-cgroup ID
+ * @page: the &struct page which describes the I/O
+ *
+ * Returns the blkio-cgroup ID of a given page. A return value zero
+ * means that the page associated with the IO belongs to default-blkio-cgroup.
+ */
+unsigned long get-blkio-cgroup-id-page(struct page *page)
+{
+ struct page-cgroup *pc;
+ unsigned long id = 0;
+
+ pc = lookup-page-cgroup(page);
+ if (pc) {
+ lock-page-cgroup(pc);
+ id = page-cgroup-get-id(pc);
+ unlock-page-cgroup(pc);
+ }
+ return id;
+}
+
+/**
+ * get-blkio-cgroup-iocontext() - determine the blkio-cgroup iocontext
+ * @bio: the &struct bio which describe the I/O
+ *
+ * Returns the iocontext of blkio-cgroup that issued a given bio.
+ */
+struct io-context *get-blkio-cgroup-iocontext(struct bio *bio)
+{
+ struct cgroup-subsys-state *css;
+ struct blkio-cgroup *biog;
+ struct io-context *ioc;
+ unsigned long id;
+
+ id = get-blkio-cgroup-id(bio);
+ rcu-read-lock();
+ css = css-lookup(&blkio-cgroup-subsys, id);
+ if (css)
+ biog = container-of(css, struct blkio-cgroup, css);
+ else
+ biog = &default-blkio-cgroup;
+ ioc = biog->io-context; /* default io-context for this cgroup */
+ atomic-long-inc(&ioc->refcount);
+ rcu-read-unlock();
+ return ioc;
+}
+
+/**
+ * blkio-cgroup-lookup() - lookup a cgroup by blkio-cgroup ID
+ * @id: blkio-cgroup ID
+ *
+ * Returns the cgroup associated with the specified ID, or NULL if lookup
+ * fails.
+ *
+ * Note:
+ * This function should be called under rcu-read-lock().
+ */
+struct cgroup *blkio-cgroup-lookup(int id)
+{
+ struct cgroup *cgrp;
+ struct cgroup-subsys-state *css;
+
+ if (blkio-cgroup-disabled())
+ return NULL;
+
+ css = css-lookup(&blkio-cgroup-subsys, id);
+ if (!css)
+ return NULL;
+ cgrp = css->cgroup;
+ return cgrp;
+}
+EXPORT-SYMBOL(get-blkio-cgroup-iocontext);
+EXPORT-SYMBOL(get-blkio-cgroup-id);
+EXPORT-SYMBOL(blkio-cgroup-lookup);
+
+static u64 blkio-id-read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct blkio-cgroup *biog = cgroup-blkio(cgrp);
+ unsigned long id;
+
+ rcu-read-lock();
+ id = css-id(&biog->css);
+ rcu-read-unlock();
+ return (u64)id;
+}
+
+
+static struct cftype blkio-files[] = {
+ {
+ .name = "id",
+ .read-u64 = blkio-id-read,
+ },
+};
+
+static int blkio-cgroup-populate(struct cgroup-subsys *ss, struct cgroup *cgrp)
+{
+ return cgroup-add-files(cgrp, ss, blkio-files,
+ ARRAY-SIZE(blkio-files));
+}
+
+struct cgroup-subsys blkio-cgroup-subsys = {
+ .name = "blkio",
+ .create = blkio-cgroup-create,
+ .destroy = blkio-cgroup-destroy,
+ .populate = blkio-cgroup-populate,
+ .subsys-id = blkio-cgroup-subsys-id,
+ .use-id = 1,
+};
diff #include <asm/tlbflush.h>
+#include <linux/biotrack.h>
#include <trace/events/block.h>
@@ -210,6 +211,7 @@ static void char *vto, *vfrom;
diff #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mm-inline.h> /* for page-is-file-cache() */
#include "internal.h"
@@ -464,6 +465,7 @@ int add-to-page-cache-locked(struct page *page, struct address-space *mapping,
gfp-mask & GFP-RECLAIM-MASK);
if (error)
goto out;
+ blkio-cgroup-set-owner(page, current->mm);
error = radix-tree-preload(gfp-mask & ~ struct mem-cgroup-per-node *nodeinfo[MAX-NUMNODES];
};
+void * The memory controller data structure. The memory controller controls both
* page cache and RSS per cgroup. We would eventually like to provide
diff #include <linux/memcontrol.h>
+#include <linux/biotrack.h>
#include <linux/mmu-notifier.h>
#include <linux/kallsyms.h>
#include <linux/swapops.h>
@@ -2117,6 +2118,7 @@ gotten:
*/
ptep-clear-flush-notify(vma, address, page-table);
page-add-new-anon-rmap(new-page, vma, address);
+ blkio-cgroup-set-owner(new-page, mm);
set-pte-at(mm, address, page-table, entry);
update-mmu-cache(vma, address, entry);
if (old-page) {
@@ -2582,6 +2584,7 @@ static int do-swap-page(struct mm-struct *mm, struct vm-area-struct *vma,
flush-icache-page(vma, page);
set-pte-at(mm, address, page-table, pte);
page-add-anon-rmap(page, vma, address);
+ blkio-cgroup-reset-owner(page, mm);
/* It's better to call commit-charge after rmap is established */
mem-cgroup-commit-charge-swapin(page, ptr);
@@ -2646,6 +2649,7 @@ static int do-anonymous-page(struct mm-struct *mm, struct vm-area-struct *vma,
goto release;
inc-mm-counter(mm, anon-rss);
page-add-new-anon-rmap(page, vma, address);
+ blkio-cgroup-set-owner(page, mm);
set-pte-at(mm, address, page-table, entry);
/* No need to invalidate - it was non-present before */
@@ -2793,6 +2797,7 @@ static int page-add-file-rmap(page);
diff #include <linux/task-io-accounting-ops.h>
+#include <linux/biotrack.h>
#include <linux/blkdev.h>
#include <linux/mpage.h>
#include <linux/rmap.h>
@@ -1244,6 +1245,7 @@ int }
diff #include <linux/swapops.h>
+#include <linux/biotrack.h>
static void + int nid, fail;
- if (mem-cgroup-disabled())
+ if (mem-cgroup-disabled() && blkio-cgroup-disabled())
return;
for-each-online-node(nid) {
@@ -83,12 +84,12 @@ void return;
fail:
printk(KERN-CRIT "allocation of page-cgroup failed.
");
- printk(KERN-CRIT "please try 'cgroup-disable=memory' boot option
");
+ printk(KERN-CRIT "please try cgroup-disable=memory,blkio boot options
");
panic("Out of memory");
}
@@ -245,7 +246,7 @@ void
for (pfn = 0; !fail && pfn < max-pfn; pfn += PAGES-PER-SECTION) {
@@ -260,8 +261,8 @@ void + " if you don't want memory and io cgroups
");
}
void #include <linux/migrate.h>
#include <linux/page-cgroup.h>
+#include <linux/biotrack.h>
#include <asm/pgtable.h>
@@ -307,6 +308,7 @@ struct page *read-swap-cache-async(swp-entry-t entry, gfp-t gfp-mask,
*/
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
PATCH 15/25 - io-conroller: Prepare elevator layer for single queue schedulers by Vivek Goyal on
2009-07-02T20:07:55+00:00
Elevator layer now has support for hierarchical fair queuing. cfq has
been migrated to make use of it and now it is time to do groundwork for
noop, deadline and AS.
noop deadline and AS don't maintain separate queues for different processes.
There is only one single queue. Effectively one can think that in hierarchical
setup, there will be one queue per cgroup where requests from all the
processes in the cgroup will be queued.
Generally io scheduler takes care of creating queues. Because there is
only one queue here, we have modified common layer to take care of queue
creation and some other functionality. This special casing helps in keeping
the changes to noop, deadline and AS to the minimum.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff static void *as-alloc-as-queue(struct request-queue *q,
- struct elevator-queue *eq, gfp-t gfp-mask)
+ struct elevator-queue *eq, gfp-t gfp-mask, struct io-queue *ioq)
{
struct as-queue *asq;
struct as-data *ad = eq->elevator-data;
diff static void *deadline-alloc-deadline-queue(struct request-queue *q,
- struct elevator-queue *eq, gfp-t gfp-mask)
+ struct elevator-queue *eq, gfp-t gfp-mask, struct io-queue *ioq)
{
struct deadline-queue *dq;
diff elv-release-ioq(e, &iog->async-idle-queue);
+
+#ifdef CONFIG-GROUP-IOSCHED
+ /* Optimization for io schedulers having single ioq */
+ if (elv-iosched-single-ioq(e))
+ elv-release-ioq(e, &iog->ioq);
+#endif
}
/* Mainly hierarchical grouping code */
@@ -1867,6 +1873,162 @@ int io-group-allow-merge(struct request *rq, struct bio *bio)
return (iog == + * pointer in group data structure and keeps track of it.
+ *
+ * For the io schdulers like cfq, which maintain multiple io queues per
+ * cgroup, and decide the io queue of request based on process, this
+ * function is not invoked.
+ */
+int elv-fq-set-request-ioq(struct request-queue *q, struct request *rq,
+ gfp-t gfp-mask)
+{
+ struct elevator-queue *e = q->elevator;
+ unsigned long flags;
+ struct io-queue *ioq = NULL, *new-ioq = NULL;
+ struct io-group *iog;
+ void *sched-q = NULL, *new-sched-q = NULL;
+
+ if (!elv-iosched-fair-queuing-enabled(e))
+ return 0;
+
+ might-sleep-if(gfp-mask & +
+ /* Get the iosched queue */
+ ioq = iog->ioq;
+ if (!ioq) {
+ /* io queue and sched-queue needs to be allocated */
+ BUG-ON(!e->ops->elevator-alloc-sched-queue-fn);
+
+ if (new-ioq) {
+ goto alloc-sched-q;
+ } else if (gfp-mask & + spin-unlock-irq(q->queue-lock);
+ new-ioq = elv-alloc-ioq(q, gfp-mask | + goto queue-fail;
+ }
+
+alloc-sched-q:
+ if (new-sched-q) {
+ ioq = new-ioq;
+ new-ioq = NULL;
+ sched-q = new-sched-q;
+ new-sched-q = NULL;
+ } else if (gfp-mask & + spin-unlock-irq(q->queue-lock);
+ /* Call io scheduer to create scheduler queue */
+ new-sched-q = e->ops->elevator-alloc-sched-queue-fn(q,
+ e, gfp-mask | + if (!sched-q) {
+ elv-free-ioq(ioq);
+ goto queue-fail;
+ }
+ }
+
+ elv-init-ioq(e, ioq, iog, sched-q, IOPRIO-CLASS-BE,
+ IOPRIO-NORM, 1);
+ io-group-set-ioq(iog, ioq);
+ elv-mark-ioq-sync(ioq);
+ elv-get-iog(iog);
+ }
+
+ if (new-sched-q)
+ e->ops->elevator-free-sched-queue-fn(q->elevator, new-sched-q);
+
+ if (new-ioq)
+ elv-free-ioq(new-ioq);
+
+ /* Request reference */
+ elv-get-ioq(ioq);
+ rq->ioq = ioq;
+ spin-unlock-irqrestore(q->queue-lock, flags);
+ return 0;
+
+queue-fail:
+ WARN-ON((gfp-mask & + * Find out the io queue of current task. Optimization for single ioq
+ * per io group io schedulers.
+ */
+struct io-queue *elv-lookup-ioq-current(struct request-queue *q)
+{
+ struct io-group *iog;
+
+ /* Determine the io group and io queue of the bio submitting task */
+ iog = io-get-io-group(q, 0);
+ if (!iog) {
+ /* May be task belongs to a cgroup for which io group has
+ * not been setup yet. */
+ return NULL;
+ }
+ return iog->ioq;
+}
+
+/*
+ * This request has been serviced. Clean up ioq info and drop the reference.
+ * Again this is called only for single queue per cgroup schedulers (noop,
+ * deadline, AS).
+ */
+void elv-fq-unset-request-ioq(struct request-queue *q, struct request *rq)
+{
+ struct io-queue *ioq = rq->ioq;
+
+ if (!elv-iosched-fair-queuing-enabled(q->elevator))
+ return;
+
+ if (ioq) {
+ rq->ioq = NULL;
+ elv-put-ioq(ioq);
+ }
+}
+
+static inline int is-only-root-group(void)
+{
+ if (list-empty(&io-root-cgroup.css.cgroup->children))
+ return 1;
+
+ return 0;
+}
+
#else /* GROUP-IOSCHED */
static void bfq-init-entity(struct io-entity *entity, struct io-group *iog)
{
@@ -1916,6 +2078,11 @@ struct io-group *io-get-io-group(struct request-queue *q, int create)
return q->elevator->efqd.root-group;
}
EXPORT-SYMBOL(io-get-io-group);
+
+static inline int is-only-root-group(void)
+{
+ return 1;
+}
#endif /* GROUP-IOSCHED */
/* Elevator fair queuing function */
@@ -2206,7 +2373,12 @@ int elv-init-ioq(struct elevator-queue *eq, struct io-queue *ioq,
ioq->efqd = efqd;
elv-ioq-set-ioprio-class(ioq, ioprio-class);
elv-ioq-set-ioprio(ioq, ioprio);
- ioq->pid = current->pid;
+
+ if (elv-iosched-single-ioq(eq))
+ ioq->pid = 0;
+ else
+ ioq->pid = current->pid;
+
ioq->sched-queue = sched-queue;
if (is-sync && !elv-ioq-class-idle(ioq))
elv-mark-ioq-idle-window(ioq);
@@ -2589,6 +2761,14 @@ static int elv-should-preempt(struct request-queue *q, struct io-queue *new-ioq,
struct io-entity *entity, *new-entity;
struct io-group *iog = NULL, *new-iog = NULL;
+ /*
+ * Currently only CFQ has preemption logic. Other schedulers don't
+ * have any notion of preemption across classes or preemption with-in
+ * class etc.
+ */
+ if (elv-iosched-single-ioq(eq))
+ return 0;
+
ioq = elv-active-ioq(eq);
if (!ioq)
@@ -2873,6 +3053,17 @@ void *elv-fq-select-ioq(struct request-queue *q, int force)
goto expire;
}
+ /*
+ * If there is only root group present, don't expire the queue for
+ * single queue ioschedulers (noop, deadline, AS). It is unnecessary
+ * overhead.
+ */
+
+ if (is-only-root-group() && elv-iosched-single-ioq(q->elevator)) {
+ elv-log-ioq(efqd, ioq, "select: only root group, no expiry");
+ goto keep-queue;
+ }
+
/* We are waiting for this queue to become busy before it expires.*/
if (efqd->fairness && elv-ioq-wait-busy(ioq)) {
ioq = NULL;
@@ -3112,6 +3303,19 @@ void elv-ioq-completed-request(struct request-queue *q, struct request *rq)
}
/*
+ * If there is only root group present, don't expire the queue
+ * for single queue ioschedulers (noop, deadline, AS). It is
+ * unnecessary overhead.
+ */
+
+ if (is-only-root-group() &&
+ elv-iosched-single-ioq(q->elevator)) {
+ elv-log-ioq(efqd, ioq, "select: only root group,"
+ " no expiry");
+ goto done;
+ }
+
+ /*
* If there are no requests waiting in this queue, and
* there are other queues ready to issue requests, AND
* those other queues are issuing requests within our
diff dev-t dev;
+
+ /* Single ioq per group, used for noop, deadline, anticipatory */
+ struct io-queue *ioq;
};
/**
@@ -514,6 +517,21 @@ static inline int update-requeue(struct io-queue *ioq, int requeue)
return requeue;
}
+extern int elv-fq-set-request-ioq(struct request-queue *q, struct request *rq,
+ gfp-t gfp-mask);
+extern void elv-fq-unset-request-ioq(struct request-queue *q,
+ struct request *rq);
+extern struct io-queue *elv-lookup-ioq-current(struct request-queue *q);
+
+/* Sets the single ioq associated with the io group. (noop, deadline, AS) */
+static inline void io-group-set-ioq(struct io-group *iog, struct io-queue *ioq)
+{
+ BUG-ON(!iog);
+ /* io group reference. Will be dropped when group is destroyed. */
+ elv-get-ioq(ioq);
+ iog->ioq = ioq;
+}
+
#else /* !GROUP-IOSCHED */
static inline int io-group-allow-merge(struct request *rq, struct bio *bio)
{
@@ -533,6 +551,26 @@ static inline int update-requeue(struct io-queue *ioq, int requeue)
return requeue;
}
+static inline void io-group-set-ioq(struct io-group *iog, struct io-queue *ioq)
+{
+}
+
+static inline int elv-fq-set-request-ioq(struct request-queue *q,
+ struct request *rq, gfp-t gfp-mask)
+{
+ return 0;
+}
+
+static inline void elv-fq-unset-request-ioq(struct request-queue *q,
+ struct request *rq)
+{
+}
+
+static inline struct io-queue *elv-lookup-ioq-current(struct request-queue *q)
+{
+ return NULL;
+}
+
#endif /* GROUP-IOSCHED */
extern ssize-t elv-slice-idle-show(struct elevator-queue *q, char *name);
@@ -642,5 +680,21 @@ static inline int io-group-allow-merge(struct request *rq, struct bio *bio)
{
return 1;
}
+static inline int elv-fq-set-request-ioq(struct request-queue *q,
+ struct request *rq, gfp-t gfp-mask)
+{
+ return 0;
+}
+
+static inline void elv-fq-unset-request-ioq(struct request-queue *q,
+ struct request *rq)
+{
+}
+
+static inline struct io-queue *elv-lookup-ioq-current(struct request-queue *q)
+{
+ return NULL;
+}
+
#endif /* CONFIG-ELV-FAIR-QUEUING */
#endif /* -BFQ-SCHED-H */
diff
+ /*
+ * If fair queuing is enabled, then queue allocation takes place
+ * during set-request() functions when request actually comes
+ * in.
+ */
+ if (elv-iosched-fair-queuing-enabled(eq))
+ return NULL;
+
if (eq->ops->elevator-alloc-sched-queue-fn) {
sched-queue = eq->ops->elevator-alloc-sched-queue-fn(q, eq,
- GFP-KERNEL);
+ GFP-KERNEL, NULL);
if (!sched-queue)
return ERR-PTR(-ENOMEM);
@@ -829,6 +837,13 @@ int elv-set-request(struct request-queue *q, struct request *rq, gfp-t gfp-mask)
{
struct elevator-queue *e = q->elevator;
+ /*
+ * Optimization for noop, deadline and AS which maintain only single
+ * ioq per io group
+ */
+ if (elv-iosched-single-ioq(e))
+ return elv-fq-set-request-ioq(q, rq, gfp-mask);
+
if (e->ops->elevator-set-req-fn)
return e->ops->elevator-set-req-fn(q, rq, gfp-mask);
@@ -840,6 +855,15 @@ void elv-put-request(struct request-queue *q, struct request *rq)
{
struct elevator-queue *e = q->elevator;
+ /*
+ * Optimization for noop, deadline and AS which maintain only single
+ * ioq per io group
+ */
+ if (elv-iosched-single-ioq(e)) {
+ elv-fq-unset-request-ioq(q, rq);
+ return;
+ }
+
if (e->ops->elevator-put-req-fn)
e->ops->elevator-put-req-fn(rq);
}
@@ -1224,9 +1248,18 @@ EXPORT-SYMBOL(elv-select-sched-queue);
/*
* Get the io scheduler queue pointer for current task.
+ *
+ * If fair queuing is enabled, determine the io group of task and retrieve
+ * the ioq pointer from that. This is used by only single queue ioschedulers
+ * for retrieving the queue associated with the group to decide whether the
+ * new bio can do a front merge or not.
*/
void *elv-get-sched-queue-current(struct request-queue *q)
{
- return q->elevator->sched-queue;
+ /* Fair queuing is not enabled. There is only one queue. */
+ if (!elv-iosched-fair-queuing-enabled(q->elevator))
+ return q->elevator->sched-queue;
+
+ return ioq-sched-queue(elv-lookup-ioq-current(q));
}
EXPORT-SYMBOL(elv-get-sched-queue-current);
diff static void *noop-alloc-noop-queue(struct request-queue *q,
- struct elevator-queue *eq, gfp-t gfp-mask)
+ struct elevator-queue *eq, gfp-t gfp-mask, struct io-queue *ioq)
{
struct noop-queue *nq;
diff typedef void (elevator-exit-fn) (struct elevator-queue *);
-typedef void* (elevator-alloc-sched-queue-fn) (struct request-queue *q, struct elevator-queue *eq, gfp-t);
+typedef void* (elevator-alloc-sched-queue-fn) (struct request-queue *q, struct elevator-queue *eq, gfp-t, struct io-queue *ioq);
typedef void (elevator-free-sched-queue-fn) (struct elevator-queue*, void *);
#ifdef CONFIG-ELV-FAIR-QUEUING
typedef void (elevator-active-ioq-set-fn) (struct request-queue*, void *, int);
@@ -247,17 +247,31 @@ enum {
/* iosched wants to use fair queuing logic of elevator layer */
#define ELV-IOSCHED-NEED-FQ 1
+/* iosched maintains only single ioq per group.*/
+#define ELV-IOSCHED-SINGLE-IOQ 2
+
static inline int elv-iosched-fair-queuing-enabled(struct elevator-queue *e)
{
return (e->elevator-type->elevator-features) & ELV-IOSCHED-NEED-FQ;
}
+static inline int elv-iosched-single-ioq(struct elevator-queue *e)
+{
+ return (e->elevator-type->elevator-features) & ELV-IOSCHED-SINGLE-IOQ;
+}
+
#else /* ELV-IOSCHED-FAIR-QUEUING */
static inline int elv-iosched-fair-queuing-enabled(struct elevator-queue *e)
{
return 0;
}
+
+static inline int elv-iosched-single-ioq(struct elevator-queue *e)
+{
+ return 0;
+}
+
#endif /* ELV-IOSCHED-FAIR-QUEUING */
extern void *elv-get-sched-queue(struct request-queue *q, struct request *rq);
extern void *elv-select-sched-queue(struct request-queue *q, int force);
PATCH 06/25 - io-controller: Modify cfq to make use of flat elevator fair queuing by Vivek Goyal on
2009-07-02T20:08:20+00:00
This patch changes cfq to use fair queuing code from elevator layer.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff config ELV-FAIR-QUEUING
- bool "Elevator Fair Queuing Support"
+ bool
default n
PATCH 02/25 - io-controller: Core of the B-WF2Q+ scheduler by Vivek Goyal on
2009-07-02T20:08:39+00:00
This is core of the BFQ(B-WF2Q+) scheduler originally implemented by Paolo and
Fabio in BFQ patches. Since then I have taken relevant pieces from BFQ and
continued the work on IO controller. It is not the full patch. Just pulled out
the some bits to show how core scheduler looks like and it becomes easier to
review.
Originally BFQ code was hierarchical. This patch only shows non-hierarchical
bits. Hierarhical code comes in later patches.
This code is the building base of introducing fair queuing logic in common
elevator layer so that it can be used by all the four IO schedulers. In
later patches, CFQ's weighted round robin scheduler will be replaced with
B-WF2Q+ scheduler.
Also note that BFQ originally provided fairness in-terms of number of
sectors of IO done by the queue. It has been modified to provide fairness
in terms of disk time (like CFQ allocate disk time slices proportionate to
prio/weight).
B-WF2Q+ is based on WF2Q+, that is described in [2], together with
H-WF2Q+, while the augmented tree used to implement B-WF2Q+ with O(log N)
complexity derives from the one introduced with EEVDF in [3].
[1] P. Valente and F. Checconi, ``High Throughput Disk Scheduling
with Deterministic Guarantees on Bandwidth Distribution,'' to be
published.
http://algo.ing.unimo.it/people/paolo/disk-sched/bfq.pdf
[2] Jon C.R. Bennett and H. Zhang, ``Hierarchical Packet Fair Queueing
Algorithms,'' IEEE/ACM Transactions on Networking, 5(5):675-689,
Oct 1997.
http://www.cs.cmu.edu/~hzhang/papers/TON-97-Oct.ps.gz
[3] I. Stoica and H. Abdel-Wahab, ``Earliest Eligible Virtual Deadline
First: A Flexible and Accurate Mechanism for Proportional Share
Resource Allocation,'' technical report.
http://www.cs.berkeley.edu/~istoica/papers/eevdf-tr-95.pdf
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff + * elevator fair queuing Layer. Uses B-WF2Q+ hierarchical scheduler for
+ * fair queuing.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ * Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal@redhat.com>
+ * Nauman Rafique <nauman@google.com>
+ */
+
+#include <linux/blkdev.h>
+#include "elevator-fq.h"
+
+#define IO-SERVICE-TREE-INIT ((struct io-service-tree)
+ { RB-ROOT, RB-ROOT, NULL, NULL, 0, 0 })
+
+/* Mainly the BFQ scheduling code Follows */
+
+/*
+ * Shift for timestamp calculations. This actually limits the maximum
+ * service allowed in one timestamp delta (small shift values increase it),
+ * the maximum total weight that can be used for the queues in the system
+ * (big shift values increase it), and the period of virtual time wraparounds.
+ */
+#define WFQ-SERVICE-SHIFT 22
+
+/**
+ * bfq-gt - compare two timestamps.
+ * @a: first ts.
+ * @b: second ts.
+ *
+ * Return @a > @b, dealing with wrapping correctly.
+ */
+static inline int bfq-gt(u64 a, u64 b)
+{
+ return (s64)(a - b) > 0;
+}
+
+/**
+ * bfq-delta - map service into the virtual time domain.
+ * @service: amount of service.
+ * @weight: scale factor.
+ */
+static inline u64 bfq-delta(unsigned long service, unsigned int weight)
+{
+ u64 d = (u64)service << WFQ-SERVICE-SHIFT;
+
+ do-div(d, weight);
+ return d;
+}
+
+/**
+ * bfq-calc-finish - assign the finish time to an entity.
+ * @entity: the entity to act upon.
+ * @service: the service to be charged to the entity.
+ */
+static inline void bfq-calc-finish(struct io-entity *entity,
+ unsigned long service)
+{
+ BUG-ON(entity->weight == 0);
+
+ entity->finish = entity->start + bfq-delta(service, entity->weight);
+}
+
+static inline struct io-queue *io-entity-to-ioq(struct io-entity *entity)
+{
+ struct io-queue *ioq = NULL;
+
+ BUG-ON(entity == NULL);
+ if (entity->my-sched-data == NULL)
+ ioq = container-of(entity, struct io-queue, entity);
+ return ioq;
+}
+
+/**
+ * io-entity-of - get an entity from a node.
+ * @node: the node field of the entity.
+ *
+ * Convert a node pointer to the relative entity. This is used only
+ * to simplify the logic of some functions and not as the generic
+ * conversion mechanism because, e.g., in the tree walking functions,
+ * the check for a %NULL value would be redundant.
+ */
+static inline struct io-entity *io-entity-of(struct rb-node *node)
+{
+ struct io-entity *entity = NULL;
+
+ if (node != NULL)
+ entity = rb-entry(node, struct io-entity, rb-node);
+
+ return entity;
+}
+
+/**
+ * bfq-remove - remove an entity from a tree.
+ * @root: the tree root.
+ * @entity: the entity to remove.
+ */
+static inline void bfq-remove(struct rb-root *root, struct io-entity *entity)
+{
+ BUG-ON(entity->tree != root);
+
+ entity->tree = NULL;
+ rb-erase(&entity->rb-node, root);
+}
+
+/**
+ * bfq-idle-remove - remove an entity from the idle tree.
+ * @st: the service tree of the owning @entity.
+ * @entity: the entity being removed.
+ */
+static void bfq-idle-remove(struct io-service-tree *st,
+ struct io-entity *entity)
+{
+ struct rb-node *next;
+
+ BUG-ON(entity->tree != &st->idle);
+
+ if (entity == st->first-idle) {
+ next = rb-next(&entity->rb-node);
+ st->first-idle = io-entity-of(next);
+ }
+
+ if (entity == st->last-idle) {
+ next = rb-prev(&entity->rb-node);
+ st->last-idle = io-entity-of(next);
+ }
+
+ bfq-remove(&st->idle, entity);
+}
+
+/**
+ * bfq-insert - generic tree insertion.
+ * @root: tree root.
+ * @entity: entity to insert.
+ *
+ * This is used for the idle and the active tree, since they are both
+ * ordered by finish time.
+ */
+static void bfq-insert(struct rb-root *root, struct io-entity *entity)
+{
+ struct io-entity *entry;
+ struct rb-node **node = &root->rb-node;
+ struct rb-node *parent = NULL;
+
+ BUG-ON(entity->tree != NULL);
+
+ while (*node != NULL) {
+ parent = *node;
+ entry = rb-entry(parent, struct io-entity, rb-node);
+
+ if (bfq-gt(entry->finish, entity->finish))
+ node = &parent->rb-left;
+ else
+ node = &parent->rb-right;
+ }
+
+ rb-link-node(&entity->rb-node, parent, node);
+ rb-insert-color(&entity->rb-node, root);
+
+ entity->tree = root;
+}
+
+/**
+ * bfq-update-min - update the min-start field of a entity.
+ * @entity: the entity to update.
+ * @node: one of its children.
+ *
+ * This function is called when @entity may store an invalid value for
+ * min-start due to updates to the active tree. The function assumes
+ * that the subtree rooted at @node (which may be its left or its right
+ * child) has a valid min-start value.
+ */
+static inline void bfq-update-min(struct io-entity *entity,
+ struct rb-node *node)
+{
+ struct io-entity *child;
+
+ if (node != NULL) {
+ child = rb-entry(node, struct io-entity, rb-node);
+ if (bfq-gt(entity->min-start, child->min-start))
+ entity->min-start = child->min-start;
+ }
+}
+
+/**
+ * bfq-update-active-node - recalculate min-start.
+ * @node: the node to update.
+ *
+ * @node may have changed position or one of its children may have moved,
+ * this function updates its min-start value. The left and right subtrees
+ * are assumed to hold a correct min-start value.
+ */
+static inline void bfq-update-active-node(struct rb-node *node)
+{
+ struct io-entity *entity = rb-entry(node, struct io-entity, rb-node);
+
+ entity->min-start = entity->start;
+ bfq-update-min(entity, node->rb-right);
+ bfq-update-min(entity, node->rb-left);
+}
+
+/**
+ * bfq-update-active-tree - update min-start for the whole active tree.
+ * @node: the starting node.
+ *
+ * @node must be the deepest modified node after an update. This function
+ * updates its min-start using the values held by its children, assuming
+ * that they did not change, and then updates all the nodes that may have
+ * changed in the path to the root. The only nodes that may have changed
+ * are the ones in the path or their siblings.
+ */
+static void bfq-update-active-tree(struct rb-node *node)
+{
+ struct rb-node *parent;
+
+up:
+ bfq-update-active-node(node);
+
+ parent = rb-parent(node);
+ if (parent == NULL)
+ return;
+
+ if (node == parent->rb-left && parent->rb-right != NULL)
+ bfq-update-active-node(parent->rb-right);
+ else if (parent->rb-left != NULL)
+ bfq-update-active-node(parent->rb-left);
+
+ node = parent;
+ goto up;
+}
+
+/**
+ * bfq-active-insert - insert an entity in the active tree of its group/device.
+ * @st: the service tree of the entity.
+ * @entity: the entity being inserted.
+ *
+ * The active tree is ordered by finish time, but an extra key is kept
+ * per each node, containing the minimum value for the start times of
+ * its children (and the node itself), so it's possible to search for
+ * the eligible node with the lowest finish time in logarithmic time.
+ */
+static void bfq-active-insert(struct io-service-tree *st,
+ struct io-entity *entity)
+{
+ struct rb-node *node = &entity->rb-node;
+
+ bfq-insert(&st->active, entity);
+
+ if (node->rb-left != NULL)
+ node = node->rb-left;
+ else if (node->rb-right != NULL)
+ node = node->rb-right;
+
+ bfq-update-active-tree(node);
+}
+
+static void bfq-get-entity(struct io-entity *entity)
+{
+ struct io-queue *ioq = io-entity-to-ioq(entity);
+
+ if (ioq)
+ elv-get-ioq(ioq);
+}
+
+static void bfq-init-entity(struct io-entity *entity, struct io-group *iog)
+{
+ entity->ioprio = entity->new-ioprio;
+ entity->ioprio-class = entity->new-ioprio-class;
+ entity->sched-data = &iog->sched-data;
+}
+
+/**
+ * bfq-find-deepest - find the deepest node that an extraction can modify.
+ * @node: the node being removed.
+ *
+ * Do the first step of an extraction in an rb tree, looking for the
+ * node that will replace @node, and returning the deepest node that
+ * the following modifications to the tree can touch. If @node is the
+ * last node in the tree return %NULL.
+ */
+static struct rb-node *bfq-find-deepest(struct rb-node *node)
+{
+ struct rb-node *deepest;
+
+ if (node->rb-right == NULL && node->rb-left == NULL)
+ deepest = rb-parent(node);
+ else if (node->rb-right == NULL)
+ deepest = node->rb-left;
+ else if (node->rb-left == NULL)
+ deepest = node->rb-right;
+ else {
+ deepest = rb-next(node);
+ if (deepest->rb-right != NULL)
+ deepest = deepest->rb-right;
+ else if (rb-parent(deepest) != node)
+ deepest = rb-parent(deepest);
+ }
+
+ return deepest;
+}
+
+/**
+ * bfq-active-remove - remove an entity from the active tree.
+ * @st: the service-tree containing the tree.
+ * @entity: the entity being removed.
+ */
+static void bfq-active-remove(struct io-service-tree *st,
+ struct io-entity *entity)
+{
+ struct rb-node *node;
+
+ node = bfq-find-deepest(&entity->rb-node);
+ bfq-remove(&st->active, entity);
+
+ if (node != NULL)
+ bfq-update-active-tree(node);
+}
+
+/**
+ * bfq-idle-insert - insert an entity into the idle tree.
+ * @st: the service tree containing the tree.
+ * @entity: the entity to insert.
+ */
+static void bfq-idle-insert(struct io-service-tree *st,
+ struct io-entity *entity)
+{
+ struct io-entity *first-idle = st->first-idle;
+ struct io-entity *last-idle = st->last-idle;
+
+ if (first-idle == NULL || bfq-gt(first-idle->finish, entity->finish))
+ st->first-idle = entity;
+ if (last-idle == NULL || bfq-gt(entity->finish, last-idle->finish))
+ st->last-idle = entity;
+
+ bfq-insert(&st->idle, entity);
+}
+
+/**
+ * bfq-forget-entity - remove an entity from the wfq trees.
+ * @st: the service tree.
+ * @entity: the entity being removed.
+ *
+ * Update the device status and forget everything about @entity, putting
+ * the device reference to it, if it is a queue. Entities belonging to
+ * groups are not refcounted.
+ */
+static void bfq-forget-entity(struct io-service-tree *st,
+ struct io-entity *entity)
+{
+ struct io-queue *ioq = NULL;
+
+ BUG-ON(!entity->on-st);
+ entity->on-st = 0;
+ st->wsum -= entity->weight;
+ ioq = io-entity-to-ioq(entity);
+ if (!ioq)
+ return;
+ elv-put-ioq(ioq);
+}
+
+/**
+ * bfq-put-idle-entity - release the idle tree ref of an entity.
+ * @st: service tree for the entity.
+ * @entity: the entity being released.
+ */
+static void bfq-put-idle-entity(struct io-service-tree *st,
+ struct io-entity *entity)
+{
+ bfq-idle-remove(st, entity);
+ bfq-forget-entity(st, entity);
+}
+
+/**
+ * bfq-forget-idle - update the idle tree if necessary.
+ * @st: the service tree to act upon.
+ *
+ * To preserve the global O(log N) complexity we only remove one entry here;
+ * as the idle tree will not grow indefinitely this can be done safely.
+ */
+static void bfq-forget-idle(struct io-service-tree *st)
+{
+ struct io-entity *first-idle = st->first-idle;
+ struct io-entity *last-idle = st->last-idle;
+
+ if (RB-EMPTY-ROOT(&st->active) && last-idle != NULL &&
+ !bfq-gt(last-idle->finish, st->vtime)) {
+ /*
+ * Active tree is empty. Pull back vtime to finish time of
+ * last idle entity on idle tree.
+ * Rational seems to be that it reduces the possibility of
+ * vtime wraparound (bfq-gt(V-F) < 0).
+ */
+ st->vtime = last-idle->finish;
+ }
+
+ if (first-idle != NULL && !bfq-gt(first-idle->finish, st->vtime))
+ bfq-put-idle-entity(st, first-idle);
+}
+
+
+static struct io-service-tree *
++ old-st->wsum -= entity->weight;
+ entity->ioprio = entity->new-ioprio;
+ entity->ioprio-class = entity->new-ioprio-class;
+ entity->weight = entity->new-weight;
+ entity->ioprio-changed = 0;
+
+ /*
+ * Also update the scaled budget for ioq. Group will get the
+ * updated budget once ioq is selected to run next.
+ */
+ if (ioq) {
+ struct elv-fq-data *efqd = ioq->efqd;
+ /*
+ * elv-prio-to-slice() is defined in later patches
+ * where a slice length is calculated from the
+ * ioprio of the queue.
+ */
+ entity->budget = elv-prio-to-slice(efqd, ioq);
+ }
+
+ /*
+ * NOTE: here we may be changing the weight too early,
+ * this will cause unfairness. The correct approach
+ * would have required additional complexity to defer
+ * weight changes to the proper time instants (i.e.,
+ * when entity->finish <= old-st->vtime).
+ */
+ new-st = io-entity-service-tree(entity);
+ new-st->wsum += entity->weight;
+
+ if (new-st != old-st)
+ entity->start = new-st->vtime;
+ }
+
+ return new-st;
+}
+
+/**
+ * bfq-update-vtime - update vtime if necessary.
+ * @st: the service tree to act upon.
+ *
+ * If necessary update the service tree vtime to have at least one
+ * eligible entity, skipping to its start time. Assumes that the
+ * active tree of the device is not empty.
+ *
+ * NOTE: this hierarchical implementation updates vtimes quite often,
+ * we may end up with reactivated tasks getting timestamps after a
+ * vtime skip done because we needed a ->first-active entity on some
+ * intermediate node.
+ */
+static void bfq-update-vtime(struct io-service-tree *st)
+{
+ struct io-entity *entry;
+ struct rb-node *node = st->active.rb-node;
+
+ entry = rb-entry(node, struct io-entity, rb-node);
+ if (bfq-gt(entry->min-start, st->vtime)) {
+ st->vtime = entry->min-start;
+ bfq-forget-idle(st);
+ }
+}
+
+/**
+ * bfq-first-active - find the eligible entity with the smallest finish time
+ * @st: the service tree to select from.
+ *
+ * This function searches the first schedulable entity, starting from the
+ * root of the tree and going on the left every time on this side there is
+ * a subtree with at least one eligible (start <= vtime) entity. The path
+ * on the right is followed only if a) the left subtree contains no eligible
+ * entities and b) no eligible entity has been found yet.
+ */
+static struct io-entity *bfq-first-active-entity(struct io-service-tree *st)
+{
+ struct io-entity *entry, *first = NULL;
+ struct rb-node *node = st->active.rb-node;
+
+ while (node != NULL) {
+ entry = rb-entry(node, struct io-entity, rb-node);
+left:
+ if (!bfq-gt(entry->start, st->vtime))
+ first = entry;
+
+ BUG-ON(bfq-gt(entry->min-start, st->vtime));
+
+ if (node->rb-left != NULL) {
+ entry = rb-entry(node->rb-left,
+ struct io-entity, rb-node);
+ if (!bfq-gt(entry->min-start, st->vtime)) {
+ node = node->rb-left;
+ goto left;
+ }
+ }
+ if (first != NULL)
+ break;
+ node = node->rb-right;
+ }
+
+ BUG-ON(first == NULL && !RB-EMPTY-ROOT(&st->active));
+ return first;
+}
+
+/**
+ * +{
+ struct io-entity *entity;
+
+ if (RB-EMPTY-ROOT(&st->active))
+ return NULL;
+
+ bfq-update-vtime(st);
+ entity = bfq-first-active-entity(st);
+ BUG-ON(bfq-gt(entity->start, st->vtime));
+
+ return entity;
+}
+
+/**
+ * bfq-lookup-next-entity - return the first eligible entity in @sd.
+ * @sd: the sched-data.
+ * @extract: if true the returned entity will be also extracted from @sd.
+ *
+ * NOTE: since we cache the next-active entity at each level of the
+ * hierarchy, the complexity of the lookup can be decreased with
+ * absolutely no effort just returning the cached next-active value;
+ * we prefer to do full lookups to test the consistency of * the data
+ * structures.
+ */
+static struct io-entity *bfq-lookup-next-entity(struct io-sched-data *sd,
+ int extract)
+{
+ struct io-service-tree *st = sd->service-tree;
+ struct io-entity *entity;
+ int i;
+
+ /*
+ * We should not call lookup when an entity is active, as doing lookup
+ * can result in an erroneous vtime jump.
+ */
+ BUG-ON(sd->active-entity != NULL);
+
+ for (i = 0; i < IO-IOPRIO-CLASSES; i++, st++) {
+ entity = + }
+ }
+
+ return entity;
+}
+
+/**
+ * + * timestamps.
+ */
+static void + /*
+ * If we are requeueing the current entity we have
+ * to take care of not charging to it service it has
+ * not received.
+ */
+ bfq-calc-finish(entity, entity->service);
+ entity->start = entity->finish;
+ sd->active-entity = NULL;
+ } else if (entity->tree == &st->active) {
+ /*
+ * Requeueing an entity due to a change of some
+ * next-active entity below it. We reuse the old
+ * start time.
+ */
+ bfq-active-remove(st, entity);
+ } else if (entity->tree == &st->idle) {
+ /*
+ * Must be on the idle tree, bfq-idle-remove() will
+ * check for that.
+ */
+ bfq-idle-remove(st, entity);
+ entity->start = bfq-gt(st->vtime, entity->finish) ?
+ st->vtime : entity->finish;
+ } else {
+ /*
+ * The finish time of the entity may be invalid, and
+ * it is in the past for sure, otherwise the queue
+ * would have been on the idle tree.
+ */
+ entity->start = st->vtime;
+ st->wsum += entity->weight;
+ bfq-get-entity(entity);
+
+ BUG-ON(entity->on-st);
+ entity->on-st = 1;
+ }
+
+ st = + * @entity: the entity to activate.
+ */
+static void bfq-activate-entity(struct io-entity *entity)
+{
+ + *
+ * Deactivate an entity, independently from its previous state. If the
+ * entity was not on a service tree just return, otherwise if it is on
+ * any scheduler tree, extract it from that tree, and if necessary
+ * and if the caller did not specify @requeue, put it on the idle tree.
+ *
+ */
+static int + if (!entity->on-st)
+ return 0;
+
+ BUG-ON(was-active && entity->tree != NULL);
+
+ if (was-active) {
+ bfq-calc-finish(entity, entity->service);
+ sd->active-entity = NULL;
+ } else if (entity->tree == &st->active)
+ bfq-active-remove(st, entity);
+ else if (entity->tree == &st->idle)
+ bfq-idle-remove(st, entity);
+ else if (entity->tree != NULL)
+ BUG();
+
+ if (!requeue || !bfq-gt(entity->finish, st->vtime))
+ bfq-forget-entity(st, entity);
+ else
+ bfq-idle-insert(st, entity);
+
+ BUG-ON(sd->active-entity == entity);
+
+ return ret;
+}
+
+/**
+ * bfq-deactivate-entity - deactivate an entity.
+ * @entity: the entity to deactivate.
+ * @requeue: true if the entity can be put on the idle tree
+ */
+static void bfq-deactivate-entity(struct io-entity *entity, int requeue)
+{
+ + st = io-entity-service-tree(entity);
+ entity->service += served;
+ BUG-ON(st->wsum == 0);
+ st->vtime += bfq-delta(served, st->wsum);
+ bfq-forget-idle(st);
+}
+
+/**
+ * io-flush-idle-tree - deactivate any entity on the idle tree of @st.
+ * @st: the service tree being flushed.
+ */
+static void io-flush-idle-tree(struct io-service-tree *st)
+{
+ struct io-entity *entity = st->first-idle;
+
+ for (; entity != NULL; entity = st->first-idle)
+ @@ -0,0 +1,172 @@
+/*
+ * elevator fair queuing Layer. Uses B-WF2Q+ hierarchical scheduler for
+ * fair queuing. Data structures and common functions prototypes.
+ *
+ * Based on ideas and code from CFQ:
+ * Copyright (C) 2003 Jens Axboe <axboe@kernel.dk>
+ *
+ * Copyright (C) 2008 Fabio Checconi <fabio@gandalf.sssup.it>
+ * Paolo Valente <paolo.valente@unimore.it>
+ *
+ * Copyright (C) 2009 Vivek Goyal <vgoyal@redhat.com>
+ * Nauman Rafique <nauman@google.com>
+ */
+
+#include <linux/blkdev.h>
+
+#ifndef -BFQ-SCHED-H
+#define -BFQ-SCHED-H
+
+#define IO-IOPRIO-CLASSES 3
+
+struct io-entity;
+struct io-queue;
+
+/**
+ * struct io-service-tree - per ioprio-class service tree.
+ * @active: tree for active entities (i.e., those backlogged).
+ * @idle: tree for idle entities (i.e., those not backlogged, with V <= F-i).
+ * @first-idle: idle entity with minimum F-i.
+ * @last-idle: idle entity with maximum F-i.
+ * @vtime: scheduler virtual time.
+ * @wsum: scheduler weight sum; active and idle entities contribute to it.
+ *
+ * Each service tree represents a B-WF2Q+ scheduler on its own. Each
+ * ioprio-class has its own independent scheduler, and so its own
+ * io-service-tree. All the fields are protected by the queue lock
+ * of the containing efqd.
+ */
+struct io-service-tree {
+ struct rb-root active;
+ struct rb-root idle;
+
+ struct io-entity *first-idle;
+ struct io-entity *last-idle;
+
+ u64 vtime;
+ unsigned int wsum;
+};
+
+/**
+ * struct io-sched-data - multi-class scheduler.
+ * @active-entity: entity under service.
+ * @next-active: head-of-the-line entity in the scheduler.
+ * @service-tree: array of service trees, one per ioprio-class.
+ *
+ * io-sched-data is the basic scheduler queue. It supports three
+ * ioprio-classes, and can be used either as a toplevel queue or as
+ * an intermediate queue on a hierarchical setup.
+ * @next-active points to the active entity of the sched-data service
+ * trees that will be scheduled next.
+ *
+ * The supported ioprio-classes are the same as in CFQ, in descending
+ * priority order, IOPRIO-CLASS-RT, IOPRIO-CLASS-BE, IOPRIO-CLASS-IDLE.
+ * Requests from higher priority queues are served before all the
+ * requests from lower priority queues; among requests of the same
+ * queue requests are served according to B-WF2Q+.
+ * All the fields are protected by the queue lock of the containing efqd.
+ */
+struct io-sched-data {
+ struct io-entity *active-entity;
+ struct io-service-tree service-tree[IO-IOPRIO-CLASSES];
+};
+
+/**
+ * struct io-entity - schedulable entity.
+ * @rb-node: service-tree member.
+ * @on-st: flag, true if the entity is on a tree (either the active or
+ * the idle one of its service-tree).
+ * @finish: B-WF2Q+ finish timestamp (aka F-i).
+ * @start: B-WF2Q+ start timestamp (aka S-i).
+ * @tree: tree the entity is enqueued into; %NULL if not on a tree.
+ * @min-start: minimum start time of the (active) subtree rooted at
+ * this entity; used for O(log N) lookups into active trees.
+ * @service: service received during the last round of service.
+ * @budget: budget used to calculate F-i; F-i = S-i + @budget / @weight.
+ * @weight: weight of the queue, calculated as IOPRIO-BE-NR - @ioprio.
+ * @new-weight: when a weight change is requested, the new weight value
+ * @parent: parent entity, for hierarchical scheduling.
+ * @my-sched-data: for non-leaf nodes in the cgroup hierarchy, the
+ * associated scheduler queue, %NULL on leaf nodes.
+ * @sched-data: the scheduler queue this entity belongs to.
+ * @ioprio: the ioprio in use.
+ * @new-ioprio: when an ioprio change is requested, the new ioprio value
+ * @ioprio-class: the ioprio-class in use.
+ * @new-ioprio-class: when an ioprio-class change is requested, the new
+ * ioprio-class value.
+ * @ioprio-changed: flag, true when the user requested an ioprio or
+ * ioprio-class change.
+ *
+ * A io-entity is used to represent either a io-queue (leaf node in the
+ * cgroup hierarchy) or a io-group into the upper level scheduler. Each
+ * entity belongs to the sched-data of the parent group in the cgroup
+ * hierarchy. Non-leaf entities have also their own sched-data, stored
+ * in @my-sched-data.
+ *
+ * Each entity stores independently its priority values; this would allow
+ * different weights on different devices, but this functionality is not
+ * exported to userspace by now. Priorities are updated lazily, first
+ * storing the new values into the new-* fields, then setting the
+ * @ioprio-changed flag. As soon as there is a transition in the entity
+ * state that allows the priority update to take place the effective and
+ * the requested priority values are synchronized.
+ *
+ * The weight value is calculated from the ioprio to export the same
+ * interface as CFQ.
+ *
+ * All the fields are protected by the queue lock of the containing efqd.
+ */
+struct io-entity {
+ struct rb-node rb-node;
+
+ int on-st;
+
+ u64 finish;
+ u64 start;
+
+ struct rb-root *tree;
+
+ u64 min-start;
+
+ unsigned long service, budget;
+ unsigned int weight, new-weight;
+
+ struct io-entity *parent;
+
+ struct io-sched-data *my-sched-data;
+ struct io-sched-data *sched-data;
+
+ unsigned short ioprio, new-ioprio;
+ unsigned short ioprio-class, new-ioprio-class;
+
+ int ioprio-changed;
+};
+
+/*
+ * A common structure representing the io queue where requests are actually
+ * queued.
+ */
+struct io-queue {
+ struct io-entity entity;
+ atomic-t ref;
+
+ /* Pointer to generic elevator fair queuing data structure */
+ struct elv-fq-data *efqd;
+};
+
+struct io-group {
+ struct io-sched-data sched-data;
+};
+
+static inline struct io-service-tree *
+io-entity-service-tree(struct io-entity *entity)
+{
+ struct io-sched-data *sched-data = entity->sched-data;
+ unsigned int idx = entity->ioprio-class - 1;
+
+ BUG-ON(idx >= IO-IOPRIO-CLASSES);
+ BUG-ON(sched-data == NULL);
+
+ return sched-data->service-tree + idx;
+}
+#endif /* -BFQ-SCHED-H */
PATCH 13/25 - io-controller: Wait for requests to complete from last queue before new queue is scheduled by Vivek Goyal on
2009-07-02T20:08:39+00:00
o Currently one can dispatch requests from multiple queues to the disk. This
is true for hardware which supports queuing. So if a disk support queue
depth of 31 it is possible that 20 requests are dispatched from queue 1
and then next queue is scheduled in which dispatches more requests.
o This multiple queue dispatch introduces issues for accurate accounting of
disk time consumed by a particular queue. For example, if one async queue
is scheduled in, it can dispatch 31 requests to the disk and then it will
be expired and a new sync queue might get scheduled in. These 31 requests
might take a long time to finish but this time is never accounted to the
async queue which dispatched these requests.
o This patch introduces the functionality where we wait for all the requests
to finish from previous queue before next queue is scheduled in. That way
a queue is more accurately accounted for disk time it has consumed. Note
this still does not take care of errors introduced by disk write caching.
o Because above behavior can result in reduced throughput, this behavior will
be enabled only if user sets "fairness" tunable to 2 or higher.
o This patch helps in achieving more isolation between reads and buffered
writes in different cgroups. buffered writes typically utilize full queue
depth and then expire the queue. On the contarary, sequential reads
typicaly driver queue depth of 1. So despite the fact that writes are
using more disk time it is never accounted to write queue because we don't
wait for requests to finish after dispatching these. This patch helps
do more accurate accounting of disk time, especially for buffered writes
hence providing better fairness hence better isolation between two cgroups
running read and write workloads.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff EXPORT-SYMBOL(elv-slice-async-store);
-STORE-FUNCTION(elv-fairness-store, &efqd->fairness, 0, 1, 0);
+STORE-FUNCTION(elv-fairness-store, &efqd->fairness, 0, 2, 0);
EXPORT-SYMBOL(elv-fairness-store);
#undef STORE-FUNCTION
@@ -2952,6 +2952,24 @@ void *elv-fq-select-ioq(struct request-queue *q, int force)
}
expire:
+ if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
+ /*
+ * If there are request dispatched from this queue, don't
+ * dispatch requests from new queue till all the requests from
+ * this queue have completed.
+ *
+ * This helps in attributing right amount of disk time consumed
+ * by a particular queue when hardware allows queuing.
+ *
+ * Set ioq = NULL so that no more requests are dispatched from
+ * this queue.
+ */
+ elv-log-ioq(efqd, ioq, "select: wait for requests to finish"
+ " disp=%lu", ioq->dispatched);
+ ioq = NULL;
+ goto keep-queue;
+ }
+
elv-ioq-slice-expired(q);
new-queue:
ioq = elv-set-active-ioq(q, new-ioq);
@@ -3109,6 +3127,17 @@ void elv-ioq-completed-request(struct request-queue *q, struct request *rq)
*/
elv-ioq-arm-slice-timer(q, 1);
} else {
+ /* If fairness >=2 and there are requests
+ * dispatched from this queue, don't dispatch
+ * new requests from a different queue till
+ * all requests from this queue have finished.
+ * This helps in attributing right disk time
+ * to a queue when hardware supports queuing.
+ */
+
+ if (efqd->fairness >= 2 && ioq->dispatched)
+ goto done;
+
/* Expire the queue */
elv-ioq-slice-expired(q);
}
PATCH 24/25 - io-controller: Debug hierarchical IO scheduling by Vivek Goyal on
2009-07-02T20:09:19+00:00
o Littile debugging aid for hierarchical IO scheduling.
o Enabled under CONFIG-DEBUG-GROUP-IOSCHED
o Currently it outputs more debug messages in blktrace output which helps
a great deal in debugging in hierarchical setup. It also creates additional
cgroup interfaces io.disk-queue and io.disk-dequeue to output some more
debugging data.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff submitting thread.
-endmenu
+config DEBUG-GROUP-IOSCHED
+ bool "Debug Hierarchical Scheduling support"
+ depends on CGROUPS && GROUP-IOSCHED
+ default n
+
PATCH 08/25 - io-controller: cgroup related changes for hierarchical group support by Vivek Goyal on
2009-07-02T20:09:19+00:00
o This patch introduces some of the cgroup related code for io controller.
Signed-off-by: Fabio Checconi <fabio@gandalf.sssup.it>
Signed-off-by: Paolo Valente <paolo.valente@unimore.it>
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff ret->ioprio = 0;
+#ifdef CONFIG-GROUP-IOSCHED
+ ret->cgroup-changed = 0;
+#endif
ret->last-waited = jiffies; /* doesn't matter... */
ret->nr-batch-requests = 0; /* because this is 0 */
ret->aic = NULL;
diff
+#define IO-DEFAULT-GRP-WEIGHT 500
+#define IO-DEFAULT-GRP-CLASS IOPRIO-CLASS-BE
+
#define IO-SERVICE-TREE-INIT ((struct io-service-tree)
{ RB-ROOT, RB-ROOT, NULL, NULL, 0, 0 })
@@ -899,6 +902,177 @@ static void io-flush-idle-tree(struct io-service-tree *st)
+ .weight = IO-DEFAULT-GRP-WEIGHT,
+ .ioprio-class = IO-DEFAULT-GRP-CLASS,
+};
+
+static struct io-cgroup *cgroup-to-io-cgroup(struct cgroup *cgroup)
+{
+ return container-of(cgroup-subsys-state(cgroup, io-subsys-id),
+ struct io-cgroup, css);
+}
+
+#define SHOW-FUNCTION(+ if (!cgroup-lock-live-group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup-to-io-cgroup(cgroup);
+ spin-lock-irq(&iocg->lock);
+ ret = iocg->+
+SHOW-FUNCTION(weight);
+SHOW-FUNCTION(ioprio-class);
+#undef SHOW-FUNCTION
+
+#define STORE-FUNCTION(+ struct hlist-node *n;
+
+ if (val < (+
+ spin-lock-irq(&iocg->lock);
+ iocg->+
+ cgroup-unlock();
+
+ return 0;
+}
+
+STORE-FUNCTION(weight, 1, WEIGHT-MAX);
+STORE-FUNCTION(ioprio-class, IOPRIO-CLASS-RT, IOPRIO-CLASS-IDLE);
+#undef STORE-FUNCTION
+
+struct cftype bfqio-files[] = {
+ {
+ .name = "weight",
+ .read-u64 = io-cgroup-weight-read,
+ .write-u64 = io-cgroup-weight-write,
+ },
+ {
+ .name = "ioprio-class",
+ .read-u64 = io-cgroup-ioprio-class-read,
+ .write-u64 = io-cgroup-ioprio-class-write,
+ },
+};
+
+static int iocg-populate(struct cgroup-subsys *subsys, struct cgroup *cgroup)
+{
+ return cgroup-add-files(cgroup, subsys, bfqio-files,
+ ARRAY-SIZE(bfqio-files));
+}
+
+static struct cgroup-subsys-state *iocg-create(struct cgroup-subsys *subsys,
+ struct cgroup *cgroup)
+{
+ struct io-cgroup *iocg;
+
+ if (cgroup->parent != NULL) {
+ iocg = kzalloc(sizeof(*iocg), GFP-KERNEL);
+ if (iocg == NULL)
+ return ERR-PTR(-ENOMEM);
+ } else
+ iocg = &io-root-cgroup;
+
+ spin-lock-init(&iocg->lock);
+ INIT-HLIST-HEAD(&iocg->group-data);
+ iocg->weight = IO-DEFAULT-GRP-WEIGHT;
+ iocg->ioprio-class = IO-DEFAULT-GRP-CLASS;
+
+ return &iocg->css;
+}
+
+/*
+ * We cannot support shared io contexts, as we have no mean to support
+ * two tasks with the same ioc in two different groups without major rework
+ * of the main cic/bfqq data structures. By now we allow a task to change
+ * its cgroup only if it's the only owner of its ioc; the drawback of this
+ * behavior is that a group containing a task that forked using CLONE-IO
+ * will not be destroyed until the tasks sharing the ioc die.
+ */
+static int iocg-can-attach(struct cgroup-subsys *subsys, struct cgroup *cgroup,
+ struct task-struct *tsk)
+{
+ struct io-context *ioc;
+ int ret = 0;
+
+ /* task-lock() is needed to avoid races with exit-io-context() */
+ task-lock(tsk);
+ ioc = tsk->io-context;
+ if (ioc != NULL && atomic-read(&ioc->nr-tasks) > 1)
+ /*
+ * ioc == NULL means that the task is either too young or
+ * exiting: if it has still no ioc the ioc can't be shared,
+ * if the task is exiting the attach will fail anyway, no
+ * matter what we return here.
+ */
+ ret = -EINVAL;
+ task-unlock(tsk);
+
+ return ret;
+}
+
+static void iocg-attach(struct cgroup-subsys *subsys, struct cgroup *cgroup,
+ struct cgroup *prev, struct task-struct *tsk)
+{
+ struct io-context *ioc;
+
+ task-lock(tsk);
+ ioc = tsk->io-context;
+ if (ioc != NULL)
+ ioc->cgroup-changed = 1;
+ task-unlock(tsk);
+}
+
+static void iocg-destroy(struct cgroup-subsys *subsys, struct cgroup *cgroup)
+{
+
+ /* Implemented in later patch */
+}
+
+struct cgroup-subsys io-subsys = {
+ .name = "io",
+ .create = iocg-create,
+ .can-attach = iocg-can-attach,
+ .attach = iocg-attach,
+ .destroy = iocg-destroy,
+ .populate = iocg-populate,
+ .subsys-id = io-subsys-id,
+ .use-id = 1,
+};
+#endif /* GROUP-IOSCHED */
/* Elevator fair queuing function */
static inline struct io-queue *elv-active-ioq(struct elevator-queue *e)
{
diff #include <linux/blkdev.h>
+#include <linux/cgroup.h>
#ifndef -BFQ-SCHED-H
#define -BFQ-SCHED-H
#define IO-IOPRIO-CLASSES 3
+#define WEIGHT-MAX 1000
struct io-entity;
struct io-queue;
@@ -88,7 +90,7 @@ struct io-sched-data {
* this entity; used for O(log N) lookups into active trees.
* @service: service received during the last round of service.
* @budget: budget used to calculate F-i; F-i = S-i + @budget / @weight.
- * @weight: weight of the queue, calculated as IOPRIO-BE-NR - @ioprio.
+ * @weight: the weight in use.
* @new-weight: when a weight change is requested, the new weight value
* @parent: parent entity, for hierarchical scheduling.
* @my-sched-data: for non-leaf nodes in the cgroup hierarchy, the
@@ -181,8 +183,10 @@ struct io-queue {
void *sched-queue;
};
+#ifdef CONFIG-GROUP-IOSCHED
struct io-group {
struct io-entity entity;
+ struct hlist-node group-node;
struct io-sched-data sched-data;
struct io-entity *my-entity;
@@ -199,8 +203,45 @@ struct io-group {
* non-RT cfqq in service when this value is non-zero.
*/
unsigned int busy-rt-queues;
+ unsigned short iocg-id;
};
+/**
+ * struct io-cgroup - io cgroup data structure.
+ * @css: subsystem state for io in the containing cgroup.
+ * @weight: cgroup weight.
+ * @ioprio-class: cgroup ioprio-class.
+ * @lock: spinlock that protects @weight, @ioprio-class and @group-data.
+ * @group-data: list containing the io-group belonging to this cgroup.
+ *
+ * @group-data is accessed using RCU, with @lock protecting the updates,
+ * @weight and @ioprio-class are protected by @lock.
+ */
+struct io-cgroup {
+ struct cgroup-subsys-state css;
+
+ unsigned int weight;
+ unsigned short ioprio-class;
+
+ spinlock-t lock;
+ struct hlist-head group-data;
+};
+#else
+struct io-group {
+ struct io-sched-data sched-data;
+
+ /* async-queue and idle-queue are used only for cfq */
+ struct io-queue *async-queue[2][IOPRIO-BE-NR];
+ struct io-queue *async-idle-queue;
+
+ /*
+ * Used to track any pending rt requests so we can pre-empt current
+ * non-RT cfqq in service when this value is non-zero.
+ */
+ unsigned int busy-rt-queues;
+};
+#endif /* CONFIG-GROUP-IOSCHED */
+
struct elv-fq-data {
struct io-group *root-group;
diff /* */
+
+#ifdef CONFIG-GROUP-IOSCHED
+SUBSYS(io)
+#endif
+
+/* */
diff
+#ifdef CONFIG-GROUP-IOSCHED
+ /* If task changes the cgroup, elevator processes it asynchronously */
+ unsigned short cgroup-changed;
+#endif
+
/*
* For request batching
*/
PATCH 17/25 - io-controller: deadline changes for hierarchical fair queuing by Vivek Goyal on
2009-07-02T20:09:45+00:00
This patch changes deadline to use queue scheduling code from elevator layer.
One can go back to old deadline by selecting CONFIG-IOSCHED-DEADLINE-HIER.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff
+config IOSCHED-DEADLINE-HIER
+ bool "Deadline Hierarchical Scheduling support"
+ depends on IOSCHED-DEADLINE && CGROUPS
+ select ELV-FAIR-QUEUING
+ select GROUP-IOSCHED
+ default n
+
PATCH 09/25 - io-controller: Common hierarchical fair queuing code in elevaotor layer by Vivek Goyal on
2009-07-02T20:11:01+00:00
o This patch enables hierarchical fair queuing in common layer. It is
controlled by config option CONFIG-GROUP-IOSCHED.
o Requests keep a reference on ioq and ioq keeps keep a reference
on groups. For async queues in CFQ, and single ioq in other
schedulers, io-group also keeps are reference on io-queue. This
reference on ioq is dropped when the queue is released
(elv-release-ioq). So the queue can be freed.
When a queue is released, it puts the reference to io-group and the
io-group is released after all the queues are released. Child groups
also take reference on parent groups, and release it when they are
destroyed.
o Reads of iocg->group-data are not always iocg->lock; so all the operations
on that list are still protected by RCU. All modifications to
iocg->group-data should always done under iocg->lock.
Whenever iocg->lock and queue-lock can both be held, queue-lock should
be held first. This avoids all deadlocks. In order to avoid race
between cgroup deletion and elevator switch the following algorithm is
used:
- Cgroup deletion path holds iocg->lock and removes iog entry
to iocg->group-data list. Then it drops iocg->lock, holds
queue-lock and destroys iog. So in this path, we never hold
iocg->lock and queue-lock at the same time. Also, since we
remove iog from iocg->group-data under iocg->lock, we can't
race with elevator switch.
- Elevator switch path does not remove iog from
iocg->group-data list directly. It first hold iocg->lock,
scans iocg->group-data again to see if iog is still there;
it removes iog only if it finds iog there. Otherwise, cgroup
deletion must have removed it from the list, and cgroup
deletion is responsible for removing iog.
So the path which removes iog from iocg->group-data list does
the final removal of iog by calling Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff cfqq->pid = current->pid;
+ /* ioq reference on iog */
+ elv-get-iog(iog);
cfq-log-cfqq(cfqd, cfqq, "alloced");
}
diff
+static void
+elv-release-ioq(struct elevator-queue *eq, struct io-queue **ioq-ptr);
+
#ifdef CONFIG-GROUP-IOSCHED
#define for-each-entity(entity)
for (; entity != NULL; entity = entity->parent)
@@ -90,6 +93,69 @@ static inline void bfq-check-next-active(struct io-sched-data *sd,
{
BUG-ON(sd->next-active != entity);
}
+
+static inline int iog-deleting(struct io-group *iog)
+{
+ return iog->deleting;
+}
+
+/* Do the two (enqueued) entities belong to the same group ? */
+static inline int
+is-same-group(struct io-entity *entity, struct io-entity *new-entity)
+{
+ if (entity->sched-data == new-entity->sched-data)
+ return 1;
+
+ return 0;
+}
+
+static inline struct io-entity *parent-entity(struct io-entity *entity)
+{
+ return entity->parent;
+}
+
+/* return depth at which a io entity is present in the hierarchy */
+static inline int depth-entity(struct io-entity *entity)
+{
+ int depth = 0;
+
+ for-each-entity(entity)
+ depth++;
+
+ return depth;
+}
+
+static void bfq-find-matching-entity(struct io-entity **entity,
+ struct io-entity **new-entity)
+{
+ int entity-depth, new-entity-depth;
+
+ /*
+ * preemption test can be made between sibling entities who are in the
+ * same group i.e who have a common parent. Walk up the hierarchy of
+ * both entities until we find their ancestors who are siblings of
+ * common parent.
+ */
+
+ /* First walk up until both entities are at same depth */
+ entity-depth = depth-entity(*entity);
+ new-entity-depth = depth-entity(*new-entity);
+
+ while (entity-depth > new-entity-depth) {
+ entity-depth+ }
+
+ while (!is-same-group(*entity, *new-entity)) {
+ *entity = parent-entity(*entity);
+ *new-entity = parent-entity(*new-entity);
+ }
+}
#else /* GROUP-IOSCHED */
#define for-each-entity(entity)
for (; entity != NULL; entity = NULL)
@@ -106,6 +172,17 @@ static inline void bfq-check-next-active(struct io-sched-data *sd,
struct io-entity *entity)
{
}
+
+static inline int iog-deleting(struct io-group *iog)
+{
+ /* In flat mode, root cgroup can't be deleted. */
+ return 0;
+}
+
+static void bfq-find-matching-entity(struct io-entity **entity,
+ struct io-entity **new-entity)
+{
+}
#endif /* GROUP-IOSCHED */
static inline int elv-prio-slice(struct elv-fq-data *efqd, int sync,
@@ -363,13 +440,6 @@ static void bfq-get-entity(struct io-entity *entity)
elv-get-ioq(ioq);
}
-static void bfq-init-entity(struct io-entity *entity, struct io-group *iog)
-{
- entity->ioprio = entity->new-ioprio;
- entity->ioprio-class = entity->new-ioprio-class;
- entity->sched-data = &iog->sched-data;
-}
-
/**
* bfq-find-deepest - find the deepest node that an extraction can modify.
* @node: the node being removed.
@@ -833,8 +903,26 @@ static int + iog = container-of(entity->sched-data, struct io-group, sched-data);
+
+ /*
+ * Hold a reference to entity's iog until we are done. This function
+ * travels the hierarchy and we don't want to free up the group yet
+ * while we are traversing the hiearchy. It is possible that this
+ * group's cgroup has been removed hence cgroup reference is gone.
+ * If this entity was active entity, then its group will not be on
+ * any of the trees and it will be freed up the moment queue is
+ * freed up in +
for-each-entity-safe(entity, parent) {
sd = entity->sched-data;
@@ -852,6 +940,7 @@ static void bfq-deactivate-entity(struct io-entity *entity, int requeue)
* the budgets on the path towards the root
* need to be updated.
*/
+ elv-put-iog(iog);
goto update;
}
@@ -859,11 +948,16 @@ static void bfq-deactivate-entity(struct io-entity *entity, int requeue)
* If we reach there the parent is no more backlogged and
* we want to propagate the dequeue upwards.
*
+ * If entity's group has been marked for deletion, don't
+ * requeue the group in idle tree so that it can be freed.
*/
-
- requeue = 1;
+ return;
update:
@@ -902,8 +996,59 @@ static void io-flush-idle-tree(struct io-service-tree *st)
+io-put-io-group-queues(struct elevator-queue *e, struct io-group *iog)
+{
+ int i, j;
+
+ for (i = 0; i < 2; i++)
+ for (j = 0; j < IOPRIO-BE-NR; j++)
+ elv-release-ioq(e, &iog->async-queue[i][j]);
+
+ /* Free up async idle queue */
+ elv-release-ioq(e, &iog->async-idle-queue);
+}
+
/* Mainly hierarchical grouping code */
#ifdef CONFIG-GROUP-IOSCHED
+static void iocg-destroy(struct cgroup-subsys *subsys, struct cgroup *cgroup);
+
+static void bfq-init-entity(struct io-entity *entity, struct io-group *iog)
+{
+ entity->ioprio = entity->new-ioprio;
+ entity->weight = entity->new-weight;
+ entity->ioprio-class = entity->new-ioprio-class;
+ entity->parent = iog->my-entity;
+ entity->sched-data = &iog->sched-data;
+}
+
+static void io-group-init-entity(struct io-cgroup *iocg, struct io-group *iog)
+{
+ struct io-entity *entity = &iog->entity;
+
+ entity->weight = entity->new-weight = iocg->weight;
+ entity->ioprio-class = entity->new-ioprio-class = iocg->ioprio-class;
+ entity->ioprio-changed = 1;
+ entity->my-sched-data = &iog->sched-data;
+}
+
+static void io-group-set-parent(struct io-group *iog, struct io-group *parent)
+{
+ struct io-entity *entity;
+
+ BUG-ON(parent == NULL);
+ BUG-ON(iog == NULL);
+
+ entity = &iog->entity;
+ entity->parent = parent->my-entity;
+ entity->sched-data = &parent->sched-data;
+ if (entity->parent)
+ /* Child group reference on parent group. */
+ elv-get-iog(parent);
+}
struct io-cgroup io-root-cgroup = {
.weight = IO-DEFAULT-GRP-WEIGHT,
@@ -916,6 +1061,26 @@ static struct io-cgroup *cgroup-to-io-cgroup(struct cgroup *cgroup)
struct io-cgroup, css);
}
+/*
+ * Search the io-group for efqd into the hash table (by now only a list)
+ * of bgrp. Must be called under rcu-read-lock().
+ */
+static struct io-group *
+io-cgroup-lookup-group(struct io-cgroup *iocg, void *key)
+{
+ struct io-group *iog;
+ struct hlist-node *n;
+ void *+
+ return NULL;
+}
+
#define SHOW-FUNCTION(-static void iocg-destroy(struct cgroup-subsys *subsys, struct cgroup *cgroup)
-{
-
- /* Implemented in later patch */
-}
-
struct cgroup-subsys io-subsys = {
.name = "io",
.create = iocg-create,
@@ -1072,7 +1231,599 @@ struct cgroup-subsys io-subsys = {
.subsys-id = io-subsys-id,
.use-id = 1,
};
+
+static inline unsigned int iog-weight(struct io-group *iog)
+{
+ return iog->entity.weight;
+}
+
+/**
+ * io-group-chain-alloc - allocate a chain of groups.
+ * @efqd: queue descriptor.
+ * @cgroup: the leaf cgroup this chain starts from.
+ *
+ * Allocate a chain of groups starting from the one belonging to
+ * @cgroup up to the root cgroup. Stop if a cgroup on the chain
+ * to the root has already an allocated group on @efqd.
+ */
+static struct io-group *
+io-group-chain-alloc(struct request-queue *q, void *key, struct cgroup *cgroup)
+{
+ struct io-cgroup *iocg;
+ struct io-group *iog, *leaf = NULL, *prev = NULL;
+ gfp-t flags = GFP-ATOMIC | + /*
+ * All the cgroups in the path from there to the
+ * root must have a io-group for efqd, so we don't
+ * need any more allocations.
+ */
+ break;
+ }
+
+ iog = kzalloc-node(sizeof(*iog), flags, q->node);
+ if (!iog)
+ goto cleanup;
+
+ iog->iocg-id = css-id(&iocg->css);
+
+ io-group-init-entity(iocg, iog);
+ iog->my-entity = &iog->entity;
+
+ atomic-set(&iog->ref, 0);
+ iog->deleting = 0;
+
+ /*
+ * Take the initial reference that will be released on destroy
+ * This can be thought of a joint reference by cgroup and
+ * elevator which will be dropped by either elevator exit
+ * or cgroup deletion path depending on who is exiting first.
+ */
+ elv-get-iog(iog);
+
+ if (leaf == NULL) {
+ leaf = iog;
+ prev = leaf;
+ } else {
+ io-group-set-parent(prev, iog);
+ /*
+ * Build a list of allocated nodes using the efqd
+ * filed, that is still unused and will be initialized
+ * only after the node will be connected.
+ */
+ prev->key = iog;
+ prev = iog;
+ }
+ }
+
+ return leaf;
+
+cleanup:
+ while (leaf != NULL) {
+ prev = leaf;
+ leaf = leaf->key;
+ kfree(prev);
+ }
+
+ return NULL;
+}
+
+/**
+ * io-group-chain-link - link an allocatd group chain to a cgroup hierarchy.
+ * @efqd: the queue descriptor.
+ * @cgroup: the leaf cgroup to start from.
+ * @leaf: the leaf group (to be associated to @cgroup).
+ *
+ * Try to link a chain of groups to a cgroup hierarchy, connecting the
+ * nodes bottom-up, so we can be sure that when we find a cgroup in the
+ * hierarchy that already as a group associated to @efqd all the nodes
+ * in the path to the root cgroup have one too.
+ *
+ * On locking: the queue lock protects the hierarchy (there is a hierarchy
+ * per device) while the io-cgroup lock protects the list of groups
+ * belonging to the same cgroup.
+ */
+static void io-group-chain-link(struct request-queue *q, void *key,
+ struct cgroup *cgroup,
+ struct io-group *leaf,
+ struct elv-fq-data *efqd)
+{
+ struct io-cgroup *iocg;
+ struct io-group *iog, *next, *prev = NULL;
+ unsigned long flags;
+
+ assert-spin-locked(q->queue-lock);
+
+ for (; cgroup != NULL && leaf != NULL; cgroup = cgroup->parent) {
+ iocg = cgroup-to-io-cgroup(cgroup);
+ next = leaf->key;
+
+ iog = io-cgroup-lookup-group(iocg, key);
+ BUG-ON(iog != NULL);
+
+ spin-lock-irqsave(&iocg->lock, flags);
+
+ rcu-assign-pointer(leaf->key, key);
+ hlist-add-head-rcu(&leaf->group-node, &iocg->group-data);
+ hlist-add-head(&leaf->elv-data-node, &efqd->group-list);
+
+ spin-unlock-irqrestore(&iocg->lock, flags);
+
+ prev = leaf;
+ leaf = next;
+ }
+
+ BUG-ON(cgroup == NULL && leaf != NULL);
+
+ if (cgroup != NULL && prev != NULL) {
+ iocg = cgroup-to-io-cgroup(cgroup);
+ iog = io-cgroup-lookup-group(iocg, key);
+ io-group-set-parent(prev, iog);
+ }
+}
+
+/**
+ * io-find-alloc-group - return the group associated to @efqd in @cgroup.
+ * @fqd: queue descriptor.
+ * @cgroup: cgroup being searched for.
+ * @create: if set to 1, create the io group if it has not been created yet.
+ *
+ * Return a group associated to @fqd in @cgroup, allocating one if
+ * necessary. When a group is returned all the cgroups in the path
+ * to the root have a group associated to @efqd.
+ *
+ * If the allocation fails, return the root group: this breaks guarantees
+ * but is a safe fallbak. If this loss becames a problem it can be
+ * mitigated using the equivalent weight (given by the product of the
+ * weights of the groups in the path from @group to the root) in the
+ * root scheduler.
+ *
+ * We allocate all the missing nodes in the path from the leaf cgroup
+ * to the root and we connect the nodes only after all the allocations
+ * have been successful.
+ */
+static struct io-group *io-find-alloc-group(struct request-queue *q,
+ struct cgroup *cgroup, struct elv-fq-data *efqd,
+ int create)
+{
+ struct io-cgroup *iocg = cgroup-to-io-cgroup(cgroup);
+ struct io-group *iog = NULL;
+ /* Note: Use efqd as key */
+ void *key = efqd;
+
+ /*
+ * Take a refenrece to css object. Don't want to map a bio to
+ * a group if it has been marked for deletion
+ */
+
+ if (!css-tryget(&iocg->css))
+ return iog;
+
+ iog = io-cgroup-lookup-group(iocg, key);
+ if (iog != NULL || !create)
+ goto end;
+
+ iog = io-group-chain-alloc(q, key, cgroup);
+ if (iog != NULL)
+ io-group-chain-link(q, key, cgroup, iog, efqd);
+
+end:
+ css-put(&iocg->css);
+ return iog;
+}
+
+/*
+ * Search for the io group current task belongs to. If create=1, then also
+ * create the io group if it is not already there.
+ *
+ * Note: This function should be called with queue lock held. It returns
+ * a pointer to io group without taking any reference. That group will
+ * be around as long as queue lock is not dropped (as group reclaim code
+ * needs to get hold of queue lock). So if somebody needs to use group
+ * pointer even after dropping queue lock, take a reference to the group
+ * before dropping queue lock.
+ */
+struct io-group *io-get-io-group(struct request-queue *q, int create)
+{
+ struct cgroup *cgroup;
+ struct io-group *iog;
+ struct elv-fq-data *efqd = &q->elevator->efqd;
+
+ assert-spin-locked(q->queue-lock);
+
+ rcu-read-lock();
+ cgroup = task-cgroup(current, io-subsys-id);
+ iog = io-find-alloc-group(q, cgroup, efqd, create);
+ if (!iog) {
+ if (create)
+ iog = efqd->root-group;
+ else
+ /*
+ * bio merge functions doing lookup don't want to
+ * map bio to root group by default
+ */
+ iog = NULL;
+ }
+ rcu-read-unlock();
+ return iog;
+}
+EXPORT-SYMBOL(io-get-io-group);
+
+static void io-free-root-group(struct elevator-queue *e)
+{
+ struct io-cgroup *iocg = &io-root-cgroup;
+ struct elv-fq-data *efqd = &e->efqd;
+ struct io-group *iog = efqd->root-group;
+ struct io-service-tree *st;
+ int i;
+
+ BUG-ON(!iog);
+ spin-lock-irq(&iocg->lock);
+ hlist-del-rcu(&iog->group-node);
+ spin-unlock-irq(&iocg->lock);
+
+ for (i = 0; i < IO-IOPRIO-CLASSES; i++) {
+ st = iog->sched-data.service-tree + i;
+ io-flush-idle-tree(st);
+ }
+
+ io-put-io-group-queues(e, iog);
+ elv-put-iog(iog);
+}
+
+static struct io-group *io-alloc-root-group(struct request-queue *q,
+ struct elevator-queue *e, void *key)
+{
+ struct io-group *iog;
+ struct io-cgroup *iocg;
+ int i;
+
+ iog = kmalloc-node(sizeof(*iog), GFP-KERNEL | + iog->sched-data.service-tree[i] = IO-SERVICE-TREE-INIT;
+
+ iocg = &io-root-cgroup;
+ spin-lock-irq(&iocg->lock);
+ rcu-assign-pointer(iog->key, key);
+ hlist-add-head-rcu(&iog->group-node, &iocg->group-data);
+ iog->iocg-id = css-id(&iocg->css);
+ spin-unlock-irq(&iocg->lock);
+
+ return iog;
+}
+
+static void io-group-free-rcu(struct rcu-head *head)
+{
+ struct io-group *iog;
+
+ iog = container-of(head, struct io-group, rcu-head);
+ kfree(iog);
+}
+
+/*
+ * This cleanup function does the last bit of things to destroy cgroup.
+ * It should only get called after io-destroy-group has been invoked.
+ */
+static void io-group-cleanup(struct io-group *iog)
+{
+ struct io-service-tree *st;
+ struct io-entity *entity = iog->my-entity;
+ int i;
+
+ for (i = 0; i < IO-IOPRIO-CLASSES; i++) {
+ st = iog->sched-data.service-tree + i;
+
+ BUG-ON(!RB-EMPTY-ROOT(&st->active));
+ BUG-ON(!RB-EMPTY-ROOT(&st->idle));
+ BUG-ON(st->wsum != 0);
+ }
+
+ BUG-ON(iog->sched-data.next-active != NULL);
+ BUG-ON(iog->sched-data.active-entity != NULL);
+ BUG-ON(entity != NULL && entity->tree != NULL);
+
+ /*
+ * Wait for any rcu readers to exit before freeing up the group.
+ * Primarily useful when io-get-io-group() is called without queue
+ * lock to access some group data from bdi-congested-group() path.
+ */
+ call-rcu(&iog->rcu-head, io-group-free-rcu);
+}
+
+void elv-put-iog(struct io-group *iog)
+{
+ struct io-group *parent = NULL;
+ struct io-entity *entity;
+
+ BUG-ON(!iog);
+
+ entity = iog->my-entity;
+
+ BUG-ON(atomic-read(&iog->ref) <= 0);
+ if (!atomic-dec-and-test(&iog->ref))
+ return;
+
+ if (entity)
+ parent = container-of(iog->my-entity->parent,
+ struct io-group, entity);
+
+ io-group-cleanup(iog);
+
+ if (parent)
+ elv-put-iog(parent);
+}
+EXPORT-SYMBOL(elv-put-iog);
+
+/*
+ * check whether a given group has got any active entities on any of the
+ * service tree.
+ */
+static inline int io-group-has-active-entities(struct io-group *iog)
+{
+ int i;
+ struct io-service-tree *st;
+
+ for (i = 0; i < IO-IOPRIO-CLASSES; i++) {
+ st = iog->sched-data.service-tree + i;
+ if (!RB-EMPTY-ROOT(&st->active))
+ return 1;
+ }
+
+ /*
+ * Also check there are no active entities being served which are
+ * not on active tree
+ */
+
+ if (iog->sched-data.active-entity)
+ return 1;
+
+ return 0;
+}
+
+/*
+ * After the group is destroyed, no new sync IO should come to the group.
+ * It might still have pending IOs in some busy queues. It should be able to
+ * send those IOs down to the disk. The async IOs (due to dirty page writeback)
+ * would go in the root group queues after this, as the group does not exist
+ * anymore.
+ */
+static void +
+ /*
+ * Mark io group for deletion so that no new entry goes in
+ * idle tree. Any active queue will be removed from active
+ * tree and not put in to idle tree.
+ */
+ iog->deleting = 1;
+
+ /* We flush idle tree now, and don't put things in there any more. */
+ for (i = 0; i < IO-IOPRIO-CLASSES; i++) {
+ st = iog->sched-data.service-tree + i;
+
+ io-flush-idle-tree(st);
+ }
+
+ eq = container-of(efqd, struct elevator-queue, efqd);
+ hlist-del(&iog->elv-data-node);
+ io-put-io-group-queues(eq, iog);
+
+ /*
+ * We can come here either through cgroup deletion path or through
+ * elevator exit path. If we come here through cgroup deletion path
+ * check if io group has any active entities or not. If not, then
+ * deactivate this io group to make sure it is removed from idle
+ * tree it might have been on. If this group was on idle tree, then
+ * this probably will be the last reference and group will be
+ * freed upon putting the reference down.
+ */
+
+ if (!io-group-has-active-entities(iog)) {
+ /*
+ * io group does not have any active entites. Because this
+ * group has been decoupled from io-cgroup list and this
+ * cgroup is being deleted, this group should not receive
+ * any new IO. Hence it should be safe to deactivate this
+ * io group and remove from the scheduling tree.
+ */
+ + elv-put-iog(iog);
+}
+
+static void iocg-destroy(struct cgroup-subsys *subsys, struct cgroup *cgroup)
+{
+ struct io-cgroup *iocg = cgroup-to-io-cgroup(cgroup);
+ struct io-group *iog;
+ struct elv-fq-data *efqd;
+ unsigned long uninitialized-var(flags);
+
+ /*
+ * io groups are linked in two lists. One list is maintained
+ * in elevator (efqd->group-list) and other is maintained
+ * per cgroup structure (iocg->group-data).
+ *
+ * While a cgroup is being deleted, elevator also might be
+ * exiting and both might try to cleanup the same io group
+ * so need to be little careful.
+ *
+ * (iocg->group-data) is protected by iocg->lock. To avoid deadlock,
+ * we can't hold the queue lock while holding iocg->lock. So we first
+ * remove iog from iocg->group-data under iocg->lock. Whoever removes
+ * iog from iocg->group-data should call + spin-lock-irqsave(&iocg->lock, flags);
+
+ if (hlist-empty(&iocg->group-data)) {
+ spin-unlock-irqrestore(&iocg->lock, flags);
+ goto done;
+ }
+ iog = hlist-entry(iocg->group-data.first, struct io-group,
+ group-node);
+ efqd = rcu-dereference(iog->key);
+ hlist-del-rcu(&iog->group-node);
+ iog->iocg-id = 0;
+ spin-unlock-irqrestore(&iocg->lock, flags);
+
+ spin-lock-irqsave(efqd->queue->queue-lock, flags);
+ + BUG-ON(!hlist-empty(&iocg->group-data));
+ kfree(iocg);
+}
+
+/*
+ * This functions checks if iog is still in iocg->group-data, and removes it.
+ * If iog is not in that list, then cgroup destroy path has removed it, and
+ * we do not need to remove it.
+ */
+static void io-group-check-and-destroy(struct elv-fq-data *efqd,
+ struct io-group *iog)
+{
+ struct io-cgroup *iocg;
+ unsigned long flags;
+ struct cgroup-subsys-state *css;
+
+ rcu-read-lock();
+
+ css = css-lookup(&io-subsys, iog->iocg-id);
+
+ if (!css)
+ goto out;
+
+ iocg = container-of(css, struct io-cgroup, css);
+
+ spin-lock-irqsave(&iocg->lock, flags);
+
+ if (iog->iocg-id) {
+ hlist-del-rcu(&iog->group-node);
+ +
+static void io-disconnect-groups(struct elevator-queue *e)
+{
+ struct hlist-node *pos, *n;
+ struct io-group *iog;
+ struct elv-fq-data *efqd = &e->efqd;
+
+ hlist-for-each-entry-safe(iog, pos, n, &efqd->group-list,
+ elv-data-node) {
+ io-group-check-and-destroy(efqd, iog);
+ }
+}
+
+/*
+ * if bio sumbmitting task and rq don't belong to same io-group, it can't
+ * be merged
+ */
+int io-group-allow-merge(struct request *rq, struct bio *bio)
+{
+ struct request-queue *q = rq->q;
+ struct io-queue *ioq = rq->ioq;
+ struct io-group *iog, *+ if (!iog) {
+ /* May be task belongs to a differet cgroup for which io
+ * group has not been setup yet. */
+ return 0;
+ }
+
+ /* Determine the io group of the ioq, rq belongs to*/
+ + entity->ioprio = entity->new-ioprio;
+ entity->weight = entity->new-weight;
+ entity->ioprio-class = entity->new-ioprio-class;
+ entity->sched-data = &iog->sched-data;
+}
+
+static inline void io-disconnect-groups(struct elevator-queue *e) {}
+static inline unsigned int iog-weight(struct io-group *iog) { return 0; }
+
+static struct io-group *io-alloc-root-group(struct request-queue *q,
+ struct elevator-queue *e, void *key)
+{
+ struct io-group *iog;
+ int i;
+
+ iog = kmalloc-node(sizeof(*iog), GFP-KERNEL | + return iog;
+}
+
+static void io-free-root-group(struct elevator-queue *e)
+{
+ struct io-group *iog = e->efqd.root-group;
+ struct io-service-tree *st;
+ int i;
+
+ for (i = 0; i < IO-IOPRIO-CLASSES; i++) {
+ st = iog->sched-data.service-tree + i;
+ io-flush-idle-tree(st);
+ }
+
+ io-put-io-group-queues(e, iog);
+ kfree(iog);
+}
+
+struct io-group *io-get-io-group(struct request-queue *q, int create)
+{
+ /* In flat mode, there is only root group */
+ return q->elevator->efqd.root-group;
+}
+EXPORT-SYMBOL(io-get-io-group);
#endif /* GROUP-IOSCHED */
+
/* Elevator fair queuing function */
static inline struct io-queue *elv-active-ioq(struct elevator-queue *e)
{
@@ -1375,10 +2126,14 @@ void elv-put-ioq(struct io-queue *ioq)
struct elv-fq-data *efqd = ioq->efqd;
struct elevator-queue *e = container-of(efqd, struct elevator-queue,
efqd);
+ struct io-group *iog;
BUG-ON(atomic-read(&ioq->ref) <= 0);
if (!atomic-dec-and-test(&ioq->ref))
return;
+
+ iog = ioq-to-io-group(ioq);
+
BUG-ON(ioq->nr-queued);
BUG-ON(ioq->entity.tree != NULL);
BUG-ON(elv-ioq-busy(ioq));
@@ -1390,10 +2145,11 @@ void elv-put-ioq(struct io-queue *ioq)
e->ops->elevator-free-sched-queue-fn(e, ioq->sched-queue);
elv-log-ioq(efqd, ioq, "put-queue");
elv-free-ioq(ioq);
+ elv-put-iog(iog);
}
EXPORT-SYMBOL(elv-put-ioq);
-void elv-release-ioq(struct elevator-queue *e, struct io-queue **ioq-ptr)
+static void elv-release-ioq(struct elevator-queue *e, struct io-queue **ioq-ptr)
{
struct io-queue *ioq = *ioq-ptr;
@@ -1485,8 +2241,12 @@ static void + elv-log-ioq(efqd, ioq, "set-active, busy=%d ioprio=%d"
+ " weight=%u group-weight=%u",
+ efqd->busy-queues,
+ ioq->entity.ioprio, ioq->entity.weight,
+ iog-weight(iog));
ioq->slice-end = 0;
elv-clear-ioq-wait-request(ioq);
@@ -1548,6 +2308,7 @@ static void elv-activate-ioq(struct io-queue *ioq, int add-front)
static void elv-deactivate-ioq(struct elv-fq-data *efqd, struct io-queue *ioq,
int requeue)
{
+ requeue = update-requeue(ioq, requeue);
bfq-deactivate-entity(&ioq->entity, requeue);
}
@@ -1725,6 +2486,7 @@ static int elv-should-preempt(struct request-queue *q, struct io-queue *new-ioq,
struct io-queue *ioq;
struct elevator-queue *eq = q->elevator;
struct io-entity *entity, *new-entity;
+ struct io-group *iog = NULL, *new-iog = NULL;
ioq = elv-active-ioq(eq);
@@ -1735,6 +2497,13 @@ static int elv-should-preempt(struct request-queue *q, struct io-queue *new-ioq,
new-entity = &new-ioq->entity;
/*
+ * In hierarchical setup, one need to traverse up the hierarchy
+ * till both the queues are children of same parent to make a
+ * decision whether to do the preemption or not.
+ */
+ bfq-find-matching-entity(&entity, &new-entity);
+
+ /*
* Allow an RT request to pre-empt an ongoing non-RT cfqq timeslice.
*/
@@ -1750,9 +2519,17 @@ static int elv-should-preempt(struct request-queue *q, struct io-queue *new-ioq,
return 1;
/*
- * Check with io scheduler if it has additional criterion based on
- * which it wants to preempt existing queue.
+ * If both the queues belong to same group, check with io scheduler
+ * if it has additional criterion based on which it wants to
+ * preempt existing queue.
*/
+ iog = ioq-to-io-group(ioq);
+ new-iog = ioq-to-io-group(new-ioq);
+
+ if (iog != new-iog)
+ return 0;
+
+
if (eq->ops->elevator-should-preempt-fn)
return eq->ops->elevator-should-preempt-fn(q,
ioq-sched-queue(new-ioq), rq);
@@ -2171,15 +2948,6 @@ void elv-ioq-completed-request(struct request-queue *q, struct request *rq)
elv-schedule-dispatch(q);
}
-struct io-group *io-get-io-group(struct request-queue *q)
-{
- struct elv-fq-data *efqd = &q->elevator->efqd;
-
- /* In flat mode, there is only root group */
- return efqd->root-group;
-}
-EXPORT-SYMBOL(io-get-io-group);
-
void *io-group-async-queue-prio(struct io-group *iog, int ioprio-class,
int ioprio)
{
@@ -2230,53 +2998,6 @@ void io-group-set-async-queue(struct io-group *iog, int ioprio-class,
}
EXPORT-SYMBOL(io-group-set-async-queue);
-/*
- * Release all the io group references to its async queues.
- */
-static void
-io-put-io-group-queues(struct elevator-queue *e, struct io-group *iog)
-{
- int i, j;
-
- for (i = 0; i < 2; i++)
- for (j = 0; j < IOPRIO-BE-NR; j++)
- elv-release-ioq(e, &iog->async-queue[i][j]);
-
- /* Free up async idle queue */
- elv-release-ioq(e, &iog->async-idle-queue);
-}
-
-static struct io-group *io-alloc-root-group(struct request-queue *q,
- struct elevator-queue *e, void *key)
-{
- struct io-group *iog;
- int i;
-
- iog = kmalloc-node(sizeof(*iog), GFP-KERNEL | - return iog;
-}
-
-static void io-free-root-group(struct elevator-queue *e)
-{
- struct io-group *iog = e->efqd.root-group;
- struct io-service-tree *st;
- int i;
-
- for (i = 0; i < IO-IOPRIO-CLASSES; i++) {
- st = iog->sched-data.service-tree + i;
- io-flush-idle-tree(st);
- }
-
- io-put-io-group-queues(e, iog);
- kfree(iog);
-}
-
static void elv-slab-kill(void)
{
/*
@@ -2320,6 +3041,7 @@ int elv-init-fq-data(struct request-queue *q, struct elevator-queue *e)
efqd->idle-slice-timer.data = (unsigned long) efqd;
INIT-WORK(&efqd->unplug-work, elv-kick-queue);
+ INIT-HLIST-HEAD(&efqd->group-list);
efqd->elv-slice[0] = elv-slice-async;
efqd->elv-slice[1] = elv-slice-sync;
@@ -2339,12 +3061,23 @@ int elv-init-fq-data(struct request-queue *q, struct elevator-queue *e)
void elv-exit-fq-data(struct elevator-queue *e)
{
struct elv-fq-data *efqd = &e->efqd;
+ struct request-queue *q = efqd->queue;
if (!elv-iosched-fair-queuing-enabled(e))
return;
elv-shutdown-timer-wq(e);
+ spin-lock-irq(q->queue-lock);
+ /* This should drop all the io group references of async queues */
+ io-disconnect-groups(e);
+ spin-unlock-irq(q->queue-lock);
+
+ elv-shutdown-timer-wq(e);
+
+ /* Wait for iog->key accessors to exit their grace periods. */
+ synchronize-rcu();
+
BUG-ON(timer-pending(&efqd->idle-slice-timer));
io-free-root-group(e);
}
diff #ifdef CONFIG-GROUP-IOSCHED
+/**
+ * struct io-group - per (device, cgroup) data structure.
+ * @entity: schedulable entity to insert into the parent group sched-data.
+ * @sched-data: own sched-data, to contain child entities (they may be
+ * both io-queues and io-groups).
+ * @group-node: node to be inserted into the io-cgroup->group-data
+ * list of the containing cgroup's io-cgroup.
+ * @elv-data-node: node to be inserted into the @efqd->group-list list
+ * of the groups active on the same device; used for cleanup.
+ * @async-queue: array of async queues for all the tasks belonging to
+ * the group, one queue per ioprio value per ioprio-class,
+ * except for the idle class that has only one queue.
+ * @async-idle-queue: async queue for the idle class (ioprio is ignored).
+ * @my-entity: pointer to @entity, %NULL for the toplevel group; used
+ * to avoid too many special cases during group creation/migration.
+ *
+ * Each (device, cgroup) pair has its own io-group, i.e., for each cgroup
+ * there is a set of io-groups, each one collecting the lower-level
+ * entities belonging to the group that are acting on the same device.
+ *
+ * Locking works as follows:
+ * o @group-node is protected by the io-cgroup lock, and is accessed
+ * via RCU from its readers.
+ * o @efqd is protected by the queue lock, RCU is used to access it
+ * from the readers.
+ * o All the other fields are protected by the @efqd queue lock.
+ */
struct io-group {
struct io-entity entity;
+ struct hlist-node elv-data-node;
struct hlist-node group-node;
struct io-sched-data sched-data;
+ atomic-t ref;
struct io-entity *my-entity;
/*
+ * A cgroup has multiple io-groups, one for each request queue.
+ * to find io group belonging to a particular queue, elv-fq-data
+ * pointer is stored as a key.
+ */
+ void *key;
+
+ /*
* async queue for each priority case for RT and BE class.
* Used only for cfq.
*/
@@ -198,11 +234,15 @@ struct io-group {
struct io-queue *async-queue[2][IOPRIO-BE-NR];
struct io-queue *async-idle-queue;
+ struct rcu-head rcu-head;
+
/*
* Used to track any pending rt requests so we can pre-empt current
* non-RT cfqq in service when this value is non-zero.
*/
unsigned int busy-rt-queues;
+
+ int deleting;
unsigned short iocg-id;
};
@@ -245,6 +285,9 @@ struct io-group {
struct elv-fq-data {
struct io-group *root-group;
+ /* List of io groups hanging on this elevator */
+ struct hlist-head group-list;
+
struct request-queue *queue;
unsigned int busy-queues;
@@ -407,7 +450,7 @@ static inline void elv-ioq-set-ioprio-class(struct io-queue *ioq,
static inline unsigned int bfq-ioprio-to-weight(int ioprio)
{
WARN-ON(ioprio < 0 || ioprio >= IOPRIO-BE-NR);
- return IOPRIO-BE-NR - ioprio;
+ return ((IOPRIO-BE-NR - ioprio) * WEIGHT-MAX)/IOPRIO-BE-NR;
}
static inline void elv-ioq-set-ioprio(struct io-queue *ioq, int ioprio)
@@ -430,6 +473,46 @@ static inline struct io-group *ioq-to-io-group(struct io-queue *ioq)
sched-data);
}
+#ifdef CONFIG-GROUP-IOSCHED
+extern int io-group-allow-merge(struct request *rq, struct bio *bio);
+extern void elv-put-iog(struct io-group *iog);
+
+static inline void elv-get-iog(struct io-group *iog)
+{
+ atomic-inc(&iog->ref);
+}
+
+static inline int update-requeue(struct io-queue *ioq, int requeue)
+{
+ struct io-group *iog = ioq-to-io-group(ioq);
+
+ if (iog->deleting == 1)
+ return 0;
+
+ return requeue;
+}
+
+#else /* !GROUP-IOSCHED */
+static inline int io-group-allow-merge(struct request *rq, struct bio *bio)
+{
+ return 1;
+}
+
+static inline void elv-get-iog(struct io-group *iog)
+{
+}
+
+static inline void elv-put-iog(struct io-group *iog)
+{
+}
+
+static inline int update-requeue(struct io-queue *ioq, int requeue)
+{
+ return requeue;
+}
+
+#endif /* GROUP-IOSCHED */
+
extern ssize-t elv-slice-idle-show(struct elevator-queue *q, char *name);
extern ssize-t elv-slice-idle-store(struct elevator-queue *q, const char *name,
size-t count);
@@ -477,7 +560,7 @@ extern void *io-group-async-queue-prio(struct io-group *iog, int ioprio-class,
int ioprio);
extern void io-group-set-async-queue(struct io-group *iog, int ioprio-class,
int ioprio, struct io-queue *ioq);
-extern struct io-group *io-get-io-group(struct request-queue *q);
+extern struct io-group *io-get-io-group(struct request-queue *q, int create);
extern int elv-nr-busy-ioq(struct elevator-queue *e);
extern struct io-queue *elv-alloc-ioq(struct request-queue *q, gfp-t gfp-mask);
extern void elv-free-ioq(struct io-queue *ioq);
@@ -528,5 +611,11 @@ static inline void *elv-fq-select-ioq(struct request-queue *q, int force)
{
return NULL;
}
+
+static inline int io-group-allow-merge(struct request *rq, struct bio *bio)
+
+{
+ return 1;
+}
#endif /* CONFIG-ELV-FAIR-QUEUING */
#endif /* -BFQ-SCHED-H */
diff
+ /* If rq and bio belongs to different groups, dont allow merging */
+ if (!io-group-allow-merge(rq, bio))
+ return 0;
+
if (!elv-iosched-allow-merge(rq, bio))
return 0;
PATCH 11/25 - io-controller: Export disk time used and nr sectors dipatched through cgroups by Vivek Goyal on
2009-07-02T20:11:08+00:00
o This patch exports some statistics through cgroup interface. Two of the
statistics currently exported are actual disk time assigned to the cgroup
and actual number of sectors dispatched to disk on behalf of this cgroup.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff #include <linux/blktrace-api.h>
+#include <linux/seq-file.h>
/* Values taken from cfq */
const int elv-slice-sync = HZ / 10;
@@ -971,13 +972,16 @@ update:
}
}
-static void entity-served(struct io-entity *entity, unsigned long served)
+void entity-served(struct io-entity *entity, unsigned long served,
+ unsigned long nr-sectors)
{
struct io-service-tree *st;
for-each-entity(entity) {
st = io-entity-service-tree(entity);
entity->service += served;
+ entity->total-service += served;
+ entity->total-sector-service += nr-sectors;
BUG-ON(st->wsum == 0);
st->vtime += bfq-delta(served, st->wsum);
bfq-forget-idle(st);
@@ -1140,6 +1144,66 @@ STORE-FUNCTION(weight, 1, WEIGHT-MAX);
STORE-FUNCTION(ioprio-class, IOPRIO-CLASS-RT, IOPRIO-CLASS-IDLE);
#undef STORE-FUNCTION
+static int io-cgroup-disk-time-read(struct cgroup *cgroup,
+ struct cftype *cftype, struct seq-file *m)
+{
+ struct io-cgroup *iocg;
+ struct io-group *iog;
+ struct hlist-node *n;
+
+ if (!cgroup-lock-live-group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup-to-io-cgroup(cgroup);
+
+ rcu-read-lock();
+ hlist-for-each-entry-rcu(iog, n, &iocg->group-data, group-node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (iog->key) {
+ seq-printf(m, "%u %u %lu
", MAJOR(iog->dev),
+ MINOR(iog->dev),
+ iog->entity.total-service);
+ }
+ }
+ rcu-read-unlock();
+ cgroup-unlock();
+
+ return 0;
+}
+
+static int io-cgroup-disk-sectors-read(struct cgroup *cgroup,
+ struct cftype *cftype, struct seq-file *m)
+{
+ struct io-cgroup *iocg;
+ struct io-group *iog;
+ struct hlist-node *n;
+
+ if (!cgroup-lock-live-group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup-to-io-cgroup(cgroup);
+
+ rcu-read-lock();
+ hlist-for-each-entry-rcu(iog, n, &iocg->group-data, group-node) {
+ /*
+ * There might be groups which are not functional and
+ * waiting to be reclaimed upon cgoup deletion.
+ */
+ if (iog->key) {
+ seq-printf(m, "%u %u %lu
", MAJOR(iog->dev),
+ MINOR(iog->dev),
+ iog->entity.total-sector-service);
+ }
+ }
+ rcu-read-unlock();
+ cgroup-unlock();
+
+ return 0;
+}
+
struct cftype bfqio-files[] = {
{
.name = "weight",
@@ -1151,6 +1215,14 @@ struct cftype bfqio-files[] = {
.read-u64 = io-cgroup-ioprio-class-read,
.write-u64 = io-cgroup-ioprio-class-write,
},
+ {
+ .name = "disk-time",
+ .read-seq-string = io-cgroup-disk-time-read,
+ },
+ {
+ .name = "disk-sectors",
+ .read-seq-string = io-cgroup-disk-sectors-read,
+ },
};
static int iocg-populate(struct cgroup-subsys *subsys, struct cgroup *cgroup)
@@ -1252,6 +1324,8 @@ io-group-chain-alloc(struct request-queue *q, void *key, struct cgroup *cgroup)
struct io-cgroup *iocg;
struct io-group *iog, *leaf = NULL, *prev = NULL;
gfp-t flags = GFP-ATOMIC |
iog->iocg-id = css-id(&iocg->css);
+ sscanf(dev-name(bdi->dev), "%u:%u", &major, &minor);
+ iog->dev = MKDEV(major, minor);
+
io-group-init-entity(iocg, iog);
iog->my-entity = &iog->entity;
@@ -1873,7 +1950,7 @@ EXPORT-SYMBOL(elv-get-slice-idle);
static void elv-ioq-served(struct io-queue *ioq, unsigned long served)
{
- entity-served(&ioq->entity, served);
+ entity-served(&ioq->entity, served, ioq->nr-sectors);
}
/* Tells whether ioq is queued in root group or not */
diff int ioprio-changed;
+
+ /*
+ * Keep track of total service received by this entity. Keep the
+ * stats both for time slices and number of sectors dispatched
+ */
+ unsigned long total-service;
+ unsigned long total-sector-service;
};
/*
@@ -244,6 +251,9 @@ struct io-group {
int deleting;
unsigned short iocg-id;
+
+ /* The device MKDEV(major, minor), this group has been created for */
+ dev-t dev;
};
/**
PATCH 20/25 - io-controller: map async requests to appropriate cgroup by Vivek Goyal on
2009-07-02T20:11:50+00:00
o So far we were assuming that a bio/rq belongs to the task who is submitting
it. It did not hold good in case of async writes. This patch makes use of
blkio-cgroup pataches to attribute the aysnc writes to right group instead
of task submitting the bio.
o For sync requests, we continue to assume that io belongs to the task
submitting it. Only in case of async requests, we make use of io tracking
patches to track the owner cgroup.
o So far cfq always caches the async queue pointer. With async requests now
not necessarily being tied to submitting task io context, caching the
pointer will not help for async queues. This patch introduces a new config
option CONFIG-TRACK-ASYNC-CONTEXT. If this option is not set, cfq retains
old behavior where async queue pointer is cached in task context. If it
is set, async queue pointer is not cached and we take help of bio
tracking patches to determine group bio belongs to and then map it to
async queue of that group.
Signed-off-by: Nauman Rafique <nauman@google.com>
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff
+config TRACK-ASYNC-CONTEXT
+ bool "Determine async request context from bio"
+ depends on GROUP-IOSCHED
+ select CGROUP-BLKIO
+ default n
+
Re: PATCH 13/25 - io-controller: Wait for requests to complete from last queue before new queue is scheduled by Nauman Rafique on
2009-07-02T20:12:01+00:00
On Thu, Jul 2, 2009 at 1:01 PM, Vivek Goyal<vgoyal@redhat.com> wrote:
> o Currently one can dispatch requests from multiple queues to the disk. This
> is true for hardware which supports queuing. So if a disk support queue
> depth of 31 it is possible that 20 requests are dispatched from queue 1
> and then next queue is scheduled in which dispatches more requests.
>
> o This multiple queue dispatch introduces issues for accurate accounting of
> disk time consumed by a particular queue. For example, if one async queue
> is scheduled in, it can dispatch 31 requests to the disk and then it will
> be expired and a new sync queue might get scheduled in. These 31 requests
> might take a long time to finish but this time is never accounted to the
> async queue which dispatched these requests.
>
> o This patch introduces the functionality where we wait for all the requests
> to finish from previous queue before next queue is scheduled in. That way
> a queue is more accurately accounted for disk time it has consumed. Note
> this still does not take care of errors introduced by disk write caching.
>
> o Because above behavior can result in reduced throughput, this behavior will
> be enabled only if user sets "fairness" tunable to 2 or higher.
Vivek,
Did you collect any numbers for the impact on throughput from this
patch? It seems like with this change, we can even support NCQ.
>
> o This patch helps in achieving more isolation between reads and buffered
> writes in different cgroups. buffered writes typically utilize full queue
> depth and then expire the queue. On the contarary, sequential reads
> typicaly driver queue depth of 1. So despite the fact that writes are
> using more disk time it is never accounted to write queue because we don't
> wait for requests to finish after dispatching these. This patch helps
> do more accurate accounting of disk time, especially for buffered writes
> hence providing better fairness hence better isolation between two cgroups
> running read and write workloads.
>
> Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
> > +++ b/block/elevator-fq.c
> @@ -2038,7 +2038,7 @@ STORE-FUNCTION(elv-slice-sync-store, &efqd->elv-slice[1], 1, UINT-MAX, 1);
> EXPORT-SYMBOL(elv-slice-sync-store);
> STORE-FUNCTION(elv-slice-async-store, &efqd->elv-slice[0], 1, UINT-MAX, 1);
> EXPORT-SYMBOL(elv-slice-async-store);
> -STORE-FUNCTION(elv-fairness-store, &efqd->fairness, 0, 1, 0);
> +STORE-FUNCTION(elv-fairness-store, &efqd->fairness, 0, 2, 0);
> EXPORT-SYMBOL(elv-fairness-store);
> #undef STORE-FUNCTION
>
> @@ -2952,6 +2952,24 @@ void *elv-fq-select-ioq(struct request-queue *q, int force)
> }
>
> expire:
> + if (efqd->fairness >= 2 && !force && ioq && ioq->dispatched) {
> + /*
> + * If there are request dispatched from this queue, don't
> + * dispatch requests from new queue till all the requests from
> + * this queue have completed.
> + *
> + * This helps in attributing right amount of disk time consumed
> + * by a particular queue when hardware allows queuing.
> + *
> + * Set ioq = NULL so that no more requests are dispatched from
> + * this queue.
> + */
> + elv-log-ioq(efqd, ioq, "select: wait for requests to finish"
> + " disp=%lu", ioq->dispatched);
> + ioq = NULL;
> + goto keep-queue;
> + }
> +
> elv-ioq-slice-expired(q);
> new-queue:
> ioq = elv-set-active-ioq(q, new-ioq);
> @@ -3109,6 +3127,17 @@ void elv-ioq-completed-request(struct request-queue *q, struct request *rq)
> */
> elv-ioq-arm-slice-timer(q, 1);
> } else {
> + /* If fairness >=2 and there are requests
> + * dispatched from this queue, don't dispatch
> + * new requests from a different queue till
> + * all requests from this queue have finished.
> + * This helps in attributing right disk time
> + * to a queue when hardware supports queuing.
> + */
> +
> + if (efqd->fairness >= 2 && ioq->dispatched)
> + goto done;
> +
> /* Expire the queue */
> elv-ioq-slice-expired(q);
> }
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
PATCH 12/25 - io-controller: idle for sometime on sync queue before expiring it by Vivek Goyal on
2009-07-02T20:12:18+00:00
o When a sync queue expires, in many cases it might be empty and then
it will be deleted from the active tree. This will lead to a scenario
where out of two competing queues, only one is on the tree and when a
new queue is selected, vtime jump takes place and we don't see services
provided in proportion to weight.
o In general this is a fundamental problem with fairness of sync queues
where queues are not continuously backlogged. Looks like idling is
only solution to make sure such kind of queues can get some decent amount
of disk bandwidth in the face of competion from continusouly backlogged
queues. But excessive idling has potential to reduce performance on SSD
and disks with commnad queuing.
o This patch experiments with waiting for next request to come before a
queue is expired after it has consumed its time slice. This can ensure
more accurate fairness numbers in some cases.
o Introduced a tunable "fairness". If set, io-controller will put more
focus on getting fairness right than getting throughput right.
o When writes are being done on a file opened with O-SYNC, ioscheduler sees
synchronous write requests with noidle flag set. But the fact is we are
seeing a continuous stream of writes with-in 1ms or so. Hence it makes sense
to wait on these writes. For the time being to achieve fairness for O-SYNC
writes, continue to idle even if last request was sync write and noidle
flag was set. (Only done if "fairness" is set). Probably right fix is to
make sure in O-SYNC path, requests are not marked with noidle flag.
Signed-off-by: Vivek Goyal <vgoyal@redhat.com>
-diff ELV-ATTR(slice-async),
+ ELV-ATTR(fairness),
@@ -424,6 +424,7 @@ static void bfq-active-insert(struct io-service-tree *st,
struct rb-node *node = &entity->rb-node;
bfq-insert(&st->active, entity);
+ entity->sched-data->nr-active++;
if (node->rb-left != NULL)
node = node->rb-left;
@@ -483,6 +484,7 @@ static void bfq-active-remove(struct io-service-tree *st,
node = bfq-find-deepest(&entity->rb-node);
bfq-remove(&st->active, entity);
+ entity->sched-data->nr-active
+/*
+ * Returns the number of active entities a particular io group has. This
+ * includes number of active entities on service tree as well as the active
+ * entity which is being served currently, if any.
+ */
+
+static inline int elv-iog-nr-active(struct io-group *iog)
+{
+ struct io-sched-data *sd = &iog->sched-data;
+
+ if (sd->active-entity)
+ return sd->nr-active + 1;
+ else
+ return sd->nr-active;
+}
static struct io-service-tree *
#undef SHOW-FUNCTION
#define STORE-FUNCTION( #undef STORE-FUNCTION
void elv-schedule-dispatch(struct request-queue *q)
@@ -2142,7 +2163,7 @@ static void elv-ioq-update-idle-window(struct elevator-queue *eq,
* io scheduler if it wants to disable idling based on additional
* considrations like seek pattern.
*/
- if (enable-idle) {
+ if (enable-idle && !efqd->fairness) {
if (eq->ops->elevator-update-idle-window-fn)
enable-idle = eq->ops->elevator-update-idle-window-fn(
eq, ioq->sched-queue, rq);
@@ -2328,6 +2349,7 @@ static void del-timer(&efqd->idle-slice-timer);
@@ -2483,10 +2505,12 @@ void
elv-clear-ioq-wait-request(ioq);
+ elv-clear-ioq-wait-busy(ioq);
+ elv-clear-ioq-wait-busy-done(ioq);
/*
* if ioq->slice-end = 0, that means a queue was expired before first
@@ -2659,7 +2683,7 @@ void elv-ioq-request-add(struct request-queue *q, struct request *rq)
* has other work pending, don't risk delaying until the
* idle timer unplug to continue working.
*/
- if (elv-ioq-wait-request(ioq)) {
+ if (elv-ioq-wait-request(ioq) && !elv-ioq-wait-busy(ioq)) {
if (blk-rq-bytes(rq) > PAGE-CACHE-SIZE ||
efqd->busy-queues > 1) {
del-timer(&efqd->idle-slice-timer);
@@ -2667,6 +2691,18 @@ void elv-ioq-request-add(struct request-queue *q, struct request *rq)
}
elv-mark-ioq-must-dispatch(ioq);
}
+
+ /*
+ * If we were waiting for a request on this queue, wait is
+ * done. Schedule the next dispatch
+ */
+ if (elv-ioq-wait-busy(ioq)) {
+ del-timer(&efqd->idle-slice-timer);
+ elv-clear-ioq-wait-busy(ioq);
+ elv-mark-ioq-wait-busy-done(ioq);
+ elv-clear-ioq-must-dispatch(ioq);
+ elv-schedule-dispatch(q);
+ }
} else if (elv-should-preempt(q, ioq, rq)) {
/*
* not the active queue - expire current slice if it is
@@ -2694,6 +2730,9 @@ static void elv-idle-slice-timer(unsigned long data)
if (ioq) {
+ if (elv-ioq-wait-busy(ioq))
+ goto expire;
+
/*
* We saw a request before the queue expired, let it through
*/
@@ -2727,7 +2766,7 @@ out-cont:
spin-unlock-irqrestore(q->queue-lock, flags);
}
-static void elv-ioq-arm-slice-timer(struct request-queue *q)
+static void elv-ioq-arm-slice-timer(struct request-queue *q, int wait-for-busy)
{
struct elv-fq-data *efqd = &q->elevator->efqd;
struct io-queue *ioq = elv-active-ioq(q->elevator);
@@ -2740,26 +2779,38 @@ static void elv-ioq-arm-slice-timer(struct request-queue *q)
* for devices that support queuing, otherwise we still have a problem
* with sync vs async workloads.
*/
- if (blk-queue-nonrot(q) && efqd->hw-tag)
+ if (blk-queue-nonrot(q) && efqd->hw-tag && !efqd->fairness)
return;
/*
- * still requests with the driver, don't idle
+ * idle is disabled, either manually or by past process history
*/
- if (efqd->rq-in-driver)
+ if (!efqd->elv-slice-idle || !elv-ioq-idle-window(ioq))
return;
/*
- * idle is disabled, either manually or by past process history
+ * This queue has consumed its time slice. We are waiting only for
+ * it to become busy before we select next queue for dispatch.
*/
- if (!efqd->elv-slice-idle || !elv-ioq-idle-window(ioq))
+ if (wait-for-busy) {
+ elv-mark-ioq-wait-busy(ioq);
+ sl = efqd->elv-slice-idle;
+ mod-timer(&efqd->idle-slice-timer, jiffies + sl);
+ elv-log-ioq(efqd, ioq, "arm idle: %lu wait busy=1", sl);
+ return;
+ }
+
+ /*
+ * still requests with the driver, don't idle
+ */
+ if (efqd->rq-in-driver && !efqd->fairness)
return;
/*
* may be iosched got its own idling logic. In that case io
* schduler will take care of arming the timer, if need be.
*/
- if (q->elevator->ops->elevator-arm-slice-timer-fn) {
+ if (q->elevator->ops->elevator-arm-slice-timer-fn && !efqd->fairness) {
q->elevator->ops->elevator-arm-slice-timer-fn(q,
ioq->sched-queue);
} else {
@@ -2822,11 +2873,38 @@ void *elv-fq-select-ioq(struct request-queue *q, int force)
goto expire;
}
+ /* We are waiting for this queue to become busy before it expires.*/
+ if (efqd->fairness && elv-ioq-wait-busy(ioq)) {
+ ioq = NULL;
+ goto keep-queue;
+ }
+
/*
* The active queue has run out of time, expire it and select new.
*/
- if (elv-ioq-slice-used(ioq) && !elv-ioq-must-dispatch(ioq))
- goto expire;
+ if (elv-ioq-slice-used(ioq) && !elv-ioq-must-dispatch(ioq)) {
+ /*
+ * Queue has used up its slice. Wait busy is not on otherwise
+ * we wouldn't have been here. There is a chance that after
+ * slice expiry no request from the queue completed hence
+ * wait busy timer could not be turned on. If that's the case
+ * don't expire the queue yet. Next request completion from
+ * the queue will arm the wait busy timer.
+ *
+ * Don't wait if this group has other active queues. This
+ * will make sure that we don't loose fairness at group level
+ * at the same time in root group we will not see cfq
+ * regressions.
+ */
+ if (elv-ioq-sync(ioq) && !ioq->nr-queued
+ && elv-ioq-nr-dispatched(ioq)
+ && (elv-iog-nr-active(ioq-to-io-group(ioq)) <= 1)
+ && !elv-ioq-wait-busy-done(ioq)) {
+ ioq = NULL;
+ goto keep-queue;
+ } else
+ goto expire;
+ }
/*
* If we have a RT cfqq waiting, then we pre-empt the current non-rt
@@ -2977,11 +3055,13 @@ void elv-ioq-completed-request(struct request-queue *q, struct request *rq)
const int sync = rq-is-sync(rq);
struct io-queue *ioq;
struct elv-fq-data *efqd = &q->elevator->efqd;
+ struct io-group *iog;
if (!elv-iosched-fair-queuing-enabled(q->elevator))
return;
ioq = rq->ioq;
+ iog = ioq-to-io-group(ioq);
elv-log-ioq(efqd, ioq, "complete");
@@ -3007,6 +3087,12 @@ void elv-ioq-completed-request(struct request-queue *q, struct request *rq)
elv-ioq-set-prio-slice(q, ioq);
elv-clear-ioq-slice-new(ioq);
}
+
+ if (elv-ioq-class-idle(ioq)) {
+ elv-ioq-slice-expired(q);
+ goto done;
+ }
+
/*
* If there are no requests waiting in this queue, and
* there are other queues ready to issue requests, AND
@@ -3014,13 +3100,24 @@ void elv-ioq-completed-request(struct request-queue *q, struct request *rq)
* mean seek distance, give them a chance to run instead
* of idling.
*/
- if (elv-ioq-slice-used(ioq) || elv-ioq-class-idle(ioq))
- elv-ioq-slice-expired(q);
- else if (!ioq->nr-queued && !elv-close-cooperator(q, ioq, 1)
- && sync && !rq-noidle(rq))
- elv-ioq-arm-slice-timer(q);
+ if (elv-ioq-slice-used(ioq)) {
+ if (sync && !ioq->nr-queued
+ && (elv-iog-nr-active(iog) <= 1)) {
+ /*
+ * Idle for one extra period in hierarchical
+ * setup
+ */
+ elv-ioq-arm-slice-timer(q, 1);
+ } else {
+ /* Expire the queue */
+ elv-ioq-slice-expired(q);
+ }
+ } else if (!ioq->nr-queued && !elv-close-cooperator(q, ioq, 1)
+ && sync && (!rq-noidle(rq) || efqd->fairness))
+ elv-ioq-arm-slice-timer(q, 0);
}
+done:
if (!efqd->rq-in-driver)
elv-schedule-dispatch(q);
}
@@ -3125,6 +3222,8 @@ int elv-init-fq-data(struct request-queue *q, struct elevator-queue *e)
efqd->elv-slice-idle = elv-slice-idle;
efqd->hw-tag = 1;
+ /* For the time being keep fairness enabled by default */
+ efqd->fairness = 1;
return 0;
}
diff struct io-entity *next-active;
+ int nr-active;
struct io-service-tree service-tree[IO-IOPRIO-CLASSES];
};
@@ -337,6 +338,13 @@ struct elv-fq-data {
unsigned long long rate-sampling-start; /*sampling window start jifies*/
/* number of sectors finished io during current sampling window */
unsigned long rate-sectors-current;
+
+ /*
+ * If set to 1, will disable many optimizations done for boost
+ * throughput and focus more on providing fairness for sync
+ * queues.
+ */
+ unsigned int fairness;
};
/* Logging facilities. */
@@ -358,6 +366,8 @@ enum elv-queue-state-flags {
ELV-QUEUE-FLAG-wait-request, /* waiting for a request */
ELV-QUEUE-FLAG-must-dispatch, /* must be allowed a dispatch */
ELV-QUEUE-FLAG-slice-new, /* no requests dispatched in slice */
+ ELV-QUEUE-FLAG-wait-busy, /* wait for this queue to get busy */
+ ELV-QUEUE-FLAG-wait-busy-done, /* Have already waited on this queue*/
};
#define ELV-IO-QUEUE-FLAG-FNS(name)
@@ -380,6 +390,8 @@ ELV-IO-QUEUE-FLAG-FNS(wait-request)
ELV-IO-QUEUE-FLAG-FNS(must-dispatch)
ELV-IO-QUEUE-FLAG-FNS(idle-window)
ELV-IO-QUEUE-FLAG-FNS(slice-new)
+ELV-IO-QUEUE-FLAG-FNS(wait-busy)
+ELV-IO-QUEUE-FLAG-FNS(wait-busy-done)
static inline struct io-service-tree *
io-entity-service-tree(struct io-entity *entity)
@@ -532,6 +544,9 @@ extern ssize-t elv-slice-sync-store(struct elevator-queue *q, const char *name,
extern ssize-t elv-slice-async-show(struct elevator-queue *q, char *name);
extern ssize-t elv-slice-async-store(struct elevator-queue *q, const char *name,
size-t count);
+extern ssize-t elv-fairness-show(struct elevator-queue *q, char *name);
+extern ssize-t elv-fairness-store(struct elevator-queue *q, const char *name,
+ size-t count);
/* Functions used by elevator.c */
extern int elv-init-fq-data(struct request-queue *q, struct elevator-queue *e);
Re: PATCH 09/25 - io-controller: Common hierarchical fair queuing code in elevaotor layer by Gui Jianfeng on
2009-07-06T02:47:16+00:00
Vivek Goyal wrote:
...
> +static struct io-group *
> +io-group-chain-alloc(struct request-queue *q, void *key, struct cgroup *cgroup)
> +{
> + struct io-cgroup *iocg;
> + struct io-group *iog, *leaf = NULL, *prev = NULL;
> + gfp-t flags = GFP-ATOMIC | > + /*
> + * All the cgroups in the path from there to the
> + * root must have a io-group for efqd, so we don't
> + * need any more allocations.
> + */
> + break;
> + }
> +
> + iog = kzalloc-node(sizeof(*iog), flags, q->node);
> + if (!iog)
> + goto cleanup;
> +
> + iog->iocg-id = css-id(&iocg->css);
Hi Vivek,
IMHO, The io-cgroup id is nothing more than keeping track the corresponding iocg.
So why not just store iocg pointer in io-group and just get rid of this complexity.
I'd like to post a patch to do this change, what's your opinion?
Re: PATCH 11/25 - io-controller: Export disk time used and nr sectors dipatched through cgroups by Gui Jianfeng on
2009-07-08T02:17:54+00:00
Vivek Goyal wrote:
...
>
> +static int io-cgroup-disk-time-read(struct cgroup *cgroup,
> + struct cftype *cftype, struct seq-file *m)
> +{
> + struct io-cgroup *iocg;
> + struct io-group *iog;
> + struct hlist-node *n;
> +
> + if (!cgroup-lock-live-group(cgroup))
> + return -ENODEV;
> +
> + iocg = cgroup-to-io-cgroup(cgroup);
> +
> + rcu-read-lock();
> + hlist-for-each-entry-rcu(iog, n, &iocg->group-data, group-node) {
> + /*
> + * There might be groups which are not functional and
> + * waiting to be reclaimed upon cgoup deletion.
> + */
> + if (iog->key) {
> + seq-printf(m, "%u %u %lu
", MAJOR(iog->dev),
> + MINOR(iog->dev),
> + iog->entity.total-service);
Hi Vivek,
Let io.disk-*'s outputs conform with io.policy's.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
-diff if (iog->key) {
- seq-printf(m, "%u %u %lu
", MAJOR(iog->dev),
+ seq-printf(m, "%u:%u %lu
", MAJOR(iog->dev),
MINOR(iog->dev),
iog->entity.total-service);
}
@@ -1661,7 +1661,7 @@ static int io-cgroup-disk-sectors-read(struct cgroup *cgroup,
* waiting to be reclaimed upon cgoup deletion.
*/
if (iog->key) {
- seq-printf(m, "%u %u %lu
", MAJOR(iog->dev),
+ seq-printf(m, "%u:%u %lu
", MAJOR(iog->dev),
MINOR(iog->dev),
iog->entity.total-sector-service);
}
@@ -1692,7 +1692,7 @@ static int io-cgroup-disk-queue-read(struct cgroup *cgroup,
* waiting to be reclaimed upon cgoup deletion.
*/
if (iog->key) {
- seq-printf(m, "%u %u %lu %lu
", MAJOR(iog->dev),
+ seq-printf(m, "%u:%u %lu %lu
", MAJOR(iog->dev),
MINOR(iog->dev), iog->queue,
iog->queue-duration);
}
@@ -1722,7 +1722,7 @@ static int io-cgroup-disk-dequeue-read(struct cgroup *cgroup,
* waiting to be reclaimed upon cgoup deletion.
*/
if (iog->key) {
- seq-printf(m, "%u %u %lu
", MAJOR(iog->dev),
+ seq-printf(m, "%u:%u %lu
", MAJOR(iog->dev),
MINOR(iog->dev), iog->dequeue);
}
}
Re: PATCH 21/25 - io-controller: Per cgroup request descriptor support by Gui Jianfeng on
2009-07-08T03:28:26+00:00
Vivek Goyal wrote:
...
> }
> +#ifdef CONFIG-GROUP-IOSCHED
> +static ssize-t queue-group-requests-show(struct request-queue *q, char *page)
> +{
> + return queue-var-show(q->nr-group-requests, (page));
> +}
> +
> +static ssize-t
> +queue-group-requests-store(struct request-queue *q, const char *page,
> + size-t count)
> +{
> + unsigned long nr;
> + int ret = queue-var-store(&nr, page, count);
> + if (nr < BLKDEV-MIN-RQ)
> + nr = BLKDEV-MIN-RQ;
> +
> + spin-lock-irq(q->queue-lock);
> + q->nr-group-requests = nr;
> + spin-unlock-irq(q->queue-lock);
> + return ret;
> +}
> +#endif
Hi Vivek,
Do we need to update the congestion thresholds for allocated io groups?
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
-diff
+extern void elv-io-group-congestion-threshold(struct request-queue *q,
+ struct io-group *iog);
+
static ssize-t
queue-group-requests-store(struct request-queue *q, const char *page,
size-t count)
{
+ struct hlist-node *n;
+ struct io-group *iog;
+ struct elv-fq-data *efqd;
unsigned long nr;
int ret = queue-var-store(&nr, page, count);
+
if (nr < BLKDEV-MIN-RQ)
nr = BLKDEV-MIN-RQ;
spin-lock-irq(q->queue-lock);
+
q->nr-group-requests = nr;
+
+ efqd = &q->elevator->efqd;
+
+ hlist-for-each-entry(iog, n, &efqd->group-list, elv-data-node) {
+ elv-io-group-congestion-threshold(q, iog);
+ }
+
spin-unlock-irq(q->queue-lock);
return ret;
}
Re: RFC - IO scheduler based IO controller V6 by Balbir Singh on
2009-07-08T03:56:43+00:00
* Vivek Goyal <vgoyal@redhat.com> [2009-07-02 16:01:32]:
>
> Hi All,
>
> Here is the V6 of the IO controller patches generated on top of 2.6.31-rc1.
>
> Previous versions of the patches was posted here.
>
> (V1) http://lkml.org/lkml/2009/3/11/486
> (V2) http://lkml.org/lkml/2009/5/5/275
> (V3) http://lkml.org/lkml/2009/5/26/472
> (V4) http://lkml.org/lkml/2009/6/8/580
> (V5) http://lkml.org/lkml/2009/6/19/279
>
> This patchset is still work in progress but I want to keep on getting the
> snapshot of my tree out at regular intervals to get the feedback hence V6.
>
Hi, Vivek,
I was able to compile and boot a 2.6.31-rc1 kernel with this patchset.
I have a request could you fold up all patches and make one
consolidated patch available somewhere (makes it easier to test), may
be a git tree?
I did some quick tests with some io benchmarks and found in a simple
scenario that the scheduler worked as expected, except that it took
very long. I'll investigate further and revert back.
PATCH - io-controller: implement per group request allocation limitation by Gui Jianfeng on
2009-07-10T01:57:27+00:00
Hi Vivek,
This patch exports a cgroup based per group request limits interface.
and removes the global one. Now we can use this interface to perform
different request allocation limitation for different groups.
Signed-off-by: Gui Jianfeng <guijianfeng@cn.fujitsu.com>
-diff {
+ struct io-group *iog;
+ unsigned long nr-group-requests;
+
if (q->rq-data.count[sync] < queue-congestion-off-threshold(q))
blk-clear-queue-congested(q, sync);
if (q->rq-data.count[sync] + 1 <= q->nr-requests)
blk-clear-queue-full(q, sync);
- if (rl->count[sync] + 1 <= q->nr-group-requests) {
+ iog = rl-iog(rl);
+
+ nr-group-requests = get-group-requests(q, iog);
+
+ if (nr-group-requests && rl->count[sync] + 1 <= nr-group-requests) {
if (waitqueue-active(&rl->wait[sync]))
wake-up(&rl->wait[sync]);
}
@@ -828,6 +835,8 @@ static struct request *get-request(struct request-queue *q, int rw-flags,
const bool is-sync = rw-is-sync(rw-flags) != 0;
int may-queue, priv;
int sleep-on-global = 0;
+ struct io-group *iog;
+ unsigned long nr-group-requests;
may-queue = elv-may-queue(q, rw-flags);
if (may-queue == ELV-MQUEUE-NO)
@@ -843,7 +852,12 @@ static struct request *get-request(struct request-queue *q, int rw-flags,
if (q->rq-data.count[is-sync]+1 >= q->nr-requests)
blk-set-queue-full(q, is-sync);
- if (rl->count[is-sync]+1 >= q->nr-group-requests) {
+ iog = rl-iog(rl);
+
+ nr-group-requests = get-group-requests(q, iog);
+
+ if (nr-group-requests &&
+ rl->count[is-sync]+1 >= nr-group-requests) {
ioc = current-io-context(GFP-ATOMIC, q->node);
/*
* The queue request descriptor group will fill after this
@@ -852,7 +866,7 @@ static struct request *get-request(struct request-queue *q, int rw-flags,
* This process will be allowed to complete a batch of
* requests, others will be blocked.
*/
- if (rl->count[is-sync] <= q->nr-group-requests)
+ if (rl->count[is-sync] <= nr-group-requests)
ioc-set-batching(q, ioc);
else {
if (may-queue != ELV-MQUEUE-MUST
@@ -898,7 +912,8 @@ static struct request *get-request(struct request-queue *q, int rw-flags,
* from per group request list
*/
- if (rl->count[is-sync] >= (3 * q->nr-group-requests / 2))
+ if (nr-group-requests &&
+ rl->count[is-sync] >= (3 * nr-group-requests / 2))
goto out;
rl->starved[is-sync] = 0;
diff q->nr-requests = BLKDEV-MAX-RQ;
- q->nr-group-requests = BLKDEV-MAX-GROUP-RQ;
q->make-request-fn = mfn;
blk-queue-dma-alignment(q, 511);
diff #ifdef CONFIG-GROUP-IOSCHED
-static ssize-t queue-group-requests-show(struct request-queue *q, char *page)
-{
- return queue-var-show(q->nr-group-requests, (page));
-}
-
extern void elv-io-group-congestion-threshold(struct request-queue *q,
struct io-group *iog);
-
-static ssize-t
-queue-group-requests-store(struct request-queue *q, const char *page,
- size-t count)
-{
- struct hlist-node *n;
- struct io-group *iog;
- struct elv-fq-data *efqd;
- unsigned long nr;
- int ret = queue-var-store(&nr, page, count);
-
- if (nr < BLKDEV-MIN-RQ)
- nr = BLKDEV-MIN-RQ;
-
- spin-lock-irq(q->queue-lock);
-
- q->nr-group-requests = nr;
-
- efqd = &q->elevator->efqd;
-
- hlist-for-each-entry(iog, n, &efqd->group-list, elv-data-node) {
- elv-io-group-congestion-threshold(q, iog);
- }
-
- spin-unlock-irq(q->queue-lock);
- return ret;
-}
#endif
static ssize-t queue-ra-show(struct request-queue *q, char *page)
@@ -278,14 +246,6 @@ static struct queue-sysfs-entry queue-requests-entry = {
.store = queue-requests-store,
};
-#ifdef CONFIG-GROUP-IOSCHED
-static struct queue-sysfs-entry queue-group-requests-entry = {
- .attr = {.name = "nr-group-requests", .mode = S-IRUGO | S-IWUSR },
- .show = queue-group-requests-show,
- .store = queue-group-requests-store,
-};
-#endif
-
static struct queue-sysfs-entry queue-ra-entry = {
.attr = {.name = "read-ahead-kb", .mode = S-IRUGO | S-IWUSR },
.show = queue-ra-show,
@@ -360,9 +320,6 @@ static struct queue-sysfs-entry queue-iostats-entry = {
static struct attribute *default-attrs[] = {
&queue-requests-entry.attr,
-#ifdef CONFIG-GROUP-IOSCHED
- &queue-group-requests-entry.attr,
-#endif
&queue-ra-entry.attr,
&queue-max-hw-sectors-entry.attr,
&queue-max-sectors-entry.attr,
diff
+unsigned short get-group-requests(struct request-queue *q,
+ struct io-group *iog)
+{
+ struct cgroup-subsys-state *css;
+ struct io-cgroup *iocg;
+ unsigned long nr-group-requests;
+
+ if (!iog)
+ return q->nr-requests;
+
+ rcu-read-lock();
+
+ if (!iog->iocg-id) {
+ nr-group-requests = 0;
+ goto out;
+ }
+
+ css = css-lookup(&io-subsys, iog->iocg-id);
+ if (!css) {
+ nr-group-requests = 0;
+ goto out;
+ }
+
+ iocg = container-of(css, struct io-cgroup, css);
+ nr-group-requests = iocg->nr-group-requests;
+out:
+ rcu-read-unlock();
+ return nr-group-requests;
+}
static struct io-entity *bfq-lookup-next-entity(struct io-sched-data *sd,
int extract);
@@ -1257,14 +1286,17 @@ void elv-io-group-congestion-threshold(struct request-queue *q,
struct io-group *iog)
{
int nr;
+ unsigned long nr-group-requests;
- nr = q->nr-group-requests - (q->nr-group-requests / 8) + 1;
- if (nr > q->nr-group-requests)
- nr = q->nr-group-requests;
+ nr-group-requests = get-group-requests(q, iog);
+
+ nr = nr-group-requests - (nr-group-requests / 8) + 1;
+ if (nr > nr-group-requests)
+ nr = nr-group-requests;
iog->nr-congestion-on = nr;
- nr = q->nr-group-requests - (q->nr-group-requests / 8)
- - (q->nr-group-requests / 16) - 1;
+ nr = nr-group-requests - (nr-group-requests / 8)
+ - (nr-group-requests / 16) - 1;
if (nr < 1)
nr = 1;
iog->nr-congestion-off = nr;
@@ -1283,6 +1315,7 @@ int elv-io-group-congested(struct request-queue *q, struct page *page, int sync)
{
struct io-group *iog;
int ret = 0;
+ unsigned long nr-group-requests;
rcu-read-lock();
@@ -1300,10 +1333,11 @@ int elv-io-group-congested(struct request-queue *q, struct page *page, int sync)
}
ret = elv-is-iog-congested(q, iog, sync);
+ nr-group-requests = get-group-requests(q, iog);
if (ret)
elv-log-iog(&q->elevator->efqd, iog, "iog congested=%d sync=%d"
" rl.count[sync]=%d nr-group-requests=%d",
- ret, sync, iog->rl.count[sync], q->nr-group-requests);
+ ret, sync, iog->rl.count[sync], nr-group-requests);
rcu-read-unlock();
return ret;
}
@@ -1549,6 +1583,48 @@ free-buf:
return ret;
}
+static u64 io-cgroup-nr-requests-read(struct cgroup *cgroup,
+ struct cftype *cftype)
+{
+ struct io-cgroup *iocg;
+ u64 ret;
+
+ if (!cgroup-lock-live-group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup-to-io-cgroup(cgroup);
+ spin-lock-irq(&iocg->lock);
+ ret = iocg->nr-group-requests;
+ spin-unlock-irq(&iocg->lock);
+
+ cgroup-unlock();
+
+ return ret;
+}
+
+static int io-cgroup-nr-requests-write(struct cgroup *cgroup,
+ struct cftype *cftype,
+ u64 val)
+{
+ struct io-cgroup *iocg;
+
+ if (val < BLKDEV-MIN-RQ)
+ val = BLKDEV-MIN-RQ;
+
+ if (!cgroup-lock-live-group(cgroup))
+ return -ENODEV;
+
+ iocg = cgroup-to-io-cgroup(cgroup);
+
+ spin-lock-irq(&iocg->lock);
+ iocg->nr-group-requests = (unsigned long)val;
+ spin-unlock-irq(&iocg->lock);
+
+ cgroup-unlock();
+
+ return 0;
+}
+
#define SHOW-FUNCTION(+ .name = "nr-group-requests",
+ .read-u64 = io-cgroup-nr-requests-read,
+ .write-u64 = io-cgroup-nr-requests-write,
+ },
+ {
.name = "policy",
.read-seq-string = io-cgroup-policy-read,
.write-string = io-cgroup-policy-write,
@@ -1790,6 +1871,7 @@ static struct cgroup-subsys-state *iocg-create(struct cgroup-subsys *subsys,
spin-lock-init(&iocg->lock);
INIT-HLIST-HEAD(&iocg->group-data);
+ iocg->nr-group-requests = BLKDEV-MAX-GROUP-RQ;
iocg->weight = IO-DEFAULT-GRP-WEIGHT;
iocg->ioprio-class = IO-DEFAULT-GRP-CLASS;
INIT-LIST-HEAD(&iocg->policy-list);
diff
+ unsigned long nr-group-requests;
/* list of io-policy-node */
struct list-head policy-list;
@@ -386,6 +387,9 @@ struct elv-fq-data {
unsigned int fairness;
};
+extern unsigned short get-group-requests(struct request-queue *q,
+ struct io-group *iog);
+
/* Logging facilities. */
#ifdef CONFIG-DEBUG-GROUP-IOSCHED
#define elv-log-ioq(efqd, ioq, fmt, args...)
Re: PATCH 21/25 - io-controller: Per cgroup request descriptor support by Gui Jianfeng on
2009-07-21T05:39:05+00:00
Vivek Goyal wrote:
> o Currently a request queue has got fixed number of request descriptors for
> sync and async requests. Once the request descriptors are consumed, new
> processes are put to sleep and they effectively become serialized. Because
> sync and async queues are separate, async requests don't impact sync ones
> but if one is looking for fairness between async requests, that is not
> achievable if request queue descriptors become bottleneck.
>
> o Make request descriptor's per io group so that if there is lots of IO
> going on in one cgroup, it does not impact the IO of other group.
>
> o This is just one relatively simple way of doing things. This patch will
> probably change after the feedback. Folks have raised concerns that in
> hierchical setup, child's request descriptors should be capped by parent's
> request descriptors. May be we need to have per cgroup per device files
> in cgroups where one can specify the upper limit of request descriptors
> and whenever a cgroup is created one needs to assign request descritor
> limit making sure total sum of child's request descriptor is not more than
> of parent.
>
> I guess something like memory controller. Anyway, that would be the next
> step. For the time being, we have implemented something simpler as follows.
>
> o This patch implements the per cgroup request descriptors. request pool per
> queue is still common but every group will have its own wait list and its
> own count of request descriptors allocated to that group for sync and async
> queues. So effectively request-list becomes per io group property and not a
> global request queue feature.
>
> o Currently one can define q->nr-requests to limit request descriptors
> allocated for the queue. Now there is another tunable q->nr-group-requests
> which controls the requests descriptr limit per group. q->nr-requests
> supercedes q->nr-group-requests to make sure if there are lots of groups
> present, we don't end up allocating too many request descriptors on the
> queue.
>
Hi Vivek,
In order to prevent q->nr-requests from becoming the bottle-neck of allocating
requests, whether we can update nr-requests accordingly when allocating or removing
a cgroup?
Re: PATCH 21/25 - io-controller: Per cgroup request descriptor support by Nauman Rafique on
2009-07-21T05:55:51+00:00
On Mon, Jul 20, 2009 at 10:37 PM, Gui
Jianfeng<guijianfeng@cn.fujitsu.com> wrote:
> Vivek Goyal wrote:
>> o Currently a request queue has got fixed number of request descriptors for
>> sync and async requests. Once the request descriptors are consumed, new
>> processes are put to sleep and they effectively become serialized. Because
>> sync and async queues are separate, async requests don't impact sync ones
>> but if one is looking for fairness between async requests, that is not
>> achievable if request queue descriptors become bottleneck.
>>
>> o Make request descriptor's per io group so that if there is lots of IO
>> going on in one cgroup, it does not impact the IO of other group.
>>
>> o This is just one relatively simple way of doing things. This patch will
>> probably change after the feedback. Folks have raised concerns that in
>> hierchical setup, child's request descriptors should be capped by parent's
>> request descriptors. May be we need to have per cgroup per device files
>> in cgroups where one can specify the upper limit of request descriptors
>> and whenever a cgroup is created one needs to assign request descritor
>> limit making sure total sum of child's request descriptor is not more than
>> of parent.
>>
>> I guess something like memory controller. Anyway, that would be the next
>> step. For the time being, we have implemented something simpler as follows.
>>
>> o This patch implements the per cgroup request descriptors. request pool per
>> queue is still common but every group will have its own wait list and its
>> own count of request descriptors allocated to that group for sync and async
>> queues. So effectively request-list becomes per io group property and not a
>> global request queue feature.
>>
>> o Currently one can define q->nr-requests to limit request descriptors
>> allocated for the queue. Now there is another tunable q->nr-group-requests
>> which controls the requests descriptr limit per group. q->nr-requests
>> supercedes q->nr-group-requests to make sure if there are lots of groups
>> present, we don't end up allocating too many request descriptors on the
>> queue.
>>
>
> Hi Vivek,
>
> In order to prevent q->nr-requests from becoming the bottle-neck of allocating
> requests, whether we can update nr-requests accordingly when allocating or removing
> a cgroup?
Vivek,
I agree with Gui here. In fact, it does not make much sense to keep
the nr-requests limit if we already have per cgroup limit in place.
This change also simplifies code quite a bit, as we can get rid of all
that sleep-on-global logic.
>
> the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
Re: RFC - IO scheduler based IO controller V6 by Gui Jianfeng on
2009-07-27T02:12:20+00:00