After synchrounous lumpy reclaim, the page_list is guaranteed to not have
active pages as page activation in shrink_page_list() disables lumpy
reclaim. Remove the dead code.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shrink_page_list() can decide to give up reclaiming a page under a
number of conditions such as
1. trylock_page() failure
2. page is unevictable
3. zone reclaim and page is mapped
4. PageWriteback() is true
5. page is swapbacked and swap is full
6. add_to_swap() failure
7. page is dirty and gfpmask don't have GFP_IO, GFP_FS
8. page is pinned
9. IO queue is congested
10. pageout() start IO, but not finished
With lumpy reclaim, failures result in entering synchronous lumpy reclaim
but this can be unnecessary. In cases (2), (3), (5), (6), (7) and (8),
there is no point retrying. This patch causes lumpy reclaim to abort when
it is known it will fail.
Case (9) is more interesting. current behavior is,
1. start shrink_page_list(async)
2. found queue_congested()
3. skip pageout write
4. still start shrink_page_list(sync)
5. wait on a lot of pages
6. again, found queue_congested()
7. give up pageout write again
So, it's useless time wasting. However, just skipping page reclaim is
also notgood as x86 allocating a huge page needs 512 pages for example.
It can have more dirty pages than queue congestion threshold (~=128).
After this patch, pageout() behaves as follows;
- If order > PAGE_ALLOC_COSTLY_ORDER
Ignore queue congestion always.
- If order <= PAGE_ALLOC_COSTLY_ORDER
skip write page and disable lumpy reclaim.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
congestion_wait() means "wait until queue congestion is cleared".
However, synchronous lumpy reclaim does not need this congestion_wait() as
shrink_page_list(PAGEOUT_IO_SYNC) uses wait_on_page_writeback() and it
provides the necessary waiting.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
There have been numerous reports of stalls that pointed at the problem
being somewhere in the VM. There are multiple roots to the problems which
means dealing with any of the root problems in isolation is tricky to
justify on their own and they would still need integration testing. This
patch series puts together two different patch sets which in combination
should tackle some of the root causes of latency problems being reported.
Patch 1 adds a tracepoint for shrink_inactive_list. For this series, the
most important results is being able to calculate the scanning/reclaim
ratio as a measure of the amount of work being done by page reclaim.
Patch 2 accounts for time spent in congestion_wait.
Patches 3-6 were originally developed by Kosaki Motohiro but reworked for
this series. It has been noted that lumpy reclaim is far too aggressive
and trashes the system somewhat. As SLUB uses high-order allocations, a
large cost incurred by lumpy reclaim will be noticeable. It was also
reported during transparent hugepage support testing that lumpy reclaim
was trashing the system and these patches should mitigate that problem
without disabling lumpy reclaim.
Patch 7 adds wait_iff_congested() and replaces some callers of
congestion_wait(). wait_iff_congested() only sleeps if there is a BDI
that is currently congested. Patch 8 notes that any BDI being congested
is not necessarily a problem because there could be multiple BDIs of
varying speeds and numberous zones. It attempts to track when a zone
being reclaimed contains many pages backed by a congested BDI and if so,
reclaimers wait on the congestion queue.
I ran a number of tests with monitoring on X86, X86-64 and PPC64. Each
machine had 3G of RAM and the CPUs were
X86: Intel P4 2-core
X86-64: AMD Phenom 4-core
PPC64: PPC970MP
Each used a single disk and the onboard IO controller. Dirty ratio was
left at 20. I'm just going to report for X86-64 and PPC64 in a vague
attempt to keep this report short. Four kernels were tested each based on
v2.6.36-rc4
traceonly-v2r2: Patches 1 and 2 to instrument vmscan reclaims and congestion_wait
lowlumpy-v2r3: Patches 1-6 to test if lumpy reclaim is better
waitcongest-v2r3: Patches 1-7 to only wait on congestion
waitwriteback-v2r4: Patches 1-8 to detect when a zone is congested
nocongest-v1r5: Patches 1-3 for testing wait_iff_congestion
nodirect-v1r5: Patches 1-10 to disable filesystem writeback for better IO
The tests run were as follows
kernbench
compile-based benchmark. Smoke test performance
sysbench
OLTP read-only benchmark. Will be re-run in the future as read-write
micro-mapped-file-stream
This is a micro-benchmark from Johannes Weiner that accesses a
large sparse-file through mmap(). It was configured to run in only
single-CPU mode but can be indicative of how well page reclaim
identifies suitable pages.
stress-highalloc
Tries to allocate huge pages under heavy load.
kernbench, iozone and sysbench did not report any performance regression
on any machine. sysbench did pressure the system lightly and there was
reclaim activity but there were no difference of major interest between
the kernels.
X86-64 micro-mapped-file-stream
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
pgalloc_dma 1639.00 ( 0.00%) 667.00 (-145.73%) 1167.00 ( -40.45%) 578.00 (-183.56%)
pgalloc_dma32 2842410.00 ( 0.00%) 2842626.00 ( 0.01%) 2843043.00 ( 0.02%) 2843014.00 ( 0.02%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 729.00 ( 0.00%) 85.00 (-757.65%) 609.00 ( -19.70%) 125.00 (-483.20%)
pgsteal_dma32 2338721.00 ( 0.00%) 2447354.00 ( 4.44%) 2429536.00 ( 3.74%) 2436772.00 ( 4.02%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 1469.00 ( 0.00%) 532.00 (-176.13%) 1078.00 ( -36.27%) 220.00 (-567.73%)
pgscan_kswapd_dma32 4597713.00 ( 0.00%) 4503597.00 ( -2.09%) 4295673.00 ( -7.03%) 3891686.00 ( -18.14%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 71.00 ( 0.00%) 134.00 ( 47.01%) 243.00 ( 70.78%) 352.00 ( 79.83%)
pgscan_direct_dma32 305820.00 ( 0.00%) 280204.00 ( -9.14%) 600518.00 ( 49.07%) 957485.00 ( 68.06%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 16296.00 ( 0.00%) 21254.00 ( 23.33%) 18447.00 ( 11.66%) 20067.00 ( 18.79%)
allocstall 443.00 ( 0.00%) 273.00 ( -62.27%) 513.00 ( 13.65%) 1568.00 ( 71.75%)
These are based on the raw figures taken from /proc/vmstat. It's a rough
measure of reclaim activity. Note that allocstall counts are higher
because we are entering direct reclaim more often as a result of not
sleeping in congestion. In itself, it's not necessarily a bad thing.
It's easier to get a view of what happened from the vmscan tracepoint
report.
FTrace Reclaim Statistics: vmscan
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3 waitwriteback-v2r4
Direct reclaims 443 273 513 1568
Direct reclaim pages scanned 305968 280402 600825 957933
Direct reclaim pages reclaimed 43503 19005 30327 117191
Direct reclaim write file async I/O 0 0 0 0
Direct reclaim write anon async I/O 0 3 4 12
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 187649 132338 191695 267701
Kswapd wakeups 3 1 4 1
Kswapd pages scanned 4599269 4454162 4296815 3891906
Kswapd pages reclaimed 2295947 2428434 2399818 2319706
Kswapd reclaim write file async I/O 1 0 1 1
Kswapd reclaim write anon async I/O 59 187 41 222
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4.34 2.52 6.63 2.96
Time kswapd awake (seconds) 11.15 10.25 11.01 10.19
Total pages scanned 4905237 4734564 4897640 4849839
Total pages reclaimed 2339450 2447439 2430145 2436897
%age total pages scanned/reclaimed 47.69% 51.69% 49.62% 50.25%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 29.23% 19.02% 38.48% 20.25%
Percentage Time kswapd Awake 78.58% 78.85% 76.83% 79.86%
What is interesting here for nocongest in particular is that while direct
reclaim scans more pages, the overall number of pages scanned remains the
same and the ratio of pages scanned to pages reclaimed is more or less the
same. In other words, while we are sleeping less, reclaim is not doing
more work and as direct reclaim and kswapd is awake for less time, it
would appear to be doing less work.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 87 196 64 0
Direct time congest waited 4604ms 4732ms 5420ms 0ms
Direct full congest waited 72 145 53 0
Direct number conditional waited 0 0 324 1315
Direct time conditional waited 0ms 0ms 0ms 0ms
Direct full conditional waited 0 0 0 0
KSwapd number congest waited 20 10 15 7
KSwapd time congest waited 1264ms 536ms 884ms 284ms
KSwapd full congest waited 10 4 6 2
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 8 seconds asleep in direct reclaim and no time at
all asleep with the patches.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 10.51 10.73 10.6 11.66
Total Elapsed Time (seconds) 14.19 13.00 14.33 12.76
Overall, the tests completed faster. It is interesting to note that backing off further
when a zone is congested and not just a BDI was more efficient overall.
PPC64 micro-mapped-file-stream
pgalloc_dma 3024660.00 ( 0.00%) 3027185.00 ( 0.08%) 3025845.00 ( 0.04%) 3026281.00 ( 0.05%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 2508073.00 ( 0.00%) 2565351.00 ( 2.23%) 2463577.00 ( -1.81%) 2532263.00 ( 0.96%)
pgsteal_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_kswapd_dma 4601307.00 ( 0.00%) 4128076.00 ( -11.46%) 3912317.00 ( -17.61%) 3377165.00 ( -36.25%)
pgscan_kswapd_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgscan_direct_dma 629825.00 ( 0.00%) 971622.00 ( 35.18%) 1063938.00 ( 40.80%) 1711935.00 ( 63.21%)
pgscan_direct_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pageoutrun 27776.00 ( 0.00%) 20458.00 ( -35.77%) 18763.00 ( -48.04%) 18157.00 ( -52.98%)
allocstall 977.00 ( 0.00%) 2751.00 ( 64.49%) 2098.00 ( 53.43%) 5136.00 ( 80.98%)
Similar trends to x86-64. allocstalls are up but it's not necessarily bad.
FTrace Reclaim Statistics: vmscan
Direct reclaims 977 2709 2098 5136
Direct reclaim pages scanned 629825 963814 1063938 1711935
Direct reclaim pages reclaimed 75550 242538 150904 387647
Direct reclaim write file async I/O 0 0 0 2
Direct reclaim write anon async I/O 0 10 0 4
Direct reclaim write file sync I/O 0 0 0 0
Direct reclaim write anon sync I/O 0 0 0 0
Wake kswapd requests 392119 1201712 571935 571921
Kswapd wakeups 3 2 3 3
Kswapd pages scanned 4601307 4128076 3912317 3377165
Kswapd pages reclaimed 2432523 2318797 2312673 2144616
Kswapd reclaim write file async I/O 20 1 1 1
Kswapd reclaim write anon async I/O 57 132 11 121
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 6.19 7.30 13.04 10.88
Time kswapd awake (seconds) 21.73 26.51 25.55 23.90
Total pages scanned 5231132 5091890 4976255 5089100
Total pages reclaimed 2508073 2561335 2463577 2532263
%age total pages scanned/reclaimed 47.95% 50.30% 49.51% 49.76%
%age total pages scanned/written 0.00% 0.00% 0.00% 0.00%
%age file pages scanned/written 0.00% 0.00% 0.00% 0.00%
Percentage Time Spent Direct Reclaim 18.89% 20.65% 32.65% 27.65%
Percentage Time kswapd Awake 72.39% 80.68% 78.21% 77.40%
Again, a similar trend that the congestion_wait changes mean that direct
reclaim scans more pages but the overall number of pages scanned while
slightly reduced, are very similar. The ratio of scanning/reclaimed
remains roughly similar. The downside is that kswapd and direct reclaim
was awake longer and for a larger percentage of the overall workload.
It's possible there were big differences in the amount of time spent
reclaiming slab pages between the different kernels which is plausible
considering that the micro tests runs after fsmark and sysbench.
Trace Reclaim Statistics: congestion_wait
Direct number congest waited 845 1312 104 0
Direct time congest waited 19416ms 26560ms 7544ms 0ms
Direct full congest waited 745 1105 72 0
Direct number conditional waited 0 0 1322 2935
Direct time conditional waited 0ms 0ms 12ms 312ms
Direct full conditional waited 0 0 0 3
KSwapd number congest waited 39 102 75 63
KSwapd time congest waited 2484ms 6760ms 5756ms 3716ms
KSwapd full congest waited 20 48 46 25
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
The vanilla kernel spent 20 seconds asleep in direct reclaim and only
312ms asleep with the patches. The time kswapd spent congest waited was
also reduced by a large factor.
MMTests Statistics: duration
ser/Sys Time Running Test (seconds) 26.58 28.05 26.9 28.47
Total Elapsed Time (seconds) 30.02 32.86 32.67 30.88
With all patches applies, the completion times are very similar.
X86-64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 82.00 ( 0.00%) 84.00 ( 2.00%) 85.00 ( 3.00%) 85.00 ( 3.00%)
Pass 2 90.00 ( 0.00%) 87.00 (-3.00%) 88.00 (-2.00%) 89.00 (-1.00%)
At Rest 92.00 ( 0.00%) 90.00 (-2.00%) 90.00 (-2.00%) 91.00 (-1.00%)
Success figures across the board are broadly similar.
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 1045 944 886 887
Direct reclaim pages scanned 135091 119604 109382 101019
Direct reclaim pages reclaimed 88599 47535 47863 46671
Direct reclaim write file async I/O 494 283 465 280
Direct reclaim write anon async I/O 29357 13710 16656 13462
Direct reclaim write file sync I/O 154 2 2 3
Direct reclaim write anon sync I/O 14594 571 509 561
Wake kswapd requests 7491 933 872 892
Kswapd wakeups 814 778 731 780
Kswapd pages scanned 7290822 15341158 11916436 13703442
Kswapd pages reclaimed 3587336 3142496 3094392 3187151
Kswapd reclaim write file async I/O 91975 32317 28022 29628
Kswapd reclaim write anon async I/O 1992022 789307 829745 849769
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 4588.93 2467.16 2495.41 2547.07
Time kswapd awake (seconds) 2497.66 1020.16 1098.06 1176.82
Total pages scanned 7425913 15460762 12025818 13804461
Total pages reclaimed 3675935 3190031 3142255 3233822
%age total pages scanned/reclaimed 49.50% 20.63% 26.13% 23.43%
%age total pages scanned/written 28.66% 5.41% 7.28% 6.47%
%age file pages scanned/written 1.25% 0.21% 0.24% 0.22%
Percentage Time Spent Direct Reclaim 57.33% 42.15% 42.41% 42.99%
Percentage Time kswapd Awake 43.56% 27.87% 29.76% 31.25%
Scanned/reclaimed ratios again look good with big improvements in
efficiency. The Scanned/written ratios also look much improved. With a
better scanned/written ration, there is an expectation that IO would be
more efficient and indeed, the time spent in direct reclaim is much
reduced by the full series and kswapd spends a little less time awake.
Overall, indications here are that allocations were happening much faster
and this can be seen with a graph of the latency figures as the
allocations were taking place
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-hydra-mean.ps
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 1333 204 169 4
Direct time congest waited 78896ms 8288ms 7260ms 200ms
Direct full congest waited 756 92 69 2
Direct number conditional waited 0 0 26 186
Direct time conditional waited 0ms 0ms 0ms 2504ms
Direct full conditional waited 0 0 0 25
KSwapd number congest waited 4 395 227 282
KSwapd time congest waited 384ms 25136ms 10508ms 18380ms
KSwapd full congest waited 3 232 98 176
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
KSwapd full conditional waited 318 0 312 9
Overall, the time spent speeping is reduced. kswapd is still hitting
congestion_wait() but that is because there are callers remaining where it
wasn't clear in advance if they should be changed to wait_iff_congested()
or not. Overall the sleep imes are reduced though - from 79ish seconds to
about 19.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 3415.43 3386.65 3388.39 3377.5
Total Elapsed Time (seconds) 5733.48 3660.33 3689.41 3765.39
With the full series, the time to complete the tests are reduced by 30%
PPC64 STRESS-HIGHALLOC
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Pass 1 17.00 ( 0.00%) 34.00 (17.00%) 38.00 (21.00%) 43.00 (26.00%)
Pass 2 25.00 ( 0.00%) 37.00 (12.00%) 42.00 (17.00%) 46.00 (21.00%)
At Rest 49.00 ( 0.00%) 43.00 (-6.00%) 45.00 (-4.00%) 51.00 ( 2.00%)
Success rates there are *way* up particularly considering that the 16MB
huge pages on PPC64 mean that it's always much harder to allocate them.
FTrace Reclaim Statistics: vmscan
stress-highalloc stress-highalloc stress-highalloc stress-highalloc
traceonly-v2r2 lowlumpy-v2r3 waitcongest-v2r3waitwriteback-v2r4
Direct reclaims 499 505 564 509
Direct reclaim pages scanned 223478 41898 51818 45605
Direct reclaim pages reclaimed 137730 21148 27161 23455
Direct reclaim write file async I/O 399 136 162 136
Direct reclaim write anon async I/O 46977 2865 4686 3998
Direct reclaim write file sync I/O 29 0 1 3
Direct reclaim write anon sync I/O 31023 159 237 239
Wake kswapd requests 420 351 360 326
Kswapd wakeups 185 294 249 277
Kswapd pages scanned 15703488 16392500 17821724 17598737
Kswapd pages reclaimed 5808466 2908858 3139386 3145435
Kswapd reclaim write file async I/O 159938 18400 18717 13473
Kswapd reclaim write anon async I/O 3467554 228957 322799 234278
Kswapd reclaim write file sync I/O 0 0 0 0
Kswapd reclaim write anon sync I/O 0 0 0 0
Time stalled direct reclaim (seconds) 9665.35 1707.81 2374.32 1871.23
Time kswapd awake (seconds) 9401.21 1367.86 1951.75 1328.88
Total pages scanned 15926966 16434398 17873542 17644342
Total pages reclaimed 5946196 2930006 3166547 3168890
%age total pages scanned/reclaimed 37.33% 17.83% 17.72% 17.96%
%age total pages scanned/written 23.27% 1.52% 1.94% 1.43%
%age file pages scanned/written 1.01% 0.11% 0.11% 0.08%
Percentage Time Spent Direct Reclaim 44.55% 35.10% 41.42% 36.91%
Percentage Time kswapd Awake 86.71% 43.58% 52.67% 41.14%
While the scanning rates are slightly up, the scanned/reclaimed and
scanned/written figures are much improved. The time spent in direct
reclaim and with kswapd are massively reduced, mostly by the lowlumpy
patches.
FTrace Reclaim Statistics: congestion_wait
Direct number congest waited 725 303 126 3
Direct time congest waited 45524ms 9180ms 5936ms 300ms
Direct full congest waited 487 190 52 3
Direct number conditional waited 0 0 200 301
Direct time conditional waited 0ms 0ms 0ms 1904ms
Direct full conditional waited 0 0 0 19
KSwapd number congest waited 0 2 23 4
KSwapd time congest waited 0ms 200ms 420ms 404ms
KSwapd full congest waited 0 2 2 4
KSwapd number conditional waited 0 0 0 0
KSwapd time conditional waited 0ms 0ms 0ms 0ms
KSwapd full conditional waited 0 0 0 0
Not as dramatic a story here but the time spent asleep is reduced and we
can still see what wait_iff_congested is going to sleep when necessary.
MMTests Statistics: duration
User/Sys Time Running Test (seconds) 12028.09 3157.17 3357.79 3199.16
Total Elapsed Time (seconds) 10842.07 3138.72 3705.54 3229.85
The time to complete this test goes way down. With the full series, we
are allocating over twice the number of huge pages in 30% of the time and
there is a corresponding impact on the allocation latency graph available
at.
http://www.csn.ul.ie/~mel/postings/vmscanreduce-20101509/highalloc-interlatency-powyah-mean.ps
This patch:
Add a trace event for shrink_inactive_list() and updates the sample
postprocessing script appropriately. It can be used to determine how many
pages were reclaimed and for non-lumpy reclaim where exactly the pages
were reclaimed from.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
`priority' cannot be negative here. And the comment is obsolete.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Ying Han reported that backing aging of anon pages in no swap system
causes unnecessary TLB flush.
When I sent a patch(69c8548175), I wanted this patch but Rik pointed out
and allowed aging of anon pages to give a chance to promote from inactive
to active LRU.
It has a two problem.
1) non-swap system
Never make sense to age anon pages.
2) swap configured but still doesn't swapon
It doesn't make sense to age anon pages until swap-on time. But it's
arguable. If we have aged anon pages by swapon, VM have moved anon pages
from active to inactive. And in the time swapon by admin, the VM can't
reclaim hot pages so we can protect hot pages swapout.
But let's think about it. When does swap-on happen? It depends on admin.
we can't expect it. Nonetheless, we have done aging of anon pages to
protect hot pages swapout. It means we lost run time overhead when below
high watermark but gain hot page swap-[in/out] overhead when VM decide
swapout. Is it true? Let's think more detail. We don't promote anon
pages in case of non-swap system. So even though VM does aging of anon
pages, the pages would be in inactive LRU for a long time. It means many
of pages in there would mark access bit again. So access bit hot/code
separation would be pointless.
This patch prevents unnecessary anon pages demotion in not-yet-swapon and
non-configured swap system. Even, in non-configuared swap system
inactive_anon_is_low can be compiled out.
It could make side effect that hot anon pages could swap out when admin
does swap on. But I think sooner or later it would be steady state. So
it's not a big problem.
We could lose someting but gain more thing(TLB flush and unnecessary
function call to demote anon pages).
Signed-off-by: Ying Han <yinghan@google.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Non-NUMA systems do never create these files anyway, since they are only
created by driver subsystem when NUMA is configured.
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@holoscopio.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This removes more dead code that was somehow missed by commit 0d99519efe
(writeback: remove unused nonblocking and congestion checks). There are
no behavior change except for the removal of two entries from one of the
ext4 tracing interface.
The nonblocking checks in ->writepages are no longer used because the
flusher now prefer to block on get_request_wait() than to skip inodes on
IO congestion. The latter will lead to more seeky IO.
The nonblocking checks in ->writepage are no longer used because it's
redundant with the WB_SYNC_NONE check.
We no long set ->nonblocking in VM page out and page migration, because
a) it's effectively redundant with WB_SYNC_NONE in current code
b) it's old semantic of "Don't get stuck on request queues" is mis-behavior:
that would skip some dirty inodes on congestion and page out others, which
is unfair in terms of LRU age.
Inspired by Christoph Hellwig. Thanks!
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: David Howells <dhowells@redhat.com>
Cc: Sage Weil <sage@newdream.net>
Cc: Steve French <sfrench@samba.org>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
M. Vefa Bicakci reported 2.6.35 kernel hang up when hibernation on his
32bit 3GB mem machine.
(https://bugzilla.kernel.org/show_bug.cgi?id=16771). Also he bisected
the regression to
commit bb21c7ce18
Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Fri Jun 4 14:15:05 2010 -0700
vmscan: fix do_try_to_free_pages() return value when priority==0 reclaim failure
At first impression, this seemed very strange because the above commit
only chenged function return value and hibernate_preallocate_memory()
ignore return value of shrink_all_memory(). But it's related.
Now, page allocation from hibernation code may enter infinite loop if the
system has highmem. The reasons are that vmscan don't care enough OOM
case when oom_killer_disabled.
The problem sequence is following as.
1. hibernation
2. oom_disable
3. alloc_pages
4. do_try_to_free_pages
if (scanning_global_lru(sc) && !all_unreclaimable)
return 1;
If kswapd is not freozen, it would set zone->all_unreclaimable to 1 and
then shrink_zones maybe return true(ie, all_unreclaimable is true). So at
last, alloc_pages could go to _nopage_. If it is, it should have no
problem.
This patch adds all_unreclaimable check to protect in direct reclaim path,
too. It can care of hibernation OOM case and help bailout
all_unreclaimable case slightly.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reported-by: M. Vefa Bicakci <bicave@superonline.com>
Reported-by: <caiqian@redhat.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Tested-by: <caiqian@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
mem_cgroup_soft_limit_reclaim() has zone, nid and zid argument. but nid
and zid can be calculated from zone. So remove it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently mem_cgroup_shrink_node_zone() call shrink_zone() directly. thus
it doesn't need to initialize sc.nodemask because shrink_zone() doesn't
use it at all.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, mem_cgroup_shrink_node_zone() initialize sc.nr_to_reclaim as 0.
It mean shrink_zone() only scan 32 pages and immediately return even if
it doesn't reclaim any pages.
This patch fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Fix "system goes unresponsive under memory pressure and lots of
dirty/writeback pages" bug.
http://lkml.org/lkml/2010/4/4/86
In the above thread, Andreas Mohr described that
Invoking any command locked up for minutes (note that I'm
talking about attempted additional I/O to the _other_,
_unaffected_ main system HDD - such as loading some shell
binaries -, NOT the external SSD18M!!).
This happens when the two conditions are both meet:
- under memory pressure
- writing heavily to a slow device
OOM also happens in Andreas' system. The OOM trace shows that 3 processes
are stuck in wait_on_page_writeback() in the direct reclaim path. One in
do_fork() and the other two in unix_stream_sendmsg(). They are blocked on
this condition:
(sc->order && priority < DEF_PRIORITY - 2)
which was introduced in commit 78dc583d (vmscan: low order lumpy reclaim
also should use PAGEOUT_IO_SYNC) one year ago. That condition may be too
permissive. In Andreas' case, 512MB/1024 = 512KB. If the direct reclaim
for the order-1 fork() allocation runs into a range of 512KB
hard-to-reclaim LRU pages, it will be stalled.
It's a severe problem in three ways.
Firstly, it can easily happen in daily desktop usage. vmscan priority can
easily go below (DEF_PRIORITY - 2) on _local_ memory pressure. Even if
the system has 50% globally reclaimable pages, it still has good
opportunity to have 0.1% sized hard-to-reclaim ranges. For example, a
simple dd can easily create a big range (up to 20%) of dirty pages in the
LRU lists. And order-1 to order-3 allocations are more than common with
SLUB. Try "grep -v '1 :' /proc/slabinfo" to get the list of high order
slab caches. For example, the order-1 radix_tree_node slab cache may
stall applications at swap-in time; the order-3 inode cache on most
filesystems may stall applications when trying to read some file; the
order-2 proc_inode_cache may stall applications when trying to open a
/proc file.
Secondly, once triggered, it will stall unrelated processes (not doing IO
at all) in the system. This "one slow USB device stalls the whole system"
avalanching effect is very bad.
Thirdly, once stalled, the stall time could be intolerable long for the
users. When there are 20MB queued writeback pages and USB 1.1 is writing
them in 1MB/s, wait_on_page_writeback() will stuck for up to 20 seconds.
Not to mention it may be called multiple times.
So raise the bar to only enable PAGEOUT_IO_SYNC when priority goes below
DEF_PRIORITY/3, or 6.25% LRU size. As the default dirty throttle ratio is
20%, it will hardly be triggered by pure dirty pages. We'd better treat
PAGEOUT_IO_SYNC as some last resort workaround -- its stall time is so
uncomfortably long (easily goes beyond 1s).
The bar is only raised for (order < PAGE_ALLOC_COSTLY_ORDER) allocations,
which are easy to satisfy in 1TB memory boxes. So, although 6.25% of
memory could be an awful lot of pages to scan on a system with 1TB of
memory, it won't really have to busy scan that much.
Andreas tested an older version of this patch and reported that it mostly
fixed his problem. Mel Gorman helped improve it and KOSAKI Motohiro will
fix it further in the next patch.
Reported-by: Andreas Mohr <andi@lisas.de>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Memcg also need to trace reclaim progress as direct reclaim. This patch
add it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Presently shrink_slab() has the following scanning equation.
lru_scanned max_pass
basic_scan_objects = 4 x ------------- x -----------------------------
lru_pages shrinker->seeks (default:2)
scan_objects = min(basic_scan_objects, max_pass * 2)
If we pass very small value as lru_pages instead real number of lru pages,
shrink_slab() drop much objects rather than necessary. And now,
__zone_reclaim() pass 'order' as lru_pages by mistake. That produces a
bad result.
For example, if we receive very low memory pressure (scan = 32, order =
0), shrink_slab() via zone_reclaim() always drop _all_ icache/dcache
objects. (see above equation, very small lru_pages make very big
scan_objects result).
This patch fixes it.
[akpm@linux-foundation.org: fix layout, typos]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Rik van Riel pointed out reading reclaim_stat should be protected
lru_lock, otherwise vmscan might sweep 2x much pages.
This fault was introduced by
commit 4f98a2fee8
Author: Rik van Riel <riel@redhat.com>
Date: Sat Oct 18 20:26:32 2008 -0700
vmscan: split LRU lists into anon & file sets
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
'slab_reclaimable' and 'nr_pages' are unsigned. Subtraction is unsafe
because negative results would be misinterpreted.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When shrink_inactive_list() isolates pages, it updates a number of
counters using temporary variables to gather them. These consume stack
and it's in the main path that calls ->writepage(). This patch moves the
accounting updates outside of the main path to reduce stack usage.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shrink_page_list() sets up a pagevec to release pages as according as they
are free. It uses significant amounts of stack on the pagevec. This
patch adds pages to be freed via pagevec to a linked list which is then
freed en-masse at the end. This avoids using stack in the main path that
potentially calls writepage().
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
uses significant amounts of stack doing this. This patch splits
shrink_inactive_list() to take the stack usage out of the main path so
that callers to writepage() do not contain an unused pagevec on the stack.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Remove temporary variable that is only used once and does not help clarify
code.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, max_scan of shrink_inactive_list() is always passed less than
SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it. This
patch also help stack diet.
detail
- remove "while (nr_scanned < max_scan)" loop
- remove nr_freed (now, we use nr_reclaimed directly)
- remove nr_scan (now, we use nr_scanned directly)
- rename max_scan to nr_to_scan
- pass nr_to_scan into isolate_pages() directly instead
using SWAP_CLUSTER_MAX
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.
Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.
The rest of this changelog is notes on prev_priority and why it existed in
the first place and why it might be not necessary any more. This information
is based heavily on discussions between Andrew Morton, Rik van Riel and
Kosaki Motohiro who is heavily quotes from.
Historically prev_priority was important because it determined when the VM
would start unmapping PTE pages. i.e. there are no balances of note within
the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there
is a potential risk of unnecessarily increasing minor faults as a large
amount of read activity of use-once pages could push mapped pages to the
end of the LRU and get unmapped.
There is no proof this is still a problem but currently it is not considered
to be. Active files are not deactivated if the active file list is smaller
than the inactive list reducing the liklihood that file-mapped pages are
being pushed off the LRU and referenced executable pages are kept on the
active list to avoid them getting pushed out by read activity.
Even if it is a problem, prev_priority prev_priority wouldn't works
nowadays. First of all, current vmscan still a lot of UP centric code. it
expose some weakness on some dozens CPUs machine. I think we need more and
more improvement.
The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but if the another task have mempolicy restriction,
it is unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.
Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue. In other word,
prev_priority concept assume the sysmtem don't have lots threads."
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add a trace event for when page reclaim queues a page for IO and records
whether it is synchronous or asynchronous. Excessive synchronous IO for a
process can result in noticeable stalls during direct reclaim. Excessive
IO from page reclaim may indicate that the system is seriously under
provisioned for the amount of dirty pages that exist.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add an event for when pages are isolated en-masse from the LRU lists.
This event augments the information available on LRU traffic and can be
used to evaluate lumpy reclaim.
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Add two trace events for kswapd waking up and going asleep for the
purposes of tracking kswapd activity and two trace events for direct
reclaim beginning and ending. The information can be used to work out how
much time a process or the system is spending on the reclamation of pages
and in the case of direct reclaim, how many pages were reclaimed for that
process. High frequency triggering of these events could point to memory
pressure problems.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Larry Woodman <lwoodman@redhat.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Chris Mason <chris.mason@oracle.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Michael Rubin <mrubin@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shrink_zones() need relatively long time and lru_pages can change
dramatically during shrink_zones(). So lru_pages should be recalculated
for each priority.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Swap token don't works when zone reclaim is enabled since it was born.
Because __zone_reclaim() always call disable_swap_token() unconditionally.
This kill swap token feature completely. As far as I know, nobody want to
that. Remove it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
We need lock_page_nosync() here because we have no reference to the
mapping when taking the page lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Greg Thelen reported recent Johannes's stack diet patch makes kernel hang.
His test is following.
mount -t cgroup none /cgroups -o memory
mkdir /cgroups/cg1
echo $$ > /cgroups/cg1/tasks
dd bs=1024 count=1024 if=/dev/null of=/data/foo
echo $$ > /cgroups/tasks
echo 1 > /cgroups/cg1/memory.force_empty
Actually, This OOM hard to try logic have been corrupted since following
two years old patch.
commit a41f24ea9f
Author: Nishanth Aravamudan <nacc@us.ibm.com>
Date: Tue Apr 29 00:58:25 2008 -0700
page allocator: smarter retry of costly-order allocations
Original intention was "return success if the system have shrinkable zones
though priority==0 reclaim was failure". But the above patch changed to
"return nr_reclaimed if .....". Oh, That forgot nr_reclaimed may be 0 if
priority==0 reclaim failure.
And Johannes's patch 0aeb2339e5 ("vmscan: remove all_unreclaimable scan
control") made it more corrupt. Originally, priority==0 reclaim failure
on memcg return 0, but this patch changed to return 1. It totally
confused memcg.
This patch fixes it completely.
Reported-by: Greg Thelen <gthelen@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Greg Thelen <gthelen@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For now, we have global isolation vs. memory control group isolation, do
not allow the reclaim entry function to set an arbitrary page isolation
callback, we do not need that flexibility.
And since we already pass around the group descriptor for the memory
control group isolation case, just use it to decide which one of the two
isolator functions to use.
The decisions can be merged into nearby branches, so no extra cost there.
In fact, we save the indirect calls.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This scan control is abused to communicate a return value from
shrink_zones(). Write this idiomatically and remove the knob.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If vmscan is under lumpy reclaim mode, it have to ignore referenced bit
for making contenious free pages. but current page_check_references()
doesn't.
Fix it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
get_scan_ratio() calculates percentage and if the percentage is < 1%, it
will round percentage down to 0% and cause we completely ignore scanning
anon/file pages to reclaim memory even the total anon/file pages are very
big.
To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned. In this way, we should get several
scanned pages for < 1% percent.
This has some benefits:
1. increase our calculation precision
2. making our scan more smoothly. Without this, if percent[x] is
underflow, shrink_zone() doesn't scan any pages and suddenly it scans
all pages when priority is zero. With this, even priority isn't zero,
shrink_zone() gets chance to scan some pages.
Note, this patch doesn't really change logics, but just increase
precision. For system with a lot of memory, this might slightly changes
behavior. For example, in a sequential file read workload, without the
patch, we don't swap any anon pages. With it, if anon memory size is
bigger than 16G, we will see one anon page swapped. The 16G is calculated
as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap is assumed to be 1024
which is common in this workload. So the impact sounds not a big deal.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Before applying this patch, cpuset updates task->mems_allowed and
mempolicy by setting all new bits in the nodemask first, and clearing all
old unallowed bits later. But in the way, the allocator may find that
there is no node to alloc memory.
The reason is that cpuset rebinds the task's mempolicy, it cleans the
nodes which the allocater can alloc pages on, for example:
(mpol: mempolicy)
task1 task1's mpol task2
alloc page 1
alloc on node0? NO 1
1 change mems from 1 to 0
1 rebind task1's mpol
0-1 set new bits
0 clear disallowed bits
alloc on node1? NO 0
...
can't alloc page
goto oom
This patch fixes this problem by expanding the nodes range first(set newly
allowed bits) and shrink it lazily(clear newly disallowed bits). So we
use a variable to tell the write-side task that read-side task is reading
nodemask, and the write-side task clears newly disallowed nodes after
read-side task ends the current memory allocation.
[akpm@linux-foundation.org: fix spello]
Signed-off-by: Miao Xie <miaox@cn.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Paul Menage <menage@google.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Ravikiran Thirumalai <kiran@scalex86.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Andi Kleen <andi@firstfloor.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Shaohua Li reported his tmpfs streaming I/O test can lead to make oom.
The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs, there
are 6 copies of kernel source and the test does kbuild for each copy. His
investigation shows the test has a lot of rotated anon pages and quite few
file pages, so get_scan_ratio calculates percent[0] (i.e. scanning
percent for anon) to be zero. Actually the percent[0] shoule be a big
value, but our calculation round it to zero.
Although before commit 84b18490 ("vmscan: get_scan_ratio() cleanup") , we
have the same problem too. But the old logic can rescue percent[0]==0
case only when priority==0. It had hided the real issue. I didn't think
merely streaming io can makes percent[0]==0 && priority==0 situation. but
I was wrong.
So, definitely we have to fix such tmpfs streaming io issue. but anyway I
revert the regression commit at first.
This reverts commit 84b18490d1.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reported-by: Shaohua Li <shaohua.li@intel.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.
percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.
http://userweb.kernel.org/~tj/misc/slabh-sweep.py
The script does the followings.
* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.
* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.
* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.
The conversion was done in the following steps.
1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.
2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.
3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.
4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.
5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.
6. percpu.h was updated not to include slab.h.
7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).
* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig
8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.
Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.
Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
The VM currently assumes that an inactive, mapped and referenced file page
is in use and promotes it to the active list.
However, every mapped file page starts out like this and thus a problem
arises when workloads create a stream of such pages that are used only for
a short time. By flooding the active list with those pages, the VM
quickly gets into trouble finding eligible reclaim canditates. The result
is long allocation latencies and eviction of the wrong pages.
This patch reuses the PG_referenced page flag (used for unmapped file
pages) to implement a usage detection that scales with the speed of LRU
list cycling (i.e. memory pressure).
If the scanner encounters those pages, the flag is set and the page cycled
again on the inactive list. Only if it returns with another page table
reference it is activated. Otherwise it is reclaimed as 'not recently
used cache'.
This effectively changes the minimum lifetime of a used-once mapped file
page from a full memory cycle to an inactive list cycle, which allows it
to occur in linear streams without affecting the stable working set of the
system.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_mapping_inuse() is a historic predicate function for pages that are
about to be reclaimed or deactivated.
According to it, a page is in use when it is mapped into page tables OR
part of swap cache OR backing an mmapped file.
This function is used in combination with page_referenced(), which checks
for young bits in ptes and the page descriptor itself for the
PG_referenced bit. Thus, checking for unmapped swap cache pages is
meaningless as PG_referenced is not set for anonymous pages and unmapped
pages do not have young ptes. The test makes no difference.
Protecting file pages that are not by themselves mapped but are part of a
mapped file is also a historic leftover for short-lived things like the
exec() code in libc. However, the VM now does reference accounting and
activation of pages at unmap time and thus the special treatment on
reclaim is obsolete.
This patch drops page_mapping_inuse() and switches the two callsites to
use page_mapped() directly.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The used-once mapped file page detection patchset.
It is meant to help workloads with large amounts of shortly used file
mappings, like rtorrent hashing a file or git when dealing with loose
objects (git gc on a bigger site?).
Right now, the VM activates referenced mapped file pages on first
encounter on the inactive list and it takes a full memory cycle to
reclaim them again. When those pages dominate memory, the system
no longer has a meaningful notion of 'working set' and is required
to give up the active list to make reclaim progress. Obviously,
this results in rather bad scanning latencies and the wrong pages
being reclaimed.
This patch makes the VM be more careful about activating mapped file
pages in the first place. The minimum granted lifetime without
another memory access becomes an inactive list cycle instead of the
full memory cycle, which is more natural given the mentioned loads.
This test resembles a hashing rtorrent process. Sequentially, 32MB
chunks of a file are mapped into memory, hashed (sha1) and unmapped
again. While this happens, every 5 seconds a process is launched and
its execution time taken:
python2.4 -c 'import pydoc'
old: max=2.31s mean=1.26s (0.34)
new: max=1.25s mean=0.32s (0.32)
find /etc -type f
old: max=2.52s mean=1.44s (0.43)
new: max=1.92s mean=0.12s (0.17)
vim -c ':quit'
old: max=6.14s mean=4.03s (0.49)
new: max=3.48s mean=2.41s (0.25)
mplayer --help
old: max=8.08s mean=5.74s (1.02)
new: max=3.79s mean=1.32s (0.81)
overall hash time (stdev):
old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)
I also tested kernbench with regular IO streaming in the background to
see whether the delayed activation of frequently used mapped file
pages had a negative impact on performance in the presence of pressure
on the inactive list. The patch made no significant difference in
timing, neither for kernbench nor for the streaming IO throughput.
The first patch submission raised concerns about the cost of the extra
faults for actually activated pages on machines that have no hardware
support for young page table entries.
I created an artificial worst case scenario on an ARM machine with
around 300MHz and 64MB of memory to figure out the dimensions
involved. The test would mmap a file of 20MB, then
1. touch all its pages to fault them in
2. force one full scan cycle on the inactive file LRU
-- old: mapping pages activated
-- new: mapping pages inactive
3. touch the mapping pages again
-- old and new: fault exceptions to set the young bits
4. force another full scan cycle on the inactive file LRU
5. touch the mapping pages one last time
-- new: fault exceptions to set the young bits
The test showed an overall increase of 6% in time over 100 iterations
of the above (old: ~212sec, new: ~225sec). 13 secs total overhead /
(100 * 5k pages), ignoring the execution time of the test itself,
makes for about 25us overhead for every page that gets actually
activated. Note:
1. File mapping the size of one third of main memory, _completely_
in active use across memory pressure - i.e., most pages referenced
within one LRU cycle. This should be rare to non-existant,
especially on such embedded setups.
2. Many huge activation batches. Those batches only occur when the
working set fluctuates. If it changes completely between every full
LRU cycle, you have problematic reclaim overhead anyway.
3. Access of activated pages at maximum speed: sequential loads from
every single page without doing anything in between. In reality,
the extra faults will get distributed between actual operations on
the data.
So even if a workload manages to get the VM into the situation of
activating a third of memory in one go on such a setup, it will take
2.2 seconds instead 2.1 without the patch.
Comparing the numbers (and my user-experience over several months),
I think this change is an overall improvement to the VM.
Patch 1 is only refactoring to break up that ugly compound conditional
in shrink_page_list() and make it easy to document and add new checks
in a readable fashion.
Patch 2 gets rid of the obsolete page_mapping_inuse(). It's not
strictly related to #3, but it was in the original submission and is a
net simplification, so I kept it.
Patch 3 implements used-once detection of mapped file pages.
This patch:
Moving the big conditional into its own predicate function makes the code
a bit easier to read and allows for better commenting on the checks
one-by-one.
This is just cleaning up, no semantics should have been changed.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag. But it had an undesireble side
effect. free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.
Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.
[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Huang Shijie <shijie8@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit cf40bd16fd ("lockdep: annotate reclaim context") introduced reclaim
context annotation. But it didn't annotate zone reclaim. This patch do
it.
The point is, commit cf40bd16fd annotate __alloc_pages_direct_reclaim but
zone-reclaim doesn't use __alloc_pages_direct_reclaim.
current call graph is
__alloc_pages_nodemask
get_page_from_freelist
zone_reclaim()
__alloc_pages_slowpath
__alloc_pages_direct_reclaim
try_to_free_pages
Actually, if zone_reclaim_mode=1, VM never call
__alloc_pages_direct_reclaim in usual VM pressure.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Acked-by: Nick Piggin <npiggin@suse.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The get_scan_ratio() should have all scan-ratio related calculations.
Thus, this patch move some calculation into get_scan_ratio.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Kswapd checks that zone has sufficient pages free via zone_watermark_ok().
If any zone doesn't have enough pages, we set all_zones_ok to zero.
!all_zone_ok makes kswapd retry rather than sleeping.
I think the watermark check before shrink_zone() is pointless. Only after
kswapd has tried to shrink the zone is the check meaningful.
Move the check to after the call to shrink_zone().
[akpm@linux-foundation.org: fix comment, layout]
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit f50de2d3 (vmscan: have kswapd sleep for a short interval and double
check it should be asleep) can cause kswapd to enter an infinite loop if
running on a single-CPU system. If all zones are unreclaimble,
sleeping_prematurely return 1 and kswapd will call balance_pgdat() again.
but it's totally meaningless, balance_pgdat() doesn't anything against
unreclaimable zone!
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Reported-by: Will Newton <will.newton@gmail.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Tested-by: Will Newton <will.newton@gmail.com>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In AIM7 runs, recent kernels start swapping out anonymous pages well
before they should. This is due to shrink_list falling through to
shrink_inactive_list if !inactive_anon_is_low(zone, sc), when all we
really wanted to do is pre-age some anonymous pages to give them extra
time to be referenced while on the inactive list.
The obvious fix is to make sure that shrink_list does not fall through to
scanning/reclaiming inactive pages when we called it to scan one of the
active lists.
This change should be safe because the loop in shrink_zone ensures that we
will still shrink the anon and file inactive lists whenever we should.
[kosaki.motohiro@jp.fujitsu.com: inactive_file_is_low() should be inactive_anon_is_low()]
Reported-by: Larry Woodman <lwoodman@redhat.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tomasz Chmielewski <mangoo@wpkg.org>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Now, All caller of reclaim use swap_cluster_max as SWAP_CLUSTER_MAX.
Then, we can remove it perfectly.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
In old days, we didn't have sc.nr_to_reclaim and it brought
sc.swap_cluster_max misuse.
huge sc.swap_cluster_max might makes unnecessary OOM risk and no
performance benefit.
Now, we can stop its insane thing.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
memory shrinker) for hibernate performance improvement. and
sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
memory for suspend).
commit a06fe4d307 said
Without the patch:
Freed 14600 pages in 1749 jiffies = 32.61 MB/s (Anomolous!)
Freed 88563 pages in 14719 jiffies = 23.50 MB/s
Freed 205734 pages in 32389 jiffies = 24.81 MB/s
With the patch:
Freed 68252 pages in 496 jiffies = 537.52 MB/s
Freed 116464 pages in 569 jiffies = 798.54 MB/s
Freed 209699 pages in 705 jiffies = 1161.89 MB/s
At that time, their patch was pretty worth. However, Modern Hardware
trend and recent VM improvement broke its worth. From several reason, I
think we should remove shrink_all_zones() at all.
detail:
1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
at no i/o congestion.
but current shrink_zone() is sane, not slow.
2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
fine on numa system.
example)
System has 4GB memory and each node have 2GB. and hibernate need 1GB.
optimal)
steal 500MB from each node.
shrink_all_zones)
steal 1GB from node-0.
Oh, Cache balancing logic was broken. ;)
Unfortunately, Desktop system moved ahead NUMA at nowadays.
(Side note, if hibernate require 2GB, shrink_all_zones() never success
on above machine)
3) if the node has several I/O flighting pages, shrink_all_zones() makes
pretty bad result.
schenario) hibernate need 1GB
1) shrink_all_zones() try to reclaim 1GB from Node-0
2) but it only reclaimed 990MB
3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
4) it reclaimed 990MB
Oh, well. it reclaimed twice much than required.
In the other hand, current shrink_zone() has sane baling out logic.
then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.
4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.
Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
it bring good reviewability and debuggability, and solve above problems.
side note: Reclaim logic unificication makes two good side effect.
- Fix recursive reclaim bug on shrink_all_memory().
it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
- Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently, sc.scap_cluster_max has double meanings.
1) reclaim batch size as isolate_lru_pages()'s argument
2) reclaim baling out thresolds
The two meanings pretty unrelated. Thus, Let's separate it.
this patch doesn't change any behavior.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion.
However, if the zone is below the min watermark then kswapd needs to
continue working without delay as there is a danger of an increased rate
of GFP_ATOMIC allocation failure.
This patch changes the conditions under which kswapd waits on congestion
by only going to sleep if the min watermarks are being met.
[mel@csn.ul.ie: add stats to track how relevant the logic is]
[mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
After kswapd balances all zones in a pgdat, it goes to sleep. In the
event of no IO congestion, kswapd can go to sleep very shortly after the
high watermark was reached. If there are a constant stream of allocations
from parallel processes, it can mean that kswapd went to sleep too quickly
and the high watermark is not being maintained for sufficient length time.
This patch makes kswapd go to sleep as a two-stage process. It first
tries to sleep for HZ/10. If it is woken up by another process or the
high watermark is no longer met, it's considered a premature sleep and
kswapd continues work. Otherwise it goes fully to sleep.
This adds more counters to distinguish between fast and slow breaches of
watermarks. A "fast" premature sleep is one where the low watermark was
hit in a very short time after kswapd going to sleep. A "slow" premature
sleep indicates that the high watermark was breached after a very short
interval.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Frans Pop <elendil@planet.nl>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 543ade1fc9 ("Streamline generic_file_* interfaces and filemap
cleanups") removed generic_file_write() in filemap. Change the comment in
vmscan pageout() to __generic_file_aio_write().
Signed-off-by: Vincent Li <macli@brc.ubc.ca>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
there are no present pages left.
In such a situation, kswapd must also be stopped since it has nothing left
to do.
Signed-off-by: David Rientjes <rientjes@google.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Rafael J. Wysocki <rjw@sisk.pl>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Randy Dunlap <randy.dunlap@oracle.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Adam Litke <agl@us.ibm.com>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Eric Whitney <eric.whitney@hp.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Isolators putting a page back to the LRU do not hold the page lock, and if
the page is mlocked, another thread might munlock it concurrently.
Expecting this, the putback code re-checks the evictability of a page when
it just moved it to the unevictable list in order to correct its decision.
The problem, however, is that ordering is not garuanteed between setting
PG_lru when moving the page to the list and checking PG_mlocked
afterwards:
#0: #1
spin_lock()
if (TestClearPageMlocked())
if (PageLRU())
move to evictable list
SetPageLRU()
spin_unlock()
if (!PageMlocked())
move to evictable list
The PageMlocked() check may get reordered before SetPageLRU() in #0,
resulting in #0 not moving the still mlocked page, and in #1 failing to
isolate and move the page as well. The page is now stranded on the
unevictable list.
The race condition is very unlikely. The consequence currently is one
page falling off the reclaim grid and eventually getting freed with
PG_unevictable set, which triggers a warning in the page allocator.
TestClearPageMlocked() in #1 already provides full memory barrier
semantics.
This patch adds an explicit full barrier to force ordering between
SetPageLRU() and PageMlocked() so that either one of the competitors
rescues the page.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
It is possible to have !Anon but SwapBacked pages, and some apps could
create huge number of such pages with MAP_SHARED|MAP_ANONYMOUS. These
pages go into the ANON lru list, and hence shall not be protected: we only
care mapped executable files. Failing to do so may trigger OOM.
Tested-by: Christian Borntraeger <borntraeger@de.ibm.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
commit 8aa7e847d (Fix congestion_wait() sync/async vs read/write
confusion) replace WRITE with BLK_RW_ASYNC. Unfortunately, concurrent mm
development made the unchanged place accidentally.
This patch fixes it too.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Jens Axboe <jens.axboe@oracle.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'writeback' of git://git.kernel.dk/linux-2.6-block:
writeback: writeback_inodes_sb() should use bdi_start_writeback()
writeback: don't delay inodes redirtied by a fast dirtier
writeback: make the super_block pinning more efficient
writeback: don't resort for a single super_block in move_expired_inodes()
writeback: move inodes from one super_block together
writeback: get rid to incorrect references to pdflush in comments
writeback: improve readability of the wb_writeback() continue/break logic
writeback: cleanup writeback_single_inode()
writeback: kupdate writeback shall not stop when more io is possible
writeback: stop background writeback when below background threshold
writeback: balance_dirty_pages() shall write more than dirtied pages
fs: Fix busyloop in wb_writeback()
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...
It's unused.
It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.
It _was_ used in two places at arch/frv for some reason.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: David Howells <dhowells@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: James Morris <jmorris@namei.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Implement reclaim from groups over their soft limit
Permit reclaim from memory cgroups on contention (via the direct reclaim
path).
memory cgroup soft limit reclaim finds the group that exceeds its soft
limit by the largest number of pages and reclaims pages from it and then
reinserts the cgroup into its correct place in the rbtree.
Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
loops in case all swap is turned off. The code has been refactored and
the loop check (loop < 2) has been enhanced for soft limits. For soft
limits, we try to do more targetted reclaim. Instead of bailing out after
two loops, the routine now reclaims memory proportional to the size by
which the soft limit is exceeded. The proportion has been empirically
determined.
[akpm@linux-foundation.org: build fix]
[kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
[nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
which case shrink_list() _still_ calls isolate_pages() with the much
larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan
rate by up to 32 times.
For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
So when shrink_zone() expects to scan 4 pages in the active/inactive list,
the active list will be scanned 4 pages, while the inactive list will be
(over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break
the balance between the two lists.
It can further impact the scan of anon active list, due to the anon
active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():
inactive anon list over scanned => inactive_anon_is_low() == TRUE
=> shrink_active_list()
=> active anon list over scanned
So the end result may be
- anon inactive => over scanned
- anon active => over scanned (maybe not as much)
- file inactive => over scanned
- file active => under scanned (relatively)
The accesses to nr_saved_scan are not lock protected and so not 100%
accurate, however we can tolerate small errors and the resulted small
imbalanced scan rates between zones.
Cc: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The name `zone_nr_pages' can be mis-read as zone's (total) number pages,
but it actually returns zone's LRU list number pages.
Signed-off-by: Vincent Li <macli@brc.ubc.ca>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Enlighten the reader of this code about what reference count makes a page
cache page freeable.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Make page_has_private() return a true boolean value and remove the double
negations from the two callsites using it for arithmetic.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
page_is_file_cache() has been used for both boolean checks and LRU
arithmetic, which was always a bit weird.
Now that page_lru_base_type() exists for LRU arithmetic, make
page_is_file_cache() a real predicate function and adjust the
boolean-using callsites to drop those pesky double negations.
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Instead of abusing page_is_file_cache() for LRU list index arithmetic, add
another helper with a more appropriate name and convert the non-boolean
users of page_is_file_cache() accordingly.
This new helper gives the LRU base type a page is supposed to live on,
inactive anon or inactive file.
[hugh.dickins@tiscali.co.uk: convert del_page_from_lru() also]
Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The pages in the list passed move_active_pages_to_lru() are already
touched by shrink_active_list(). IOW the prefetch in
move_active_pages_to_lru() don't populate any cache. it's pointless.
This patch remove it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The page_lru() already evaluate PageActive() and PageSwapBacked(). We
don't need to re-evaluate it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The move_active_pages_to_lru() function is called under irq disabled and
ClearPageActive() doesn't need irq disabling.
Then, this patch move it into shrink_active_list().
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The VM already avoids attempting to reclaim anon pages in various places,
But it doesn't avoid it for lumpy reclaim.
It shuffles lru list unnecessary so that it is pointless.
[akpm@linux-foundation.org: cleanup]
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
global_lru_pages() / zone_lru_pages() can be used in two ways:
- to estimate max reclaimable pages in determine_dirtyable_memory()
- to calculate the slab scan ratio
When swap is full or not present, the anon lru lists are not reclaimable
and also won't be scanned. So the anon pages shall not be counted in both
usage scenarios. Also rename to _reclaimable_pages: now they are counting
the possibly reclaimable lru pages.
It can greatly (and correctly) increase the slab scan rate under high
memory pressure (when most file pages have been reclaimed and swap is
full/absent), thus reduce false OOM kills.
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Jesse Barnes <jbarnes@virtuousgeek.org>
Cc: David Howells <dhowells@redhat.com>
Cc: "Li, Ming Chun" <macli@brc.ubc.ca>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
When way too many processes go into direct reclaim, it is possible for all
of the pages to be taken off the LRU. One result of this is that the next
process in the page reclaim code thinks there are no reclaimable pages
left and triggers an out of memory kill.
One solution to this problem is to never let so many processes into the
page reclaim path that the entire LRU is emptied. Limiting the system to
only having half of each inactive list isolated for reclaim should be
safe.
Signed-off-by: Rik van Riel <riel@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If the system is running a heavy load of processes then concurrent reclaim
can isolate a large number of pages from the LRU. /proc/vmstat and the
output generated for an OOM do not show how many pages were isolated.
This has been observed during process fork bomb testing (mstctl11 in LTP).
This patch shows the information about isolated pages.
Reproduced via:
-----------------------
% ./hackbench 140 process 1000
=> OOM occur
active_anon:146 inactive_anon:0 isolated_anon:49245
active_file:79 inactive_file:18 isolated_file:113
unevictable:0 dirty:0 writeback:0 unstable:0 buffer:39
free:370 slab_reclaimable:309 slab_unreclaimable:5492
mapped:53 shmem:15 pagetables:28140 bounce:0
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
If sc->isolate_pages() return 0, we don't need to call shrink_page_list().
In past days, shrink_inactive_list() handled it properly.
But commit fb8d14e1 (three years ago commit!) breaked it. current
shrink_inactive_list() always call shrink_page_list() although
isolate_pages() return 0.
This patch restore proper return value check.
Requirements:
o "nr_taken == 0" condition should stay before calling shrink_page_list().
o "nr_taken == 0" condition should stay after nr_scan related statistics
modification.
Cc: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Currently the pgmoved variable has two meanings. It causes harder
reviewing. This patch separates it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
try_to_unmap currently has multiple modi (migration, munlock, normal unmap)
which are selected by magic flag variables. The logic is not very straight
forward, because each of these flag change multiple behaviours (e.g.
migration turns off aging, not only sets up migration ptes etc.)
Also the different flags interact in magic ways.
A later patch in this series adds another mode to try_to_unmap, so
this becomes quickly unmanageable.
Replace the different flags with a action code (migration, munlock, munmap)
and some additional flags as modifiers (ignore mlock, ignore aging).
This makes the logic more straight forward and allows easier extension
to new behaviours. Change all the caller to declare what they want to
do.
This patch is supposed to be a nop in behaviour. If anyone can prove
it is not that would be a bug.
Cc: Lee.Schermerhorn@hp.com
Cc: npiggin@suse.de
Signed-off-by: Andi Kleen <ak@linux.intel.com>
This gets rid of pdflush for bdi writeout and kupdated style cleaning.
pdflush writeout suffers from lack of locality and also requires more
threads to handle the same workload, since it has to work in a
non-blocking fashion against each queue. This also introduces lumpy
behaviour and potential request starvation, since pdflush can be starved
for queue access if others are accessing it. A sample ffsb workload that
does random writes to files is about 8% faster here on a simple SATA drive
during the benchmark phase. File layout also seems a LOT more smooth in
vmstat:
r b swpd free buff cache si so bi bo in cs us sy id wa
0 1 0 608848 2652 375372 0 0 0 71024 604 24 1 10 48 42
0 1 0 549644 2712 433736 0 0 0 60692 505 27 1 8 48 44
1 0 0 476928 2784 505192 0 0 4 29540 553 24 0 9 53 37
0 1 0 457972 2808 524008 0 0 0 54876 331 16 0 4 38 58
0 1 0 366128 2928 614284 0 0 4 92168 710 58 0 13 53 34
0 1 0 295092 3000 684140 0 0 0 62924 572 23 0 9 53 37
0 1 0 236592 3064 741704 0 0 4 58256 523 17 0 8 48 44
0 1 0 165608 3132 811464 0 0 0 57460 560 21 0 8 54 38
0 1 0 102952 3200 873164 0 0 4 74748 540 29 1 10 48 41
0 1 0 48604 3252 926472 0 0 0 53248 469 29 0 7 47 45
where vanilla tends to fluctuate a lot in the creation phase:
r b swpd free buff cache si so bi bo in cs us sy id wa
1 1 0 678716 5792 303380 0 0 0 74064 565 50 1 11 52 36
1 0 0 662488 5864 319396 0 0 4 352 302 329 0 2 47 51
0 1 0 599312 5924 381468 0 0 0 78164 516 55 0 9 51 40
0 1 0 519952 6008 459516 0 0 4 78156 622 56 1 11 52 37
1 1 0 436640 6092 541632 0 0 0 82244 622 54 0 11 48 41
0 1 0 436640 6092 541660 0 0 0 8 152 39 0 0 51 49
0 1 0 332224 6200 644252 0 0 4 102800 728 46 1 13 49 36
1 0 0 274492 6260 701056 0 0 4 12328 459 49 0 7 50 43
0 1 0 211220 6324 763356 0 0 0 106940 515 37 1 10 51 39
1 0 0 160412 6376 813468 0 0 0 8224 415 43 0 6 49 45
1 1 0 85980 6452 886556 0 0 4 113516 575 39 1 11 54 34
0 2 0 85968 6452 886620 0 0 0 1640 158 211 0 0 46 54
A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
SSD based writeback test on XFS performs over 20% better as well, with
the throughput being very stable around 1GB/sec, where pdflush only
manages 750MB/sec and fluctuates wildly while doing so. Random buffered
writes to many files behave a lot better as well, as does random mmap'ed
writes.
A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
An mlocked page might lose the isolatation race. This causes the page to
clear PG_mlocked while it remains in a VM_LOCKED vma. This means it can
be put onto the [in]active list. We can rescue it by using try_to_unmap()
in shrink_page_list().
But now, As Wu Fengguang pointed out, vmscan has a bug. If the page has
PG_referenced, it can't reach try_to_unmap() in shrink_page_list() but is
put into the active list. If the page is referenced repeatedly, it can
remain on the [in]active list without being moving to the unevictable
list.
This patch fixes it.
Reported-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <<kosaki.motohiro@jp.fujitsu.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 1faa16d228 accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The isolated page is "cursor_page" not "page".
This could cause LRU list corruption under memory pressure, caught by
CONFIG_DEBUG_LIST.
Reported-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Tested-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Try to fix memcg's lru rotation sanity: make memcg use the same logic as
the global LRU does.
Now, at __isolate_lru_page() retruns -EBUSY, the page is rotated to the
tail of LRU in global LRU's isolate LRU pages. But in memcg, it's not
handled. This makes memcg do the same behavior as global LRU and rotate
LRU in the page is busy.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
At lumpy reclaim, a page failed to be taken by __isolate_lru_page() can be
pushed back to "src" list by list_move(). But the page may not be from
"src" list. This pushes the page back to wrong LRU. And list_move()
itself is unnecessary because the page is not on top of LRU. Then, leave
it as it is if __isolate_lru_page() fails.
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.
There is a heuristic that determines if the scan is worthwhile but it is
possible that the heuristic will fail and the CPU gets tied up scanning
uselessly. Detecting the situation requires some guesswork and
experimentation so this patch adds a counter "zreclaim_failed" to
/proc/vmstat. If during high CPU utilisation this counter is increasing
rapidly, then the resolution to the problem may be to set
/proc/sys/vm/zone_reclaim_mode to 0.
[akpm@linux-foundation.org: name things consistently]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met. The problem is that zone_reclaim() failing at all means the
zone gets marked full.
This can cause situations where a zone is usable, but is being skipped
because it has been considered full. Take a situation where a large tmpfs
mount is occuping a large percentage of memory overall. The pages do not
get cleaned or reclaimed by zone_reclaim(), but the zone gets marked full
and the zonelist cache considers them not worth trying in the future.
This patch makes zone_reclaim() return more fine-grained information about
what occured when zone_reclaim() failued. The zone only gets marked full
if it really is unreclaimable. If it's a case that the scan did not occur
or if enough pages were not reclaimed with the limited reclaim_mode, then
the zone is simply skipped.
There is a side-effect to this patch. Currently, if zone_reclaim()
successfully reclaimed SWAP_CLUSTER_MAX, an allocation attempt would go
ahead. With this patch applied, zone watermarks are rechecked after
zone_reclaim() does some work.
This bug was introduced by commit 9276b1bc96
("memory page_alloc zonelist caching speedup") way back in 2.6.19 when the
zonelist_cache was introduced. It was not intended that zone_reclaim()
aggressively consider the zone to be full when it failed as full direct
reclaim can still be an option. Due to the age of the bug, it should be
considered a -stable candidate.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
A bug was brought to my attention against a distro kernel but it affects
mainline and I believe problems like this have been reported in various
guises on the mailing lists although I don't have specific examples at the
moment.
The reported problem was that malloc() stalled for a long time (minutes in
some cases) if a large tmpfs mount was occupying a large percentage of
memory overall. The pages did not get cleaned or reclaimed by
zone_reclaim() because the zone_reclaim_mode was unsuitable, but the lists
are uselessly scanned frequencly making the CPU spin at near 100%.
This patchset intends to address that bug and bring the behaviour of
zone_reclaim() more in line with expectations which were noticed during
investigation. It is based on top of mmotm and takes advantage of
Kosaki's work with respect to zone_reclaim().
Patch 1 fixes the heuristics that zone_reclaim() uses to determine if the
scan should go ahead. The broken heuristic is what was causing the
malloc() stall as it uselessly scanned the LRU constantly. Currently,
zone_reclaim is assuming zone_reclaim_mode is 1 and historically it
could not deal with tmpfs pages at all. This fixes up the heuristic so
that an unnecessary scan is more likely to be correctly avoided.
Patch 2 notes that zone_reclaim() returning a failure automatically means
the zone is marked full. This is not always true. It could have
failed because the GFP mask or zone_reclaim_mode were unsuitable.
Patch 3 introduces a counter zreclaim_failed that will increment each
time the zone_reclaim scan-avoidance heuristics fail. If that
counter is rapidly increasing, then zone_reclaim_mode should be
set to 0 as a temporarily resolution and a bug reported because
the scan-avoidance heuristic is still broken.
This patch:
On NUMA machines, the administrator can configure zone_reclaim_mode that
is a more targetted form of direct reclaim. On machines with large NUMA
distances for example, a zone_reclaim_mode defaults to 1 meaning that
clean unmapped pages will be reclaimed if the zone watermarks are not
being met.
There is a heuristic that determines if the scan is worthwhile but the
problem is that the heuristic is not being properly applied and is
basically assuming zone_reclaim_mode is 1 if it is enabled. The lack of
proper detection can manfiest as high CPU usage as the LRU list is scanned
uselessly.
Historically, once enabled it was depending on NR_FILE_PAGES which may
include swapcache pages that the reclaim_mode cannot deal with. Patch
vmscan-change-the-number-of-the-unmapped-files-in-zone-reclaim.patch by
Kosaki Motohiro noted that zone_page_state(zone, NR_FILE_PAGES) included
pages that were not file-backed such as swapcache and made a calculation
based on the inactive, active and mapped files. This is far superior when
zone_reclaim==1 but if RECLAIM_SWAP is set, then NR_FILE_PAGES is a
reasonable starting figure.
This patch alters how zone_reclaim() works out how many pages it might be
able to reclaim given the current reclaim_mode. If RECLAIM_SWAP is set in
the reclaim_mode it will either consider NR_FILE_PAGES as potential
candidates or else use NR_{IN}ACTIVE}_PAGES-NR_FILE_MAPPED to discount
swapcache and other non-file-backed pages. If RECLAIM_WRITE is not set,
then NR_FILE_DIRTY number of pages are not candidates. If RECLAIM_SWAP is
not set, then NR_FILE_MAPPED are not.
[kosaki.motohiro@jp.fujitsu.com: Estimate unmapped pages minus tmpfs pages]
[fengguang.wu@intel.com: Fix underflow problem in Kosaki's estimate]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Rik van Riel <riel@redhat.com>
Acked-by: Christoph Lameter <cl@linux-foundation.org>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: <stable@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Commit 2e2e425989 ("vmscan,memcg:
reintroduce sc->may_swap) add may_swap flag and handle it at
get_scan_ratio().
But the result of get_scan_ratio() is ignored when priority == 0, so anon
lru is scanned even if may_swap == 0 or nr_swap_pages == 0. IMHO, this is
not an expected behavior.
As for memcg especially, because of this behavior many and many pages are
swapped-out just in vain when oom is invoked by mem+swap limit.
This patch is for handling may_swap flag more strictly.
Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Minchan Kim <minchan.kim@gmail.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
The "move pages to active list" and "move pages to inactive list" code
blocks are mostly identical and can be served by a function.
Thanks to Andrew Morton for pointing this out.
Note that buffer_heads_over_limit check will also be carried out for
re-activated pages, which is slightly different from pre-2.6.28 kernels.
Also, Rik's "vmscan: evict use-once pages first" patch could totally stop
scans of active file list when memory pressure is low. So the net effect
could be, the number of buffer heads is now more likely to grow large.
However that's fine according to Johannes' comments:
I don't think that this could be harmful. We just preserve the buffer
mappings of what we consider the working set and with low memory
pressure, as you say, this set is not big.
As to stripping of reactivated pages: the only pages we re-activate
for now are those VM_EXEC mapped ones. Since we don't expect IO from
or to these pages, removing the buffer mappings in case they grow too
large should be okay, I guess.
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Protect referenced PROT_EXEC mapped pages from being deactivated.
PROT_EXEC(or its internal presentation VM_EXEC) pages normally belong to some
currently running executables and their linked libraries, they shall really be
cached aggressively to provide good user experiences.
Thanks to Johannes Weiner for the advice to reuse the VMA walk in
page_referenced() to get the PROT_EXEC bit.
[more details]
( The consequences of this patch will have to be discussed together with
Rik van Riel's recent patch "vmscan: evict use-once pages first". )
( Some of the good points and insights are taken into this changelog.
Thanks to all the involved people for the great LKML discussions. )
the problem
===========
For a typical desktop, the most precious working set is composed of
*actively accessed*
(1) memory mapped executables
(2) and their anonymous pages
(3) and other files
(4) and the dcache/icache/.. slabs
while the least important data are
(5) infrequently used or use-once files
For a typical desktop, one major problem is busty and large amount of (5)
use-once files flushing out the working set.
Inside the working set, (4) dcache/icache have already been too sticky ;-)
So we only have to care (2) anonymous and (1)(3) file pages.
anonymous pages
===============
Anonymous pages are effectively immune to the streaming IO attack, because we
now have separate file/anon LRU lists. When the use-once files crowd into the
file LRU, the list's "quality" is significantly lowered. Therefore the scan
balance policy in get_scan_ratio() will choose to scan the (low quality) file
LRU much more frequently than the anon LRU.
file pages
==========
Rik proposed to *not* scan the active file LRU when the inactive list grows
larger than active list. This guarantees that when there are use-once streaming
IO, and the working set is not too large(so that active_size < inactive_size),
the active file LRU will *not* be scanned at all. So the not-too-large working
set can be well protected.
But there are also situations where the file working set is a bit large so that
(active_size >= inactive_size), or the streaming IOs are not purely use-once.
In these cases, the active list will be scanned slowly. Because the current
shrink_active_list() policy is to deactivate active pages regardless of their
referenced bits. The deactivated pages become susceptible to the streaming IO
attack: the inactive list could be scanned fast (500MB / 50MBps = 10s) so that
the deactivated pages don't have enough time to get re-referenced. Because a
user tend to switch between windows in intervals from seconds to minutes.
This patch holds mapped executable pages in the active list as long as they
are referenced during each full scan of the active list. Because the active
list is normally scanned much slower, they get longer grace time (eg. 100s)
for further references, which better matches the pace of user operations.
Therefore this patch greatly prolongs the in-cache time of executable code,
when there are moderate memory pressures.
before patch: guaranteed to be cached if reference intervals < I
after patch: guaranteed to be cached if reference intervals < I+A
(except when randomly reclaimed by the lumpy reclaim)
where
A = time to fully scan the active file LRU
I = time to fully scan the inactive file LRU
Note that normally A >> I.
side effects
============
This patch is safe in general, it restores the pre-2.6.28 mmap() behavior
but in a much smaller and well targeted scope.
One may worry about some one to abuse the PROT_EXEC heuristic. But as
Andrew Morton stated, there are other tricks to getting that sort of boost.
Another concern is the PROT_EXEC mapped pages growing large in rare cases,
and therefore hurting reclaim efficiency. But a sane application targeted for
large audience will never use PROT_EXEC for data mappings. If some home made
application tries to abuse that bit, it shall be aware of the consequences.
If it is abused to scale of 2/3 total memory, it gains nothing but overheads.
benchmarks
==========
1) memory tight desktop
1.1) brief summary
- clock time and major faults are reduced by 50%;
- pswpin numbers are reduced to ~1/3.
That means X desktop responsiveness is doubled under high memory/swap pressure.
1.2) test scenario
- nfsroot gnome desktop with 512M physical memory
- run some programs, and switch between the existing windows
after starting each new program.
1.3) progress timing (seconds)
before after programs
0.02 0.02 N xeyes
0.75 0.76 N firefox
2.02 1.88 N nautilus
3.36 3.17 N nautilus --browser
5.26 4.89 N gthumb
7.12 6.47 N gedit
9.22 8.16 N xpdf /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
13.58 12.55 N xterm
15.87 14.57 N mlterm
18.63 17.06 N gnome-terminal
21.16 18.90 N urxvt
26.24 23.48 N gnome-system-monitor
28.72 26.52 N gnome-help
32.15 29.65 N gnome-dictionary
39.66 36.12 N /usr/games/sol
43.16 39.27 N /usr/games/gnometris
48.65 42.56 N /usr/games/gnect
53.31 47.03 N /usr/games/gtali
58.60 52.05 N /usr/games/iagno
65.77 55.42 N /usr/games/gnotravex
70.76 61.47 N /usr/games/mahjongg
76.15 67.11 N /usr/games/gnome-sudoku
86.32 75.15 N /usr/games/glines
92.21 79.70 N /usr/games/glchess
103.79 88.48 N /usr/games/gnomine
113.84 96.51 N /usr/games/gnotski
124.40 102.19 N /usr/games/gnibbles
137.41 114.93 N /usr/games/gnobots2
155.53 125.02 N /usr/games/blackjack
179.85 135.11 N /usr/games/same-gnome
224.49 154.50 N /usr/bin/gnome-window-properties
248.44 162.09 N /usr/bin/gnome-default-applications-properties
282.62 173.29 N /usr/bin/gnome-at-properties
323.72 188.21 N /usr/bin/gnome-typing-monitor
363.99 199.93 N /usr/bin/gnome-at-visual
394.21 206.95 N /usr/bin/gnome-sound-properties
435.14 224.49 N /usr/bin/gnome-at-mobility
463.05 234.11 N /usr/bin/gnome-keybinding-properties
503.75 248.59 N /usr/bin/gnome-about-me
554.00 276.27 N /usr/bin/gnome-display-properties
615.48 304.39 N /usr/bin/gnome-network-preferences
693.03 342.01 N /usr/bin/gnome-mouse-properties
759.90 388.58 N /usr/bin/gnome-appearance-properties
937.90 508.47 N /usr/bin/gnome-control-center
1109.75 587.57 N /usr/bin/gnome-keyboard-properties
1399.05 758.16 N : oocalc
1524.64 830.03 N : oodraw
1684.31 900.03 N : ooimpress
1874.04 993.91 N : oomath
2115.12 1081.89 N : ooweb
2369.02 1161.99 N : oowriter
Note that the last ": oo*" commands are actually commented out.
1.4) vmstat numbers (some relevant ones are marked with *)
before after
nr_free_pages 1293 3898
nr_inactive_anon 59956 53460
nr_active_anon 26815 30026
nr_inactive_file 2657 3218
nr_active_file 2019 2806
nr_unevictable 4 4
nr_mlock 4 4
nr_anon_pages 26706 27859
*nr_mapped 3542 4469
nr_file_pages 72232 67681
nr_dirty 1 0
nr_writeback 123 19
nr_slab_reclaimable 3375 3534
nr_slab_unreclaimable 11405 10665
nr_page_table_pages 8106 7864
nr_unstable 0 0
nr_bounce 0 0
*nr_vmscan_write 394776 230839
nr_writeback_temp 0 0
numa_hit 6843353 3318676
numa_miss 0 0
numa_foreign 0 0
numa_interleave 1719 1719
numa_local 6843353 3318676
numa_other 0 0
*pgpgin 5954683 2057175
*pgpgout 1578276 922744
*pswpin 1486615 512238
*pswpout 394568 230685
pgalloc_dma 277432 56602
pgalloc_dma32 6769477 3310348
pgalloc_normal 0 0
pgalloc_movable 0 0
pgfree 7048396 3371118
pgactivate 2036343 1471492
pgdeactivate 2189691 1612829
pgfault 3702176 3100702
*pgmajfault 452116 201343
pgrefill_dma 12185 7127
pgrefill_dma32 334384 653703
pgrefill_normal 0 0
pgrefill_movable 0 0
pgsteal_dma 74214 22179
pgsteal_dma32 3334164 1638029
pgsteal_normal 0 0
pgsteal_movable 0 0
pgscan_kswapd_dma 1081421 1216199
pgscan_kswapd_dma32 58979118 46002810
pgscan_kswapd_normal 0 0
pgscan_kswapd_movable 0 0
pgscan_direct_dma 2015438 1086109
pgscan_direct_dma32 55787823 36101597
pgscan_direct_normal 0 0
pgscan_direct_movable 0 0
pginodesteal 3461 7281
slabs_scanned 564864 527616
kswapd_steal 2889797 1448082
kswapd_inodesteal 14827 14835
pageoutrun 43459 21562
allocstall 9653 4032
pgrotated 384216 228631
1.5) free numbers at the end of the tests
before patch:
total used free shared buffers cached
Mem: 474 467 7 0 0 236
-/+ buffers/cache: 230 243
Swap: 1023 418 605
after patch:
total used free shared buffers cached
Mem: 474 457 16 0 0 236
-/+ buffers/cache: 221 253
Swap: 1023 404 619
2) memory flushing in a file server
2.1) brief summary
The number of major faults from 50 to 3 during 10% cache hot reads.
That means this patch successfully stops major faults when the active file
list is slowly scanned when there are partially cache hot streaming IO.
2.2) test scenario
Do 100000 pread(size=110 pages, offset=(i*100) pages), where 10% of the
pages will be activated:
for i in `seq 0 100 10000000`; do echo $i 110; done > pattern-hot-10
iotrace.rb --load pattern-hot-10 --play /b/sparse
vmmon nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree
and monitor /proc/vmstat during the time. The test box has 2G memory.
I carried out tests on fresh booted console as well as X desktop, and
fetched the vmstat numbers on
(1) begin: shortly after the big read IO starts;
(2) end: just before the big read IO stops;
(3) restore: the big read IO stops and the zsh working set restored
(4) restore X: after IO, switch back and forth between the urxvt and firefox
windows to restore their working set.
2.3) console mode results
nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree
2.6.29 VM_EXEC protection ON:
begin: 2481 2237 8694 630 0 574299
end: 275 231976 233914 633 776271 20933042
restore: 370 232154 234524 691 777183 20958453
2.6.29 VM_EXEC protection ON (second run):
begin: 2434 2237 8493 629 0 574195
end: 284 231970 233536 632 771918 20896129
restore: 399 232218 234789 690 774526 20957909
2.6.30-rc4-mm VM_EXEC protection OFF:
begin: 2479 2344 9659 210 0 579643
end: 284 232010 234142 260 772776 20917184
restore: 379 232159 234371 301 774888 20967849
The above console numbers show that
- The startup pgmajfault of 2.6.30-rc4-mm is merely 1/3 that of 2.6.29.
I'd attribute that improvement to the mmap readahead improvements :-)
- The pgmajfault increment during the file copy is 633-630=3 vs 260-210=50.
That's a huge improvement - which means with the VM_EXEC protection logic,
active mmap pages is pretty safe even under partially cache hot streaming IO.
- when active:inactive file lru size reaches 1:1, their scan rates is 1:20.8
under 10% cache hot IO. (computed with formula Dpgdeactivate:Dpgfree)
That roughly means the active mmap pages get 20.8 more chances to get
re-referenced to stay in memory.
- The absolute nr_mapped drops considerably to 1/9 during the big IO, and the
dropped pages are mostly inactive ones. The patch has almost no impact in
this aspect, that means it won't unnecessarily increase memory pressure.
(In contrast, your 20% mmap protection ratio will keep them all, and
therefore eliminate the extra 41 major faults to restore working set
of zsh etc.)
The iotrace.rb read throughput is
151.194384MB/s 284.198252s 100001x 450560b --load pattern-hot-10 --play /b/sparse
which means the inactive list is rotated at the speed of 250MB/s,
so a full scan of which takes about 3.5 seconds, while a full scan
of active file list takes about 77 seconds.
2.4) X mode results
We can reach roughly the same conclusions for X desktop:
nr_mapped nr_active_file nr_inactive_file pgmajfault pgdeactivate pgfree
2.6.30-rc4-mm VM_EXEC protection ON:
begin: 9740 8920 64075 561 0 678360
end: 768 218254 220029 565 798953 21057006
restore: 857 218543 220987 606 799462 21075710
restore X: 2414 218560 225344 797 799462 21080795
2.6.30-rc4-mm VM_EXEC protection OFF:
begin: 9368 5035 26389 554 0 633391
end: 770 218449 221230 661 646472 17832500
restore: 1113 218466 220978 710 649881 17905235
restore X: 2687 218650 225484 947 802700 21083584
- the absolute nr_mapped drops considerably (to 1/13 of the original size)
during the streaming IO.
- the delta of pgmajfault is 3 vs 107 during IO, or 236 vs 393
during the whole process.
Cc: Elladan <elladan@eskimo.com>
Cc: Nick Piggin <npiggin@suse.de>
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Collect vma->vm_flags of the VMAs that actually referenced the page.
This is preparing for more informed reclaim heuristics, eg. to protect
executable file pages more aggressively. For now only the VM_EXEC bit
will be used by the caller.
Thanks to Johannes, Peter and Minchan for all the good tips.
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Rik van Riel <riel@redhat.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>