original development tree for Linux kernel GTP module; now long in mainline.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1123 lines
30 KiB

ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
fs: convert fs shrinkers to new scan/count API Convert the filesystem shrinkers to use the new API, and standardise some of the behaviours of the shrinkers at the same time. For example, nr_to_scan means the number of objects to scan, not the number of objects to free. I refactored the CIFS idmap shrinker a little - it really needs to be broken up into a shrinker per tree and keep an item count with the tree root so that we don't need to walk the tree every time the shrinker needs to count the number of objects in the tree (i.e. all the time under memory pressure). [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree] [assorted fixes folded in] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
fs: convert fs shrinkers to new scan/count API Convert the filesystem shrinkers to use the new API, and standardise some of the behaviours of the shrinkers at the same time. For example, nr_to_scan means the number of objects to scan, not the number of objects to free. I refactored the CIFS idmap shrinker a little - it really needs to be broken up into a shrinker per tree and keep an item count with the tree root so that we don't need to walk the tree every time the shrinker needs to count the number of objects in the tree (i.e. all the time under memory pressure). [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree] [assorted fixes folded in] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
fs: convert fs shrinkers to new scan/count API Convert the filesystem shrinkers to use the new API, and standardise some of the behaviours of the shrinkers at the same time. For example, nr_to_scan means the number of objects to scan, not the number of objects to free. I refactored the CIFS idmap shrinker a little - it really needs to be broken up into a shrinker per tree and keep an item count with the tree root so that we don't need to walk the tree every time the shrinker needs to count the number of objects in the tree (i.e. all the time under memory pressure). [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree] [assorted fixes folded in] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
fs: convert fs shrinkers to new scan/count API Convert the filesystem shrinkers to use the new API, and standardise some of the behaviours of the shrinkers at the same time. For example, nr_to_scan means the number of objects to scan, not the number of objects to free. I refactored the CIFS idmap shrinker a little - it really needs to be broken up into a shrinker per tree and keep an item count with the tree root so that we don't need to walk the tree every time the shrinker needs to count the number of objects in the tree (i.e. all the time under memory pressure). [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree] [assorted fixes folded in] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
fs: convert fs shrinkers to new scan/count API Convert the filesystem shrinkers to use the new API, and standardise some of the behaviours of the shrinkers at the same time. For example, nr_to_scan means the number of objects to scan, not the number of objects to free. I refactored the CIFS idmap shrinker a little - it really needs to be broken up into a shrinker per tree and keep an item count with the tree root so that we don't need to walk the tree every time the shrinker needs to count the number of objects in the tree (i.e. all the time under memory pressure). [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree] [assorted fixes folded in] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8 years ago
fs: convert fs shrinkers to new scan/count API Convert the filesystem shrinkers to use the new API, and standardise some of the behaviours of the shrinkers at the same time. For example, nr_to_scan means the number of objects to scan, not the number of objects to free. I refactored the CIFS idmap shrinker a little - it really needs to be broken up into a shrinker per tree and keep an item count with the tree root so that we don't need to walk the tree every time the shrinker needs to count the number of objects in the tree (i.e. all the time under memory pressure). [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree] [assorted fixes folded in] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
fs: convert fs shrinkers to new scan/count API Convert the filesystem shrinkers to use the new API, and standardise some of the behaviours of the shrinkers at the same time. For example, nr_to_scan means the number of objects to scan, not the number of objects to free. I refactored the CIFS idmap shrinker a little - it really needs to be broken up into a shrinker per tree and keep an item count with the tree root so that we don't need to walk the tree every time the shrinker needs to count the number of objects in the tree (i.e. all the time under memory pressure). [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree] [assorted fixes folded in] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
ext4: improve extent cache shrink mechanism to avoid to burn CPU time Now we maintain an proper in-order LRU list in ext4 to reclaim entries from extent status tree when we are under heavy memory pressure. For keeping this order, a spin lock is used to protect this list. But this lock burns a lot of CPU time. We can use the following steps to trigger it. % cd /dev/shm % dd if=/dev/zero of=ext4-img bs=1M count=2k % mkfs.ext4 ext4-img % mount -t ext4 -o loop ext4-img /mnt % cd /mnt % for ((i=0;i<160;i++)); do truncate -s 64g $i; done % for ((i=0;i<160;i++)); do cp $i /dev/null &; done % perf record -a -g % perf report This commit tries to fix this problem. Now a new member called i_touch_when is added into ext4_inode_info to record the last access time for an inode. Meanwhile we never need to keep a proper in-order LRU list. So this can avoid to burns some CPU time. When we try to reclaim some entries from extent status tree, we use list_sort() to get a proper in-order list. Then we traverse this list to discard some entries. In ext4_sb_info, we use s_es_last_sorted to record the last time of sorting this list. When we traverse the list, we skip the inode that is newer than this time, and move this inode to the tail of LRU list. When the head of the list is newer than s_es_last_sorted, we will sort the LRU list again. In this commit, we break the loop if s_extent_cache_cnt == 0 because that means that all extents in extent status tree have been reclaimed. Meanwhile in this commit, ext4_es_{un}register_shrinker()'s prototype is changed to save a local variable in these functions. Reported-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
8 years ago
fs: convert fs shrinkers to new scan/count API Convert the filesystem shrinkers to use the new API, and standardise some of the behaviours of the shrinkers at the same time. For example, nr_to_scan means the number of objects to scan, not the number of objects to free. I refactored the CIFS idmap shrinker a little - it really needs to be broken up into a shrinker per tree and keep an item count with the tree root so that we don't need to walk the tree every time the shrinker needs to count the number of objects in the tree (i.e. all the time under memory pressure). [glommer@openvz.org: fixes for ext4, ubifs, nfs, cifs and glock. Fixes are needed mainly due to new code merged in the tree] [assorted fixes folded in] Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Glauber Costa <glommer@openvz.org> Acked-by: Mel Gorman <mgorman@suse.de> Acked-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Arve Hjønnevåg <arve@android.com> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Chuck Lever <chuck.lever@oracle.com> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: David Rientjes <rientjes@google.com> Cc: Gleb Natapov <gleb@redhat.com> Cc: Greg Thelen <gthelen@google.com> Cc: J. Bruce Fields <bfields@redhat.com> Cc: Jan Kara <jack@suse.cz> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Stultz <john.stultz@linaro.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kent Overstreet <koverstreet@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Steven Whitehouse <swhiteho@redhat.com> Cc: Thomas Hellstrom <thellstrom@vmware.com> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
8 years ago
  1. /*
  2. * fs/ext4/extents_status.c
  3. *
  4. * Written by Yongqiang Yang <xiaoqiangnk@gmail.com>
  5. * Modified by
  6. * Allison Henderson <achender@linux.vnet.ibm.com>
  7. * Hugh Dickins <hughd@google.com>
  8. * Zheng Liu <wenqing.lz@taobao.com>
  9. *
  10. * Ext4 extents status tree core functions.
  11. */
  12. #include <linux/rbtree.h>
  13. #include <linux/list_sort.h>
  14. #include "ext4.h"
  15. #include "extents_status.h"
  16. #include <trace/events/ext4.h>
  17. /*
  18. * According to previous discussion in Ext4 Developer Workshop, we
  19. * will introduce a new structure called io tree to track all extent
  20. * status in order to solve some problems that we have met
  21. * (e.g. Reservation space warning), and provide extent-level locking.
  22. * Delay extent tree is the first step to achieve this goal. It is
  23. * original built by Yongqiang Yang. At that time it is called delay
  24. * extent tree, whose goal is only track delayed extents in memory to
  25. * simplify the implementation of fiemap and bigalloc, and introduce
  26. * lseek SEEK_DATA/SEEK_HOLE support. That is why it is still called
  27. * delay extent tree at the first commit. But for better understand
  28. * what it does, it has been rename to extent status tree.
  29. *
  30. * Step1:
  31. * Currently the first step has been done. All delayed extents are
  32. * tracked in the tree. It maintains the delayed extent when a delayed
  33. * allocation is issued, and the delayed extent is written out or
  34. * invalidated. Therefore the implementation of fiemap and bigalloc
  35. * are simplified, and SEEK_DATA/SEEK_HOLE are introduced.
  36. *
  37. * The following comment describes the implemenmtation of extent
  38. * status tree and future works.
  39. *
  40. * Step2:
  41. * In this step all extent status are tracked by extent status tree.
  42. * Thus, we can first try to lookup a block mapping in this tree before
  43. * finding it in extent tree. Hence, single extent cache can be removed
  44. * because extent status tree can do a better job. Extents in status
  45. * tree are loaded on-demand. Therefore, the extent status tree may not
  46. * contain all of the extents in a file. Meanwhile we define a shrinker
  47. * to reclaim memory from extent status tree because fragmented extent
  48. * tree will make status tree cost too much memory. written/unwritten/-
  49. * hole extents in the tree will be reclaimed by this shrinker when we
  50. * are under high memory pressure. Delayed extents will not be
  51. * reclimed because fiemap, bigalloc, and seek_data/hole need it.
  52. */
  53. /*
  54. * Extent status tree implementation for ext4.
  55. *
  56. *
  57. * ==========================================================================
  58. * Extent status tree tracks all extent status.
  59. *
  60. * 1. Why we need to implement extent status tree?
  61. *
  62. * Without extent status tree, ext4 identifies a delayed extent by looking
  63. * up page cache, this has several deficiencies - complicated, buggy,
  64. * and inefficient code.
  65. *
  66. * FIEMAP, SEEK_HOLE/DATA, bigalloc, and writeout all need to know if a
  67. * block or a range of blocks are belonged to a delayed extent.
  68. *
  69. * Let us have a look at how they do without extent status tree.
  70. * -- FIEMAP
  71. * FIEMAP looks up page cache to identify delayed allocations from holes.
  72. *
  73. * -- SEEK_HOLE/DATA
  74. * SEEK_HOLE/DATA has the same problem as FIEMAP.
  75. *
  76. * -- bigalloc
  77. * bigalloc looks up page cache to figure out if a block is
  78. * already under delayed allocation or not to determine whether
  79. * quota reserving is needed for the cluster.
  80. *
  81. * -- writeout
  82. * Writeout looks up whole page cache to see if a buffer is
  83. * mapped, If there are not very many delayed buffers, then it is
  84. * time comsuming.
  85. *
  86. * With extent status tree implementation, FIEMAP, SEEK_HOLE/DATA,
  87. * bigalloc and writeout can figure out if a block or a range of
  88. * blocks is under delayed allocation(belonged to a delayed extent) or
  89. * not by searching the extent tree.
  90. *
  91. *
  92. * ==========================================================================
  93. * 2. Ext4 extent status tree impelmentation
  94. *
  95. * -- extent
  96. * A extent is a range of blocks which are contiguous logically and
  97. * physically. Unlike extent in extent tree, this extent in ext4 is
  98. * a in-memory struct, there is no corresponding on-disk data. There
  99. * is no limit on length of extent, so an extent can contain as many
  100. * blocks as they are contiguous logically and physically.
  101. *
  102. * -- extent status tree
  103. * Every inode has an extent status tree and all allocation blocks
  104. * are added to the tree with different status. The extent in the
  105. * tree are ordered by logical block no.
  106. *
  107. * -- operations on a extent status tree
  108. * There are three important operations on a delayed extent tree: find
  109. * next extent, adding a extent(a range of blocks) and removing a extent.
  110. *
  111. * -- race on a extent status tree
  112. * Extent status tree is protected by inode->i_es_lock.
  113. *
  114. * -- memory consumption
  115. * Fragmented extent tree will make extent status tree cost too much
  116. * memory. Hence, we will reclaim written/unwritten/hole extents from
  117. * the tree under a heavy memory pressure.
  118. *
  119. *
  120. * ==========================================================================
  121. * 3. Performance analysis
  122. *
  123. * -- overhead
  124. * 1. There is a cache extent for write access, so if writes are
  125. * not very random, adding space operaions are in O(1) time.
  126. *
  127. * -- gain
  128. * 2. Code is much simpler, more readable, more maintainable and
  129. * more efficient.
  130. *
  131. *
  132. * ==========================================================================
  133. * 4. TODO list
  134. *
  135. * -- Refactor delayed space reservation
  136. *
  137. * -- Extent-level locking
  138. */
  139. static struct kmem_cache *ext4_es_cachep;
  140. static int __es_insert_extent(struct inode *inode, struct extent_status *newes);
  141. static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
  142. ext4_lblk_t end);
  143. static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
  144. int nr_to_scan);
  145. static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
  146. struct ext4_inode_info *locked_ei);
  147. int __init ext4_init_es(void)
  148. {
  149. ext4_es_cachep = kmem_cache_create("ext4_extent_status",
  150. sizeof(struct extent_status),
  151. 0, (SLAB_RECLAIM_ACCOUNT), NULL);
  152. if (ext4_es_cachep == NULL)
  153. return -ENOMEM;
  154. return 0;
  155. }
  156. void ext4_exit_es(void)
  157. {
  158. if (ext4_es_cachep)
  159. kmem_cache_destroy(ext4_es_cachep);
  160. }
  161. void ext4_es_init_tree(struct ext4_es_tree *tree)
  162. {
  163. tree->root = RB_ROOT;
  164. tree->cache_es = NULL;
  165. }
  166. #ifdef ES_DEBUG__
  167. static void ext4_es_print_tree(struct inode *inode)
  168. {
  169. struct ext4_es_tree *tree;
  170. struct rb_node *node;
  171. printk(KERN_DEBUG "status extents for inode %lu:", inode->i_ino);
  172. tree = &EXT4_I(inode)->i_es_tree;
  173. node = rb_first(&tree->root);
  174. while (node) {
  175. struct extent_status *es;
  176. es = rb_entry(node, struct extent_status, rb_node);
  177. printk(KERN_DEBUG " [%u/%u) %llu %llx",
  178. es->es_lblk, es->es_len,
  179. ext4_es_pblock(es), ext4_es_status(es));
  180. node = rb_next(node);
  181. }
  182. printk(KERN_DEBUG "\n");
  183. }
  184. #else
  185. #define ext4_es_print_tree(inode)
  186. #endif
  187. static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
  188. {
  189. BUG_ON(es->es_lblk + es->es_len < es->es_lblk);
  190. return es->es_lblk + es->es_len - 1;
  191. }
  192. /*
  193. * search through the tree for an delayed extent with a given offset. If
  194. * it can't be found, try to find next extent.
  195. */
  196. static struct extent_status *__es_tree_search(struct rb_root *root,
  197. ext4_lblk_t lblk)
  198. {
  199. struct rb_node *node = root->rb_node;
  200. struct extent_status *es = NULL;
  201. while (node) {
  202. es = rb_entry(node, struct extent_status, rb_node);
  203. if (lblk < es->es_lblk)
  204. node = node->rb_left;
  205. else if (lblk > ext4_es_end(es))
  206. node = node->rb_right;
  207. else
  208. return es;
  209. }
  210. if (es && lblk < es->es_lblk)
  211. return es;
  212. if (es && lblk > ext4_es_end(es)) {
  213. node = rb_next(&es->rb_node);
  214. return node ? rb_entry(node, struct extent_status, rb_node) :
  215. NULL;
  216. }
  217. return NULL;
  218. }
  219. /*
  220. * ext4_es_find_delayed_extent_range: find the 1st delayed extent covering
  221. * @es->lblk if it exists, otherwise, the next extent after @es->lblk.
  222. *
  223. * @inode: the inode which owns delayed extents
  224. * @lblk: the offset where we start to search
  225. * @end: the offset where we stop to search
  226. * @es: delayed extent that we found
  227. */
  228. void ext4_es_find_delayed_extent_range(struct inode *inode,
  229. ext4_lblk_t lblk, ext4_lblk_t end,
  230. struct extent_status *es)
  231. {
  232. struct ext4_es_tree *tree = NULL;
  233. struct extent_status *es1 = NULL;
  234. struct rb_node *node;
  235. BUG_ON(es == NULL);
  236. BUG_ON(end < lblk);
  237. trace_ext4_es_find_delayed_extent_range_enter(inode, lblk);
  238. read_lock(&EXT4_I(inode)->i_es_lock);
  239. tree = &EXT4_I(inode)->i_es_tree;
  240. /* find extent in cache firstly */
  241. es->es_lblk = es->es_len = es->es_pblk = 0;
  242. if (tree->cache_es) {
  243. es1 = tree->cache_es;
  244. if (in_range(lblk, es1->es_lblk, es1->es_len)) {
  245. es_debug("%u cached by [%u/%u) %llu %x\n",
  246. lblk, es1->es_lblk, es1->es_len,
  247. ext4_es_pblock(es1), ext4_es_status(es1));
  248. goto out;
  249. }
  250. }
  251. es1 = __es_tree_search(&tree->root, lblk);
  252. out:
  253. if (es1 && !ext4_es_is_delayed(es1)) {
  254. while ((node = rb_next(&es1->rb_node)) != NULL) {
  255. es1 = rb_entry(node, struct extent_status, rb_node);
  256. if (es1->es_lblk > end) {
  257. es1 = NULL;
  258. break;
  259. }
  260. if (ext4_es_is_delayed(es1))
  261. break;
  262. }
  263. }
  264. if (es1 && ext4_es_is_delayed(es1)) {
  265. tree->cache_es = es1;
  266. es->es_lblk = es1->es_lblk;
  267. es->es_len = es1->es_len;
  268. es->es_pblk = es1->es_pblk;
  269. }
  270. read_unlock(&EXT4_I(inode)->i_es_lock);
  271. trace_ext4_es_find_delayed_extent_range_exit(inode, es);
  272. }
  273. static struct extent_status *
  274. ext4_es_alloc_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t len,
  275. ext4_fsblk_t pblk)
  276. {
  277. struct extent_status *es;
  278. es = kmem_cache_alloc(ext4_es_cachep, GFP_ATOMIC);
  279. if (es == NULL)
  280. return NULL;
  281. es->es_lblk = lblk;
  282. es->es_len = len;
  283. es->es_pblk = pblk;
  284. /*
  285. * We don't count delayed extent because we never try to reclaim them
  286. */
  287. if (!ext4_es_is_delayed(es)) {
  288. EXT4_I(inode)->i_es_lru_nr++;
  289. percpu_counter_inc(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
  290. }
  291. return es;
  292. }
  293. static void ext4_es_free_extent(struct inode *inode, struct extent_status *es)
  294. {
  295. /* Decrease the lru counter when this es is not delayed */
  296. if (!ext4_es_is_delayed(es)) {
  297. BUG_ON(EXT4_I(inode)->i_es_lru_nr == 0);
  298. EXT4_I(inode)->i_es_lru_nr--;
  299. percpu_counter_dec(&EXT4_SB(inode->i_sb)->s_extent_cache_cnt);
  300. }
  301. kmem_cache_free(ext4_es_cachep, es);
  302. }
  303. /*
  304. * Check whether or not two extents can be merged
  305. * Condition:
  306. * - logical block number is contiguous
  307. * - physical block number is contiguous
  308. * - status is equal
  309. */
  310. static int ext4_es_can_be_merged(struct extent_status *es1,
  311. struct extent_status *es2)
  312. {
  313. if (ext4_es_status(es1) != ext4_es_status(es2))
  314. return 0;
  315. if (((__u64) es1->es_len) + es2->es_len > 0xFFFFFFFFULL)
  316. return 0;
  317. if (((__u64) es1->es_lblk) + es1->es_len != es2->es_lblk)
  318. return 0;
  319. if ((ext4_es_is_written(es1) || ext4_es_is_unwritten(es1)) &&
  320. (ext4_es_pblock(es1) + es1->es_len == ext4_es_pblock(es2)))
  321. return 1;
  322. if (ext4_es_is_hole(es1))
  323. return 1;
  324. /* we need to check delayed extent is without unwritten status */
  325. if (ext4_es_is_delayed(es1) && !ext4_es_is_unwritten(es1))
  326. return 1;
  327. return 0;
  328. }
  329. static struct extent_status *
  330. ext4_es_try_to_merge_left(struct inode *inode, struct extent_status *es)
  331. {
  332. struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
  333. struct extent_status *es1;
  334. struct rb_node *node;
  335. node = rb_prev(&es->rb_node);
  336. if (!node)
  337. return es;
  338. es1 = rb_entry(node, struct extent_status, rb_node);
  339. if (ext4_es_can_be_merged(es1, es)) {
  340. es1->es_len += es->es_len;
  341. rb_erase(&es->rb_node, &tree->root);
  342. ext4_es_free_extent(inode, es);
  343. es = es1;
  344. }
  345. return es;
  346. }
  347. static struct extent_status *
  348. ext4_es_try_to_merge_right(struct inode *inode, struct extent_status *es)
  349. {
  350. struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
  351. struct extent_status *es1;
  352. struct rb_node *node;
  353. node = rb_next(&es->rb_node);
  354. if (!node)
  355. return es;
  356. es1 = rb_entry(node, struct extent_status, rb_node);
  357. if (ext4_es_can_be_merged(es, es1)) {
  358. es->es_len += es1->es_len;
  359. rb_erase(node, &tree->root);
  360. ext4_es_free_extent(inode, es1);
  361. }
  362. return es;
  363. }
  364. #ifdef ES_AGGRESSIVE_TEST
  365. #include "ext4_extents.h" /* Needed when ES_AGGRESSIVE_TEST is defined */
  366. static void ext4_es_insert_extent_ext_check(struct inode *inode,
  367. struct extent_status *es)
  368. {
  369. struct ext4_ext_path *path = NULL;
  370. struct ext4_extent *ex;
  371. ext4_lblk_t ee_block;
  372. ext4_fsblk_t ee_start;
  373. unsigned short ee_len;
  374. int depth, ee_status, es_status;
  375. path = ext4_ext_find_extent(inode, es->es_lblk, NULL, EXT4_EX_NOCACHE);
  376. if (IS_ERR(path))
  377. return;
  378. depth = ext_depth(inode);
  379. ex = path[depth].p_ext;
  380. if (ex) {
  381. ee_block = le32_to_cpu(ex->ee_block);
  382. ee_start = ext4_ext_pblock(ex);
  383. ee_len = ext4_ext_get_actual_len(ex);
  384. ee_status = ext4_ext_is_uninitialized(ex) ? 1 : 0;
  385. es_status = ext4_es_is_unwritten(es) ? 1 : 0;
  386. /*
  387. * Make sure ex and es are not overlap when we try to insert
  388. * a delayed/hole extent.
  389. */
  390. if (!ext4_es_is_written(es) && !ext4_es_is_unwritten(es)) {
  391. if (in_range(es->es_lblk, ee_block, ee_len)) {
  392. pr_warn("ES insert assertion failed for "
  393. "inode: %lu we can find an extent "
  394. "at block [%d/%d/%llu/%c], but we "
  395. "want to add an delayed/hole extent "
  396. "[%d/%d/%llu/%llx]\n",
  397. inode->i_ino, ee_block, ee_len,
  398. ee_start, ee_status ? 'u' : 'w',
  399. es->es_lblk, es->es_len,
  400. ext4_es_pblock(es), ext4_es_status(es));
  401. }
  402. goto out;
  403. }
  404. /*
  405. * We don't check ee_block == es->es_lblk, etc. because es
  406. * might be a part of whole extent, vice versa.
  407. */
  408. if (es->es_lblk < ee_block ||
  409. ext4_es_pblock(es) != ee_start + es->es_lblk - ee_block) {
  410. pr_warn("ES insert assertion failed for inode: %lu "
  411. "ex_status [%d/%d/%llu/%c] != "
  412. "es_status [%d/%d/%llu/%c]\n", inode->i_ino,
  413. ee_block, ee_len, ee_start,
  414. ee_status ? 'u' : 'w', es->es_lblk, es->es_len,
  415. ext4_es_pblock(es), es_status ? 'u' : 'w');
  416. goto out;
  417. }
  418. if (ee_status ^ es_status) {
  419. pr_warn("ES insert assertion failed for inode: %lu "
  420. "ex_status [%d/%d/%llu/%c] != "
  421. "es_status [%d/%d/%llu/%c]\n", inode->i_ino,
  422. ee_block, ee_len, ee_start,
  423. ee_status ? 'u' : 'w', es->es_lblk, es->es_len,
  424. ext4_es_pblock(es), es_status ? 'u' : 'w');
  425. }
  426. } else {
  427. /*
  428. * We can't find an extent on disk. So we need to make sure
  429. * that we don't want to add an written/unwritten extent.
  430. */
  431. if (!ext4_es_is_delayed(es) && !ext4_es_is_hole(es)) {
  432. pr_warn("ES insert assertion failed for inode: %lu "
  433. "can't find an extent at block %d but we want "
  434. "to add an written/unwritten extent "
  435. "[%d/%d/%llu/%llx]\n", inode->i_ino,
  436. es->es_lblk, es->es_lblk, es->es_len,
  437. ext4_es_pblock(es), ext4_es_status(es));
  438. }
  439. }
  440. out:
  441. if (path) {
  442. ext4_ext_drop_refs(path);
  443. kfree(path);
  444. }
  445. }
  446. static void ext4_es_insert_extent_ind_check(struct inode *inode,
  447. struct extent_status *es)
  448. {
  449. struct ext4_map_blocks map;
  450. int retval;
  451. /*
  452. * Here we call ext4_ind_map_blocks to lookup a block mapping because
  453. * 'Indirect' structure is defined in indirect.c. So we couldn't
  454. * access direct/indirect tree from outside. It is too dirty to define
  455. * this function in indirect.c file.
  456. */
  457. map.m_lblk = es->es_lblk;
  458. map.m_len = es->es_len;
  459. retval = ext4_ind_map_blocks(NULL, inode, &map, 0);
  460. if (retval > 0) {
  461. if (ext4_es_is_delayed(es) || ext4_es_is_hole(es)) {
  462. /*
  463. * We want to add a delayed/hole extent but this
  464. * block has been allocated.
  465. */
  466. pr_warn("ES insert assertion failed for inode: %lu "
  467. "We can find blocks but we want to add a "
  468. "delayed/hole extent [%d/%d/%llu/%llx]\n",
  469. inode->i_ino, es->es_lblk, es->es_len,
  470. ext4_es_pblock(es), ext4_es_status(es));
  471. return;
  472. } else if (ext4_es_is_written(es)) {
  473. if (retval != es->es_len) {
  474. pr_warn("ES insert assertion failed for "
  475. "inode: %lu retval %d != es_len %d\n",
  476. inode->i_ino, retval, es->es_len);
  477. return;
  478. }
  479. if (map.m_pblk != ext4_es_pblock(es)) {
  480. pr_warn("ES insert assertion failed for "
  481. "inode: %lu m_pblk %llu != "
  482. "es_pblk %llu\n",
  483. inode->i_ino, map.m_pblk,
  484. ext4_es_pblock(es));
  485. return;
  486. }
  487. } else {
  488. /*
  489. * We don't need to check unwritten extent because
  490. * indirect-based file doesn't have it.
  491. */
  492. BUG_ON(1);
  493. }
  494. } else if (retval == 0) {
  495. if (ext4_es_is_written(es)) {
  496. pr_warn("ES insert assertion failed for inode: %lu "
  497. "We can't find the block but we want to add "
  498. "an written extent [%d/%d/%llu/%llx]\n",
  499. inode->i_ino, es->es_lblk, es->es_len,
  500. ext4_es_pblock(es), ext4_es_status(es));
  501. return;
  502. }
  503. }
  504. }
  505. static inline void ext4_es_insert_extent_check(struct inode *inode,
  506. struct extent_status *es)
  507. {
  508. /*
  509. * We don't need to worry about the race condition because
  510. * caller takes i_data_sem locking.
  511. */
  512. BUG_ON(!rwsem_is_locked(&EXT4_I(inode)->i_data_sem));
  513. if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
  514. ext4_es_insert_extent_ext_check(inode, es);
  515. else
  516. ext4_es_insert_extent_ind_check(inode, es);
  517. }
  518. #else
  519. static inline void ext4_es_insert_extent_check(struct inode *inode,
  520. struct extent_status *es)
  521. {
  522. }
  523. #endif
  524. static int __es_insert_extent(struct inode *inode, struct extent_status *newes)
  525. {
  526. struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
  527. struct rb_node **p = &tree->root.rb_node;
  528. struct rb_node *parent = NULL;
  529. struct extent_status *es;
  530. while (*p) {
  531. parent = *p;
  532. es = rb_entry(parent, struct extent_status, rb_node);
  533. if (newes->es_lblk < es->es_lblk) {
  534. if (ext4_es_can_be_merged(newes, es)) {
  535. /*
  536. * Here we can modify es_lblk directly
  537. * because it isn't overlapped.
  538. */
  539. es->es_lblk = newes->es_lblk;
  540. es->es_len += newes->es_len;
  541. if (ext4_es_is_written(es) ||
  542. ext4_es_is_unwritten(es))
  543. ext4_es_store_pblock(es,
  544. newes->es_pblk);
  545. es = ext4_es_try_to_merge_left(inode, es);
  546. goto out;
  547. }
  548. p = &(*p)->rb_left;
  549. } else if (newes->es_lblk > ext4_es_end(es)) {
  550. if (ext4_es_can_be_merged(es, newes)) {
  551. es->es_len += newes->es_len;
  552. es = ext4_es_try_to_merge_right(inode, es);
  553. goto out;
  554. }
  555. p = &(*p)->rb_right;
  556. } else {
  557. BUG_ON(1);
  558. return -EINVAL;
  559. }
  560. }
  561. es = ext4_es_alloc_extent(inode, newes->es_lblk, newes->es_len,
  562. newes->es_pblk);
  563. if (!es)
  564. return -ENOMEM;
  565. rb_link_node(&es->rb_node, parent, p);
  566. rb_insert_color(&es->rb_node, &tree->root);
  567. out:
  568. tree->cache_es = es;
  569. return 0;
  570. }
  571. /*
  572. * ext4_es_insert_extent() adds information to an inode's extent
  573. * status tree.
  574. *
  575. * Return 0 on success, error code on failure.
  576. */
  577. int ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
  578. ext4_lblk_t len, ext4_fsblk_t pblk,
  579. unsigned int status)
  580. {
  581. struct extent_status newes;
  582. ext4_lblk_t end = lblk + len - 1;
  583. int err = 0;
  584. es_debug("add [%u/%u) %llu %x to extent status tree of inode %lu\n",
  585. lblk, len, pblk, status, inode->i_ino);
  586. if (!len)
  587. return 0;
  588. BUG_ON(end < lblk);
  589. newes.es_lblk = lblk;
  590. newes.es_len = len;
  591. ext4_es_store_pblock(&newes, pblk);
  592. ext4_es_store_status(&newes, status);
  593. trace_ext4_es_insert_extent(inode, &newes);
  594. ext4_es_insert_extent_check(inode, &newes);
  595. write_lock(&EXT4_I(inode)->i_es_lock);
  596. err = __es_remove_extent(inode, lblk, end);
  597. if (err != 0)
  598. goto error;
  599. retry:
  600. err = __es_insert_extent(inode, &newes);
  601. if (err == -ENOMEM && __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
  602. EXT4_I(inode)))
  603. goto retry;
  604. if (err == -ENOMEM && !ext4_es_is_delayed(&newes))
  605. err = 0;
  606. error:
  607. write_unlock(&EXT4_I(inode)->i_es_lock);
  608. ext4_es_print_tree(inode);
  609. return err;
  610. }
  611. /*
  612. * ext4_es_cache_extent() inserts information into the extent status
  613. * tree if and only if there isn't information about the range in
  614. * question already.
  615. */
  616. void ext4_es_cache_extent(struct inode *inode, ext4_lblk_t lblk,
  617. ext4_lblk_t len, ext4_fsblk_t pblk,
  618. unsigned int status)
  619. {
  620. struct extent_status *es;
  621. struct extent_status newes;
  622. ext4_lblk_t end = lblk + len - 1;
  623. newes.es_lblk = lblk;
  624. newes.es_len = len;
  625. ext4_es_store_pblock(&newes, pblk);
  626. ext4_es_store_status(&newes, status);
  627. trace_ext4_es_cache_extent(inode, &newes);
  628. if (!len)
  629. return;
  630. BUG_ON(end < lblk);
  631. write_lock(&EXT4_I(inode)->i_es_lock);
  632. es = __es_tree_search(&EXT4_I(inode)->i_es_tree.root, lblk);
  633. if (!es || es->es_lblk > end)
  634. __es_insert_extent(inode, &newes);
  635. write_unlock(&EXT4_I(inode)->i_es_lock);
  636. }
  637. /*
  638. * ext4_es_lookup_extent() looks up an extent in extent status tree.
  639. *
  640. * ext4_es_lookup_extent is called by ext4_map_blocks/ext4_da_map_blocks.
  641. *
  642. * Return: 1 on found, 0 on not
  643. */
  644. int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk,
  645. struct extent_status *es)
  646. {
  647. struct ext4_es_tree *tree;
  648. struct extent_status *es1 = NULL;
  649. struct rb_node *node;
  650. int found = 0;
  651. trace_ext4_es_lookup_extent_enter(inode, lblk);
  652. es_debug("lookup extent in block %u\n", lblk);
  653. tree = &EXT4_I(inode)->i_es_tree;
  654. read_lock(&EXT4_I(inode)->i_es_lock);
  655. /* find extent in cache firstly */
  656. es->es_lblk = es->es_len = es->es_pblk = 0;
  657. if (tree->cache_es) {
  658. es1 = tree->cache_es;
  659. if (in_range(lblk, es1->es_lblk, es1->es_len)) {
  660. es_debug("%u cached by [%u/%u)\n",
  661. lblk, es1->es_lblk, es1->es_len);
  662. found = 1;
  663. goto out;
  664. }
  665. }
  666. node = tree->root.rb_node;
  667. while (node) {
  668. es1 = rb_entry(node, struct extent_status, rb_node);
  669. if (lblk < es1->es_lblk)
  670. node = node->rb_left;
  671. else if (lblk > ext4_es_end(es1))
  672. node = node->rb_right;
  673. else {
  674. found = 1;
  675. break;
  676. }
  677. }
  678. out:
  679. if (found) {
  680. BUG_ON(!es1);
  681. es->es_lblk = es1->es_lblk;
  682. es->es_len = es1->es_len;
  683. es->es_pblk = es1->es_pblk;
  684. }
  685. read_unlock(&EXT4_I(inode)->i_es_lock);
  686. trace_ext4_es_lookup_extent_exit(inode, es, found);
  687. return found;
  688. }
  689. static int __es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
  690. ext4_lblk_t end)
  691. {
  692. struct ext4_es_tree *tree = &EXT4_I(inode)->i_es_tree;
  693. struct rb_node *node;
  694. struct extent_status *es;
  695. struct extent_status orig_es;
  696. ext4_lblk_t len1, len2;
  697. ext4_fsblk_t block;
  698. int err;
  699. retry:
  700. err = 0;
  701. es = __es_tree_search(&tree->root, lblk);
  702. if (!es)
  703. goto out;
  704. if (es->es_lblk > end)
  705. goto out;
  706. /* Simply invalidate cache_es. */
  707. tree->cache_es = NULL;
  708. orig_es.es_lblk = es->es_lblk;
  709. orig_es.es_len = es->es_len;
  710. orig_es.es_pblk = es->es_pblk;
  711. len1 = lblk > es->es_lblk ? lblk - es->es_lblk : 0;
  712. len2 = ext4_es_end(es) > end ? ext4_es_end(es) - end : 0;
  713. if (len1 > 0)
  714. es->es_len = len1;
  715. if (len2 > 0) {
  716. if (len1 > 0) {
  717. struct extent_status newes;
  718. newes.es_lblk = end + 1;
  719. newes.es_len = len2;
  720. if (ext4_es_is_written(&orig_es) ||
  721. ext4_es_is_unwritten(&orig_es)) {
  722. block = ext4_es_pblock(&orig_es) +
  723. orig_es.es_len - len2;
  724. ext4_es_store_pblock(&newes, block);
  725. }
  726. ext4_es_store_status(&newes, ext4_es_status(&orig_es));
  727. err = __es_insert_extent(inode, &newes);
  728. if (err) {
  729. es->es_lblk = orig_es.es_lblk;
  730. es->es_len = orig_es.es_len;
  731. if ((err == -ENOMEM) &&
  732. __ext4_es_shrink(EXT4_SB(inode->i_sb), 1,
  733. EXT4_I(inode)))
  734. goto retry;
  735. goto out;
  736. }
  737. } else {
  738. es->es_lblk = end + 1;
  739. es->es_len = len2;
  740. if (ext4_es_is_written(es) ||
  741. ext4_es_is_unwritten(es)) {
  742. block = orig_es.es_pblk + orig_es.es_len - len2;
  743. ext4_es_store_pblock(es, block);
  744. }
  745. }
  746. goto out;
  747. }
  748. if (len1 > 0) {
  749. node = rb_next(&es->rb_node);
  750. if (node)
  751. es = rb_entry(node, struct extent_status, rb_node);
  752. else
  753. es = NULL;
  754. }
  755. while (es && ext4_es_end(es) <= end) {
  756. node = rb_next(&es->rb_node);
  757. rb_erase(&es->rb_node, &tree->root);
  758. ext4_es_free_extent(inode, es);
  759. if (!node) {
  760. es = NULL;
  761. break;
  762. }
  763. es = rb_entry(node, struct extent_status, rb_node);
  764. }
  765. if (es && es->es_lblk < end + 1) {
  766. ext4_lblk_t orig_len = es->es_len;
  767. len1 = ext4_es_end(es) - end;
  768. es->es_lblk = end + 1;
  769. es->es_len = len1;
  770. if (ext4_es_is_written(es) || ext4_es_is_unwritten(es)) {
  771. block = es->es_pblk + orig_len - len1;
  772. ext4_es_store_pblock(es, block);
  773. }
  774. }
  775. out:
  776. return err;
  777. }
  778. /*
  779. * ext4_es_remove_extent() removes a space from a extent status tree.
  780. *
  781. * Return 0 on success, error code on failure.
  782. */
  783. int ext4_es_remove_extent(struct inode *inode, ext4_lblk_t lblk,
  784. ext4_lblk_t len)
  785. {
  786. ext4_lblk_t end;
  787. int err = 0;
  788. trace_ext4_es_remove_extent(inode, lblk, len);
  789. es_debug("remove [%u/%u) from extent status tree of inode %lu\n",
  790. lblk, len, inode->i_ino);
  791. if (!len)
  792. return err;
  793. end = lblk + len - 1;
  794. BUG_ON(end < lblk);
  795. write_lock(&EXT4_I(inode)->i_es_lock);
  796. err = __es_remove_extent(inode, lblk, end);
  797. write_unlock(&EXT4_I(inode)->i_es_lock);
  798. ext4_es_print_tree(inode);
  799. return err;
  800. }
  801. static int ext4_inode_touch_time_cmp(void *priv, struct list_head *a,
  802. struct list_head *b)
  803. {
  804. struct ext4_inode_info *eia, *eib;
  805. eia = list_entry(a, struct ext4_inode_info, i_es_lru);
  806. eib = list_entry(b, struct ext4_inode_info, i_es_lru);
  807. if (ext4_test_inode_state(&eia->vfs_inode, EXT4_STATE_EXT_PRECACHED) &&
  808. !ext4_test_inode_state(&eib->vfs_inode, EXT4_STATE_EXT_PRECACHED))
  809. return 1;
  810. if (!ext4_test_inode_state(&eia->vfs_inode, EXT4_STATE_EXT_PRECACHED) &&
  811. ext4_test_inode_state(&eib->vfs_inode, EXT4_STATE_EXT_PRECACHED))
  812. return -1;
  813. if (eia->i_touch_when == eib->i_touch_when)
  814. return 0;
  815. if (time_after(eia->i_touch_when, eib->i_touch_when))
  816. return 1;
  817. else
  818. return -1;
  819. }
  820. static int __ext4_es_shrink(struct ext4_sb_info *sbi, int nr_to_scan,
  821. struct ext4_inode_info *locked_ei)
  822. {
  823. struct ext4_inode_info *ei;
  824. struct list_head *cur, *tmp;
  825. LIST_HEAD(skipped);
  826. int nr_shrunk = 0;
  827. int retried = 0, skip_precached = 1, nr_skipped = 0;
  828. spin_lock(&sbi->s_es_lru_lock);
  829. retry:
  830. list_for_each_safe(cur, tmp, &sbi->s_es_lru) {
  831. int shrunk;
  832. /*
  833. * If we have already reclaimed all extents from extent
  834. * status tree, just stop the loop immediately.
  835. */
  836. if (percpu_counter_read_positive(&sbi->s_extent_cache_cnt) == 0)
  837. break;
  838. ei = list_entry(cur, struct ext4_inode_info, i_es_lru);
  839. /*
  840. * Skip the inode that is newer than the last_sorted
  841. * time. Normally we try hard to avoid shrinking
  842. * precached inodes, but we will as a last resort.
  843. */
  844. if ((sbi->s_es_last_sorted < ei->i_touch_when) ||
  845. (skip_precached && ext4_test_inode_state(&ei->vfs_inode,
  846. EXT4_STATE_EXT_PRECACHED))) {
  847. nr_skipped++;
  848. list_move_tail(cur, &skipped);
  849. continue;
  850. }
  851. if (ei->i_es_lru_nr == 0 || ei == locked_ei)
  852. continue;
  853. write_lock(&ei->i_es_lock);
  854. shrunk = __es_try_to_reclaim_extents(ei, nr_to_scan);
  855. if (ei->i_es_lru_nr == 0)
  856. list_del_init(&ei->i_es_lru);
  857. write_unlock(&ei->i_es_lock);
  858. nr_shrunk += shrunk;
  859. nr_to_scan -= shrunk;
  860. if (nr_to_scan == 0)
  861. break;
  862. }
  863. /* Move the newer inodes into the tail of the LRU list. */
  864. list_splice_tail(&skipped, &sbi->s_es_lru);
  865. INIT_LIST_HEAD(&skipped);
  866. /*
  867. * If we skipped any inodes, and we weren't able to make any
  868. * forward progress, sort the list and try again.
  869. */
  870. if ((nr_shrunk == 0) && nr_skipped && !retried) {
  871. retried++;
  872. list_sort(NULL, &sbi->s_es_lru, ext4_inode_touch_time_cmp);
  873. sbi->s_es_last_sorted = jiffies;
  874. ei = list_first_entry(&sbi->s_es_lru, struct ext4_inode_info,
  875. i_es_lru);
  876. /*
  877. * If there are no non-precached inodes left on the
  878. * list, start releasing precached extents.
  879. */
  880. if (ext4_test_inode_state(&ei->vfs_inode,
  881. EXT4_STATE_EXT_PRECACHED))
  882. skip_precached = 0;
  883. goto retry;
  884. }
  885. spin_unlock(&sbi->s_es_lru_lock);
  886. if (locked_ei && nr_shrunk == 0)
  887. nr_shrunk = __es_try_to_reclaim_extents(locked_ei, nr_to_scan);
  888. return nr_shrunk;
  889. }
  890. static unsigned long ext4_es_count(struct shrinker *shrink,
  891. struct shrink_control *sc)
  892. {
  893. unsigned long nr;
  894. struct ext4_sb_info *sbi;
  895. sbi = container_of(shrink, struct ext4_sb_info, s_es_shrinker);
  896. nr = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
  897. trace_ext4_es_shrink_enter(sbi->s_sb, sc->nr_to_scan, nr);
  898. return nr;
  899. }
  900. static unsigned long ext4_es_scan(struct shrinker *shrink,
  901. struct shrink_control *sc)
  902. {
  903. struct ext4_sb_info *sbi = container_of(shrink,
  904. struct ext4_sb_info, s_es_shrinker);
  905. int nr_to_scan = sc->nr_to_scan;
  906. int ret, nr_shrunk;
  907. ret = percpu_counter_read_positive(&sbi->s_extent_cache_cnt);
  908. trace_ext4_es_shrink_enter(sbi->s_sb, nr_to_scan, ret);
  909. if (!nr_to_scan)
  910. return ret;
  911. nr_shrunk = __ext4_es_shrink(sbi, nr_to_scan, NULL);
  912. trace_ext4_es_shrink_exit(sbi->s_sb, nr_shrunk, ret);
  913. return nr_shrunk;
  914. }
  915. void ext4_es_register_shrinker(struct ext4_sb_info *sbi)
  916. {
  917. INIT_LIST_HEAD(&sbi->s_es_lru);
  918. spin_lock_init(&sbi->s_es_lru_lock);
  919. sbi->s_es_last_sorted = 0;
  920. sbi->s_es_shrinker.scan_objects = ext4_es_scan;
  921. sbi->s_es_shrinker.count_objects = ext4_es_count;
  922. sbi->s_es_shrinker.seeks = DEFAULT_SEEKS;
  923. register_shrinker(&sbi->s_es_shrinker);
  924. }
  925. void ext4_es_unregister_shrinker(struct ext4_sb_info *sbi)
  926. {
  927. unregister_shrinker(&sbi->s_es_shrinker);
  928. }
  929. void ext4_es_lru_add(struct inode *inode)
  930. {
  931. struct ext4_inode_info *ei = EXT4_I(inode);
  932. struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
  933. ei->i_touch_when = jiffies;
  934. if (!list_empty(&ei->i_es_lru))
  935. return;
  936. spin_lock(&sbi->s_es_lru_lock);
  937. if (list_empty(&ei->i_es_lru))
  938. list_add_tail(&ei->i_es_lru, &sbi->s_es_lru);
  939. spin_unlock(&sbi->s_es_lru_lock);
  940. }
  941. void ext4_es_lru_del(struct inode *inode)
  942. {
  943. struct ext4_inode_info *ei = EXT4_I(inode);
  944. struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
  945. spin_lock(&sbi->s_es_lru_lock);
  946. if (!list_empty(&ei->i_es_lru))
  947. list_del_init(&ei->i_es_lru);
  948. spin_unlock(&sbi->s_es_lru_lock);
  949. }
  950. static int __es_try_to_reclaim_extents(struct ext4_inode_info *ei,
  951. int nr_to_scan)
  952. {
  953. struct inode *inode = &ei->vfs_inode;
  954. struct ext4_es_tree *tree = &ei->i_es_tree;
  955. struct rb_node *node;
  956. struct extent_status *es;
  957. unsigned long nr_shrunk = 0;
  958. static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
  959. DEFAULT_RATELIMIT_BURST);
  960. if (ei->i_es_lru_nr == 0)
  961. return 0;
  962. if (ext4_test_inode_state(inode, EXT4_STATE_EXT_PRECACHED) &&
  963. __ratelimit(&_rs))
  964. ext4_warning(inode->i_sb, "forced shrink of precached extents");
  965. node = rb_first(&tree->root);
  966. while (node != NULL) {
  967. es = rb_entry(node, struct extent_status, rb_node);
  968. node = rb_next(&es->rb_node);
  969. /*
  970. * We can't reclaim delayed extent from status tree because
  971. * fiemap, bigallic, and seek_data/hole need to use it.
  972. */
  973. if (!ext4_es_is_delayed(es)) {
  974. rb_erase(&es->rb_node, &tree->root);
  975. ext4_es_free_extent(inode, es);
  976. nr_shrunk++;
  977. if (--nr_to_scan == 0)
  978. break;
  979. }
  980. }
  981. tree->cache_es = NULL;
  982. return nr_shrunk;
  983. }