original development tree for Linux kernel GTP module; now long in mainline.
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

1837 lines
44 KiB

implement in-kernel gendisk events handling Currently, media presence polling for removeable block devices is done from userland. There are several issues with this. * Polling is done by periodically opening the device. For SCSI devices, the command sequence generated by such action involves a few different commands including TEST_UNIT_READY. This behavior, while perfectly legal, is different from Windows which only issues single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some ATAPI devices lock up after being periodically queried such command sequences. * There is no reliable and unintrusive way for a userland program to tell whether the target device is safe for media presence polling. For example, polling for media presence during an on-going burning session can make it fail. The polling program can avoid this by opening the device with O_EXCL but then it risks making a valid exclusive user of the device fail w/ -EBUSY. * Userland polling is unnecessarily heavy and in-kernel implementation is lighter and better coordinated (workqueue, timer slack). This patch implements framework for in-kernel disk event handling, which includes media presence polling. * bdops->check_events() is added, which supercedes ->media_changed(). It should check whether there's any pending event and return if so. Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be called parallelly. * gendisk->events and ->async_events are added. These should be initialized by block driver before passing the device to add_disk(). The former contains the mask of all supported events and the latter the mask of all events which the device can report without polling. /sys/block/*/events[_async] export these to userland. * Kernel parameter block.events_dfl_poll_msecs controls the system polling interval (default is 0 which means disable) and /sys/block/*/events_poll_msecs control polling intervals for individual devices (default is -1 meaning use system setting). Note that if a device can report all supported events asynchronously and its polling interval isn't explicitly set, the device won't be polled regardless of the system polling interval. * If a device is opened exclusively with write access, event checking is automatically disabled until all write exclusive accesses are released. * There are event 'clearing' events. For example, both of currently defined events are cleared after the device has been successfully opened. This information is passed to ->check_events() callback using @clearing argument as a hint. * Event checking is always performed from system_nrt_wq and timer slack is set to 25% for polling. * Nothing changes for drivers which implement ->media_changed() but not ->check_events(). Going forward, all drivers will be converted to ->check_events() and ->media_change() will be dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
11 years ago
block: fix __blkdev_get and add_disk race condition The following situation might occur: __blkdev_get: add_disk: register_disk() get_gendisk() disk_block_events() disk->ev == NULL disk_add_events() __disk_unblock_events() disk->ev != NULL --ev->block Then we unblock events, when they are suppose to be blocked. This can trigger events related block/genhd.c warnings, but also can crash in sd_check_events() or other places. I'm able to reproduce crashes with the following scripts (with connected usb dongle as sdb disk). <snip> DEV=/dev/sdb ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue function stop_me() { for i in `jobs -p` ; do kill $i 2> /dev/null ; done exit } trap stop_me SIGHUP SIGINT SIGTERM for ((i = 0; i < 10; i++)) ; do while true; do fdisk -l $DEV 2>&1 > /dev/null ; done & done while true ; do echo 1 > $ENABLE sleep 1 echo 0 > $ENABLE done </snip> I use the script to verify patch fixing oops in sd_revalidate_disk http://marc.info/?l=linux-scsi&m=132935572512352&w=2 Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in sd_revalidate_disk" or this one, script easily crash kernel within a few seconds. With both patches applied I do not observe crash. Unfortunately after some time (dozen of minutes), script will hung in: [ 1563.906432] [<c08354f5>] schedule_timeout_uninterruptible+0x15/0x20 [ 1563.906437] [<c04532d5>] msleep+0x15/0x20 [ 1563.906443] [<c05d60b2>] blk_drain_queue+0x32/0xd0 [ 1563.906447] [<c05d6e00>] blk_cleanup_queue+0xd0/0x170 [ 1563.906454] [<c06d278f>] scsi_free_queue+0x3f/0x60 [ 1563.906459] [<c06d7e6e>] __scsi_remove_device+0x6e/0xb0 [ 1563.906463] [<c06d4aff>] scsi_forget_host+0x4f/0x60 [ 1563.906468] [<c06cd84a>] scsi_remove_host+0x5a/0xf0 [ 1563.906482] [<f7f030fb>] quiesce_and_remove_host+0x5b/0xa0 [usb_storage] [ 1563.906490] [<f7f03203>] usb_stor_disconnect+0x13/0x20 [usb_storage] Anyway I think this patch is some step forward. As drawback, I do not teardown on sysfs file create error, because I do not know how to nullify disk->ev (since it can be used). However add_disk error handling practically does not exist too, and things will work without this sysfs file, except events will not be exported to user space. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 years ago
implement in-kernel gendisk events handling Currently, media presence polling for removeable block devices is done from userland. There are several issues with this. * Polling is done by periodically opening the device. For SCSI devices, the command sequence generated by such action involves a few different commands including TEST_UNIT_READY. This behavior, while perfectly legal, is different from Windows which only issues single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some ATAPI devices lock up after being periodically queried such command sequences. * There is no reliable and unintrusive way for a userland program to tell whether the target device is safe for media presence polling. For example, polling for media presence during an on-going burning session can make it fail. The polling program can avoid this by opening the device with O_EXCL but then it risks making a valid exclusive user of the device fail w/ -EBUSY. * Userland polling is unnecessarily heavy and in-kernel implementation is lighter and better coordinated (workqueue, timer slack). This patch implements framework for in-kernel disk event handling, which includes media presence polling. * bdops->check_events() is added, which supercedes ->media_changed(). It should check whether there's any pending event and return if so. Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be called parallelly. * gendisk->events and ->async_events are added. These should be initialized by block driver before passing the device to add_disk(). The former contains the mask of all supported events and the latter the mask of all events which the device can report without polling. /sys/block/*/events[_async] export these to userland. * Kernel parameter block.events_dfl_poll_msecs controls the system polling interval (default is 0 which means disable) and /sys/block/*/events_poll_msecs control polling intervals for individual devices (default is -1 meaning use system setting). Note that if a device can report all supported events asynchronously and its polling interval isn't explicitly set, the device won't be polled regardless of the system polling interval. * If a device is opened exclusively with write access, event checking is automatically disabled until all write exclusive accesses are released. * There are event 'clearing' events. For example, both of currently defined events are cleared after the device has been successfully opened. This information is passed to ->check_events() callback using @clearing argument as a hint. * Event checking is always performed from system_nrt_wq and timer slack is set to 25% for polling. * Nothing changes for drivers which implement ->media_changed() but not ->check_events(). Going forward, all drivers will be converted to ->check_events() and ->media_change() will be dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
11 years ago
block: fix __blkdev_get and add_disk race condition The following situation might occur: __blkdev_get: add_disk: register_disk() get_gendisk() disk_block_events() disk->ev == NULL disk_add_events() __disk_unblock_events() disk->ev != NULL --ev->block Then we unblock events, when they are suppose to be blocked. This can trigger events related block/genhd.c warnings, but also can crash in sd_check_events() or other places. I'm able to reproduce crashes with the following scripts (with connected usb dongle as sdb disk). <snip> DEV=/dev/sdb ENABLE=/sys/bus/usb/devices/1-2/bConfigurationValue function stop_me() { for i in `jobs -p` ; do kill $i 2> /dev/null ; done exit } trap stop_me SIGHUP SIGINT SIGTERM for ((i = 0; i < 10; i++)) ; do while true; do fdisk -l $DEV 2>&1 > /dev/null ; done & done while true ; do echo 1 > $ENABLE sleep 1 echo 0 > $ENABLE done </snip> I use the script to verify patch fixing oops in sd_revalidate_disk http://marc.info/?l=linux-scsi&m=132935572512352&w=2 Without Jun'ichi Nomura patch titled "Fix NULL pointer dereference in sd_revalidate_disk" or this one, script easily crash kernel within a few seconds. With both patches applied I do not observe crash. Unfortunately after some time (dozen of minutes), script will hung in: [ 1563.906432] [<c08354f5>] schedule_timeout_uninterruptible+0x15/0x20 [ 1563.906437] [<c04532d5>] msleep+0x15/0x20 [ 1563.906443] [<c05d60b2>] blk_drain_queue+0x32/0xd0 [ 1563.906447] [<c05d6e00>] blk_cleanup_queue+0xd0/0x170 [ 1563.906454] [<c06d278f>] scsi_free_queue+0x3f/0x60 [ 1563.906459] [<c06d7e6e>] __scsi_remove_device+0x6e/0xb0 [ 1563.906463] [<c06d4aff>] scsi_forget_host+0x4f/0x60 [ 1563.906468] [<c06cd84a>] scsi_remove_host+0x5a/0xf0 [ 1563.906482] [<f7f030fb>] quiesce_and_remove_host+0x5b/0xa0 [usb_storage] [ 1563.906490] [<f7f03203>] usb_stor_disconnect+0x13/0x20 [usb_storage] Anyway I think this patch is some step forward. As drawback, I do not teardown on sysfs file create error, because I do not know how to nullify disk->ev (since it can be used). However add_disk error handling practically does not exist too, and things will work without this sysfs file, except events will not be exported to user space. Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 years ago
block: make gendisk hold a reference to its queue The following command sequence triggers an oops. # mount /dev/sdb1 /mnt # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete # umount /mnt general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60 ... Call Trace: [<ffffffff810d2845>] lock_acquire+0x95/0x140 [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50 [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70 [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0 [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0 [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0 [<ffffffff811c40df>] blkdev_put+0x5f/0x190 [<ffffffff8118f18d>] kill_block_super+0x4d/0x80 [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70 [<ffffffff8119003a>] deactivate_super+0x4a/0x70 [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130 [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0 [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b This is because bdev holds on to disk but disk doesn't pin the associated queue. If a SCSI device is removed while the device is still open, the sdev puts the base reference to the queue on release. When the bdev is finally released, the associated queue is already gone along with the bdi and bdev_inode_switch_bdi() ends up dereferencing already freed bdi. Even if it were not for this bug, disk not holding onto the associated queue is very unusual and error-prone. Fix it by making add_disk() take an extra reference to its queue and put it on disk_release() and ensuring that disk and its fops owner are put in that order after all accesses to the disk and queue are complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 years ago
block: make gendisk hold a reference to its queue The following command sequence triggers an oops. # mount /dev/sdb1 /mnt # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete # umount /mnt general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60 ... Call Trace: [<ffffffff810d2845>] lock_acquire+0x95/0x140 [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50 [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70 [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0 [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0 [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0 [<ffffffff811c40df>] blkdev_put+0x5f/0x190 [<ffffffff8118f18d>] kill_block_super+0x4d/0x80 [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70 [<ffffffff8119003a>] deactivate_super+0x4a/0x70 [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130 [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0 [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b This is because bdev holds on to disk but disk doesn't pin the associated queue. If a SCSI device is removed while the device is still open, the sdev puts the base reference to the queue on release. When the bdev is finally released, the associated queue is already gone along with the bdi and bdev_inode_switch_bdi() ends up dereferencing already freed bdi. Even if it were not for this bug, disk not holding onto the associated queue is very unusual and error-prone. Fix it by making add_disk() take an extra reference to its queue and put it on disk_release() and ensuring that disk and its fops owner are put in that order after all accesses to the disk and queue are complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 years ago
implement in-kernel gendisk events handling Currently, media presence polling for removeable block devices is done from userland. There are several issues with this. * Polling is done by periodically opening the device. For SCSI devices, the command sequence generated by such action involves a few different commands including TEST_UNIT_READY. This behavior, while perfectly legal, is different from Windows which only issues single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some ATAPI devices lock up after being periodically queried such command sequences. * There is no reliable and unintrusive way for a userland program to tell whether the target device is safe for media presence polling. For example, polling for media presence during an on-going burning session can make it fail. The polling program can avoid this by opening the device with O_EXCL but then it risks making a valid exclusive user of the device fail w/ -EBUSY. * Userland polling is unnecessarily heavy and in-kernel implementation is lighter and better coordinated (workqueue, timer slack). This patch implements framework for in-kernel disk event handling, which includes media presence polling. * bdops->check_events() is added, which supercedes ->media_changed(). It should check whether there's any pending event and return if so. Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be called parallelly. * gendisk->events and ->async_events are added. These should be initialized by block driver before passing the device to add_disk(). The former contains the mask of all supported events and the latter the mask of all events which the device can report without polling. /sys/block/*/events[_async] export these to userland. * Kernel parameter block.events_dfl_poll_msecs controls the system polling interval (default is 0 which means disable) and /sys/block/*/events_poll_msecs control polling intervals for individual devices (default is -1 meaning use system setting). Note that if a device can report all supported events asynchronously and its polling interval isn't explicitly set, the device won't be polled regardless of the system polling interval. * If a device is opened exclusively with write access, event checking is automatically disabled until all write exclusive accesses are released. * There are event 'clearing' events. For example, both of currently defined events are cleared after the device has been successfully opened. This information is passed to ->check_events() callback using @clearing argument as a hint. * Event checking is always performed from system_nrt_wq and timer slack is set to 25% for polling. * Nothing changes for drivers which implement ->media_changed() but not ->check_events(). Going forward, all drivers will be converted to ->check_events() and ->media_change() will be dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
11 years ago
implement in-kernel gendisk events handling Currently, media presence polling for removeable block devices is done from userland. There are several issues with this. * Polling is done by periodically opening the device. For SCSI devices, the command sequence generated by such action involves a few different commands including TEST_UNIT_READY. This behavior, while perfectly legal, is different from Windows which only issues single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some ATAPI devices lock up after being periodically queried such command sequences. * There is no reliable and unintrusive way for a userland program to tell whether the target device is safe for media presence polling. For example, polling for media presence during an on-going burning session can make it fail. The polling program can avoid this by opening the device with O_EXCL but then it risks making a valid exclusive user of the device fail w/ -EBUSY. * Userland polling is unnecessarily heavy and in-kernel implementation is lighter and better coordinated (workqueue, timer slack). This patch implements framework for in-kernel disk event handling, which includes media presence polling. * bdops->check_events() is added, which supercedes ->media_changed(). It should check whether there's any pending event and return if so. Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be called parallelly. * gendisk->events and ->async_events are added. These should be initialized by block driver before passing the device to add_disk(). The former contains the mask of all supported events and the latter the mask of all events which the device can report without polling. /sys/block/*/events[_async] export these to userland. * Kernel parameter block.events_dfl_poll_msecs controls the system polling interval (default is 0 which means disable) and /sys/block/*/events_poll_msecs control polling intervals for individual devices (default is -1 meaning use system setting). Note that if a device can report all supported events asynchronously and its polling interval isn't explicitly set, the device won't be polled regardless of the system polling interval. * If a device is opened exclusively with write access, event checking is automatically disabled until all write exclusive accesses are released. * There are event 'clearing' events. For example, both of currently defined events are cleared after the device has been successfully opened. This information is passed to ->check_events() callback using @clearing argument as a hint. * Event checking is always performed from system_nrt_wq and timer slack is set to 25% for polling. * Nothing changes for drivers which implement ->media_changed() but not ->check_events(). Going forward, all drivers will be converted to ->check_events() and ->media_change() will be dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
11 years ago
block: fix ext_devt_idr handling While adding and removing a lot of disks disks and partitions this sometimes shows up: WARNING: at fs/sysfs/dir.c:512 sysfs_add_one+0xc9/0x130() (Not tainted) Hardware name: sysfs: cannot create duplicate filename '/dev/block/259:751' Modules linked in: raid1 autofs4 bnx2fc cnic uio fcoe libfcoe libfc 8021q scsi_transport_fc scsi_tgt garp stp llc sunrpc cpufreq_ondemand powernow_k8 freq_table mperf ipv6 dm_mirror dm_region_hash dm_log power_meter microcode dcdbas serio_raw amd64_edac_mod edac_core edac_mce_amd i2c_piix4 i2c_core k10temp bnx2 sg ixgbe dca mdio ext4 mbcache jbd2 dm_round_robin sr_mod cdrom sd_mod crc_t10dif ata_generic pata_acpi pata_atiixp ahci mptsas mptscsih mptbase scsi_transport_sas dm_multipath dm_mod [last unloaded: scsi_wait_scan] Pid: 44103, comm: async/16 Not tainted 2.6.32-195.el6.x86_64 #1 Call Trace: warn_slowpath_common+0x87/0xc0 warn_slowpath_fmt+0x46/0x50 sysfs_add_one+0xc9/0x130 sysfs_do_create_link+0x12b/0x170 sysfs_create_link+0x13/0x20 device_add+0x317/0x650 idr_get_new+0x13/0x50 add_partition+0x21c/0x390 rescan_partitions+0x32b/0x470 sd_open+0x81/0x1f0 [sd_mod] __blkdev_get+0x1b6/0x3c0 blkdev_get+0x10/0x20 register_disk+0x155/0x170 add_disk+0xa6/0x160 sd_probe_async+0x13b/0x210 [sd_mod] add_wait_queue+0x46/0x60 async_thread+0x102/0x250 default_wake_function+0x0/0x20 async_thread+0x0/0x250 kthread+0x96/0xa0 child_rip+0xa/0x20 kthread+0x0/0xa0 child_rip+0x0/0x20 This most likely happens because dev_t is freed while the number is still used and idr_get_new() is not protected on every use. The fix adds a mutex where it wasn't before and moves the dev_t free function so it is called after device del. Signed-off-by: Tomas Henzl <thenzl@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
9 years ago
block: fix boot failure with CONFIG_DEBUG_BLOCK_EXT_DEVT=y and nash We run into system boot failure with kernel 2.6.28-rc. We found it on a couple of machines, including T61 notebook, nehalem machine, and another HPC NX6325 notebook. All the machines use FedoraCore 8 or FedoraCore 9. With kernel prior to 2.6.28-rc, system boot doesn't fail. I debug it and locate the root cause. Pls. see http://bugzilla.kernel.org/show_bug.cgi?id=11899 https://bugzilla.redhat.com/show_bug.cgi?id=471517 As a matter of fact, there are 2 bugs. 1)root=/dev/sda1, system boot randomly fails. Mostly, boot for 5 times and fails once. nash has a bug. Some of its functions misuse return value 0. Sometimes, 0 means timeout and no uevent available. Sometimes, 0 means nash gets an uevent, but the uevent isn't block-related (for exmaple, usb). If by coincidence, kernel tells nash that uevents are available, but kernel also set timeout, nash might stops collecting other uevents in queue if current uevent isn't block-related. I work out a patch for nash to fix it. http://bugzilla.kernel.org/attachment.cgi?id=18858 2) root=LABEL=/, system always can't boot. initrd init reports switchroot fails. Here is an executation branch of nash when booting: (1) nash read /sys/block/sda/dev; Assume major is 8 (on my desktop) (2) nash query /proc/devices with the major number; It found line "8 sd"; (3) nash use 'sd' to search its own probe table to find device (DISK) type for the device and add it to its own list; (4) Later on, it probes all devices in its list to get filesystem labels; scsi register "8 sd" always. When major is 259, nash fails to find the device(DISK) type. I enables CONFIG_DEBUG_BLOCK_EXT_DEVT=y when compiling kernel, so 259 is picked up for device /dev/sda1, which causes nash to fail to find device (DISK) type. To fixing issue 2), I create a patch for nash and another patch for kernel. http://bugzilla.kernel.org/attachment.cgi?id=18859 http://bugzilla.kernel.org/attachment.cgi?id=18837 Below is the patch for kernel 2.6.28-rc4. It registers blkext, a new block device in proc/devices. With 2 patches on nash and 1 patch on kernel, I boot my machines for dozens of times without failure. Signed-off-by Zhang Yanmin <yanmin.zhang@linux.intel.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
13 years ago
implement in-kernel gendisk events handling Currently, media presence polling for removeable block devices is done from userland. There are several issues with this. * Polling is done by periodically opening the device. For SCSI devices, the command sequence generated by such action involves a few different commands including TEST_UNIT_READY. This behavior, while perfectly legal, is different from Windows which only issues single command, GET_EVENT_STATUS_NOTIFICATION. Unfortunately, some ATAPI devices lock up after being periodically queried such command sequences. * There is no reliable and unintrusive way for a userland program to tell whether the target device is safe for media presence polling. For example, polling for media presence during an on-going burning session can make it fail. The polling program can avoid this by opening the device with O_EXCL but then it risks making a valid exclusive user of the device fail w/ -EBUSY. * Userland polling is unnecessarily heavy and in-kernel implementation is lighter and better coordinated (workqueue, timer slack). This patch implements framework for in-kernel disk event handling, which includes media presence polling. * bdops->check_events() is added, which supercedes ->media_changed(). It should check whether there's any pending event and return if so. Currently, two events are defined - DISK_EVENT_MEDIA_CHANGE and DISK_EVENT_EJECT_REQUEST. ->check_events() is guaranteed not to be called parallelly. * gendisk->events and ->async_events are added. These should be initialized by block driver before passing the device to add_disk(). The former contains the mask of all supported events and the latter the mask of all events which the device can report without polling. /sys/block/*/events[_async] export these to userland. * Kernel parameter block.events_dfl_poll_msecs controls the system polling interval (default is 0 which means disable) and /sys/block/*/events_poll_msecs control polling intervals for individual devices (default is -1 meaning use system setting). Note that if a device can report all supported events asynchronously and its polling interval isn't explicitly set, the device won't be polled regardless of the system polling interval. * If a device is opened exclusively with write access, event checking is automatically disabled until all write exclusive accesses are released. * There are event 'clearing' events. For example, both of currently defined events are cleared after the device has been successfully opened. This information is passed to ->check_events() callback using @clearing argument as a hint. * Event checking is always performed from system_nrt_wq and timer slack is set to 25% for polling. * Nothing changes for drivers which implement ->media_changed() but not ->check_events(). Going forward, all drivers will be converted to ->check_events() and ->media_change() will be dropped. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: Jan Kara <jack@suse.cz> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
11 years ago
block: make gendisk hold a reference to its queue The following command sequence triggers an oops. # mount /dev/sdb1 /mnt # echo 1 > /sys/class/scsi_device/0\:0\:1\:0/device/delete # umount /mnt general protection fault: 0000 [#1] PREEMPT SMP CPU 2 Modules linked in: Pid: 791, comm: umount Not tainted 3.1.0-rc3-work+ #8 Bochs Bochs RIP: 0010:[<ffffffff810d0879>] [<ffffffff810d0879>] __lock_acquire+0x389/0x1d60 ... Call Trace: [<ffffffff810d2845>] lock_acquire+0x95/0x140 [<ffffffff81aed87b>] _raw_spin_lock+0x3b/0x50 [<ffffffff811573bc>] bdi_lock_two+0x5c/0x70 [<ffffffff811c2f6c>] bdev_inode_switch_bdi+0x4c/0xf0 [<ffffffff811c3fcb>] __blkdev_put+0x11b/0x1d0 [<ffffffff811c4010>] __blkdev_put+0x160/0x1d0 [<ffffffff811c40df>] blkdev_put+0x5f/0x190 [<ffffffff8118f18d>] kill_block_super+0x4d/0x80 [<ffffffff8118f4a5>] deactivate_locked_super+0x45/0x70 [<ffffffff8119003a>] deactivate_super+0x4a/0x70 [<ffffffff811ac4ad>] mntput_no_expire+0xed/0x130 [<ffffffff811acf2e>] sys_umount+0x7e/0x3a0 [<ffffffff81aeeeab>] system_call_fastpath+0x16/0x1b This is because bdev holds on to disk but disk doesn't pin the associated queue. If a SCSI device is removed while the device is still open, the sdev puts the base reference to the queue on release. When the bdev is finally released, the associated queue is already gone along with the bdi and bdev_inode_switch_bdi() ends up dereferencing already freed bdi. Even if it were not for this bug, disk not holding onto the associated queue is very unusual and error-prone. Fix it by making add_disk() take an extra reference to its queue and put it on disk_release() and ensuring that disk and its fops owner are put in that order after all accesses to the disk and queue are complete. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: stable@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
10 years ago