113 lines
4.8 KiB
Diff
113 lines
4.8 KiB
Diff
From: Andrea Arcangeli <aarcange@redhat.com>
|
|
Subject: thp: avoid atomic64_read in pmd_read_atomic for 32bit PAE
|
|
|
|
In the x86 32bit PAE CONFIG_TRANSPARENT_HUGEPAGE=y case while holding the
|
|
mmap_sem for reading, cmpxchg8b cannot be used to read pmd contents under
|
|
Xen.
|
|
|
|
So instead of dealing only with "consistent" pmdvals in
|
|
pmd_none_or_trans_huge_or_clear_bad() (which would be conceptually
|
|
simpler) we let pmd_none_or_trans_huge_or_clear_bad() deal with pmdvals
|
|
where the low 32bit and high 32bit could be inconsistent (to avoid having
|
|
to use cmpxchg8b).
|
|
|
|
The only guarantee we get from pmd_read_atomic is that if the low part of
|
|
the pmd was found null, the high part will be null too (so the pmd will be
|
|
considered unstable). And if the low part of the pmd is found "stable"
|
|
later, then it means the whole pmd was read atomically (because after a
|
|
pmd is stable, neither MADV_DONTNEED nor page faults can alter it anymore,
|
|
and we read the high part after the low part).
|
|
|
|
In the 32bit PAE x86 case, it is enough to read the low part of the pmdval
|
|
atomically to declare the pmd as "stable" and that's true for THP and no
|
|
THP, furthermore in the THP case we also have a barrier() that will
|
|
prevent any inconsistent pmdvals to be cached by a later re-read of the
|
|
*pmd.
|
|
|
|
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
|
|
Cc: Jonathan Nieder <jrnieder@gmail.com>
|
|
Cc: Ulrich Obergfell <uobergfe@redhat.com>
|
|
Cc: Mel Gorman <mgorman@suse.de>
|
|
Cc: Hugh Dickins <hughd@google.com>
|
|
Cc: Larry Woodman <lwoodman@redhat.com>
|
|
Cc: Petr Matousek <pmatouse@redhat.com>
|
|
Cc: Rik van Riel <riel@redhat.com>
|
|
Cc: Jan Beulich <jbeulich@suse.com>
|
|
Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
|
|
Cc: <stable@vger.kernel.org>
|
|
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
---
|
|
|
|
arch/x86/include/asm/pgtable-3level.h | 30 +++++++++++++-----------
|
|
include/asm-generic/pgtable.h | 10 ++++++++
|
|
2 files changed, 27 insertions(+), 13 deletions(-)
|
|
|
|
--- a/arch/x86/include/asm/pgtable-3level.h
|
|
+++ b/arch/x86/include/asm/pgtable-3level.h
|
|
@@ -47,16 +47,26 @@
|
|
* they can run pmd_offset_map_lock or pmd_trans_huge or other pmd
|
|
* operations.
|
|
*
|
|
- * Without THP if the mmap_sem is hold for reading, the
|
|
- * pmd can only transition from null to not null while pmd_read_atomic runs.
|
|
- * So there's no need of literally reading it atomically.
|
|
+ * Without THP if the mmap_sem is hold for reading, the pmd can only
|
|
+ * transition from null to not null while pmd_read_atomic runs. So
|
|
+ * we can always return atomic pmd values with this function.
|
|
*
|
|
* With THP if the mmap_sem is hold for reading, the pmd can become
|
|
- * THP or null or point to a pte (and in turn become "stable") at any
|
|
- * time under pmd_read_atomic, so it's mandatory to read it atomically
|
|
- * with cmpxchg8b.
|
|
+ * trans_huge or none or point to a pte (and in turn become "stable")
|
|
+ * at any time under pmd_read_atomic. We could read it really
|
|
+ * atomically here with a atomic64_read for the THP enabled case (and
|
|
+ * it would be a whole lot simpler), but to avoid using cmpxchg8b we
|
|
+ * only return an atomic pmdval if the low part of the pmdval is later
|
|
+ * found stable (i.e. pointing to a pte). And we're returning a none
|
|
+ * pmdval if the low part of the pmd is none. In some cases the high
|
|
+ * and low part of the pmdval returned may not be consistent if THP is
|
|
+ * enabled (the low part may point to previously mapped hugepage,
|
|
+ * while the high part may point to a more recently mapped hugepage),
|
|
+ * but pmd_none_or_trans_huge_or_clear_bad() only needs the low part
|
|
+ * of the pmd to be read atomically to decide if the pmd is unstable
|
|
+ * or not, with the only exception of when the low part of the pmd is
|
|
+ * zero in which case we return a none pmd.
|
|
*/
|
|
-#ifndef CONFIG_TRANSPARENT_HUGEPAGE
|
|
static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
|
|
{
|
|
pmdval_t ret;
|
|
@@ -74,12 +84,6 @@
|
|
|
|
return (pmd_t) { ret };
|
|
}
|
|
-#else /* CONFIG_TRANSPARENT_HUGEPAGE */
|
|
-static inline pmd_t pmd_read_atomic(pmd_t *pmdp)
|
|
-{
|
|
- return (pmd_t) { atomic64_read((atomic64_t *)pmdp) };
|
|
-}
|
|
-#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
|
|
|
|
static inline void native_set_pte_atomic(pte_t *ptep, pte_t pte)
|
|
{
|
|
--- a/include/asm-generic/pgtable.h
|
|
+++ b/include/asm-generic/pgtable.h
|
|
@@ -484,6 +484,16 @@
|
|
/*
|
|
* The barrier will stabilize the pmdval in a register or on
|
|
* the stack so that it will stop changing under the code.
|
|
+ *
|
|
+ * When CONFIG_TRANSPARENT_HUGEPAGE=y on x86 32bit PAE,
|
|
+ * pmd_read_atomic is allowed to return a not atomic pmdval
|
|
+ * (for example pointing to an hugepage that has never been
|
|
+ * mapped in the pmd). The below checks will only care about
|
|
+ * the low part of the pmd with 32bit PAE x86 anyway, with the
|
|
+ * exception of pmd_none(). So the important thing is that if
|
|
+ * the low part of the pmd is found null, the high part will
|
|
+ * be also null or the pmd_none() check below would be
|
|
+ * confused.
|
|
*/
|
|
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
|
|
barrier();
|