Commit 6234346e authored by Stephen Rothwell's avatar Stephen Rothwell

Merge branch 'akpm-current/current'

parents b8282eee 85714e5d
......@@ -98,3 +98,35 @@ Description:
The backing_dev file is read-write and set up backing
device for zram to write incompressible pages.
For using, user should enable CONFIG_ZRAM_WRITEBACK.
What: /sys/block/zram<id>/idle
Date: November 2018
Contact: Minchan Kim <minchan@kernel.org>
Description:
idle file is write-only and mark zram slot as idle.
If system has mounted debugfs, user can see which slots
are idle via /sys/kernel/debug/zram/zram<id>/block_state
What: /sys/block/zram<id>/writeback
Date: November 2018
Contact: Minchan Kim <minchan@kernel.org>
Description:
The writeback file is write-only and trigger idle and/or
huge page writeback to backing device.
What: /sys/block/zram<id>/bd_stat
Date: November 2018
Contact: Minchan Kim <minchan@kernel.org>
Description:
The bd_stat file is read-only and represents backing device's
statistics (bd_count, bd_reads, bd_writes) in a format
similar to block layer statistics file format.
What: /sys/block/zram<id>/writeback_limit
Date: November 2018
Contact: Minchan Kim <minchan@kernel.org>
Description:
The writeback_limit file is read-write and specifies the maximum
amount of writeback ZRAM can do. The limit could be changed
in run time and "0" means disable the limit.
No limit is the initial state.
......@@ -1803,6 +1803,9 @@
ip= [IP_PNP]
See Documentation/filesystems/nfs/nfsroot.txt.
ipcmni_extend [KNL] Extend the maximum number of unique System V
IPC identifiers from 32,768 to 8,388,608.
irqaffinity= [SMP] Set the default irq affinity mask
The argument is a cpu list, as described above.
......@@ -3088,6 +3091,14 @@
timeout < 0: reboot immediately
Format: <timeout>
panic_print= Bitmask for printing system info when panic happens.
User can chose combination of the following bits:
bit 0: print all tasks info
bit 1: print system memory info
bit 2: print timer info
bit 3: print locks info if CONFIG_LOCKDEP is on
bit 4: print ftrace buffer
panic_on_warn panic() instead of WARN(). Useful to cause kdump
on a WARN().
......
......@@ -164,11 +164,14 @@ reset WO trigger device reset
mem_used_max WO reset the `mem_used_max' counter (see later)
mem_limit WO specifies the maximum amount of memory ZRAM can use
to store the compressed data
writeback_limit WO specifies the maximum amount of write IO zram can
write out to backing device as 4KB unit
max_comp_streams RW the number of possible concurrent compress operations
comp_algorithm RW show and change the compression algorithm
compact WO trigger memory compaction
debug_stat RO this file is used for zram debugging purposes
backing_dev RW set up backend storage for zram to write out
idle WO mark allocated slot as idle
User space is advised to use the following files to read the device statistics.
......@@ -220,6 +223,17 @@ line of text and contains the following stats separated by whitespace:
pages_compacted the number of pages freed during compaction
huge_pages the number of incompressible pages
File /sys/block/zram<id>/bd_stat
The stat file represents device's backing device statistics. It consists of
a single line of text and contains the following stats separated by whitespace:
bd_count size of data written in backing device.
Unit: 4K bytes
bd_reads the number of reads from backing device
Unit: 4K bytes
bd_writes the number of writes to backing device
Unit: 4K bytes
9) Deactivate:
swapoff /dev/zram0
umount /dev/zram1
......@@ -237,11 +251,60 @@ line of text and contains the following stats separated by whitespace:
= writeback
With incompressible pages, there is no memory saving with zram.
Instead, with CONFIG_ZRAM_WRITEBACK, zram can write incompressible page
With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page
to backing storage rather than keeping it in memory.
User should set up backing device via /sys/block/zramX/backing_dev
before disksize setting.
To use the feature, admin should set up backing device via
"echo /dev/sda5 > /sys/block/zramX/backing_dev"
before disksize setting. It supports only partition at this moment.
If admin want to use incompressible page writeback, they could do via
"echo huge > /sys/block/zramX/write"
To use idle page writeback, first, user need to declare zram pages
as idle.
"echo all > /sys/block/zramX/idle"
From now on, any pages on zram are idle pages. The idle mark
will be removed until someone request access of the block.
IOW, unless there is access request, those pages are still idle pages.
Admin can request writeback of those idle pages at right timing via
"echo idle > /sys/block/zramX/writeback"
With the command, zram writeback idle pages from memory to the storage.
If there are lots of write IO with flash device, potentially, it has
flash wearout problem so that admin needs to design write limitation
to guarantee storage health for entire product life.
To overcome the concern, zram supports "writeback_limit".
The "writeback_limit"'s default value is 0 so that it doesn't limit
any writeback. If admin want to measure writeback count in a certain
period, he could know it via /sys/block/zram0/bd_stat's 3rd column.
If admin want to limit writeback as per-day 400M, he could do it
like below.
MB_SHIFT=20
4K_SHIFT=12
echo $((400<<MB_SHIFT>>4K_SHIFT)) > \
/sys/block/zram0/writeback_limit.
If admin want to allow further write again, he could do it like below
echo 0 > /sys/block/zram0/writeback_limit
If admin want to see remaining writeback budget since he set,
cat /sys/block/zram0/writeback_limit
The writeback_limit count will reset whenever you reset zram(e.g.,
system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of
writeback happened until you reset the zram to allocate extra writeback
budget in next setting is user's job.
= memory tracking
......@@ -251,16 +314,17 @@ pages of the process with*pagemap.
If you enable the feature, you could see block state via
/sys/kernel/debug/zram/zram0/block_state". The output is as follows,
300 75.033841 .wh
301 63.806904 s..
302 63.806919 ..h
300 75.033841 .wh.
301 63.806904 s...
302 63.806919 ..hi
First column is zram's block index.
Second column is access time since the system was booted
Third column is state of the block.
(s: same page
w: written page to backing store
h: huge page)
h: huge page
i: idle page)
First line of above example says 300th block is accessed at 75.033841sec
and the block's state is huge so it is written back to the backing
......
This diff is collapsed.
......@@ -189,6 +189,7 @@ read the file /proc/PID/status:
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
THP_enabled: 1
Threads: 1
SigQ: 0/28578
SigPnd: 0000000000000000
......@@ -265,6 +266,8 @@ Table 1-2: Contents of the status files (as of 4.19)
HugetlbPages size of hugetlb memory portions
CoreDumping process's memory is currently being dumped
(killing the process may lead to a corrupted core)
THP_enabled process is allowed to use THP (returns 0 when
PR_SET_THP_DISABLE is set on the process
Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread
......@@ -436,6 +439,7 @@ SwapPss: 0 kB
KernelPageSize: 4 kB
MMUPageSize: 4 kB
Locked: 0 kB
THPeligible: 0
VmFlags: rd ex mr mw me dw
the first of these lines shows the same information as is displayed for the
......@@ -473,6 +477,8 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
does not take into account swapped out page of underlying shmem objects.
"Locked" indicates whether the mapping is locked in memory or not.
"THPeligible" indicates whether the mapping is eligible for THP pages - 1 if
true, 0 otherwise.
"VmFlags" field deserves a separate description. This member represents the kernel
flags associated with the particular virtual memory area in two letter encoded
......@@ -502,12 +508,19 @@ manner. The codes are the following:
sd - soft-dirty flag
mm - mixed map area
hg - huge page advise flag
nh - no-huge page advise flag
nh - no-huge page advise flag [*]
mg - mergable advise flag
[*] A process mapping may be advised to not be backed by transparent hugepages
by either madvise(MADV_NOHUGEPAGE) or prctl(PR_SET_THP_DISABLE). See
Documentation/admin-guide/mm/transhuge.rst for system-wide and process
mapping policies.
Note that there is no guarantee that every flag and associated mnemonic will
be present in all further kernel releases. Things get changed, the flags may
be vanished or the reverse -- new added.
be vanished or the reverse -- new added. Interpretation of their meaning
might change in future as well. So each consumer of these flags has to
follow each specific kernel version for the exact semantic.
This file is only present if the CONFIG_MMU kernel configuration option is
enabled.
......
......@@ -443,6 +443,9 @@ In function prototypes, include parameter names with their data types.
Although this is not required by the C language, it is preferred in Linux
because it is a simple way to add valuable information for the reader.
Do not use the `extern' keyword with function prototypes as this makes
lines longer and isn't strictly necessary.
7) Centralized exiting of functions
-----------------------------------
......
......@@ -60,6 +60,7 @@ show up in /proc/sys/kernel:
- panic_on_stackoverflow
- panic_on_unrecovered_nmi
- panic_on_warn
- panic_print
- panic_on_rcu_stall
- perf_cpu_time_max_percent
- perf_event_paranoid
......@@ -654,6 +655,22 @@ a kernel rebuild when attempting to kdump at the location of a WARN().
==============================================================
panic_print:
Bitmask for printing system info when panic happens. User can chose
combination of the following bits:
bit 0: print all tasks info
bit 1: print system memory info
bit 2: print timer info
bit 3: print locks info if CONFIG_LOCKDEP is on
bit 4: print ftrace buffer
So for example to print tasks and memory info on panic, user can:
echo 3 > /proc/sys/kernel/panic_print
==============================================================
panic_on_rcu_stall:
When set to 1, calls panic() after RCU stall detection messages. This
......
......@@ -63,6 +63,7 @@ Currently, these files are in /proc/sys/vm:
- swappiness
- user_reserve_kbytes
- vfs_cache_pressure
- watermark_boost_factor
- watermark_scale_factor
- zone_reclaim_mode
......@@ -856,6 +857,26 @@ ten times more freeable objects than there are.
=============================================================
watermark_boost_factor:
This factor controls the level of reclaim when memory is being fragmented.
It defines the percentage of the high watermark of a zone that will be
reclaimed if pages of different mobility are being mixed within pageblocks.
The intent is that compaction has less work to do in the future and to
increase the success rate of future high-order allocations such as SLUB
allocations, THP and hugetlbfs pages.
To make it sensible with respect to the watermark_scale_factor parameter,
the unit is in fractions of 10,000. The default value of 15,000 means
that up to 150% of the high watermark will be reclaimed in the event of
a pageblock being mixed due to fragmentation. The level of reclaim is
determined by the number of fragmentation events that occurred in the
recent past. If this value is smaller than a pageblock then a pageblocks
worth of pages will be reclaimed (e.g. 2MB on 64-bit x86). A boost factor
of 0 will disable the feature.
=============================================================
watermark_scale_factor:
This factor controls the aggressiveness of kswapd. It defines the
......
......@@ -391,9 +391,9 @@ static inline unsigned long __fls(unsigned long x)
return fls64(x) - 1;
}
static inline int fls(int x)
static inline int fls(unsigned int x)
{
return fls64((unsigned int) x);
return fls64(x);
}
/*
......
......@@ -278,7 +278,7 @@ static inline __attribute__ ((const)) int clz(unsigned int x)
return res;
}
static inline int constant_fls(int x)
static inline int constant_fls(unsigned int x)
{
int r = 32;
......@@ -312,7 +312,7 @@ static inline int constant_fls(int x)
* @result: [1-32]
* fls(1) = 1, fls(0x80000000) = 32, fls(0) = 0
*/
static inline __attribute__ ((const)) int fls(unsigned long x)
static inline __attribute__ ((const)) int fls(unsigned int x)
{
if (__builtin_constant_p(x))
return constant_fls(x);
......
......@@ -17,6 +17,8 @@
#ifndef __ASSEMBLY__
#include <linux/personality.h> /* For READ_IMPLIES_EXEC */
#ifndef CONFIG_MMU
#include <asm/page-nommu.h>
......
......@@ -111,6 +111,7 @@ config ARM64
select HAVE_ARCH_JUMP_LABEL
select HAVE_ARCH_JUMP_LABEL_RELATIVE
select HAVE_ARCH_KASAN if !(ARM64_16K_PAGES && ARM64_VA_BITS_48)
select HAVE_ARCH_KASAN_SW_TAGS if HAVE_ARCH_KASAN
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
......
......@@ -101,10 +101,19 @@ else
TEXT_OFFSET := 0x00080000
endif
ifeq ($(CONFIG_KASAN_SW_TAGS), y)
KASAN_SHADOW_SCALE_SHIFT := 4
else
KASAN_SHADOW_SCALE_SHIFT := 3
endif
KBUILD_CFLAGS += -DKASAN_SHADOW_SCALE_SHIFT=$(KASAN_SHADOW_SCALE_SHIFT)
KBUILD_CPPFLAGS += -DKASAN_SHADOW_SCALE_SHIFT=$(KASAN_SHADOW_SCALE_SHIFT)
KBUILD_AFLAGS += -DKASAN_SHADOW_SCALE_SHIFT=$(KASAN_SHADOW_SCALE_SHIFT)
# KASAN_SHADOW_OFFSET = VA_START + (1 << (VA_BITS - KASAN_SHADOW_SCALE_SHIFT))
# - (1 << (64 - KASAN_SHADOW_SCALE_SHIFT))
# in 32-bit arithmetic
KASAN_SHADOW_SCALE_SHIFT := 3
KASAN_SHADOW_OFFSET := $(shell printf "0x%08x00000000\n" $$(( \
(0xffffffff & (-1 << ($(CONFIG_ARM64_VA_BITS) - 32))) \
+ (1 << ($(CONFIG_ARM64_VA_BITS) - 32 - $(KASAN_SHADOW_SCALE_SHIFT))) \
......
......@@ -16,10 +16,12 @@
* 0x400: for dynamic BRK instruction
* 0x401: for compile time BRK instruction
* 0x800: kernel-mode BUG() and WARN() traps
* 0x9xx: tag-based KASAN trap (allowed values 0x900 - 0x9ff)
*/
#define FAULT_BRK_IMM 0x100
#define KGDB_DYN_DBG_BRK_IMM 0x400
#define KGDB_COMPILED_DBG_BRK_IMM 0x401
#define BUG_BRK_IMM 0x800
#define KASAN_BRK_IMM 0x900
#endif
......@@ -4,12 +4,16 @@
#ifndef __ASSEMBLY__
#ifdef CONFIG_KASAN
#include <linux/linkage.h>
#include <asm/memory.h>
#include <asm/pgtable-types.h>
#define arch_kasan_set_tag(addr, tag) __tag_set(addr, tag)
#define arch_kasan_reset_tag(addr) __tag_reset(addr)
#define arch_kasan_get_tag(addr) __tag_get(addr)
#ifdef CONFIG_KASAN
/*
* KASAN_SHADOW_START: beginning of the kernel virtual addresses.
* KASAN_SHADOW_END: KASAN_SHADOW_START + 1/N of kernel virtual addresses,
......
......@@ -74,13 +74,12 @@
#endif
/*
* KASAN requires 1/8th of the kernel virtual address space for the shadow
* region. KASAN can bloat the stack significantly, so double the (minimum)
* stack size when KASAN is in use, and then double it again if KASAN_EXTRA is
* on.
* Generic and tag-based KASAN require 1/8th and 1/16th of the kernel virtual
* address space for the shadow region respectively. They can bloat the stack
* significantly, so double the (minimum) stack size when they are in use,
* and then double it again if KASAN_EXTRA is on.
*/
#ifdef CONFIG_KASAN
#define KASAN_SHADOW_SCALE_SHIFT 3
#define KASAN_SHADOW_SIZE (UL(1) << (VA_BITS - KASAN_SHADOW_SCALE_SHIFT))
#ifdef CONFIG_KASAN_EXTRA
#define KASAN_THREAD_SHIFT 2
......@@ -212,6 +211,26 @@ extern u64 vabits_user;
*/
#define PHYS_PFN_OFFSET (PHYS_OFFSET >> PAGE_SHIFT)
/*
* When dealing with data aborts, watchpoints, or instruction traps we may end
* up with a tagged userland pointer. Clear the tag to get a sane pointer to
* pass on to access_ok(), for instance.