Skip to content
  • Michal Hocko's avatar
    proc, oom: drop bogus task_lock and mm check · d49fbf76
    Michal Hocko authored
    Series "Handle oom bypass more gracefully", V5
    
    The following 10 patches should put some order to very rare cases of mm
    shared between processes and make the paths which bypass the oom killer
    oom reapable and therefore much more reliable finally.  Even though mm
    shared outside of thread group is rare (either vforked tasks for a short
    period, use_mm by kernel threads or exotic thread model of
    clone(CLONE_VM) without CLONE_SIGHAND) it is better to cover them.  Not
    only it makes the current oom killer logic quite hard to follow and
    reason about it can lead to weird corner cases.  E.g.  it is possible to
    select an oom victim which shares the mm with unkillable process or
    bypass the oom killer even when other processes sharing the mm are still
    alive and other weird cases.
    
    Patch 1 drops bogus task_lock and mm check from oom_{score_}adj_write.
    This can be considered a bug fix with a low impact as nobody has noticed
    for years.
    
    Patch 2 drops sighand lock because it is not needed anymore as pointed
    by Oleg.
    
    Patch 3 is a clean up of oom_score_adj handling and a preparatory work
    for later patches.
    
    Patch 4 enforces oom_adj_score to be consistent between processes
    sharing the mm to behave consistently with the regular thread groups.
    This can be considered a user visible behavior change because one thread
    group updating oom_score_adj will affect others which share the same mm
    via clone(CLONE_VM).  I argue that this should be acceptable because we
    already have the same behavior for threads in the same thread group and
    sharing the mm without signal struct is just a different model of
    threading.  This is probably the most controversial part of the series,
    I would like to find some consensus here.  There were some suggestions
    to hook some counter/oom_score_adj into the mm_struct but I feel that
    this is not necessary right now and we can rely on proc handler +
    oom_kill_process to DTRT.  I can be convinced otherwise but I strongly
    think that whatever we do the userspace has to have a way to see the
    current oom priority as consistently as possible.
    
    Patch 5 makes sure that no vforked task is selected if it is sharing the
    mm with oom unkillable task.
    
    Patch 6 ensures that all user tasks sharing the mm are killed which in
    turn makes sure that all oom victims are oom reapable.
    
    Patch 7 guarantees that task_will_free_mem will always imply reapable
    bypass of the oom killer.
    
    Patch 8 is new in this version and it addresses an issue pointed out by
    0-day OOM report where an oom victim was reaped several times.
    
    Patch 9 puts an upper bound on how many times oom_reaper tries to reap a
    task and hides it from the oom killer to move on when no progress can be
    made.  This will give an upper bound to how long an oom_reapable task
    can block the oom killer from selecting another victim if the oom_reaper
    is not able to reap the victim.
    
    Patch 10 tries to plug the (hopefully) last hole when we can still lock
    up when the oom victim is shared with oom unkillable tasks (kthreads and
    global init).  We just try to be best effort in that case and rather
    fallback to kill something else than risk a lockup.
    
    This patch (of 10):
    
    Both oom_adj_write and oom_score_adj_write are using task_lock, check for
    task->mm and fail if it is NULL.  This is not needed because the
    oom_score_adj is per signal struct so we do not need mm at all.  The code
    has been introduced by 3d5992d2 ("oom: add per-mm oom disable count")
    but we do not do per-mm oom disable since c9f01245 ("oom: remove
    oom_disable_count").
    
    The task->mm check is even not correct because the current thread might
    have exited but the thread group might be still alive - e.g.  thread group
    leader would lead that echo $VAL > /proc/pid/oom_score_adj would always
    fail with EINVAL while /proc/pid/task/$other_tid/oom_score_adj would
    succeed.  This is unexpected at best.
    
    Remove the lock along with the check to fix the unexpected behavior and
    also because there is not real need for the lock in the first place.
    
    Link: http://lkml.kernel.org/r/1466426628-15074-2-git-send-email-mhocko@kernel.org
    
    
    Signed-off-by: default avatarMichal Hocko <mhocko@suse.com>
    Reviewed-by: default avatarVladimir Davydov <vdavydov@virtuozzo.com>
    Acked-by: default avatarOleg Nesterov <oleg@redhat.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    d49fbf76