Skip to content
  • Henry Willard's avatar
    mm: numa: do not trap faults on shared data section pages. · 859d4adc
    Henry Willard authored
    Workloads consisting of a large number of processes running the same
    program with a very large shared data segment may experience performance
    problems when numa balancing attempts to migrate the shared cow pages.
    This manifests itself with many processes or tasks in
    TASK_UNINTERRUPTIBLE state waiting for the shared pages to be migrated.
    
    The program listed below simulates the conditions with these results
    when run with 288 processes on a 144 core/8 socket machine.
    
    Average throughput 	Average throughput     Average throughput
    with numa_balancing=0	with numa_balancing=1  with numa_balancing=1
         			without the patch      with the patch
    ---------------------	---------------------  ---------------------
    2118782			2021534		       2107979
    
    Complex production environments show less variability and fewer poorly
    performing outliers accompanied with a smaller number of processes
    waiting on NUMA page migration with this patch applied.  In some cases,
    %iowait drops from 16%-26% to 0.
    
      // SPDX-License-Identifier: GPL-2.0
      /*
       * Copyright (c) 2017 Oracle and/or its affiliates. All rights reserved.
       */
      #include <sys/time.h>
      #include <stdio.h>
      #include <wait.h>
      #include <sys/mman.h>
    
      int a[1000000] = {13};
    
      int  main(int argc, const char **argv)
      {
    	int n = 0;
    	int i;
    	pid_t pid;
    	int stat;
    	int *count_array;
    	int cpu_count = 288;
    	long total = 0;
    
    	struct timeval t1, t2 = {(argc > 1 ? atoi(argv[1]) : 10), 0};
    
    	if (argc > 2)
    		cpu_count = atoi(argv[2]);
    
    	count_array = mmap(NULL, cpu_count * sizeof(int),
    			   (PROT_READ|PROT_WRITE),
    			   (MAP_SHARED|MAP_ANONYMOUS), 0, 0);
    
    	if (count_array == MAP_FAILED) {
    		perror("mmap:");
    		return 0;
    	}
    
    	for (i = 0; i < cpu_count; ++i) {
    		pid = fork();
    		if (pid <= 0)
    			break;
    		if ((i & 0xf) == 0)
    			usleep(2);
    	}
    
    	if (pid != 0) {
    		if (i == 0) {
    			perror("fork:");
    			return 0;
    		}
    
    		for (;;) {
    			pid = wait(&stat);
    			if (pid < 0)
    				break;
    		}
    
    		for (i = 0; i < cpu_count; ++i)
    			total += count_array[i];
    
    		printf("Total %ld\n", total);
    		munmap(count_array, cpu_count * sizeof(int));
    		return 0;
    	}
    
    	gettimeofday(&t1, 0);
    	timeradd(&t1, &t2, &t1);
    	while (timercmp(&t2, &t1, <)) {
    		int b = 0;
    		int j;
    
    		for (j = 0; j < 1000000; j++)
    			b += a[j];
    		gettimeofday(&t2, 0);
    		n++;
    	}
    	count_array[i] = n;
    	return 0;
      }
    
    This patch changes change_pte_range() to skip shared copy-on-write pages
    when called from change_prot_numa().
    
    NOTE: change_prot_numa() is nominally called from task_numa_work() and
    queue_pages_test_walk().  task_numa_work() is the auto NUMA balancing
    path, and queue_pages_test_walk() is part of explicit NUMA policy
    management.  However, queue_pages_test_walk() only calls
    change_prot_numa() when MPOL_MF_LAZY is specified and currently that is
    not allowed, so change_prot_numa() is only called from auto NUMA
    balancing.
    
    In the case of explicit NUMA policy management, shared pages are not
    migrated unless MPOL_MF_MOVE_ALL is specified, and MPOL_MF_MOVE_ALL
    depends on CAP_SYS_NICE.  Currently, there is no way to pass information
    about MPOL_MF_MOVE_ALL to change_pte_range.  This will have to be fixed
    if MPOL_MF_LAZY is enabled and MPOL_MF_MOVE_ALL is to be honored in lazy
    migration mode.
    
    task_numa_work() skips the read-only VMAs of programs and shared
    libraries.
    
    Link: http://lkml.kernel.org/r/1516751617-7369-1-git-send-email-henry.willard@oracle.com
    
    
    Signed-off-by: default avatarHenry Willard <henry.willard@oracle.com>
    Reviewed-by: default avatarHåkon Bugge <haakon.bugge@oracle.com>
    Reviewed-by: default avatarSteve Sistare <steven.sistare@oracle.com>
    Acked-by: default avatarMel Gorman <mgorman@suse.de>
    Cc: Kate Stewart <kstewart@linuxfoundation.org>
    Cc: Zi Yan <zi.yan@cs.rutgers.edu>
    Cc: Philippe Ombredanne <pombredanne@nexb.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
    Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: "Jérôme Glisse" <jglisse@redhat.com>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    859d4adc