A NUMA-aware VM Balancer Using Group Scheduling for OpenNebula: Part 1
I’ve recently had the joy of building out a Hadoop cluster (Cloudera Distribution with Cloudera Manager) for internal development at work. The process was quite tedious as a large number of machines had to be hand configured. This seemed like a good application to move to a cloud infrastructure in order to improve cluster deployment times as well as to provide an on-demand resource for this workload (see Cloudera Hadoop Distribution and Manager). OpenNebula was the perfect choice for the compute oriented cloud architecture because the resources can be allocated/removed on demand into the hadoop environment. As hadoop is a fairly memory intensive workload we wanted to improve memory throughput on the VMs and group scheduling showed some promise to improve VM cpu and memory placement.
About our Infrastructure
We are working with Dell C6145 which are 8-way NUMA systems based on AMD quad-socket 12-core Mangy-Cours processors, note that each socket has 2 numa nodes. An interesting thing about these systems is that even though they are quad socket, they have 8 NUMA domains! We wanted to see if group scheduling can be used to improve performance on these boxes by compartmentalizing VMs so that memory accesses between numa domains can be minimized and to improve L2/L3 cache hits.
The Linux numa-aware scheduler already does a great job however we wanted to see if there was a quick and easy way to allocate resources on these numa machines to reduce non-local memory access and improve memory throughput and in turn improve memory sensitive workloads like Hadoop. A cpuset is a combination of memory and cpu configured as a single scheduling domain. Libvirt, the control API used to manage KVM has some capabilities to map vcpus to real cpus and even configure them in a virtual NUMA configuration mimicking the host its running on; however we found it very cumbersome to use because each VM has to be hand tuned to get any advantage. It also defeats the OpenNebula paradigm of rapid template based provisioning.
Implementation
Alex Tsariounov wrote a very user friendly program called cpuset that does all the heavy lifting of moving processes from one cpu set to another. The source is available from google code repository or from Ubuntu 12.04+ repository.
http://code.google.com/p/cpuset/
I wrote a python wrapper script building on cpuset, which adds the following features:
- Creates CPU sets based on the numactl –hardware output
- Maps CPUs and their memory domains into their respective CPU set
- Places KVM virtual machines built using libvirt into cpuset using a balancing policy
- Rebalances VMs based on a balancing policy
- Runs ones then exits so that system admins can control when and how much balancing to do.
Implementation – Example
Cpuset without numa configuration, this is the status of most systems without group scheduling configured. The CPUs column describes the number of cpus in that particular scheduling domain, same for the memory domain. In this system there are 48 cores (0-47) and 8 Numa nodes (0-7).
root@clyde:~# cset set
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-47 y 0-7 y 490 1 /
libvirt ***** n ***** n 0 1 /libvirt
Cpuset after vm-balancer.py run without any vms running.. Notice how the cpus and memory domains have been paired up.
root@clyde:~# cset set
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-47 y 0-7 y 489 9 /
libvirt ***** n ***** n 0 1 /libvirt
VMBS7 42-47 n 7 n 0 0 /VMBS7
VMBS6 36-41 n 6 n 0 0 /VMBS6
VMBS5 30-35 n 5 n 0 0 /VMBS5
VMBS4 24-29 n 4 n 0 0 /VMBS4
VMBS3 18-23 n 3 n 0 0 /VMBS3
VMBS2 12-17 n 2 n 0 0 /VMBS2
VMBS1 6-11 n 1 n 0 0 /VMBS1
VMBS0 0-5 n 0 n 0 0 /VMBS0
VM balancer and cset in action, moving 8 newly created KVM processes and their threads (vcpus and iothreads to a numa core)
root@clyde:~# ./vm-balancer.py Found cset at /usr/bin/cset Found numactl at /usr/bin/numactl Found virsh at /usr/bin/virsh cset: --> created cpuset "VMBS0" cset: --> created cpuset "VMBS1" cset: --> created cpuset "VMBS2" cset: --> created cpuset "VMBS3" cset: --> created cpuset "VMBS4" cset: --> created cpuset "VMBS5" cset: --> created cpuset "VMBS6" cset: --> created cpuset "VMBS7" cset: moving following pidspec: 47737,47763,47762,47765,49299 cset: moving 5 userspace tasks to /VMBS0 [==================================================]% cset: done cset: moving following pidspec: 46200,46203,46204,46207 cset: moving 4 userspace tasks to /VMBS1 [==================================================]% cset: done cset: moving following pidspec: 45213,45210,45215,45214 cset: moving 4 userspace tasks to /VMBS2 [==================================================]% cset: done cset: moving following pidspec: 45709,45710,45711,45705 cset: moving 4 userspace tasks to /VMBS3 [==================================================]% cset: done cset: moving following pidspec: 46719,46718,46717,46714 cset: moving 4 userspace tasks to /VMBS4 [==================================================]% cset: done cset: moving following pidspec: 47306,47262,49078,47246,47278 cset: moving 5 userspace tasks to /VMBS5 [==================================================]% cset: done cset: moving following pidspec: 48247,48258,48252,48274 cset: moving 4 userspace tasks to /VMBS6 [==================================================]% cset: done cset: moving following pidspec: 48743,48748,48749,48746 cset: moving 4 userspace tasks to /VMBS7 [==================================================]% cset: done
After VMs are balanced into their respective numa domains, note that there are 3 VCPUs per VM and 1 parent process, the vm that has 5 threads is actually running a short lived iothread.
root@clyde:~# cset set
cset:
Name CPUs-X MEMs-X Tasks Subs Path
------------ ---------- - ------- - ----- ---- ----------
root 0-47 y 0-7 y 500 9 /
VMBS7 42-47 n 7 n 5 0 /VMBS7
VMBS6 36-41 n 6 n 4 0 /VMBS6
VMBS5 30-35 n 5 n 4 0 /VMBS5
VMBS4 24-29 n 4 n 4 0 /VMBS4
VMBS3 18-23 n 3 n 4 0 /VMBS3
VMBS2 12-17 n 2 n 4 0 /VMBS2
VMBS1 6-11 n 1 n 4 0 /VMBS1
VMBS0 0-5 n 0 n 4 0 /VMBS0
libvirt ***** n ***** n 0 1 /libvirt
The python script can be downloaded form http://code.google.com/p/vm-balancer-numa/downloads/list for now.
Note. This was inspired by work presented at the KVM Forum 2011 by Andrew Theurer, we even used the same benchmark to test our configuration.
http://www.linux-kvm.org/wiki/images/5/53/2011-forum-Improving-out-of-box-performance-v1.4.pdf
In the next part I will explain how the vm-balancer.py script works and it’s limitations.
