Empirical power consumption characterization and energy aware scheduling in data centers

Energy-eﬃcient management is key to reduce operational cost and environmental contamination in modern data centers. Energy management and renewable energy utilization are strategies to optimize energy consumption in high-performance computing. In any case, understanding the power consumption behavior of physical servers in datacenter is fundamental to implement energy-aware policies eﬀectively. These policies should deal with possible performance degradation of applications to ensure quality of service. This manuscript presents an empirical evaluation of power consumption for scientiﬁc computing applications in multicore systems. Three types of applications are studied, in single and combined executions on Intel and AMD servers, for evaluating the overall power consumption of each application. The main results indicate that power consumption behavior has a strong dependency with the type of application. Additional performance analysis shows that the best load of the server regarding energy eﬃciency depends on the type of the applications, with eﬃciency decreasing in heavily loaded situations. These results allow formulating models to characterize applications according to power consumption, eﬃciency, and resource sharing, which provide useful information for resource management and scheduling policies. Several scheduling strategies are evaluated using the proposed energy model over realistic scientiﬁc computing workloads. Results conﬁrm that strategies that maximize host utilization provide the best energy eﬃciency.


Introduction
Data centers are key infrastructures for developing and executing industrial and scientific applications. In the last decade, data centers have become highly popular for providing storage, computing power, hosting, middleware software, and other information technology services, available to researchers with ubiquitous access [1]. Energy efficiency of data centers has become one of the main concerns in recent years, having a significant impact on monetary cost, environment, and guarantees for service-level agreements (SLA) [2].
The main sources of power consumption in data centers are the computational resources and the cooling system [3]. Regarding power consumption of the computational resources, several techniques for hardware and software optimization can be applied to improve energy efficiency.
For example, software characterization techniques [4], which are applied to determine features that are useful to analyze the software behavior. This behavior analysis is an input to analyze and improve power consumption [5].
Energy consumption models are useful synthetic tools for analyzing issues related to energy efficiency of computing infrastructures, e.g., by predicting the power consumption for a given workload. Information resulting from the analysis is often used by decision-makers to take technical, monetary, or environmental decisions regarding the computing infrastructure. Energy models are based in the relation between power consumption and resource utilization and can be classified as hardware-or software-centric [2].
Models that follow complex approaches, i.e., by modeling the contribution of every hardware component and/or using machine learning methods for prediction are accurate, but they demand significant design and implementation effort. On the other hand, simple models, i.e., just based on overall server consumption and CPU utilization, demand significantly lower design and implementation effort, and can produce fairly accurate results. In the review of related work, the use of simple models has been predominant, especially until 2013 (see, for example Dayarathna et al. [2] and Kurowski et al. [6]).
Simulations are widely utilized for evaluating performance and energy models, schedule techniques, and others features of the execution of scientific workloads on real infrastructures [7]. Using simulations, identical scenarios can be executed several times in order to perform statistical analysis. Moreover, simulators avoid the direct utilization of expensive hardware, allow a significant reduction in the execution time of the experiments [8]. Due to the aforementioned reasons, cloud simulation is a key component of the research in the area. Many cloud simulators with different characteristics and specializations have been proposed in the literature [9]. CloudSim is a simulation toolkit widely used in the literature. This tool allows modeling and simulating cloud computing infrastructures, provisioning environments, and applications [10]. In this work, a custom version of CloudSim [11] is extended by including new energy models, built considering the empirical power consumption evaluation, and different allocation policies.
In this line of work, this research focuses on the characterization of power consumption for applications over nowadays multicore hardware used in scientific computing platforms. Such characterization is useful for studying and understanding energy efficiency in data centers and for designing energy efficient scheduling strategies. Three synthetic benchmarks are studied over two physical servers from a real High Performance Computing (HPC) platform, registering their power consumption with a power meter device. Furthermore, the experimental analysis studies the power consumption of different applications sharing a computing resource via simultaneous execution. The proposed study is very relevant for nowadays data centers and HPC infrastructures that execute many applications, with an associated impact on both the energy efficiency and the quality of service (QoS) offered to the users of the platform. Furthermore, new energy models are built by applying polynomial regression on the empirical data. The new models are applied in simulation of data centers operation for predicting the power consumption and several scheduling strategies are evaluated in simulations performed considering real servers and scientific computing workloads.
The initial questions that have guided the research include the following: What is the relationship between the server load and its power consumption? Is power consumption different if the task is intensive in different computing resources? In that case, is it possible to save energy by executing tasks in the same server? What type of task is it convenient to execute together? To what extent is it possible to combine tasks without losing energy efficiency due to performance degradation caused by resource usage conflicts? What is the behaviour of the power consumption at the critical utilization level (about 100% of server capacity)?
This research focuses on answering the aforementioned questions using an empirical approach. The experiments are designed to cover the domain of the problem and gather relevant information to process it in later stages using appropriate computational techniques. The main contributions of this work are: 1. A cutting-edge review of related works regarding power characterization, modeling, and energy-aware scheduling in cloud computing and supercomputing systems. Also, a thorough review of available cloud simulation tools considering advantages and disadvantages of each one, is presented.
2. An empirical study of power consumption of the three main computing resources that most contribute to power consumption utilization (CPU, memory, and disk). The study considers isolated and combined computing resources, and different levels of server load. Two high-end multi-core servers (AMD and Intel architectures) are analyzed.
3. An empirical study of performance degradation in multi-core servers regarding the server load and the computing resource type. 4. Several energy models built from experimental power consumption data using computational intelligence techniques. Each model considers AMD and Intel architectures. Also, relevant metrics to asses the quality of each model are presented. 5. Energy evaluation of six scheduling strategies, considering realistic workloads and the developed models, through simulations. 6. The research follows the reproducible/replicable paradigm by using a Jupyter Notebook that shows in a clear and understandable manner the data processing from raw data. The reproducible/replicable paradigm allows to reduce errors and add new experiments data quickly. Also, the results and claims can be verified or extended by other researchers.
This manuscript summarizes the research developed in the thesis "Empirical characterization and modeling of power consumption and energy aware scheduling in data centers" (Universidad de la República, Uruguay, 2019). Part of the thesis content was previously published in a journal article [12]. This manuscript includes new scientific content not previously published: i) two energy models (for AMD and Intel hosts) that consider three types of computing resource, ii) a comparative analysis between the power consumption of CPU-bound tasks and memory-bound tasks, iii) analysis of the improvements of scheduling strategies regarding business-as-usual planning, iv) four unpublished graphic analysis (Figures 5,18,19,and 20), and v) three unpublished numeric reports (Tables 1, 19, and 18).
The manuscript is organized as follows. Section 2 reviews related works on energy characterization in multicores and simulators for energy-aware data centers. Section 3 presents the route map of the research. The different stages, their inputs and outcomes are introduced. Also, the proposed methodology for energy characterization, the benchmarks, and the physical setup for experiments, are described. The experimental evaluation of power consumption and performance of different benchmarks, is reported and discussed in Section 4. Section 5 describes the details of the power consumption models and performance models, built using polynomial regression. These models quality is assessed and compared by utilization of relevant metrics. In adition, several scheduling strategies for data centers, based on well-known heuristic are presented. The strategies were compared according to its energy efficiency through a simulation tool, using realistic workloads. Finally, the conclusions and main lines for future work are formulated in Section 6.

Related Work
This section presents the review of related literature, that includes two parts. The first one presents the revised works related to characterization and modeling of power consumption of servers and energy-aware scheduling. Then, the second part presents a comprehensive review of the state-of-art in relation to cloud simulation tools.
In relation to the power characterization, modeling and scheduling, Iturriaga et al. [13] studied the multiobjective optimization problem to optimize power consumption and execution time in heterogeneous computers systems, considering uncertainty. Specific versions of well-known heuristic were proposed for scheduling on realistic scenarios, applying the power consumption model defined in Nesmachnow et al. [14] and considering only CPU-bound workloads. A model for uncertainty on power consumption was determined through empirical evaluations using three CPU-bound benchmarks. Regarding scheduling results, online heuristics computed better schedules than offline approaches. Results also confirmed that uncertainty has a significant impact on the accuracy of the scheduling algorithms. The power consumption behavior of CPUbound benchmarks shown by [13] is consistent with the one reported in our research. Moreover, we propose a fully empirical power consumption characterization, also considering two additional types of benchmarks: memory bound and disk bound.
Srikantaiah et al. [15] studied workload consolidation strategies for energy optimization in cloud computing systems. An empirical study of the relationship between power consumption, performance, and resource utilization was presented. The experiments were executed on four physical servers connected to a power meter to track the power consumption, and resource utilization was monitored using the Xperf toolkit. Only two resources were considered in the study: processor and disk. The performance degraded for high levels of disk utilization, and variations in CPU usage did not result in significant performance variations. Energy results were presented in terms of power consumption per transaction, resources utilization, and performance degradation. Results also showed that power consumption per transaction is more sensitive to CPU utilization than disk utilization. The authors proposed a heuristic method to solve a modified bin packing problem where the servers are bins and the computing resources are bin dimensions. Results reported for small scenarios showed that power consumption of the solutions computed by the heuristic is about 5% from the optimal solution. The tolerance for performance degradation was 20%.
Du Bois et al. [16] presented a framework for creating workloads with specific features, applied to compare energy efficiency in commercial systems. CPU-bound, memory-bound, and disk-bound benchmarks were executed on a power monitoring setup composed of an oscilloscope connected to the host and a logging machine to persist the data. Two commercial systems were studied: a high-end with AMD processors and a low-end with Intel processors. Benchmarks were executed independently, isolating the power consumption of each resource. Results confirmed that energy efficiency depends on the workload type. Comparatively, the high-end system had better results for the CPU-bound workload, the low-end system was better for disk-bound, and both had similar efficiency for the memory-bound workload. Our work complements this approach by including a study of the power consumption behavior when executing different types of tasks simultaneously on specific architectures for high performance computing.
Feng et al. [17] evaluated the energy efficiency of a high-end distributed system, with focus on scientific workloads. The authors proposed a power monitoring setup that allows isolating the power consumption of CPU, memory, and disk. The experimental analysis studied single node executions and distributed executions. In the single node experiments, results of executing a memory-bound benchmark showed that the total power consumption is distributed as follow: 35% corresponds to CPU, 16% to physical memory, and 7% to disk. The rest is consumed by the power supply, fans, network, and other components. Idle state represented 66% of the total power consumption. In distributed experiments, benchmarks that are intensive in more than one computing resource were studied. Results showed that energy efficiency increased with the number of nodes used for execution.
Kurowski et al. [6] presented a data center simulator that allows specifying various energy models and management policies. Three types of theoretical energy models are proposed: i) static approach, which considers a unique power value by processing unit; ii) dynamic approach, which considers power levels, representing the usage of the processing unit; and iii) application specific approach, which considers usage of application resources to determine the power consumption. Simulation results were compared with empirical measurements over real hardware to validate the theoretical energy models in arbitrary scenarios. All models obtained accurate results (error was less than 10% with respect to empirical measurements), and the dynamic approach was the most precise. Langer et al. [18] studied energy efficiency of low voltage operations in manycore chips. Two scientific applications were considered for benchmarking over a multicore simulator. The performance model considered for a chip was S = a k ( f i ) + b k , where S are the instructions per cycle, f i is the i-th core frequency and a k , b k depend on k, the number of cores in the chip. A similar model is used for power consumption. On 25 different chips, an optimization method based on integer linear programming achieved 26% in energy savings regarding to the power consumption of the faster configuration.
There are different opinions in the related literature about the importance of network power consumption. On the one hand, Feng et al. [17] measured the power consumption of each computing resource in isolated manner using different scientific computing benchmarks, and network turned to be in the fourth place in terms of power consumption, behind CPU, memory, and disk. Moreover, a recent survey by Dayarathna et al. [2] presented a graphic distribution of power usage by components, based on results from the analysis of a Google data center by Barroso et al. [19], where network power consumption is behind the other computing resources. A comparison between two Intel hosts (with Xeon and Atom processors), based on results from Malladi et al. [20], showed that the network power consumption is less than CPU, memory, and disk consumption. On the other hand, Totoni et al. [21] presented a runtime power management for saving energy by turning on/off not used or underutilized links during the execution of applications. The authors justified the relevance of the study, regarding the power consumption of the system, based on the use of low frequency many cores in computing systems and the complexity of network design nowadays. However, the study was not specifically aimed at analyzing scientific workloads.
Regarding cloud simulation, Kurowski et al. [6] proposed DCworms, an event-driven data center simulator written in Java, which allows defining performance models, energy models, and scheduling polices. DCworms is built using GSSIM, a Grid simulator by Bak et al. [8]. Among other features, GSSIM allows simulating a variety of entities and distributed architectures. Regarding to power consumption, GSSIM provides several energy models classified in static, resource load and application specific. In the third type of model, the user can specify energy profiles that allow modeling the type of application and the type of resource. Energy information is logged for analyzing the simulation. Calheiros et al. [10] introduced CloudSim, a cloud simulator developed in Java. Among the main advantages of CloudSim are the virtualization layer, which allows defining an abstraction of different execution environments of the applications in virtual machines, and the support provided for implementing federated clouds, which allows modeling large distributed systems. The simulator provides functions for implementing custom energy models. In addition, new scheduling polices can be included, for instance the strategies oriented to optimize the power consumption developed in our work. The total power consumption of a simulation can be used to compare the efficiency of scheduling policies. Kliazovich et al. [22] presented GreenCloud, a simulator oriented to energy-aware data centers. The GreenCloud design allows measuring the power consumption of each component of the infrastructure (host, switch, etc) in a detailed manner. A set of energy efficiency strategies are provided, including DVSF and dynamic shutdown of the computing components. These GreenCloud characteristics allow implementing fine-grained energy-aware scheduling strategies.
Nuñez et al. [23] introduced Icancloud, a Cloud Simulator which allows reproducing Amazon EC2 instances types. Also, it is able to customize the VM brokering due to the flexible design of the hypervisor model. The hosts can be single core or multicore, and the software is able to simulate long time periods (years). Regarding storage, local devices, NFS, and parallel storage can be simulated.
Several works in literature have focused on modeling and characterizing power consumption of scientific applications. However, to the best of our knowledge, there is no empirical research focused on the inter-relationship between power consumption and CPU, memory, and disk utilization. Also, there is no experimental analysis of critical levels of resource utilization (close to 100%) and its impact on power consumption and performance. This work contributes to this line of research, proposing an empirical analysis for both issues mentioned above. In turn, the review allows identifying several desirable features of the available cloud simulators that are useful for the analysis reported in this research, including support for virtualization, single and multicore hosts, distributed systems, energy and performance models available, scheduling policies available, ease of use, and flexible customization. CloudSim includes many of the aforementioned useful characteristics and also has a detailed documentation. For these reasons, we selected CloudSim as the simulator to use in the power consumption evaluation reported in this work.

Methodology for power consumption evaluation
This section describes the proposed methodology for power consumption evaluation, the benchmarks and architectures studied, the power evaluation setup and the experiments designed.

Overview of the proposed methodology
The experiments design was oriented to characterize the power consumption of the most important computing resources from the point of view of energy efficiency: CPU, memory and disk [19,16,17,13,20]. The importance of network power consumption has increased in the last years, mainly in modern data centers that offer distributed services, which heavily rely on network communications [24,25]. However, network power consumption analysis was not included in the characterization, due to two main reasons: i) scientific computing applications in multicore architectures, on which the study of this research is centered, do not necessarily have a great use of the network. In these applications, the multithreaded paradigm allows solving complex scientific computing problems, without using the network extensively, since they mainly apply shared memory communication methods and ii) the complexity of the experiments, since to performing measures of network power consumption implies the consideration of several components such as switches, network interfaces, network topology, and cards, etc. Furthermore, the relevance of network power consumption is subject of debate, as we already acknowledged in the review of related literature in Section 2.
The main goal of the analysis was studying the holistic behavior of the power consumption of a host, considering the following elements, which are correlated: the utilization level of each resource; the total utilization level of the host; and the types of resources involved. In consequence, experiments were designed to execute synthetic benchmarks that make intensive utilization of different computing resources, which allow capturing the features of different scientific computing applications in multicore computers. Benchmarks were executed in both isolated and combined manner, at different levels of resource utilization. Analyzing the power consumption of hosts at levels close to maximum utilization (100%) has received low consideration in related bibliography, thus it is included as one of the main relevant issues to study. Thus, all type of experiments performed consider the critic level of utilization.
Since the downside of reducing energy consumption is the degradation of system performance, it is necessary to complement the characterization of energy consumption with performance experiments. In addition to the relationship between system load and power consumption, the relationship between system load and execution time is studied. This way, it is possible to make trade-off decisions according to the context demand, such as energy prices, service levels, etc. In performance experiments, the same benchmarks were executed with a fixed computational effort as stopping criterion, instead of the wall-time considered in power consumption experiments. For example, the criterion for stopping the CPU-intensive benchmark is when the loop counter to find prime numbers is greater than 20000.

Benchmarks for power consumption characterization
Simple benchmarks (i. e., benchmarks that are intensive in a single computer resource) were used for the analysis, in line with several articles reviewed in the literature related to the evaluation of power consumption. For example, Iturriaga et al. [13] considered a power consumption evaluation of LINPACK [26], a well-known CPU-intensive benchmark used for ranking supercomputers of the TOP500 list. The same benchmark were used by Kurowski et al. [6], together with Abinit [27], a software to calculate properties of material within density functional theory (which can be CPU or memory intensive depending on the configuration of parameters), NAMD [28], a CPU-intensive software for biomolecular system simulation, and CPUBURN [29], a software that allows stressing the CPU. The approach that uses simple benchmarks was followed since analysis is oriented to characterize power consumption regarding the utilization of each computing resource, both isolated and when combining the utilization of several resources. For this reason, specific programs with intensive utilization of each one of the three studied computing resources (CPU, memory, disk) were needed. The chosen synthetic benchmarks allow isolating the utilization of each resource, in order to perform the characterization.
A set of benchmarks included in the Sysbench toolkit [30] were used in the analysis and characterization of power consumption. Sysbench is a cross-platform software written in C that provides CPU, memory, and disk intensive benchmarks for performance evaluation. The components used in the experiments include a CPU-bound benchmark, a memory-bound benchmark, and a disk-bound benchmark. The main details of each benchmark are described next.
CPU-bound benchmark. The CPU-bound benchmark is an algorithm that calculates π(n) (the prime counting function) using a backtracking method. The algorithm in the benchmark includes loops, square root, and module operations, as described in Algorithm 1.

Algorithm 1 CPU-bound benchmark program
while l < t do 6: if c (mod t) = 0 then ReqsCount + + 10: end while Memory-bound benchmark. The memory-bound benchmark is a program that executes write operations in memory, as described in Algorithm 2, where the buf variable is an array of integers. The cells of the array are overwritten with the value of tmp until the last position of the array, i.e., the value of the end variable.
Disk-bound benchmark. The disk-bound benchmark is a program that reads/writes content in files. Read or write requests are generated randomly and executed until a given number of requests (MaxReqs) is reached, as described in Algorithm 3.

Multicore hosts and power monitoring setup
Experiments were performed on high-end servers from Cluster FING, the HPC platform from Universidad de la República, Uruguay [31]. Two hosts were chosen according to their features and availability: HP Proliant DL385 G7 server (2 AMD Opteron 6172 CPUs, 12 cores each, 72 GB RAM), and HP Proliant DL380 G9 server (2 Intel Xeon E5-2643v3 CPUs, 12 cores each, 128 GB RAM). The considered hardware testbed covers the two most important architectures for high-end computers and high performance computing data centers nowadays. Table 1 presents the specification of both hosts. To measure the power consumption of a server, two approaches are found in literature: software-based metering (i.e., in-band) and hardware-based metering (i.e., out-of-band) [2,32]. Software-based meters estimate the power consumption of some server components by consulting internal counters [33]. An example of a software-based meter is likwid [34], which reads the counter Running Average Power Limit (RAPL) of Intel architectures. On the one hand, software-based approaches have low cost and high scalability and, on the other hand, the power measurement is an estimate and is partial, since the CPU counters do not consider the consumption of server components such as fans, motherboard and others. Besides, counters and sensors are not available in many high-end server models, so the software-based approaches are restricted to one group of servers. Hardware-based approaches [35,36] (the server is connected to an external power meter, which is connected to the power outlet) have the advantage that the power measurement is independent of the server model. Also, since the total energy consumption is considered, the measurements are accurate.
On the downside of hardware-based approaches, the economic cost is greater than the measurement of the software and can not be scaled, since including a new host for power monitoring involves buying and installing new power meters. In addition, power meters installation is not possible if physical access to the IT equipment is restricted.
Because this research focuses on the overall power consumption of the server and studies its holistic behavior, the use of external power meters is more appropriate than software-based meters. Figure 1 presents the power monitoring setup applied in the research developed in this research. The presented setup allows reducing the measurements noise, since the processing that is not related to the benchmark executions (polling, log writing, etc.), is executed in an external machine. The applied setup is similar to the one used in related works [13,16,15]. Benchmarks were executed in a host connected to the power source via a Power Distribution Unit (PDU) to register the instant power consumption. The PDU used is CyberPower PDU20SWHVIEC8FNET model. The PDU allows accessing the power consumption indicators through a web service. In an secondary machine, a polling demon logged data for post-processing.

Design of experiments
In order to consider the general remarks of the key steps in the design of experiments suggested by [37], the design of the experiments in this work is a consequence of initial questions about the factors that affect the host PC of multi-cores (see Introduction 1). In the experiments, the average power consumption (PC ) of each host was computed considering the measures obtained in 20 independent executions of each benchmark (in single benchmark experiments) or combination of benchmarks (in the combined benchmarks experiments), to obtain statistically significant values. Average and standard deviation of PC are reported for each benchmark in the experimental analysis. The average idle power consumption (IC ), i.e., the average consumption of the host when it is not executing any workload, is computed considering the measures for 20 independent executions of a program with null operations. Both PC and IC are used to to compute the effective consumption (EC ) for a given benchmark or workload as EC = P C − IC.
In order to consider an incremental approach to the experiment design that facilitates understanding, the experiments are described in three stages: single experiments, combined experiments and performance experiments. The stages are described in the next paragraphs.
First stage: single benchmarks. In a first set of experiments, benchmarks were evaluated isolated (i.e., independently from each other, analyzing only one resource). Utilization level (UL) is defined as the percentage of processors being used regarding the total number of processors in the host.
Each instance of the memory-bound benchmark was configured to use (100/N ) percent of the available memory, where N is the number of processors of the host. This configuration allows using 100% of the memory in full utilization mode. Each instance of the disk benchmark was configured to use 4 GB of disk size in AMD experiments and 2 GB of disk size in Intel experiments. These disk sizes were chosen taking into account the available disk capacity in each host. Instances were executed and monitored for 60 seconds. These duration allows ensuring the steady state of the benchmark execution, avoiding biasing the measurements to the initialization stages. Third stage: performance evaluation. Finally, the impact on performance was analyzed. One of the most relevant metrics in performance analysis of applications is the makespan, defined as the time spent from the moment when the first task in a batch or bag of tasks begins execution to the moment when the last task is completed [38]. In performance evaluation experiments, the makespan when executing multiple applications at the same time is compared with the makespan of single executions. In all cases, the average makespan values computed over 20 independent executions of each benchmark or combination of benchmarks are reported, in order to provide statistical significance.
Summarizing, the design of experiments aims at studying the factors that affect the power consumption of servers, such as server load and the type of computing resources used by tasks, among others. The independent variable is the UL, i.e., the percentage of processors used with respect to the total number of processors on the server. In combined experiments UL is a vector where each entry represents the percentage of processors being used by each type of benchmark. In addition, in power experiments the dependent variable is the power consumption of the server. In turn, in performance experiments the dependent variable is the completion time (makespan) of the bag-of-tasks job.

Power and performance evaluation results
This section presents the results obtained in experiments that studied the power consumption. Details of idle power consumption evaluation are presented in Subsection 4.1. Subsection 4.2 shows and discusses the results of the power consumption evaluation considering benchmarks execution independently. Then, the results of combined benchmark executions are presented in Subsection 4.3.

Idle power consumption evaluation
The average idle power consumption in both AMD and Intel hosts was calculated by performing 20 independent executions of a null program, i.e, a program that executes a sleep function for 60 seconds. The number of independent executions was chosen to obtain results with statistical validity. The average values obtained for idle power consumption (± standard deviation) were 183.4±1.3 W for the AMD host and 57.0±0.9 W for the Intel host. These average values are considered as the idle power consumption, of the corresponding host, in all the experiments reported in this section. For example, if the measured overall consumption of one experiment in Intel host was 200 W, the effective power consumption of the experiment was calculated as 200 W less 57 W, this is, 143 W.

Results of single benchmark executions
CPU-bound benchmark. Table 2 reports PC and EC values for the CPU-bound benchmark for Intel and AMD hosts. The PC is reported as the average more less the standard deviation of the independent executions of the experiment. The PC results indicate that the AMD host demands more power than the Intel host for CPU-bound workload. The PC difference between hosts is in the interval [68 W, 73 W], which means that the AMD host consumes approximately 60% more power than the Intel host. However, it is not possible conclude about the hosts energy efficiency without a performance analysis.
According to peak PC reported by Table 2 (264.1 W on AMD and 194.6 W, that occurs at UL 100% ), the IC represents the 69% on AMD and 29% on Intel, of the maximum power consumption.   Figure 3 presents a comparison of EC values in both hosts, which shows an average EC difference of 56 W between Intel and AMD hosts for all ULs. However, this difference is not relevant in order to evaluate which of both host consumes more power, because the absolute value of EC is not representative of the overall power consumption of the host. In the graphic analysis, the relevant indicator is the variation of EC regarding the UL and also the difference (of the EC variation) between hosts (i.e., a comparison of its derivatives). The almost linear behavior of EC in Figure 3 indicates that EC is proportional to the UL. Furthermore, curves for Intel and AMD are almost parallel, indicating that the UL of CPU-bound applications has a similar behavior in both hosts. EC on Intel host shows a remarkable increase when moving from 0 to 12.5% UL, which is not observed for the other ULs. This increase is explained by the dynamic handling of chip power, awaking, or increasing voltage of its components according to usage demand, which is notorious on Intel architecture. In critic UL (%100), it is not observed any different behavior regarding the power consumption, which allows concluding that in CPU-bound workloads, resource conflict does not implies an increase in power consumption.
Memory-bound benchmark. Table 3 reports the PC and EC values for the memory-bound benchmark for Intel and AMD hosts. The comparison between PC values of memory experiments and the CPU (presented above) allows concluding that memory use has impact in power consumption, because at same level and host, the power consumption of memory experiments is greater than CPU experiments. As seen in the CPU experiments, the AMD host consumes more power than Intel. The biggest difference in PC (89.6 W) occurs at the minimum UL (12%) and the lowest (45.7 W) occurs at 100%, indicating that Intel has an efficient management of the energy with respect to the use memory.   Figure 4 presents a graphic comparison of the EC values in both hosts. Results show a significant increase of EC with regard to CPU-bound executions for all ULs (104% for the AMD host and 36% for the Intel host, on average). A logarithmic behavior is observed for both PC and EC, which does not occur in CPU-bound case. This behavior may be mainly due to the bottleneck in the access to the main memory that reduces the CPU usage. No significant increase is detected on high/critical ULs, possibly by effective resource contention by the operating system for solving conflicts over access to shared resources.
In order to analyze the PC difference between CPU and memory bound workload, Figure 5 presents ∆PC, defined as the additional percentage of power that memory experiments consumes regarding CPU experiments (i.e, ∆P C = (P C M EM −P C CP U )×100/P C CP U ). Since the CPU-bound benchmark is designed to consume more CPU cycles than the memory-bound benchmark, ∆PC is related to the power consumption of the memory usage. The graphic comparison of ∆PC allows observing a remarkable increase of ∆PC on Intel at medium and high ULs, when compared with low UL of the same host. On AMD, there is not a notorious ∆P C increment. This difference between hosts suggests that the power management on Intel architecture is better than on AMD architecture, since the power consumption increases according to the percentage of used memory (the percentage of the system memory used by the benchmark is equal to the UL, for example, 50% of the host total memory is used at UL 50% ). 12  Disk-bound benchmark. Table 4 reports PC and EC values for the disk-bound benchmark. The PC is almost constant and notoriously less than CPU and memory experiments. Figure 6 presents a comparison of EC values in both hosts. The maximum EC variation through ULs is 4W in Intel and 2W in AMD. The reported low power variations indicate that disk usage has low impact in power consumption in comparison with CPU and memory.   Table 5 reports PC and EC for the simultaneous execution of the CPU-and memory-bound benchmarks on AMD host. The results shows that the peak PC on AMD is 312.7 W and it occurs at UL (25%,75%). This peak is the maximum PC on AMD host registered considering all experiments in this research, indicating that the IC on AMD (183.4 W) represents 59% of the total PC of the host.  Symbol ↑ indicates that the EC of the combined benchmarks is higher than the sum of the ECs of each benchmark executed independently, i.e., the combined execution is less efficient than the independent execution. Symbol ↓ indicates the opposite, that is, the combined execution is more efficient. Symbol = indicates that the values are equal, considering a threshold of 1 W of difference. Column ∆EC reports the difference between the EC of the combined execution and the sum of the EC of the independent executions (in W). The ∆EC values reported in Table 5 shows that the combined execution on AMD host allows reduce the EC compared to independent executions. In UL with high memory use, the independent executions presents lower EC values that combined. The 3D graph on Figure 7 presents the EC values of the combined execution of CPU and memory on AMD host. The color scale of the graph allows observing that the dark sectors, which correspond to a high EC, are accentuated near the memory axis. Table 6 reports PC and EC for the simultaneous execution of the CPU-and memory-bound benchmarks on Intel host. The peak PC (255.3 W) corresponds to UL (50%,50%), thus, the IC of the Intel host represents the 22% of the overall PC. The low percentage of IC is due to Intel host reduces the IC when the host is not used (at UL almost 0%), however, in other ULs the IC increases (as is observed and discussed in Section 3, a significant increment of PC occurs in the first UL, which does not occurs in later ULs). The combined executions of CPU and memory benchmarks in Intel host achieve the best results with regard to EC in pair combined experiments. In this case, a significant gain (around 47 W) is observed for all ULs. The highest gain occurs when CPU use is low and memory use is low, simultaneously.  The 3D graph on Figure 8 reports EC values of the combined execution of CPU and memory on AMD host. The graph allows observing the difference in derivative of UL 0 and the other ULs and the dark sectors near memory axis. Results of combined CPU and memory benchmarks show that for Intel host, combined executions reduce EC compared to single executions. In AMD host, however, combined executions of these types of benchmarks have not such notorious improvement when compared with single executions.
CPU-and disk-bound benchmarks. Table 7 reports PC and EC values for the simultaneous execution of the CPU-and disk-bound benchmarks on AMD host. ∆EC values indicate that the combined execution has not a significant gain or loss regarding independent execution.  The 3D graph on Figure 9 presents EC values on AMD host, for different combinations of CPU and disk ULs. The graph shows that EC values only change significantly in the direction of the CPU axis, which indicates that the disk-bound load has not significant impact on EC. Table 8 reports PC and EC values for the simultaneous execution of the CPU-and disk-bound benchmarks on Intel host. Results show an average gain of 10 W of combination regarding to independent execution. The maximum gain is 13.06 W and it occurs at UL (25%,75%). The 3D graph on Figure 10 presents EC values on Intel host, for different combinations of CPU and disk ULs. A similar result to the one obtained in AMD is observed, as EC changes significantly only in CPU axis direction. The aforementioned remarkable EC increase when moving from 0 to the next UL is notorious on CPU axis direction, but not on disk axis direction. This results indicate that the increment is related to CPU use.  Memory-and disk-bound benchmarks. Table 9 reports PC and EC values for the simultaneous execution of the memory-and disk-bound benchmarks on AMD host. The combined executions consume less energy at ULs (50%,25%) and (50%,50%), i.e., when memory-bound load is 50%. In the other ULs, there are not significant improvements on power consumption.  The 3D graph on Figure 11 presents EC values on AMD host for memory and disk-bound benchmarks executing in combination. The graph allows observing the same behaviour noted in the analysis of Figure 10 regarding the low increment of EC in disk axis direction, which confirms the almost negligible power consumption of disk-bound benchmark. Table 10 reports PC and EC values for the simultaneous execution of the memory-and disk-bound benchmarks on Intel host. The results show that single executions consume less EC than combined, except for those experiments with high load of the memory-bound benchmark. The 3D graph on Figure 12 presents EC values on Intel host for memory and disk-bound benchmarks executing in combination. A similar results to the one obtained in AMD is observed, where the disk-bound benchmark has not impact in EC.   Table 11 reports the PC and EC values obtained when executing the CPU-, memory-and disk-bound benchmarks combined, on AMD and Intel hosts.  Table 11 show that the combined execution on the AMD host has a higher EC compared to their independent execution in ULs (25%,50%,25%) and (50%,25%,25%). This behavior indicates that the three-combined execution is not efficient for high load of CPU-bound and memory-bound benchmarks. However, on Intel host, combined executions reduce EC compared to independent executions for all ULs. For independent executions on Intel host, the difference on PC between consecutive ULs is larger at the first UL (12.5%). Because of this difference, combined executions achieve better efficiency regarding independent execution on Intel host.

Performance evaluation
This subsection analyzes the performance evaluation experiments performed for each benchmark. Table 12 reports the makespan of the CPU-bound benchmark and Figure 13 presents a graphic comparison for both hosts.  Results show that increasing the UL does not impact significantly the completion time, due to the absence of resource competition. However, both hosts present a slight degradation for UL 100%, possibly due to conflicts with operating system processes. Table 13 reports the makespan of the memory-bound benchmark and Figure 14 presents a graphic comparison for both hosts. Performance degrades in AMD; there is a gap of 400 seconds between the lowest and the highest UL. For Intel the difference is only 48 seconds. The difference in gaps is possibly explained by the specific memory features of each host, such as cache size and transfer speed: the Intel host processor has a significantly larger L3 cache size than the AMD host processor (20MB vs 12MB) and a faster bus clock speed (4.800 MHz vs 3.200 MHz).   Table 14 reports the makespan of the disk-bound benchmark and Figure 15 presents a graphic comparison for both hosts. The disk-bound case presents a significant degradation in performance when increasing UL when compared with other benchmarks, because of concurrency has a bigger impact in the disk access than memory access and CPU access.

Energy efficiency analysis
This subsection analyzes the energy efficiency from the collected measurements. The energy efficiency metric (eff ), defined in Equation 1, allows comparing the PC results for different ULs and hosts, while taking into account the execution time of an application. For comparing the energy efficiency of each host, the metric focuses on PC because it represents the real power consumption of each host. The lower the metric value, the higher energy efficiency of the host (lower power consumption and lower makespan). Results from the study indicate that the CPU-bound benchmark is more efficient at high ULs, the memory-bound benchmark is more efficient at medium ULs and the disk-bound benchmark is more efficient at low ULs. These results hold for both hosts. Overall, Intel host is more efficient than AMD for all ULs and all types of benchmarks. This observation is coherent with reported comparison of these processors [39], where Intel host processor is presented as 10 times more efficient than AMD host processor. However, our energy analysis is more comprehensive, because it considers more components of each host, not only the processor. Finally, the high-critic UL (100%) is less efficient than the high-medium UL (87.5%) in all cases, except for disk-bound benchmark executions. For disk-bound benchmarks there are small variations on the PC values between ULs. Thus, the proposed efficiency metric improves when the number of instances increase. eff = PC × makespan number of instances × 3600 (1) Table 15 reports the average energy efficiency for all ULs and both hosts. The more efficient UL for a given host and type of benchmark is presented in bold.

Concluding remarks
The empirical study was aimed at extracting the characteristics of the power consumption of multi-cores and their relationship with the type of workload, considering all the components involved holistically. Single executions allowed concluding that EC vary in different ways depending on the computing resource considered.
In particular, EC is directly proportional to the UL in the CPU-bound workload. For the memory-bound workload, the single executions showed a deceleration in EC variation as the UL increases. The combined execution of different types of benchmarks indicated that EC reduces with respect to single executions, by tasks consolidation, taking advantage of the optimum UL (regarding EC) of computing resources. Results of combined experiments of two benchmarks showed a remarkable gain in Intel host, where the difference in EC reached 57.65 W at UL (25%, 25%). This gain represents the 32% of the EC of single executions. The three combined executions presented a EC gain on Intel host for all ULs regarding to single executions. However, on AMD host the EC is equal at (25%, 25%, 25%) and at (25%, 25%, 50%), and it is greater than EC of single executions at (50%, 25%, 25%) and at (25%, 50%, 25%). On the one hand, the performance analysis indicated that the CPU-bound workload showed not degradation as UL increases. On the other hand, results showed performance degradation of memory-bound and disk-bound workloads, specially in AMD host.
The energy efficiency study, which combines results of power consumption and performance experiments, showed that the optimum UL depends on the type of workload: CPU-bound workload presented best efficiency at high UL, memory-bound workload at medium UL, and disk-bound workload at low UL. The difference between host efficiency observed in results was consistent with the published specifications of the hosts. Numerical results of the presented power characterization can be used to build power consumption models. In addition, it is possible to consider the conclusions of the analysis as a guide to develop energyaware scheduling strategies based on heuristics. Next section presents the proposed power consumption models, several scheduling strategies, and the simulations performed for their evaluation.

Energy model and simulation
This section presents the utilization of power and performance characterization results, by building a wide variety of models, which are used in simulations to evaluate energy aware scheduling strategies.

Energy model
According to the literature reviewed, power consumption models can be classified on three categories: static, where the consumption of the host is a constant value; dynamic, where the value of the power consumption depends of the utilization level of the host; and application specific, where the consumption is associated to the characteristics of each application [6]. Models with higher complexity can be built considering the competition for computing resources between those applications running at the same time in a host [40].
The proposed power consumption models are versions of specific application models. They are based on partitioning the applications into equivalence classes according their bounding computing resource (two applications are of the same class if they are intensive in the same computing resource). The identified variables that affect the energy consumption are: the computer resource, the load of the computer resource, and the total load of the host. These variables are synthesized in the definition of UL, an expression (x 1 , x 2 , ..., x n ). The proposed models seek the form of a function f :[0,1] n →IR, f (x 1 , x 2 , . . . , x n ). The models built from independent executions focused in the effective consumption (EC) because it contains more synthetic information than power overall power consumption (PC), since PC is just EC plus a constant.
To build the energy models, experimental data were adjusted by polynomial regression [41]. Four functions were studied: linear, piece-wise, quadratic, and cubic. Two intervals were considered for the piece-wise model: [0,50] and [50,100]. The complete data processing and modeling is available in a Jupyter Notebook in https://www.fing.edu.uy/ ∼ jmurana/msc/. The analysis follows the paradigm of reproducible research [42]. All raw data, processing tools, and processed results are provided for verification by researchers. Complete results and design decision are explained and the process of analysis is clearly documented, thus the published Jupyter Notebook allows reproducing the complete data processing and modeling, making easy to correct errors as well as extending the research by considering new experimental data and/or new models, without significant effort. This is an important contribution of our research, since the reproducible/replicable approach is not found often in related research areas [43].
Two relevant metrics to assess the quality of statistical models were applied to analyze the results: coefficient of determination (R-squared, R 2 ) and adjusted R-squared (R 2 ). Both metrics evaluate the capabilities of the studied model to forectast future values in a given temporal series.R 2 is an extension of R 2 proposed to avoid spurious increasing when using a larger number of independent variables [44].
Graphs in Figure 16 presents the four EC models for CPU-bound workload, on AMD and Intel hosts. Graph legends show the equations for each curve, where x is the independent variable, i.e. UL. The ancillary variable x in the piece-wise model is zero when x < 50 and x − 50 in other case. A linear behavior of the models is observed. Besides, quadratic and cubic models have no-linear coefficients close to zero, which indicates the linear dependency too. Both hosts have a similar rate of increase. For example, in linear models, the slope is 0.8 in AMD host and 0.827 in Intel host. This similarity indicates that the proportional dependency between UL and EC is independent of the architecture, for CPU-bound workloads.  Figure 17 presents the EC models for memory-bound workloads on AMD and Intel hosts. The comparison of slopes of linear models shows that the EC on Intel increases faster than AMD, by a factor of 1.6 (1.368/0.846). The piece-wise graph for AMD shows a noticeable decrease of the first derivative in UL 50%, possibly due to the degradation of the performance caused by the waits in the access to memory, which implies less use of memory and, consequently, less consumption of energy.  Tables 16 and 17, indicate that the cubic model provides the best fit for CPU on AMD host and memory on both host. Piecewise model provides the best values for CPU experiments on Intel host (values close to 1 means the better fit). According to the obtained results, the aforementioned cubic and piecewise models were considered for the simulations. Equation 2 presents the PC model built applying the linear combination for CPU, memory and disk workload for AMD host, based on the best independent models. Since the independent models correspond to the EC modeling, the host IC is added in the PC models. In turn, Equation 3 presents the PC model built applying the linear combination for Intel host for CPU, memory, and disk workload. The ancillary variable α is zero when α < 50 and α − 50 in other case. In both equations, α is the percentage of CPU-bound workload, β is the percentage of memory-bound workload, and γ is the percentage of disk-bound workload. The linear combination models assume that CPU-bound, memory-bound and disk-bound jobs arrive to the host (to be processed) with equal probability (0.33).
The proposed energy models were implemented in the version of Cloudsim developed by CICESE. Cloudsim handles the following main entities: hosts, which correspond to the physical host, virtual machines (VM), which are the processing units and assigned to the hosts, and jobs, which correspond to the workload. The jobs are assigned to the VMs to be processed.
Two Java classes were modified in Cloudsim: i) PowerModelJobType, located in package cicese.cloudbus.cloudsim.power.models, which is used by Cloudsim for scheduling tasks; and ii) PowerModelJobType located in package cicese.cloudbus.cloudsim.util.power, which is applied for computing the total power consumption in a post-processing stage, using the simulation results. In each Java class, the method getEnergy(double α,double β) was extended to include the main features of each energy model developed in this research. This method returns the current power consumption of a VM that executes a task. The method getEnergy(double α,double β) was overrode to include the empirical coefficients from the linear interpolation and the corresponding value of IC for each host.

Schedulers evaluation
This section describes the simulations performed for comparing different scheduling strategies regarding energy efficiency. The proposed model is used for calculating the power consumption for each scheduler.

Simulation details
Two different power consumption models were considered, modeling multicore hosts with AMD and Intel architectures (eight cores each), which correspond to versions of the linear combination models presented on Equations 2 and 3. Since the simulation only considered two type of workload (CPU-bound and memorybound), the term of variable γ (disk-bound workload) was eliminated from the equations.
Thirty different workloads were considered in the evaluation for each architecture, thus a total number of 60 simulations were performed (without considering the independent executions of the stochastic algorithms). In order to focus the simulations in the energy consumption study, some simplifications were considered: i) the physical hosts within a same simulation are identical; ii) only one VM is created per physical host, which can execute as many tasks as cores available in the host; and iii) each job has only one task, so the terms job and task both denote the atomic workload unit to schedule. These simplifications are realistic in the context of scientific computing infrastructures, for example Cluster FING [31].
During a simulation, N independent jobs arrive in different times (considering realistic distributions, see Subsection 5.2.2 for details) and the scheduler must deliver the job to one VM from M active VMs. Each VM can execute up to eight jobs at the same time and a job requires only one core to execute (1/8 of the VM capacity). When all active VMs are full, a new VM is turned on and this VM is considered active from that moment. It is assumed that a VM always has enough memory, bandwidth, and disk to execute any job.

Workloads
The workloads used for the experimental evaluation are based on traces of real applications taken from HPC Parallel Workloads Archive executed on parallel supercomputers, clusters, and grids. The workloads include traces from DAS2-University of Amsterdam, DAS2-Delft University of Technology, DAS2-Utrecht University, DAS2-Leiden University, Royal Institute of Technology (KTH), DAS2-Vrije University Amsterdam, High Performance Computing Center North (HPC2N), Cornell Theory Cente (CTC), and (Los Alamos National Laboratory) LANL. A detailed study of the workloads were presented by [45]. The information included in these workloads allows evaluating allocation strategies over real scenarios and applications.
Workloads are specified in a format that extends the Standard Workload Format (swf). The extended format, introduced by [11], adds two new fields to the format: codec utilization and job type. Fields of the extended swf format used in this work include: • Job id : is the identifier of the job.
• Submitted time: is the arrival time of the job, in seconds. A job cannot begin execution before its submitted time.
• Job length: is a measure of the needed resources to complete the job, expressed in MIPS. When instantiated over a specific host, the duration of the job is given by the quotient of the job length and the power processing of the job (i.e., a job with length of 800 MIPS executing in a core of 100 MIPS will execute for eight seconds).
• Codec utilization: represents the percentage of a VM defined over the host that is requested to be used by a job. In the context of this research, codec represents the number of cores required by a job.
• Job type: this field allows specifying the type of the job (CPU-bound, memory-bound, disk-bound, or other relevant type). In the experiments performed in this article, type 0 represents a CPU-bound job and type 1 represents a memory-bound job.

Scheduling heuristics
The scheduling strategies were compared in experiments performed considering the two power consumption models over the 30 different workloads studied (W01, W02, . . ., W30). Each workload accounts for one week of operation of a real HPC platform, as described in subsection 5.2.2. The computing platform simulated in the experiments is composed of 150 multicore hosts with eight cores each. An on-line scheduling approach is applied. On-line scheduling is a model in which the scheduling decisions are taken immediately after a job arrives or is released [46]. Although they are based on local and often sub-optimal decisions, on-line schedulers provide some advantages over static scheduling methods including a more realistic model of the user-system interaction and less information is needed to perform the resource assignment. In the experiments, the destination VM is chosen by the cloud scheduler by applying a specific strategy when a new job arrives to the system. Figure 18 illustrates the scheduling approach used in simulations. The proposed scheduling strategies hold a list of active VMs (the active list), i.e., VMs that are currently executing jobs. Adding a VM to the active list corresponds to turning on a physical host. A VM is capable of executing a job if the number of free processors of the VM is greater than (or equal) the number of processors that this job requires. Strategies are implemented using well-know heuristics for scheduling problems. The proposed heuristics are oriented to fulfill the followings goals: minimize the host utilization, maximize the host utilization, balance the utilization, and minimize the power consumption of each host.
The studied heuristics include: • Random (RD). The purpose of the heuristic is to assign the jobs randomly. The destination VM for an arriving job is selected randomly from the list of active VMs, applying a uniform distribution.
• Round Robin (RR). The purpose of the heuristic is is to distribute equitably the jobs between the VMs, in a rational order [38,47]. The destination VM for an arriving job is selected in circular order. VMs are considered ordered by VM identifier and when the last active VM is assigned, the VM with identifier equal to 0 is assigned again. Algorithm 4 explains the Round Robin strategy, where vm idx is the index of the VM where the last job was assigned. vm idx = vm index + 1 (mod size(active vm list)) 10: until assigned or f irst vm idx = vm idx • First Fit (FF) assigns jobs to the first possible VM [48]. The destination VM for an arriving job is the first VM in the active list with enough free resources (in this work, cores) as requested by the arriving job. The active list is considered to be ordered by VM identifier, ascendant.

Algorithm 4 Round Robin Strategy
• Minimum Energy (mE). assigns jobs to the possible VM that consumes less power, according to the current assignment. The destination VM for an arriving job is the VM whose corresponding host has the lowest energy consumption. Algorithm 5 shows mE strategy.
• Minimum Utilization (mU) assigns jobs to the possible VM that have the most resources available (in this work, available cores). The destination VM for an arriving job is the VM of the active list with the lowest percentage of utilization.
• Maximum Utilization (MU) assigns jobs to the possible VM that have the less resources available (in this work, available cores). The destination VM for an arriving job is the VM of the active list with the highest percentage of utilization. if is capable(vm, arrived job) then 3: if is not asiigned(arrived job) then 4: assign(vm, arrived job)

5:
else 6: if current power of host(vm) < current power of host(assigned vm (job)) then

Simulation Results
Tables 18 and 19 report the power consumption values for each scheduling strategy using the AMD and Intel energy models, respectively, in W/s (×10 8 ). All scheduling strategies are deterministic, except for RD, which is non-deterministic. For RD, the mean and standard deviation of power consumption values (calculated in 20 independent simulations performed for each workload) are reported. Results in Tables 18 and 19 indicate that strategies that maximize the utilization of hosts (i.e., MU and FF) achieved better energy efficiency. On the AMD host, the FF strategy achieves the best results in 6 scenarios, MU achieves the best result in 18 scenarios, and the results of both strategies are the same in 6 scenarios. In addition, on the Intel host, the FF strategy achieves the best results in 5 scenarios, MU achieves the best result in 15 scenarios, and the results of both strategies are the same in 10 scenarios.
Conversely, lower efficiency is achieved when the utilization is minimized (mU). Strategy mE, which is oriented to minimize the energy consumption also achieved worse results. The fact that both utilization model and energy model are linear explains the results for strategy Me, because minimizing energy corresponds to minimizing the utilization. Strategies that balance the utilization (RR, RD) achieved intermediate results regarding the other reported strategies. A relevant analysis regarding energy efficient is the comparison of the best schedulers against strategies that apply a Business-as-Usual approach, i.e., when energy efficiency is not considered to perform the jobto-resources allocation. RR is one of the most widely used strategies for scheduling and it is included as the default scheduler in popular resource managers such as Maui and Condor [49]. Maui and Condor is widely used by multiple computational platform such as cluster, data centers, distributed platforms, among others.
The percent improvements over RR strategy (∆ RR ) of each proposed scheduling strategies are also reported in Tables 18 and 19. Results indicate that FF is able to improve over RR up to 19.83% on AMD and 15.66% in Intel. In turn, MU improves over RR up to 18.53% in AMD and up to 15.06% in Intel. Moreover, considering average results, FF improves over RR a 10.63% on AMD and a 8.07% in Intel. In turn, MU improves over RR a 11.15% in AMD and a 8.05% in Intel.
For all workloads studied, AMD hosts consumed more energy than Intel hosts. Moreover, if a performance model is added (with the characteristics studied in subsection 4.4), it is expected that the difference is even greater. The improvements of each scheduling over RR on AMD host are graphically presented in Figure 19 and the improvements of each scheduling over RR on Intel host are graphically presented in Figure 20.
Overall, the reported results indicate that energy models built from power characterization and empirical data are important tools for the operation and management of HPC and data center infrastructures. Energy models allow evaluating energy-aware scheduling strategies with the main goal of reducing power consumption in large computing platforms.  This work presented an empirical analysis of the energy consumption of synthetic benchmarks in highend multi-core servers. In addition, power consumption models were constructed and used in simulations to evaluate several energy-aware scheduling strategies. In the last decade, data centers have become a fundamental part of computers technology, since they are the backbone of the cloud, which is widely used by industrial and scientific applications worldwide. Furthermore, data centers are large consumers of energy, which implies a great challenge when it comes to consciously manage energy consumption, in order to reduce the environmental impact and the monetary costs. The first step towards achieving energy efficiency in data centers is to know the energy consumption of its different components, in particular, the physical hosts. In this work, an exhaustive study of the inter-relationship among the power consumption of the main computing resources at different ULs was carried out over AMD and Intel architectures. Furthermore, all experiments were performed applying the reproducible research paradigm, which is not found often in this research area. All data and processing tools are publicly available for verification and extension by researchers. The experimental methodology consisted on executing synthetic benchmarks over high-end hosts connected to a PDU, considering different ULs and combinations of benchmarks with the goal of characterizing the power consumption of each computing resource (CPU, memory, and disk). The operations performed by the benchmarks include mathematical functions and read/write of main memory and disk. The study was complemented with performance experiments. A total number of 144 experiments were performed: 96 experiments evaluating power consumption and 48 evaluating performance. For each experiment, 20 independent executions were performed. Experimental results showed that in single executions, CPU utilization has a linear relation with power consumption. Memory utilization has a significant impact on power consumption when compared to CPU, up to 157% more EC for AMD and 46% more EC for Intel. On the other hand, disk utilization presented low EC variation for all ULs. Combined executions are able to reduce EC with regard to independent executions for CPU and disk combined execution, taking advantage of the optimum UL of computing resources. The results of two combined executions showed a remarkable energy reduction in the Intel host (32% of the EC of single executions). Efficiency analysis showed that different benchmarks performed more efficiently at different ULs: CPU at high ULs, memory at medium ULs, and disk at low ULs. The critic UL (100%) showed worse efficiency than high-medium UL (87.5%), except for disk.
Statistical tools were applied to built a realistic power consumption model for multicores. The models built provide a good fit to the empirical data according to relevant metrics for statistical models, and they are useful tools to assess the capabilities of scheduling strategies regarding energy efficiency. In order to evaluate energy-aware scheduling strategies, a total of 60 different simulations were performed considering linear energy models for AMD and Intel architectures and 30 realistic workloads based on traces of HPC Parallel Workloads Archive executed on real computing infrastructures. The simulation results showed that strategies which maximize the host utilization achieves better results, notably improving over traditional Business-as-Usual schedulers.