When it comes to choosing hard drives, there are several measures of performance that may influence your decision.
An important parameter for overall drive performance is revolutions per minute (abbreviated rpm, RPM, r/min, or r·min−1); this is the speed at which the disk platters spin. High rpms mean data passes the read/write heads of the drive faster. In general, the higher the rpm of the drive, the faster data is accessed or stored on the platter. Drive vendors normally state the rpms of a drive as part of the name. HP SATA drives currently spin at 5400 or 7200 rpm while HP SAS drives come in 7200, 10,000, or 15,000 rpm speeds.IOPS (input/output operations per second) is a set of common benchmarks for hard disks measuring the number of:
- Sequential read
- Sequential write
- Random read, or
- Random write
Seek time is the time it takes to move the read/write head to the right place on the platter. Related to seek time is rotational delay or latency: the time required for the addressed area of the disk to rotate to the position where the read/write head can get to it. The transfer time is the time it takes to actually read or write data from/to the platter. All three of these and spin-up time (the time needed to speed the disk to operating speed )determine the disk access time of a drive.
One other common term used when talking about drive performance is data transfer rate. This is often measured in bits per second and refers to the average rate data is moved from the drive to the storage controller. Often when people talk about data transfer rate, they are actually talking about system bandwidth or throughput, which is the rate at which data travels from the drive to other components of the server/system. The throughput of the system depend on the data transfer rates of the drive, the controller, and the system BIOS and chipset – any one of these can be a bottleneck.
So, is rpm the most important measure of drive performance for your application/environment, or do you normally pay more attention to the other measures?
One of the complexities in benchmarking applications is defining the server configuration adequately. The purpose of a benchmark is to provide guidance - to demonstrate how to obtain a given performance from a specific application workload. This guidance is not useful if the performance cannot be reproduced. To make benchmark results reproducible, it is necessary to define the server configuration, specifying everything that can affect performance. From a hardware point of view, this is not difficult - specify the model numbers of every component in each server, then specify the model numbers of the components in storage and networking components. Some of this information is available on-line, such as processor model number. Other information, such as model or speed of memory DIMMs, is usually not accessible on-line, but this data is important. For example, some current x86 servers have an option of 667MHz or 800MHz DIMMs, and this choice can affect application performance considerably.
Identifying the software components can be difficult, since you need to know which components affect the performance of the workload.
And the most obscure configuration area is firmware - in some cases, versions of firmware have a big impact on workload performance. It is rarely necessary to document the firmware version of the server, but it is a good idea to document firmware versions of networking components.
Next, it is important to know how the quantities of specific components affect performance. Performance varies with the number of disks internal to the servers, the number controllers connecting the server to external disks, the number and topology of network switches, etc.
One important variable is the number of memory DIMMs. The number of DIMMs affects performance in two ways - the total amount of memory on the server, and the memory performance. It is useful to run the workload using the maximum number of DIMMs, then repeat the benchmark using ½ as many DIMMs. Memory is expensive, and it is very useful to know how the workload performance varies with memory configuration.
Given that it is important to measure power usage and correlate it to application performance, how do you measure the power?
We use 2 different methods - one for rack-mounted servers and another for blade servers. The rack-mounted servers do not provide power meters, so we bought a power meter. We plug the server into the power meter, so we are measuring the total power used. Then, with a simple PC interface, we allow the application user on the server to obtain continuous power data which is easy to correlate with the applications.
This is easy for the users, but it requires planning and logistics and some work by our system managers, to connect the meter to the right server at the right time.
We often want to measure the power of a cluster running one HPC application in parallel, and it is usually sufficient to measure the power of any one server in the cluster running the application.
It is easier to measure power on an HP blade enclosure, since the enclosure contains power measurement capability and provides this data in a usable way. The available data includes the total enclosure power and also the power used by each blade server and each fan in the enclosure. We integrated this information with the Platform Computing LSF job scheduler. Now, users of our blade servers submit their jobs via LSF and automatically receive their power usage data as part of the job.
Next week, I expect to post a message from the SC08 conference.
Until a couple of years ago, when we referred to performance measurement of an application, we meant the amount of time that it took the job to run vs. the specific resources it used - number of cores, number of servers if you are using a cluster, the specific characteristics of the server cores and memory and other server specs, plus IO/storage resources and specs.
Basically, we only measured one thing, the elapsed time of the job. Then, using the resources and specs, we computed lots of things - throughput efficiency, parallel scalability and efficiency, performance per core or per server, IO metrics, etc, etc.
Now, we make an additional measurement - power utilization, which we correlate in time with the execution of an application. We want to know the average power used during the execution of a single job, and we also look at the variation of power during a job, and the maximum power used.
Of course, lots of people measure power used by computers. But, since most of these people are system managers or system designers, they don't have a reason to correlate power with specific applications and compute jobs. They want to know the average and peak power used to run their overall workload, so they can plan for current and future power requirements. This is important work, but it does not give them the ability to optimize their workload.
If you measure the average power used during the execution of one compute job, and you multiply that power by the elapsed time of the job, you have Application Energy - the electrical energy used to run that specific job. This is a very convenient quantity, since it gives you a single number that relates power usage to compute jobs. You can use Application Energy to optimize your workload, just as you use elapsed time.
A couple of examples:
1. You can measure application energy for a given set of applications on two or more different server models, and then select the more energy-efficient model. You can use this App Energy comparison together with elapsed time comparison, and then make speed vs. energy tradeoffs. If a job runs 30% faster but consumes 50% more Application Energy on Server A than it does on Server B, which is a better choice for your requirements?
2. We are also using Application Energy to determine the most efficient way to run applications which run in parallel on a cluster of servers - a common way to run HPC codes. For one common HPC application, we ran the same job at 3 levels of parallelization and compared the elapsed times and Application Energy. We showed that the job used only 4% more Application Energy running 32-way-parallel (on 32 cores) vs. 16-way parallel. But the job used 20% more Application Energy running 64-way-parallel vs. 32-way-parallel. In other words, there is very little energy cost using 32 cores and returning the results to the user much faster vs. using 16 cores. But there is a substantial energy cost to use 64 cores, which returns the results even faster.
Does anyone find this interesting, or agree (or disagree) with this approach?
In some situations, it is useful to not use some of the cores on a server. Since most processors do not have sufficient memory BW to support a memory BW-intensive code running on all cores, such codes do not "scale" perfectly. There are 2 common ways to define scaling - serial-job throughput workload (multiple serial jobs), and a single parallel code workload.
If scaling is perfect, then an 8-core server can run 8 copies of a serial job in the same time as one serial job.
For a highly scalable application: if scaling is perfect, then an 8-core server runs an 8-way-parallel job 8 times faster than a serial job.
Most HPC jobs can not scale perfectly, so this is an issue. But in most cases, the server can run more total work (jobs per day) using all of its cores than it can if some cores are unused. So why would we consider leaving some cores idle? The primary reason is the cost of running licensed applications. Many HPC applications are licensed on a per-core basis, although the cost may not be linear with the number of cores. It is useful to compare the per-core job performance to the per-core license cost to determine the best performance-to-cost operating point.
Given that it may be useful to use a subset of the cores, doing so correctly is difficult. You need to know the architecture of your server. Here is an example of an HP ProLiant server containing two Intel Xeon Harpertown quad-core processors.
-Each processor has a separate connection to the memory system.
-Each processor has 4 cores. Each pair of cores shares a data cache. The 4 cores share the processor's memory BW.
If you draw a picture of this, you will see that not all combinations of cores are equal in terms of cache size and memory BW resources.
Let's say we want to run a workload consisting of one parallel job, using only 4 cores in the server. The best performance is obtained using 2 cores on each processor, and selecting cores which do not share a cache (so that one core has full use of the entire shared cache). The next-best performance uses 2 cores per processor, and selecting cores which share a cache. The 3rd-best performance uses 3 cores on one processor and one core on the other processor. The worst performance is obtained using 4 cores on one processor.
If we want to run a single parallel job on only 2 cores in the server, there are three possible choices. The best performance uses 1 core per processor. The next-best performance uses 2 cores on one processor, selecting cores which do not share a cache. Third best uses 2 cores on one processor, selecting cores which share a cache.
How big is the performance difference based on these choices? The answer depends on the application, the specific input data, and other factors. But here is an example, for a moderately parallel HPC application. Running the code in parallel using one of its standard performance benchmarks, it runs 4 times faster using all cores (8-way-parallel) than using 1 core (serial).
Running 4-way-parallel using the above choices of 4 cores, the performance varies from 3.4 (best case) to 2.9 (next-best case) to 2.4 (third-best case) to 2.3 (worst case) times faster than serial.
Running 2-way-parallel using the above choices of 2 cores, the performance is either 2.0, 1.8, or 1.5 times faster than serial.
Clearly we can lose a lot of performance if we do not select the cores carefully!
This analysis would be different for different processors or different server configurations.