PERFORMANCE ANALYSIS OF INTEL CORE 2 DUO PROCESSOR A Thesis Submitted to the Graduate Faculty of the Louisiana State University and Agricultural and Mechanical College in partial fulfillment of the requirements for the degree of Master of Science in Electrical Engineering In The Department of Electrical and Computer Engineering By Tribuvan Kumar Prakash Bachelor of Engineering in Electronics and Communication Engineering Visveswaraiah Technological University, Karnataka, 2004. August 2007 Acknowledgements

I would like to express my gratitude to my advisor, Dr. Lu Peng for his guidance, and constant motivation towards the completion of this thesis. His technical advice and suggestions helped me to overcome hurdles and kept me enthusiastic and made this work a wonderful learning experience. I would like to thank my committee members Dr. David Koppelman and Dr. Suresh Rai for taking time out of their busy schedule and agreeing to be a part of my committee. I would like to also thank them for their valuable feedback.

I would like to thank the faculty members and Shirley and Tonya of the Department of Electrical Engineering, for all the support and making my study at Louisiana State University a pleasant experience. I would like to thank my parents and sister without whom I would not have made it to this point. I would like to thank my friends Srinath Sitaraman and Balachandran Ramadas for their help while collecting data. I would also like to thank my roommates & friends here at LSU and back home for all the love and unending support. ii Table of Contents

List of Tables ..................................................................................................................... iv List of Figures ..................................................................................................................... v Abstract .............................................................................................................................. vi 1. Introduction ................................................................................................................. 1 1. Overview ............................................................................................................. 1 1. 2 Architecture of Intel Core 2 Duo ........................................................................ 3 2. Performance Analysis of SPEC CPU Benchmarks Running on Intel’s Core 2 Duo Processor ............................................................................................................................. 7 2. 1 Overview ............................................................................................................. 7 2. Methodology ....................................................................................................... 7 2. 3 Measurement Results .......................................................................................... 9 2. 3. 1 IPC and Instruction Profile ......................................................................... 9 L1 D-Cache Misses ................................................................................... 11 2. 3. 2 2. 3. 3 L2 Cache Misses ....................................................................................... 3 2. 3. 4 Branch Misprediction................................................................................ 15 3. Performance Comparison of Dual Core Processor Using Microbenchmarks .......... 17 3. 1 Overview ........................................................................................................... 17 3. 2 Architecture of Dual-Core Processors .............................................................. 18 3. 2. 1 Intel Pentium D 830 .................................................................................. 18 AMD Athlon 64X2 .................................................................................. 19 3. 2. 2 3. 2. 3 Processor Comparison .............................................................................. 20 3. 3 Methodology ..................................................................................................... 21 3. 4 Memory Bandwidth and Latency Measurements ............................................. 23 4. Performance Comparison of Dual Core Processors Using Multiprogrammed and Multithreaded Benchmarks ............................................................................................... 1 4. 1 Overview ........................................................................................................... 31 4. 2 Methodology ..................................................................................................... 31 Multiprogrammed Workload Measurements .................................................... 33 4. 3 4. 4 Multithreaded Program Behavior ..................................................................... 36 5. 6. Related Work ............................................................................................................ 9 Conclusion ................................................................................................................ 41 References......................................................................................................................... 43 Vita.................................................................................................................................... 46 iii List of Tables Table 1. 1 Specification of Intel Core 2 Duo machine. ....................................................... Table 2. 1 SPEC CPU20006 Integer Benchmark ................................................................ 8 Table 2. 2 SPEC CPU20006 Floating Point Benchmark..................................................... 8 Table 3. 1 Specifications of the selected processors.......................................................... 21 Table 3. 2 Memory operations from Lmbench................................................................... 22 Table 3. 3 Kernel operations of the STREAM and STREAM2 benchmarks....................... 23 Table 4. Input parameters of the selected multithreaded workloads .............................. 33 iv List of Figures Figure 1-1 Block Diagram of Intel Core 2 Duo Processor ................................................. 4 Figure 1-2 Block Diagram of Intel Core Micro-architecture’s IP Prefetcher..................... 5 Figure 2-1 IPC of SPEC Benchmarks.............................................................................. 10 Figure 2-2 Instruction Profile of SPEC Benchmarks........................................................ 1 Figure 2-3 L1-D Cache Misses per 1000 instructions of SPEC Benchmarks .................. 12 Figure 2-4 Sample Code of MCF Benchmark .................................................................. 13 Figure 2-5 L2 Cache Misses per 1000 instructions of SPEC Benchmarks..................... 14 Figure 2-6 Sample Code of LBM Benchmark.................................................................. 15 Figure 2-7 Branch Mispredicted Per 1000 Instructions of SPEC Benchmarks ................ 6 Figure 3-1 Block Diagram of Pentium D Processor ......................................................... 19 Figure 3-2 Block Diagram of AMD Athlon 64x2 Processor............................................ 20 Figure 3-3 Memory bandwidth collected from the lmbench suite (1 or 2 copies)............ 25 Figure 3-4 Memory load latency collected from the lmbench suite (1 or 2 copies) ......... 27 Figure 3-5 Memory bandwidth and latency collected from the STREAM and STREAM2 benchmarks (1 or 2 copies) ............................................................................................... 9 Figure 4-1 SPEC CPU2000 and CPU2006 benchmarks execution time.......................... 34 Figure 4-2 Multi-programmed speedup of mixed SPEC CPU 2000/2006 benchmarks... 35 Figure 4-3 (a) Execution time for 1-thread version of selected multithreaded programs. 36 Figure 4-4 Throughput of SPECjbb2005 running with 1 to 8 warehouses....................... 38 v Abstract With the emergence of thread level parallelism as a more efficient method of improving processor performance, Chip Multiprocessor (CMP) technology is being more widely used in developing processor architectures.

Also, the widening gap between CPU and memory speed has evoked the interest of researchers to understand performance of memory hierarchical architectures. As part of this research, performance characteristic studies were carried out on the Intel Core 2 Duo, a dual core power efficient processor, using a variety of new generation benchmarks. This study provides a detailed analysis of the memory hierarchy performance and the performance scalability between single and dual core processors. The behavior of SPEC CPU2006 benchmarks running on Intel Core 2 Duo processor is also explained. Lastly, the overall execution time and hroughput measurement using both multi-programmed and multi-threaded workloads for the Intel Core 2 Duo processor is reported and compared to that of the Intel Pentium D and AMD Athlon 64X2 processors. Results showed that the Intel Core 2 Duo had the best performance for a variety of workloads due to its advanced micro-architectural features such as the shared L2 cache, fast cache to cache communication and smart memory access. vi 1. Introduction 1. 1 Overview This thesis work analyzes the performance characteristics of major architectural developments employed in Intel Core 2 Duo E6400 processor with 2. 13GHz [15].

Intel Core 2 Duo is a high performance and power efficient dual core Chip-Multiprocessor (CMP). CMP embeds multiple processor cores into a single die to exploit thread-level parallelism for achieving higher overall chip-level Instruction-Per-Cycle (IPC) [4] [14] [15] [21]. In a multi-core, multithreaded processor chip, thread-level parallelism combined with increased clock frequency exerts a higher demand for on-chip and offchip memory bandwidth causing longer average memory access delays. There has been great interest shown by researchers to understand the underlying reasons that cause these bottlenecks in processors.

The advances in circuit integration technology and inevitability of thread level parallelism over instruction level parallelism for performance efficiency has made ChipMultiprocessor (CMP) or multi-core technology the mainstream in CPU designs. With the evolution of processor architectures over time, the benchmarks used to measure the performance of these high performance processors have also continued to evolve. Many single and multi threaded benchmarks have been defined and developed to stress the processor units to its maximum limit.

Standard Performance Evaluation Corporation (SPEC) is one of the non profit organizations that have been developing benchmarks to meet the requirements of these dynamic processor architectures for nearly a decade. SPEC CPU2006 is a single-threaded compute-intensive benchmark developed by SPEC using C, C++ and FORTRAN programming language. To understand the performance of 1 multi-core processors completely it is equally important to understand their behavior while running multi threaded applications. SPEC JBB2005, lmbench, bioperf and splash2 are some of the most popularly used multithreaded benchmarks for this purpose.

This thesis work focuses mainly on workload characteristics, memory system behavior and multi-thread interaction of the benchmarks. This work also seeks to report performance measurement on Intel Core 2 Duo E6400 with 2. 13GHz [15] and compare the results with Intel Pentium D 830 with 3. 0GHz [19] and AMD Athlon 64X2 4400+ with 2. 2GHz [2]. In contrast to existing performance evaluations [13] [26] [27] that usually provide overall execution time and throughput, this work emphasizes on the memory hierarchy performance.

It reports the measured memory access latency and bandwidth as well as cache-to-cache communication delays. It also examines the performance scalability between single and dual cores on the three tested processors. Summarized below are a few interesting findings based on experiments conducted as part of this research: SPEC CPU2006 running on Core 2 Duo exerts less pressure on the L1 cache compared to SPEC CPU2000 benchmarks. However, CPU2006 benchmarks have larger data sets and longer execution times resulting in comparatively high stress on L2 cache.

The cache to cache latency of Core 2 Duo was measured to be 33ns. Core 2 Duo has high memory bandwidth and low latency as a result of on-chip access to the other L1 cache and the presence of aggressive memory dependence predictors. . Its shared L2 generates less off-chip traffic than the other two. 2 Due to its shared L2 cache access the execution time of all single threaded workloads are fast and range from 56-1500 seconds for Core 2 Duo. The average multi-programmed speedup for CPU2006 and CPU2000 benchmarks was measured at 1. 76 and 1. 7 respectively which is lower than the ideal speedup of 2. The Core 2 Duo’s speed-ups are constrained due to its ability to use the entire L2 cache. 1. 2 Architecture of Intel Core 2 Duo The Intel Core 2 Duo E6400 (Figure 1. 1) processor supports CMP and belongs to the Intel’s mobile core family. It is implemented by using two Intel’s Core architecture on a single die. The design of Intel Core 2 Duo E6400 is chosen to maximize performance and minimize power consumption [18]. It emphasizes mainly on cache efficiency and does not stress on the clock frequency for high power efficiency.

Although clocking at a slower rate than most of its competitors, shorter stages and wider issuing pipeline compensates the performance with higher IPC’s. In addition, the Core 2 Duo processor has more ALU units [13]. The five main features of Intel Core 2 Duo contributing towards its high performance are: • Intel’s Wide Dynamic Execution • Intel’s Advanced Digital Media Boost • Intel’s Intelligent Power Capability • Intel’s Advanced Smart Cache • Intel’s Smart Memory Access Core 2 Duo employs Intel’s Advanced Smart Cache which is a shared L2 cache to increase the effective on-chip cache capacity. Upon a miss from the core’s L1 cache, the shared L2 and the L1 of the other core are looked up in parallel before sending the request to the memory [18]. The cache block located in the other L1 cache can be fetched without off-chip traffic. Both memory controller and FSB are still located off-chip. The off-chip memory controller can adapt the new DRAM technology with the cost of longer memory access latency. Intel Advanced Smart Cache provides a peak transfer rate of 96 GB/sec (at 3 GHz frequency) [17]. Figure 1-1 Block Diagram of Intel Core 2 Duo Processor Core 2 Duo employs aggressive memory dependence predictors for memory disambiguation.

A load instruction is allowed to be executed before an early store instruction with an unknown address. It also implements a macro-fusion technology to combine multiple micro-operations. Another important aspect to alleviate cache miss penalty is data prefetching. According to the hardware specifications, the Intel Core 2 Duo includes a stride prefetcher on its L1 data cache [17] and a next line prefetcher on its L2 cache [13]. The Intel Core micro-architecture includes in each processing core two prefetchers to the Level 1 data cache and the traditional prefetcher to the Level 1 instruction cache.

In 4 addition it includes two prefetchers associated with the Level 2 cache and shared between the cores. In total, there are eight prefetchers per dual core processor [17]. The L2 prefetcher can be triggered after detecting consecutive line requests twice. The stride prefetcher on L1 cache is also known as Instruction Pointer-Based (IP) prefetcher to level 1 data cache (Figure 1. 2). The IP prefetcher builds a history for each load using the load instruction pointer and keeps it in the IP history array.

The address of the next load is predicted using a constant stride calculated from the entries in the history array [17]. The history array consists of the following fields. 12 un-translated bits of last demand address 13 bits of last stride data (12 bits of positive or negative stride with the 13th bit the sign) 2 bits of history state machine 6 bits of last prefetched address—used to avoid redundant prefetch requests. Figure 1-2 Block Diagram of Intel Core Micro-architecture’s IP Prefetcher The IP prefetcher then generates a prefetch request to L1 cache for the predicted address.

This request for prefetch enters a FIFO and waits for its turn. When the request is encountered a lookup for that line is done in the L1 cache and the fill buffer unit. If the 5 prefetch hits either the L1 cache or the fill buffer, the request is dropped. Otherwise a read request to the corresponding line is sent to L2 cache. Other important features involve support for new SIMD instructions called Supplemental Streaming SIMD Extension 3, coupled with better power saving technologies. Table 1. 1 specifies the CPU specification of the Intel Core 2 Duo machine used for carrying out the experiments.

It has separate 32 KB L1 instruction and data caches per core. A 2MB L2 cache is shared by two cores. Both L1 and L2 caches are 8way set associative and have 64-byte lines. Table 1. 1 Specification of Intel Core 2 Duo machine. CPU Intel Core 2 Duo E6400 (2 x 2. 13GHz) Technology 65nm Transistors 291 Millions Hyperthreading No L1 Cache Code and Data: 32 KB X 2, 8 way, 64–byte cache line size, write-back L2 Cache 2MB shared cache (2MB x 1), 8-way, 64-byte line size, non-inclusive with L1 cache. Memory 2GB (1GB x 2) DDR2 533MHz FSB 1066MHz Data Rate 64-bit FSB bandwidth 8. GB/s HD Interface SATA 375MB/s The remainder of this work is organized as follows. Chapter 2 analyzes SPEC CPU2006 benchmark using variety of performance results obtained from Intel(R) VTune(TM) Performance Analyzer 8. 0. 1 and compares it with SPEC CPU2000 benchmarks. Chapter 3 compares memory latency and hierarchy of three dual core processors using micro-benchmarks. Chapter 4 discusses the performance measurement results for three dual core processors using single threaded, multi-programmed and multithreaded workloads.

Chapter 5 describes related work. Finally, chapter 6 explains the brief conclusion obtained. 6 2. Performance Analysis of SPEC CPU Benchmarks Running on Intel’s Core 2 Duo Processor 2. 1 Overview With the evolution of processor architecture over time, benchmarks that were used to measure the performance of these processors are not as useful today as they were before due to their inability to stress the new architectures to their maximum capacity in terms of clock cycles, cache, main memory and I/O bandwidth. Hence new and mproved benchmarks need to be developed and used. The SPEC CPU2006 is one such benchmark that has intensive workloads based on real applications and is a successor of the SPEC CPU2000 benchmark. This section presents a detailed analysis of the SPEC CPU2006 benchmark running on the Core 2 duo processor discussed earlier and emphasizes on its workload characteristics and memory system behavior. Also, the cpu2006 and cpu2000 benchmarks are compared with respect to performance bottlenecks by using the v-tune performance analyzer for the entire program execution. . 2 Methodology The SPEC CPU2006 has 29 benchmarks with 12 integer and 17 floating point programs. For our experiments, all the integer programs and a subset of 10 floating point programs were considered. The details of these benchmark programs are shown in Tables 2. 1 and 2. 2. All experiments were run on systems with 32 bit Windows XP SP2 operating system and Intel Core 2 Duo processors, as explained in Chapter 1. The Intel(R) VTune(TM) Performance Analyzer 8. 0. 1 was used to analyze all benchmarks for their Table 2. 1 SPEC CPU20006 Integer Benchmark Integer Benchmark Astar Bzip2 Gcc Gobmk H264ref Hmmer Libquantum Mcf Omnetpp Perlbench Sjeng Xalancbmk Language C++ C C C C C C C C++ C C C++ Description Path-Finding Algorithm Compression C Compiler Artificial Intelligence: go Video Compression Search Gene Sequence Physics: Quantum Computing Combinatorial Optimization Discrete Event Simulation PERL Programming Language Artificial Intelligence: Chess XML Processing Table 2. SPEC CPU20006 Floating Point Benchmark Floating Point benchmarks Language Description Bwaves Fortran Fluid Dynamics Gamess Fortran Quantum Chemistry Physics: Quantum Milc C Chromodynamics Biochemistry/Molecular Gromacs C/Fortran Dynamics CactusADM C/Fortran Physics / General Relativity Leslie3d Fortran Fluid Dynamics Linear Programming, Soplex C++ Optimization GemsFDTD Fortran Computational Electromagnetics Lbm C Fluid Dynamics Sphinx3 C Speech recognition complete run time [20]. At a given time, Intel(R) VTune(TM) Performance Analyzer 8. 0. can measure only certain definite number of events, depending upon the configuration; hence, several complete runs were made to measure all the events. All 8 benchmarks were compiled using Microsoft Visual C/C++ 2005 and Intel® FORTRAN Compiler 9. 1. We used the fastest speed compilation flags i. e. for the Microsoft VC++ compiler, we set “-O2”. 2. 3 Measurement Results 2. 3. 1 IPC and Instruction Profile Figure 2. 1(a) and Figure 2. 1(b) represent the IPC of CPU2006 and CPU2000 respectively. The average IPC’s for CPU2006 and CPU2000 benchmarks were measured at 1. 06 and 0. 85 respectively. From the figures it can be observed that mcf, omnetpp and lbm have low IPC among CPU2006 benchmarks, while mcf, art and swim have low IPC among the CPU2000 benchmarks. It is interesting to understand the causes of performance bottlenecks among these benchmarks and to do so the instruction profiles of these benchmarks were analyzed. Figure 2. 2(a) and Figure 2. 2(b) represent the instruction profile of CPU2006 and CPU2000 respectively. It is evident from the figure that a very high percentage of instructions retired consist of loads and stores.

CPU2006 benchmarks like h264ref, hmmer, bwaves, lesli3d and GemsFDTD have comparatively high percentage of loads while astar, bzip2, gcc, gobmk, libquantum, mcf, omnetpp, perlbench, sjeng, xalancbmk and gamess have high percentage of branch instructions. On the contrary CPU2000 benchmarks like gap, parser, vortex, applu, equake, fma3d, mgrid and swim have comparatively high percentage of loads while almost all integer programs have high percentage of branch instructions. 9 CPU 2006 IPC 1. 8 1. 6 1. 4 1. 2 1 0. 8 0. 6 0. 4 0. 2 0 BZIP2 ASTAR IPC IPC OMNETPP LIBQUANTUM PERLBENCH GEMSFDTD HMMER CACTUSADM GROMACS LESLIE3D H264REF

GAMESS XALANBMK (a) CPU2000 IPC 2 1. 8 1. 6 1. 4 1. 2 IPC 1 0. 8 0. 6 0. 4 0. 2 0 BZIP CRAFTY GAP BWAVES IPC FACEREC MGRID (b) Figure 2-1 (a) IPC of SPEC CPU2006 Benchmarks (b) IPC of SPEC CPU2000 Benchmarks However, higher percentage of load and store instructions retired or higher percentage of branches do not indicate presence of better bottlenecks. For example h264ref and perlbench have high percentage of load, store and branch instructions, but they also have comparatively high IPC. Similarly among CPU2000 benchmarks crafty, parser and perl have high percentage of load, store and branch instruction and have better IPC.

To get a better understanding of the bottlenecks of these benchmarks, L1 cache misses, L2 cache misses and branch instruction mis-predicted were measured and 10 WUPWISE VORTEX EQUAKE GZIP PARSER PERL GCC TWOLF MCF AMMP SIXTRAK GALGEL FMA3D APPLU LUCAS MESA SWIM VPR ART SPHINX3 SJENG SOPLEX GOBMK DEALSII MILC GCC MCF LBM analyzed. The higher the measured rates the better is the bottleneck produced by the respective benchmark. CPU2006 INSTRUCTION PROFILE 100% 80% 60% LOADS BRANCH STORES OTHER % 40% 20% OMNETPP LIBQUANTUM PERLBENCH GEMSFDTD HMMER GROMACS 0% BZIP2 ASTAR CACTUSADM LESLIE3D H264REF GAMESS XALANBMK (a)

CPU2000 INSTRUCTION PROFILE 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% CRAFTY BZIP GAP BWAVES LOADS BRANCH STORES OTHER % FACEREC MGRID (b) Figure 2-2 (a) Instruction Profile of SPEC CPU2006 Benchmarks (b) Instruction Profile of SPEC CPU2000 Benchmarks 2. 3. 2 L1 D-Cache Misses Figure 2. 3(a) and 2. 3(b) indicates the L1 cache misses per 1000 instructions of CPU2006 and CPU2000 benchmarks. The results show that there is no significant improvement in CPU2006 than CPU2000 with respect to stressing the L1 cache. The average L1-D cache misses per 1000 instructions for cpu2006 and cpu2000 benchmark set under consideration was found to be 24. and 27. 8 respectively. The mcf benchmark 11 WUPWISE GZIP EQUAKE VORTEX PARSER PERL GCC MCF TWOLF AMMP SIXTRAK GALGEL FMA3D APPLU LUCAS MESA SWIM VPR ART SPHINX3 GOBMK SJENG SOPLEX DEALSII GCC MILC MCF LBM has highest L1 cache misses per 1000 instructions in both CPU2000 and CPU2006 benchmarks. This is one of the significant reasons for its low IPC. CPU2006 L1 D-Cache Miss Per Kinst 180 160 140 120 100 80 60 40 20 0 BZIP2 ASTAR L1 D-Cache Miss Per Kinst INSTRUCTIONS OMNETPP PERLBENCH LIBQUANTUM GEMSFDTD HMMER CACTUSADM GROMACS LESLIE3D H264REF GAMESS XALANBMK (a) CPU2000 L1 D-Cache Miss Per Kinst 80 160 BWAVES L1 D-Cache Miss Per Kinst INSTRUCTIONS 140 120 100 80 60 40 20 FACEREC MGRID WUPWISE VORTEX EQUAKE GZIP PARSER PERL GCC 0 BZIP CRAFTY GAP TWOLF MCF SIXTRAK AMMP GALGEL FMA3D APPLU LUCAS MESA SWIM VPR ART (b) Figure 2-3 (a) L1-D Cache Misses per 1000 instructions of SPEC CPU2006 Benchmarks (b) L1-D Cache Misses per 1000 instructions of SPEC CPU2000 Benchmarks Mcf is a memory intensive integer benchmark written in C language. Code analysis using Intel(R) VTune(TM) Performance Analyzer 8. 0. 1 shows that the key functions responsible for stressing the various processor nits are primal_bea_mpp and refresh_potential. Primal_bea_mpp (72. 6%) and refresh_potential (12. 8%) together are responsible for 85% of the overall L1 data cache miss events. 12 SPHINX3 GOBMK SJENG SOPLEX DEALSII GCC MILC MCF LBM A code sample of primal_bea_mpp function is shown in Figure 2. 4. The function traverses an array of pointer (denoted by arc_t) to a set of structures. For each structure traversed, it optimizes the routines used for massive communication. In the code under consideration, pointer chasing in line 6 is responsible for more than 50% of overall L1D cache misses for the whole program.

Similar result for mcf in CPU2000 was also found in previous work [11]. Apart from mcf, lbm have comparatively significant L1 cache misses rate in CPU2006 and mcf, art and swim have comparatively significant L1 cache misses rate in CPU2000. Figure 2-4 Sample Code of MCF Benchmark 2. 3. 3 L2 Cache Misses Figure 2. 4(a) and 2. 4(b) represent the L2 cache misses per 1000 instructions of CPU2006 and CPU2000 SPEC benchmarks respectively. The average L2 cache misses per 1000 instructions for CPU2006 and CPU2000 benchmarks under consideration was found to be 4. 4 and 2. respectively. Lbm has the highest L2 cache misses which attributes for its low IPC. Lbm (Lattice Boltzmann Method) is a floating point based benchmark written in C language. It is used in the field of fluid dynamics to simulate the behavior of fluids in 3D. Lbm has two steps of accessing memory, namely i) streaming 13 step , in which values are derived from neighboring cells and ii) linear memory access to read the cell values (collide-stream) and write the values to the cell (stream-collide) [9]. CPU2006 L2 Cache Miss Per Kinst 30 L2 Cache Miss Per Kinst INSTRUCTIONS 5 20 15 10 5 OMNETPP PERLBENCH LIBQUANTUM GEMSFDTD HMMER CACTUSADM GROMACS LESLIE3D H264REF GAMESS XALANBMK SPHINX3 WUPWISE GOBMK SJENG SOPLEX SWIM DEALSII BZIP2 ASTAR MILC SIXTRAK GCC 0 MCF LBM MESA (a) CPU2000 L2 Cache Miss Per Kinst 25 BWAVES L2 Cache Miss Per Kinst INSTRUCTIONS 20 15 10 5 0 CRAFTY BZIP GAP FACEREC VORTEX EQUAKE PARSER TWOLF GALGEL FMA3D APPLU (b) Figure 2-5 (a) L2 Cache Misses per 1000 instructions of SPEC CPU2006 Benchmarks (b) L2 Cache Misses per 1000 instructions of SPEC CPU2000 Benchmarks Code analysis reveals that LBM_performStreamCollide function used to write the values to the cell is responsible for 99. 8% of the overall L2 cache miss events. A code sample of the same function is shown in Figure 2. 6. A macro “TEST_FLAG_SWEEP” is responsible for 21% of overall L2 cache misses. The definition of TEST_FLAG_SWEEP is shown in Figure 2. 6(b). The pointer *MAGIC_CAST dynamically accesses memory accesses over 400MB of data which is much larger than the available L2 cache size 14 LUCAS MGRID GZIP PERL GCC MCF AMMP VPR ART (2MB), resulting in very high L2 cache misses. Hence it can be concluded that lbm has very large data footprint which results in high stress on L2 cache.

For mcf, Primal_bea_mpp (33. 4%) and refresh_poten-tial (20. 2%) are two major functions resulting in L2 cache misses. Intensive pointer chasing is responsible for this. Figure 2-6 Sample Cde of LBM Benchmark 2. 3. 4 Branch Misprediction Figure 2. 5(a) and 2. 5(b) represents the branch mispredicted per 1000 instructions of CPU2006 and CPU2000 SPEC benchmarks. CPU2006 benchmarks have comparatively higher branch misprediction than CPU2000 benchmark and almost all floating point benchmarks under consideration have negligible branch misprediction comparatively.

The average branch mispredicted per 1000 instructions for CPU2006 and CPU2000 integer benchmarks were measured as 4. 2 and 4. 0 respectively and the average branch misprediction per 1000 instructions for CPU2006 and CPU2000 floating point benchmarks were measured as 0. 4 and 0. 08 respectively. We also measured L1 DTLB misses for SPEC CPU2006. Only a few programs have L1 DTLB miss rates equal to or larger than 1%. They are astar (1%), mcf (6%), omnetpp (1%) and cactusADM (2%). Some programs have very small L1 DTLB miss rate, for example, the miss rates for hammer, gromacs are 3. *10-5 and 6. 2*10-5 respectively. Other interesting results include hmmer and h264ref that has very high 15 percentage of loads and store but have negligible L1 and L2 cache misses per 1000 instructions. This is likely because hmmer and h264ref exhibit high locality of data set which favors the hardware prefetcher. CPU2006 Branch Mispredicted per Kinst 14 Branch Mispredicted per Kinst INSTRUCTIONS 12 10 8 6 4 2 OMNETPP LIBQUANTUM PERLBENCH GEMSFDTD HMMER CACTUSADM GROMACS LESLIE3D H264REF GAMESS XALANBMK BWAVES SPHINX3 WUPWISE SJENG (a) CPU2000 Branch Mispredicted per Kinst 6 14 12 10 8 6 4 2 FACEREC VORTEX EQUAKE PARSER TWOLF MGRID GZIP PERL GCC 0 BZIP CRAFTY GAP MCF SIXTRAK AMMP GALGEL FMA3D APPLU LUCAS MESA SWIM VPR ART Branch Mispredicted per Kinst INSTRUCTIONS (b) Figure 2-7 (a) Branch Mispredicted Per 1000 Instructions of SPEC CPU2006 Benchmarks; (b) Branch Mispredicted Per 1000 Instructions of SPEC CPU2000 Benchmarks Thus from the results analyzed so far we can conclude that the cpu2006 benchmarks have larger data sets and requires longer execution time than its predecessor CPU2000 benchmarks. 16 SOPLEX GOBMK DEALSII BZIP2 ASTAR MILC GCC MCF LBM 3. Performance Comparison of Dual Core Processor Using Microbenchmarks 3. 1 Overview In this section performance measurement results of three dual core desktop processors: Intel Core 2 Duo E6400 with 2. 13GHz [15], Intel Pentium D 830 with 3. 0GHz [19] and AMD Athlon 64X2 4400+ with 2. 2GHz [2] are analyzed and compared. The results in this section of work done emphasizes mainly on memory hierarchy and cache-to-cache communication delays of the three processors under consideration. There are several key design choices for the memory subsystem of the three processors.

All three have private L1 caches with different sizes. At the next level, the Intel Core 2 Duo processor adapts a shared L2 cache design, called Intel Advanced Smart Cache for the dual cores [17]. The shared L2 approach provides a larger cache capacity by eliminating data replications. It also permits naturally sharing of cache space among multiple cores. When only one core is active, the entire shared L2 can be allocated to the single active core. However, the downside for the shared L2 cache is that it suffers longer hit latency and may encounter competitions of its shared cache resources.

Both the Intel Pentium D and the AMD Athlon 64X2 have a private L2 cache for each core, enabling fast L2 accesses, but restricting any capacity sharing among the two cores. The shared L2 cache in the Core 2 Duo eliminates on-chip L2-level cache coherence. Furthermore, it resolves coherence of the two core’s L1 caches internally within the chip for fast access to the L1 cache of the other core. The Pentium D uses an off-chip Front-Side Bus (FSB) for inter-core communications. Basically, the Pentium D is basically a technology remap of the Pentium 4 Symmetric Multiprocessor (SMP) that requires accessing the FSB for maintaining cache coherence.

AMD Athlon 64X2 uses a 17 Hyper-Transport interconnect technology for faster inter-chip communication. Given an additional ownership state in the Athlon 64X2, cache coherence between the two cores can be accomplished without off-chip traffic. In addition, the Athlon 64X2 has an on-die memory controller to reduce memory access latency. To examine memory bandwidth and latency, we used lmbench [33], a suite of memory measurement benchmarks. Lmbench attempts to measure the most commonly found performance bottlenecks in a wide range of system applications.

These bottlenecks can be identified, isolated, and reproduced in a set of small micro-benchmarks, which measure system latency and bandwidth of data movement among the processor, memory, network, file system, and disk. In addition, we also ran STREAM [24] and STREAM2 [25] recreated by using lmbench’s timing harness. They are kernel benchmarks measuring memory bandwidth and latency during several common vector operations such as matrix addition, copy of matrix, etc. We also used a small lockless program [29] to measure the cache-to-cache latency of the three processors.

The lockless program records the duration of ping-pong procedures of a small token bouncing between two caches to get the average cache-to-cache latency. 3. 2 Architecture of Dual-Core Processors 3. 2. 1 Intel Pentium D 830 The Pentium D 830 (Figure 3. 1) glues two Pentium 4 cores together and connects them with the memory controller through the north-bridge. The off-chip memory controller provides flexibility to support the newest DRAM with the cost of longer memory access latency. The MESI coherence protocol from Pentium SMP is adapted in Pentium D that requires a memory update in order to change a modified block to shared. 8 The systems interconnect for processors remains through the Front-Side Bus (FSB). To accommodate the memory update, the FSB is located off-chip that increases the latency for maintaining cache coherence. The Pentium D’s hardware prefetcher allows stride-based prefetches beyond the adjacent lines. In addition, it attempts to trigger multiple prefetches for staying 256 bytes ahead of current data access locations [16]. The advanced prefetching in Pentium D enables more overlapping of cache misses. Figure 3-1 Block Diagram of Pentium D Processor 3. 2. 2 AMD Athlon 64X2 The Athlon 64X2 (Figure 3. ) is designed specifically for multiple cores in a single chip (Figure 1(c)). Similar to the Pentium D processor, it also employs private L2 caches. However, both L2 caches share a system request queue, which connects with an on-die memory controller and a Hyper-Transport. The Hyper-Transport removes system bottlenecks by reducing the number of buses required in a system. It provides significantly more bandwidth than current PCI technology [3]. The system request queue serves as an internal interconnection between the two cores without involvements of an external bus. The Athlon 64X2 processor employs MOESI protocol, which adds an 9 “Ownership” state to enable blocks to be shared on both cores without the need to keep the memory copy updated. The Athlon 64X2 has a next line hardware prefetcher. However, accessing data in increments larger than 64 bytes may fail to trigger the hardware prefetcher [5]. Figure 3-2 Block Diagram of AMD Athlon 64x2 Processor 3. 2. 3 Processor Comparison Table 3. 1 lists the specifications of the three processors experimented in this paper. There are no Hyper-threading settings on any of these processors. The Intel Core 2 Duo E6400 has separate 32 KB L1 instruction and data caches per core. A 2MB L2 cache is shared by two cores.

Both L1 and L2 caches are 8-way set associative and have 64byte lines. The Pentium D processor has a Trace Cache which stores 12Kuops. It is also equipped with a write-through, 8-way 16KB L1 data cache with a private 8-way 1MB L2 cache. The Athlon 64X2 processor’s L1 data and instruction cache are 2-way 64KB with a private 16-way 1MB L2 cache for each core. Athlon 64X2’s L1 and L2 caches in each core is exclusive. All three machines have the same size L2 caches and Memory. The Core 2 Duo and the Pentium D are equipped with DDR2 DRAM using advanced memory 20 controllers in their chipsets. The Athlon 64X2 has a DDR on-die memory controller.

All three machines have 2GB memory. The FSB of the Core 2 Duo is clocked at 1066MHz with bandwidth up to 8. 5GB/s. The FSB of the Pentium D operates at 800MHz and provides up to 6. 4GB/sec bandwidth. The Athlon 64X2 has a 2GHz I/O Hyper-Transport with bandwidth up to 8GB/s. Bandwidth of hard drive interface for the three machines are 375MB/s, 150MB/s and 300MB/s respectively. Because of our experiments are all inmemory benchmarks, difference in hard drives should have little impact. Table 3. 1 Specifications of the selected processors AMD Athlon64 Intel Core 2 Duo Intel Pentium D 830 4400+ (2 x 2. 0GHz) CPU E6400 (2 x 2. 13GHz) (2 x 3. 00GHz) Technology 65nm 90nm 90nm Transistors 291 Millions 230 Millions 230 Millions Hyperthreading No No No Code and Data: 32 KB Trace cache: 12Kuops Code and data: 64KB X 2, 8 way, 64–byte X 2, data: 16KB X 2, X 2, 2-way, 64-byte cache line size, write- 8-way, 64-byte line cache line size, writeL1 Cache back size, write-through back 2MB shared cache 2MB private cache 2MB private cache (2MB x 1), 8-way, 64- (1MB x 2), 8-way, 64- (1MB x 2), 16-way, byte line size, nonbyte line size, 64-byte line size, inclusive with L1 inclusive with L1 exclusive with L1 L2 Cache cache. ache. cache. 2GB (1GB x 2) DDR2 2GB(512MBx4) 2GB(1GB x 2) DDR Memory 533MHz DDR2 533MHz 400MHz HyperTransport 16bit 1066MHz Data Rate 800MHz Data Rate up/down 2GHz Data FSB 64-bit 64-bit Rate (up+down) FSB bandwidth 8. 5GB/s 6. 4GB/s 8GB/s HD Interface SATA 375MB/s SATA 150MB/s SATA 300MB/s 3. 3 Methodology We installed SUSE linux 10. 1 with kernel 2. 6. 16-smp on all three machines. We used maximum level GCC optimization to compile the C/C++ benchmarks of lmbench and lockless program. We used lmbench suite running on the three machines to measure 1 bandwidth and latency of memory hierarchy. Lmbench attempts to measure performance bottlenecks in a wide range of system applications. These bottlenecks have been identified, isolated, and reproduced in a set of small micro-benchmarks, which measure system latency and bandwidth of data movement among the processor, memory, network, file system, and disk. Table 3. 2 Memory operations from Lmbench Description measuring how fast the processor can copy data blocks when data segments are not aligned with pages using a system call bcopy(). easuring how fast the processor can copy data blocks when data segments are aligned with pages using a system call bcopy(). measuring how fast the processor can reset memory blocks using a system call bzero(). measuring how fast the system can copy data blocks without using bcopy(), when data segments are not aligned with pages. measuring the time to read every 4 byte word from memory measuring the time to write every 4 byte word to memory Operation Libc bcopy unaligned Libc bcopy aligned Memory bzero Unrolled bcopy unaligned Memory read Memory write

In our experiments, we focus on the memory subsystem and measure memory bandwidth and latency with various operations [33]. Table 3. 2 lists the operations used to test memory bandwidth and their meanings. We can run variable stride accesses to get average memory read latency. In addition, we ran multi-copies lmbench, one on each core to test the memory hierarchy system. We also ran STREAM [24] and STREAM2 [25] recreated by using lmbench’s timing harness. They are simple vector kernel benchmarks measuring memory bandwidth. Each version has four common vector operations as listed in Table 3. . Average memory latencies for these operations are also reported. 22 Table 3. 3 Kernel operations of the STREAM and STREAM2 benchmarks Set Kernel Operation STREAM copy c[i]=a[i] STREAM scale b[i] = scalar * c[i] STREAM add c[i] = a[i] + b[i] a[i] = b[i] + scalar * STREAM triad STREAM2 fill a[i] = q STREAM2 copy a[i] = b[i] STREAM2 daxpy a[i] = a[i] + q * b[i] STREAM2 sum sum = sum + a[i] We measured the cache-to-cache latency using a small lockless program [29]. It doesn’t employ expensive read-modify-write atomic instructions. Instead, it maintains a lockless counter for each thread.

The c-code of each thread is as follows. *pPong = 0; for (i = 0; i < NITER; ++i) { while (*pPing < i); *pPong = i+1; } Each thread increases its own counter pPong and keeps reading the peer’s counter by checking pPing. The counter pPong is in a different cache line from the counter pPing. A counter pPong can be increased by one only after verifying the update of the peer’s counter. This generates a heavy read-write sharing between the two cores and produces a Ping-Pong procedure between the two caches. The average cache-tocache latency is measured by repeating the procedure. . 4 Memory Bandwidth and Latency Measurements We used the lockless program described in section 3. 3 to measure the dual-core cache-to-cache latency. The average cache-to-cache latency of Core 2 Duo, Pentium D, and Athlon 64X2 are 33ns, 133ns and 68ns respectively. Core 2 Duo resolves L1 cache 23 coherence within the chip and enables the fastest cache-to-cache transfer. Pentium D requires external FSB for cache-to-cache transfer. Athlon 64X2’s on-chip system request inter-face and the MOESI protocol permits fast cache-to-cache communication.

We ran the bandwidth and latency test programs present in the lmbench suite. Figure 3. 3 shows memory bandwidth for many operations from lmbench. Figure 3. 3(a), 3. 3(c) and 3. 3 (e) present data collected while running one copy of lmbench on the three machines. Several observations can be made: (1) In general, Core 2 Duo and Athlon 64 X2 have better bandwidth than that of Pentium D. Only exception is that Pentium D shows the best memory read bandwidth when the array size is less than 1MB. The shared cache of Core 2 Duo demands longer access latency though providing larger effective capacity.

For Athlon 64X2, because the equipped DRAM has lower bandwidth, its memory read bandwidth is lower than that of Pentium D when memory bus is not saturated. The memory read bandwidth for the three machines drops when the array size is larger than 32KB, 16KB and 64KB respectively. These reflect the sizes of their L1 cache. When the array size is larger than 2MB, 1MB and 1MB for the respective three systems, we can see another dropping, reflecting their L2 cache sizes. (2) The memory bzero operation shows different behaviors: when the array size is larger than their L1 data cache sizes, i. e. 32KB for Core 2 Duo and 64KB for Athlon 64X2, the memory bandwidth drops sharply. This is not true for Pentium D. The L1 cache of Core 2 Duo and Athlon 64X2 employ a write-back policy while the L1 cache of Pentium D uses a write-through policy. When the array size is smaller than their L1 data cache sizes, the write-back policy updates the L2 cache less frequently than the write- 24 through policy, leading to higher bandwidth. However, when the array size is larger than their L1 data cache sizes, the write-back policy does not have any advantage as indicated Intel Core 2 Duo Memory Bandwidth (1 copy) 7500 15000 B a n d w i d th (M B / s ) 12500 libc bcopy unaligned libc bcopy aligned Memory bzero unrolled bcopy unaligned Memory read Memory write Intel Core 2 Duo Memory Bandwidth (2 copies) 35000 30000 25000 20000 15000 10000 5000 0 libc bcopy unaligned libc bcopy aligned Memory bzero unrolled bcopy unaligned Memory read Memory write 10000 7500 5000 2500 0 512 1024 2048 4096 8192 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M 1024M B a n d w id th (M B /s ) Array Size (Bytes) (a) Intel Pentium D Memory Bandwidth (1 copy) 7500 libc bcopy unaligned libc bcopy aligned Memory bzero unrolled bcopy unaligned Memory read Memory write 35000 B a n d w id th (M B /s ) B a n d w id th (M B /s ) 15000 12500 10000 7500 5000 2500 0 30000 25000 20000 15000 10000 5000 0 5 12 10 24 20 48 40 96 81 92 16 K 32 K 64 K 1 28 K 2 56 K 5 12 K 1M 2M 4M 8M 1 6M 3 2M 6 4M 1 2 8M 2 5 6M 5 1 2M 10 2 4M Array Size (Bytes) (c) AMD Athlon 64X2-Memory Bandwidth (1 copy) 17500 15000 libc bcopy unaligned libc bcopy aligned Memory bzero unrolled bcopy unaligned Memory read Memory w rite 35000 30000 12500 10000 7500 5000 2500 0 B a n d w id th (M B /s )

B and width (MB /s) 25000 20000 15000 10000 5000 0 512 1024 2048 4096 8192 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M 512M 1024M 512 1024 2048 4096 8192 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M Array Size (Bytes) Array Size (Bytes) (e) (f) Figure 3-3 Memory bandwidth collected from the lmbench suite (1 or 2 copies). 25 128M 256M 512 1024 2048 4096 8192 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M Array Size (Bytes) 512 1024 2048 4096 8192 16K 32K 64K 128K 256K 512K 1M 2M 4M 8M 16M 32M 64M 128M 256M Array Size (Bytes) (b) Intel Pentium D Memory Bandwidth (2 copies) ibc bcopy unaligned libc bcopy aligned Memory bzero unrolled bcopy unaligned Memory read Memory write (d) AMD Athlon 64X2-Memory Bandwidth (2copies) libc bcopy unaligned libc bcopy aligned Memory bzero unrolled bcopy unaligned Memory read Memory write by the sharp decline of the bandwidth. (3) For Athlon 64X2, libc bcopy unaligned and libc bcopy aligned show a big difference while alignment does not have much difference for Core 2 Duo and Pentium D. ‘Aligned’ here means the memory segments are aligned to the page boundary. The operation bcopy could be optimized if the segments are page aligned.

In Figure 3. 3(a), 3. 3 (c) and 3. 3 (e), Core 2 Duo and Pentium D show optimizations for unaligned bcopy access while Athlon 64X2 does not. Figure 3. 3 (b), 3. 3 (d) and 3. 3 (f) plot the bandwidth while running two copies of lmbench on three machines. The scale of the vertical axis of these three figures is doubled compared to their one-copy counterparts. We can observe that memory bandwidth of Pentium D and Athlon 64X2 are almost doubled for all operations. Core 2 Duo has increased bandwidth, but not doubled. This is because of the access contention when two lmbench copies compete with the shared cache.

When the array size is larger than its L2 cache size 2MB, Athlon 64X2 provides almost doubled bandwidth for two-copy lmbench memory read operation compared with its one-copy counterpart. Athlon 64X2 benefits from its on-die memory controller and separate I/O Hyper-Transport. Intel Core 2 Duo and Pentium D processors suffer FSB bandwidth saturation when the array size exceeds the L2 capacity. We tested memory load latency for multiple sizes of stride access and random access for all the three machines. Figure 3. 4(a), 3. 4 (c) and 3. 4 (e) depict the memory load latency lines of the three machines running with one copy of lmbench.

Several observations can be made: (1) For Core 2 Duo, latencies for all configurations jump after the array size is larger than 2 MB while for Pentium D and Athlon 64X2 latencies for all 26 the configurations jump after the array size is larger than 1MB. This relates to the L2 cache sizes of the measured machines. (2) As described in Section 2, when hardware Intel Core 2 Duo-Memory Load Latency-1 copy 150 stride-16 stride-64 120 stride-256 stride-512 stride-32 stride-128 120 150 stride-16 stride-64 stride-256 stride-32 stride-128 stride-512 Intel Core 2 Duo-M ory Load Latency-2 copies em Latency (ns) stride-1024 90

Latency (ns) 90 stride-1024 60 60 30 30 3. 5 128 224 128 224 0. 01 0. 02 0. 04 0. 08 0. 14 0. 25 0. 47 0. 88 1. 63 112 208 384 704 1280 Array Size (MB) Array Size (MB) (a) Intel Pentium D-Memory Load Latency-1 copy 150 150 (b) Intel Pentium D-Memory Load Latency-2 copies stride-16 stride-64 120 stride-256 stride-1024 90 stride-32 stride-128 stride-512 stride-16 stride-64 stride-32 stride-128 stride-512 120 stride-256 stride-1024 Latency (ns) 90 Latency (ns) 60 60 30 30 0 0 0. 01 0. 02 0. 03 0. 04 0. 06 0. 09 0. 14 0. 22 0. 34 0. 5 0. 81 1. 25 1. 88 3 4. 5 1. 25 2 0. 01 0. 02 0. 03 0. 05 0. 09 0. 16 0. 25 0. 44 . 75 5. 5 0 3. 5 0 2 0. 01 0. 02 0. 03 0. 05 0. 09 0. 16 0. 25 0. 44 0. 75 1. 25 6 10 16 28 48 80 128 224 384 640 1024 Array Size (MB) Array Size (MB) (c) AMD-Memory Load Latency-1 copy 150 stride-16 stride-64 120 stride-256 stride-1024 90 stride-32 stride-128 stride-512 120 150 stride-16 stride-64 stride-256 (d) AMD-Memory Load Latency-2 copies stride-32 stride-128 stride-512 Latency (ns) Latency (ns) stride-1024 90 60 60 30 30 0 2 0. 01 0. 02 0. 03 0. 05 0. 09 0. 16 0. 25 0. 44 0. 75 1. 25 6 10 16 28 48 80 1024 10 16 28 48 80 128 224 384 0. 01 0. 02 0. 03 0. 05 0. 09 0. 16 0. 25 0. 44 0. 75 1. 25 640 3. 5

Array Size (MB) Array Size (MB) (e) (f) Figure 3-4 Memory load latency collected from the lmbench suite (1 or 2 copies) 27 384 3. 5 0 0 2 6 0 7 11 16 26 40 60 96 144 224 352 384 6 10 16 28 48 10 18 32 60 80 0 3 0 0 0 prefetchers on all machines work, the memory bus bottleneck will not be reflected. When the stride size is equal to 128 bytes, Pentium D still benefits partially from its hardware prefetcher but the L2 prefetchers of Core 2 Duo and Athlon 64X2 is not triggered. This leads to better performance for Pentium D. (3) When the stride size is large than 128 bytes, all hardware prefetchers don’t take effect.

Multiple L2 cache misses put pressures onto the memory buses. Athlon 64X2’s on-die memory controller and separate I/O HyperTransport show the advantage. Pentium D’s memory latency have a large jump for these operations but Athlon 64X2’s latency almost keeps unchanged. We increased pressure on memory hierarchy by running 2 copies of lmbench simultaneously. Figure 3. 4(b), 3. 4(d) and 3. 4(f) show memory latencies of two lmbench copies. We found that Core 2 Duo and Athlon 64X2 show a slight increase in the latencies for stride sizes larger than 128 bytes while Pentium D’s latencies in those situations increases a lot.

Core 2 Duo benefits from its shared cache, which generates lower external traffic and its faster FSB while Athlon 64X2 take the advantage of on-chip memory controller and separate I/O Hyper-Transport. However, Pentium D’s latencies jump due to suffering from memory bus saturation. We also ran the STREAM and STREAM2 benchmarks implemented in lmbench to measure memory bandwidth and latency of eight kernel operations. Figure 3. 5(a) shows memory bandwidth of STREAM and STREAM2 operations when running with a single copy of each operation. We made two observations.

First, the add operation in the STREAM suite shows much higher bandwidth than other operations. After examining the related assembly code, we found that the add operation is a loop of c[i] = a[i] + b[i], which can easily take advantage of the SSE2 packet operations. Other operations such as 28 copy and fill do not use SSE2 instructions and therefore do not show much difference. Triad and daxpy have longer delay and lower bandwidth for each step because of multiplication. Performance of the operation sum was hurt because of its inter-loop dependence: s += a[i].

Second, Intel Core 2 Duo shows the best bandwidth for all operations because of L1 data prefetchers and the faster Front Side Bus. S tream B andwidth (1 C opy) Stream B andwidth(2Copy) C 2D ore uo P entiumD Athlon 64X 2 Bandwidth (MB/s) 16 000 14 000 12 000 10 000 8 000 6 000 4 000 2 000 0 copy scale add triad fill* copy* daxpy* sum * co py scale a dd triad fill* copy* daxpy* su * m O peration (* m eans fromS E 2) TR AM O eration(* m ns fromST E 2) p ea R AM C 2D ore uo P tiumD en Ath 64X lon 2 16000 14000 Bandwidth (MB/s) 12000 10000 8000 6000 4000 2000 0 (a) S tream Latency (1 C opy) 20 C 2Duo ore P entiumD Athlon 64X 2 20 b) S treamLatency (2 C opy) C 2D ore uo P tiumD en Athlon64X 2 15 Latency (ns) 15 Latency (ns) 10 10 5 5 0 copy scale add triad fill* copy* daxpy* sum * Operation (* m eans fromST AM RE 2) 0 copy scale add triad fill* copy* daxpy* sum * O peration(* m eans fromSTR AM ) E 2 (c) (d) Figure 3-5 Memory bandwidth and latency collected from the STREAM and STREAM2 benchmarks (1 or 2 copies) 29 Figure 3. 5(b) depicts memory bandwidth when running with 2 copies of each operation in STREAM / STREAM2, one on each core. From this figure, we can see that Core 2 Duo and Athlon 64X2 have better bandwidth than that of Pentium D.

This is due to the fact that Pentium D’s FSB is saturated when running two copies of each operation. Athlon 64X2 benefits from its on-die memory controller and separate HyperTransport for I/O although its main memory DDR bandwidth is worse than that of Pentium D. Core 2 duo benefits from the presence of its L1 data prefetchers and the faster FSB. Figure 3. 5(c) and 3. 5(d) show the memory latencies for the three machines. Similar to the bandwidth figures, memory latency of Core 2 Duo and Pentium D are shorter than that of Athlon 64X2 when a single copy of the STREAM/STREAM2 benchmark is running.

Apparently, the shorter latency from on-die memory controller does not pay off in comparison with an off-die controller with better DRAM technology. However, while running the 2-copy version, memory latency of Pentium D is higher than the other two. 30 4. Performance Comparison of Dual Core Processors Using Multiprogrammed and Multithreaded Benchmarks 4. 1 Overview This section emphasizes on comparing the performance measurement results of three dual core desktop processors, explained in chapter 3 : Intel Core 2 Duo E6400 with 2. 13GHz [15], Intel Pentium D 830 with 3. GHz [19] and AMD Athlon 64X2 4400+ with 2. 2GHz [2] using multi-programmed and multi threaded benchmarks. To evaluate the architecture performance a mixture of single threaded and multiprogrammed benchmarks are used. A set of single-threaded workloads is run on the three systems to determine the dual-core speedups over a single core. For single-thread programs, we experiment a subset of mixed SPEC CPU2000 and SPEC CPU2006 benchmarks [31]. To examine the scalability of single and dual cores, we run a set of single- and multi- threaded workloads on the three systems.

For multi-threaded workloads, we select blastp and hmmpfam from the BioPerf suites [6], SPECjbb2005 [32], as well as a subset of SPLASH2 [22]. 4. 2 Methodology Similar to the methodology used in chapter 3 we used SUSE linux 10. 1 with kernel 2. 6. 16-smp on all three machines for all our experiments in this section. We used maximum level GCC optimization to compile all the C/C++ benchmarks including SPEC CPU2000, SPEC CPU2006, SPLASH2 and blastp and hmmpfam from BioPerf. SPEC jbb2005 was compiled using SUN JDK 1. 5. 0. For multiprogrammed workloads, the cross-product of mixed SPEC

CPU2000/2006 benchmarks were run on the three machines to examine the dual-core 31 speedups over a single core. All the SPEC CPU2000/2006 programs were run with their respective ref inputs. In our simulations, when two programs were run together, we guaranteed that each program was repeated at least four times. The shorter programs may run more than four iterations until the longer program completes its four full iterations. We discarded the results obtained in the first run and used the average execution time and other metrics from the remainder three repeated runs to determine the speedups.

We calculated the dual-core speedup for multiprogrammed workloads similarly to that used in [25]. Firstly, the single program’s running time were collected individually and were considered as the base runtime. Secondly, the average execution time of each workload when run simultaneously was re-corded. Then, the dual-core speedup of each workload is calculated by finding the ratio of average run time when run individually (single core) by the average run-time when run together (dual core). Finally, we add the speedups of the two programs run together to obtain the dual-core speedup.

For example, if the speedups of two programs are 0. 8 and 0. 9 when run simultaneously, the respective dual-core speedup will be 1. 7. We used the same procedure for homogeneous multi-threaded workloads including blastp and hmmpfam from the BioPerf suites, a subset of SPLASH2, as well as SPECjbb2005. The BioPerf suite has emerging Bio-informatics programs. SPLASH2 is a widely used scientific workload suite. SPECjbb2005 is a java based business database program. Table 4. 1 lists the input parameters of the multithreaded workloads used.

We ran each of these workloads long enough to compensate overheads of sequential portions of the workloads. 32 Table 4. 1 Input parameters of the selected multithreaded workloads Workload blastp hmmpfam barnes fmm ocean-continuous fft lu-continuous lu-non-continuous radix Input parameters Swissprot database, large input Large input 1048576 bodies 524288 particles 2050 X 2050 grid 2^24 total complex data transformed 4096 X 4096 node matrix 4096 X 4096 node matrix 134217728 keys to sort Default ramp up time 30s, measurement time 240s, from 1 to 8 warehouses points

SPECjbb2005 4. 3 Multiprogrammed Workload Measurements We measured execution time of a subset of SPEC CPU2000 and CPU 2006 benchmarks running on the three systems. In figure 5(a) and 5(c), the Core 2 Duo processor runs fastest for almost all workloads, especially for memory intensive workloads art and mcf. Core 2 Duo has a wider pipeline, more functional units, and a shared L2 cache that provides bigger cache for single thread. Athlon 64X2 shows the best performance for ammp, whose working set is large, resulting in large amount of L2 cache misses for all three machines.

Athlon 64X2 benefits from its faster on-chip memory controller. Figure 4. 1(b) and 4. 1(d) depict average execution time of each workload when mixed with another program in the same suite. There is an execution time increasing for each workload. For memory bounded programs art, mcf and ammp, execution time increasing is large while CPU bounded workloads such as crafty, mesa, perl and sjeng show a little increasing. 33 SPEC CPU2000 Execution Time (Single Program) 700 600 500 Seconds C 2D ore uo P entiumD Athlon 64X2

SP CC 2000A E PU verag E e xecutionT e (M Pro ram im ixed g ) 700 600 500 Seconds C 2D ore uo P entiumD Athlo 6 2 n 4X 400 300 200 100 GZIP EQUAKE PARSER BZIP2 PERL EON GCC MCF 400 300 200 100 0 AMMP ART 0 AMMP ART CRAFTY TWOLF MESA GAP VPR GZIP EQUAKE PARSER BZIP2 PERL EON GCC MCF CRAFTY TWOLF SJENG GAP (a) SPECCPU2006 Execution Time (Single Program) 2200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 GCC H264REF ASTAR BZIP2 C 2D ore uo P entiumD Athlon 64X2 Seconds (b) SPEC CPU2006 Average Execution Time (Mixed Program) 200 2000 1800 1600 1400 1200 1000 800 600 400 200 0 GCC H264REF ASTAR BZIP2 Core 2 Duo P entiumD Athlon 64X2 Seconds OMNETPP HMMER MESA OMNETPP LIBQUANT (c) (d) Figure 4-1 SPEC CPU2000 and CPU2006 benchmarks execution time The multi-programmed speedup of the cross-product of mixed SPEC CPU2000 and CPU2006 programs for the three machines are given in the Figure 4. 2, where C2D, PNT and ATH denote the measured Core 2 Duo, Pentium D, and Athlon 64X2 respectively. From Figure 4, we can see that Athlon 64X2 achieves the best speedup 2. 0 for all the workloads. Crafty, eon, mesa in CPU 2000 and perl in