Introduction

Parallel computing is known to be the act of concurrently using several computational resources such as CPUs to resolve IT problems (Knowledge Base, 2010, Reschke, 2004). These problems are broken into separate entities/ instructions to be executed and solved simultaneously by multiple CPUs (Barney, 2010).

Modern parallel computers are subject to multiple instructions with multiple data types. They engage in the act of decomposing the domain as a manner of dividing the workload. “Master nodes implicitly synchronize the computation and communication among processes and high level languages are used” (Karniadakis et al, 2003, p.61). However modern parallel computer architecture today is becoming increasingly complicated and users are considering a transformation of general purpose CPUs to more specialist processors of heterogeneous architectural nature (Brito Alves et al, 2009).

In this piece of writing, I will put into perspective a critical engagement of graphics processing units (GPUs), which are highly efficient at the manipulation of computer graphics and are used to process significant amounts of data simultaneously. They play their roles effectively making them even more efficient than all-purpose CPUs for solving algorithmic problems. GPUs give a whole new meaning to parallel computing today, due to their dedicated functions. I will focus on this case through critical analysis, argumentation and engagement, to give a comprehensive understanding of modern parallel computing.

Graphics Processing Units

A Graphics Processing Unit is a multi-core processor that was introduced to the community of scientific computing on the 31st of August 1999 (Brito Alves et al, 2009; Nvidia Corporation, 2011). The ‘processor’ is the very basic component of processing, which executes instructions that refer to the functions aimed at various devices. The grouping of processors is known as a ‘multiprocessor’ (Paolini, 2009). An individual GPU distinctly contains hundreds of these core processors giving systems the access to several cores at the same time (Brito Alves et al, 2009).

Modern GPUs have revolutionised from machines which simply rendered services to immensely parallel all-purpose processors. “Recently, they exceeded 1 TeraFLOPS 7, outrunning the computational power of commodity CPUs by two orders of magnitude” (Diamantaras et al, 2010, p.83).

GPUs nowadays aid concurrent ?oating-point computations inclusive of its shaders and programming pipelines. Therefore the functionalities of GPUs have enhanced to be more applicable than how it was before (Offerman, 2010). On a normal processor, it is the control flow which gains the position of prominence. This denotes how the algorithms process data and variables. A modern GPU on the other hand, provides a stream processing model which helps execute the traditional concurrent calculations utilized in “High Performance Computing (HPC), industrial, finance, engineering programs and in High performance technical computing (HPTC)” (Offerman, 2010,p.32-33).

GPU manufacturers however had in fact failed to detect the opportunity not until the consumers began to manipulate those new capabilities. The manufacturers began to elongate their current product lines to add GPGPU (General-purpose computing on graphics processing units) solutions merely after HPC programs were used in game consoles and graphics adapters (Offerman, 2010). For instance, according to Offerman (2010), “some users deployed a stack of PlayStation 3 systems to do their parallel calculations. Today, IBM o?ers the Cell processor that is speci?cally designed for this game console as a parallel computing blade” (p.33).

The processing models of GPGPUs also known as the general purpose GPUs are massively parallel, but they heavily rely on “off-chip video memory” (Halfhill, 2008, p.3) in order to run on big sets of data. Distinct threads need interacting with one another via this off-chip memory. As the frequency of memory access increase, the performance tends to get limited (Halfhill, 2008). Those who manufactured graphics processors were slow at adopting the GPGPU trends, judging by the sales made on the high end systems for HPC and HPTC (Offerman, 2010). Offerman (2010) states, “The double precision ?oating-point operations have been introduced over the last years, but performance in that area is still lacking. The same goes for access to memory.” (p.34).

Regardless of these disadvantages, the nVidia and the ATI presently provide a product line which is targeted at GPGPU programmes (Offerman, 2010). As for ATI the product portfolio consist of Stream products whereas for nVidia it consists of Tesla cards “based on their GeForce 8 GPUs” (Offerman, 2010, p.34) which could be unfolded by making use of the Compute Unified Device Architecture (CUDA) which will be discussed further on in the essay. Computing on GPGPU obits around significant data structures and matrixes “where super-fast and in parallel relatively small computations are performed on the individual elements” (Offerman, 2010, p.33). This is the reason why a graphics processor has an even greater local memory compared to a traditional CPU. Therefore it makes a GPU particularly suitable for significant parallel applications today (Offerman, 2010).

A GPU uses an important ‘Single Program, Multiple Data’ (SPMD) architecture for the purpose of specialising on intensely parallel calculations. A heavy amount of the data processing is done through the devotion of transistors instead of caching data (Alerstam et al, 2008). GPUs today are highly data parallel processors which are utilized to give substantially high “floating point arithmetic throughput” (Alerstam et al, 2008, p. 060504-1) for issues meant to be resolved using the SPMD model. “On a GPU, the SPMD model works by launching thousands of threads running the same program called the kernel working on different data. The ability of the GPU to rapidly switch between threads in combination with the high number of threads ensures the hardware is busy at all times” (Alerstam et al, 2008, p. 060504-1). This capability efficiently conceals memory latency, and the performance of GPUs will also be improved by combining with multiple levels of high bandwidth memory, accessible in the latest GPUs (Alerstam et al, 2008).

“Nvidia revolutionized the GPGPU and accelerated the computing world in 2006-2007 by introducing its new massively parallel “CUDA” architecture. The CUDA architecture consists of 100s of processor cores that operate together to crunch through the data set in the application” (Nvidia Coropration, 2011, p.1). The CUDA GPU programming framework from Nvidia enables the growth of parallel applications through the elongation of C, which is known as “C for CUDA” (Diamantaras et al, 2010).

Nvidia’s CUDA is a software platform for a massive degree of high performance parallel computing on any firm’s powerful GPUs. (Halfhill, 2008) The CUDA is a model which has the ability to scale parallel programming. “The CUDA programming model has the SPMD software style, in which a programmer writes a program for one thread that is instanced and executed by many threads in parallel on the multiple processors of the GPU” (Patterson et al, 2009, p.A-5). This CUDA model regards graphics devices as discrete co-processors to the CPU. CUDA programs as mentioned before “are based on the C programming language with certain extensions to utilize the parallelism of the GPU. These extensions also provide very fast implementations of standard mathematical functions such as trigonometric functions, ?oating point divisions, logarithms, etc.” (Alerstam et al, 2008, p.060504-2) Kernel functions which are basically C functions carried out in N parallel threads initiate the calculations on GPUs. In a semantic manner, the threads are formed in 1-2-3- proportional sets of up to threads of 512 known as ‘blocks’. Every block is planned to run independently of each other on a multiprocessor. These blocks are simultaneously or sequentially executed in any order based on the resources of the system. However, this scalable notion comes at the cost of limiting communication amongst the threads (Diamantaras et al, 2010).

For multiple threads to run simultaneously, a type of architecture known as the ‘Single Instruction, Multiple Threads’ (SIMT) is employed by the multiprocessors. The SIMT units in the multiprocessors develop planned activities and carries out sets of 32 parallel threads in regular succession. The efficient levels could be maximised if the identical instruction path is executed by every thread. The access of memory from the multiple threads could be joined together into one act of transacting memory, given that the successive threads obtain data from the very segment of memory (Diamantaras et al, 2010). Diamantaras, Duch and Iliadi (2010) argue further, “Following such specific access patterns can dramatically improve the memory utilization and is essential for optimizing the performance of an application.” (p.83). Regardless of such optimal results, the CUDA framework is in fact meant for applications with a high concentration of arithmetic memory (Diamantaras et al, 2010).

The CUDA model like the GPGPU model is massively parallel. However, it separates the sets of data into relatively small compact masses which are stored in on-chip memories. Several thread processors are then allowed to share every mass. The local storage of data cuts down the necessity of obtaining off-chip memory, which improves the performance. From time to time threads do need to obtain off-chip memory, for instance, to load off-chip data needed into the local memory. Off-chip memory accesses in the CUDA normally do not delay the thread processors. Rather, the delayed threads go into a queue of inactive nature and are substituted for another thread which would be available for execution. As soon as the delayed data of the threads become obtainable, the threads enter into other queues which signal that they are ready to go. Bands of threads alternatively execute in a round-robin style, making sure that every thread gets to be executed on time without stalling the other threads (Halfhill, 2008).

A prominent characteristic of a modern CUDA model is that programmers do not write threaded code in a clear concise manner. The manager of hardware threads manages the threading almost mechanically (Halfhill, 2008). “Automatic thread management is vital when multithreading scales to thousands of threads—as with Nvidia’s GeForce 8 GPUs, which can manage as many as 12,288 concurrent threads” (Halfhill, 2008, p.3). Even though the threads are light in weight, meaning that every thread runs on relatively small pieces of data, these threads are in fact fully developed. All threads have their own stacks, register files, program counters and local memories. Every thread processor has 32 bit wide 1024 registers enforced in static random access memory (SRAM) rather than latches. The GPUs preserve the threads which are not active and regenerate them once they reach their active form again. As Halfhill (2008) states, “Instructions from multiple threads can share a thread processor’s instruction pipeline at the same time, and the processors can switch their attention among these threads in a single clock cycle. All this run-time thread management is transparent to the programmer” (p.3)

By taking away the load of managing the threads in an explicit manner, Nvivia makes the programming model more simplified. It removes an entire class of all likely bugs. Theoretically, the CUDA model eliminates the possibility of having deadlocks amongst the threads where deadlocks are said to be the occurrence of a blockage between many threads which prevents the threads from influencing and controlling the data. This creates repulsion where no single thread may be allowed to continue. The risk of having deadlocks is that they could lie in and wait without any detection in well behaved codes for decades (Halfhill, 2008). CUDA could eliminate any deadlock, regardless of the number of threads running. An application programming interface (API) named ‘syncthreads’ supplies clear synchronization of barriers. “Calling syncthreads at the outer level of a section of code invokes a compiler-intrinsic function that translates into a single instruction for the GPU” (Halfhill, 2008, p.4). This instruction prevents threads from running on data which other threads are making use of.

The point at which graphics processing intersects with parallel computing comes a modern prototype for graphics called ‘visual computing’. It plays a role of replacing broad segments of the “traditional sequential hardware graphics pipeline model” (Patterson et al, 2009, p.A-4) with geometric programming components, pixel and vertex systems. Patterson and Hennessy (2009) argue, “Visual computing in a modern GPU combines graphics processing and parallel computing in novel ways that permit new graphic algorithms to be implemented, and open the doors to entirely new parallel processing applications on pervasive high-performance GPUs.” (Patterson et al, 2009, p.A-4).

Even though GPUs are considered to be the most parallel and the most potent processors in an average computer, they are arguably not the only processors in existence. CPUs have become multicore and in the near future would turn into manycore. They are considered to be primary successive processor companions and tend to compliment the hugely parallel manycore GPUs. These dual typed processors would incorporate heterogeneous multiprocessors (Patterson et al, 2009).

GPUs have evolved into scalable parallel processors. The modern GPUs have further developed (function wise) from Video Graphics Arrays (VGA) controllers of constrained capabilities to programme centric parallel processors. The revolution has continued from the change of API based graphics pipelines to integrated programme centric components and by also devising a more programmable and less specialised hardware pipeline stage. In the end, it seemed sensible to unify distinct programmable pipelines to a merged array of several processors which were programmable. In the “GeForce 8-series generation of GPUs” (Patterson et al, 2009, p.A-5), the processing of pixel, vertex and geometry operate on the very same processor type. This fusion enables impressive scalability. The entire system is greatened by more processor cores as process functions can make use of the total processor array (Patterson et al, 2009). The processor array is now made with fewer processors on the side of the spectrum as every function can be operated on the very same processor (Patterson et al, 2009, p.A5).

However, a lesson that could be learnt from GPUs and graphics processing software is that an API does not disclose concurrency to programmers in a direct manner (Asanovic et al, 2006). “OpenGL, for instance, allows the programmer to describe a set of “vertex shader” operations in Cg (a specialized language for describing such operations) that are applied to every polygon in the scene without having to consider how many hardware fragment processors or vertex processors are available in the hardware implementation of the GPU” (Asanovic et al, 2006, p.13)

The uniformity and scalability of the arrays brings modern programming models for GPUs. Solving non graphics issues is made possible by the significant amounts of floating-point power embedded in the processor arrays of the GPUs (Patterson et al, 2009). As Patterson and Hennessy (2009) say “Given the large degree of parallelism and the range of scalability of the processor array for graphics applications, the programming model for more general computing must express the massive parallelism directly but allow for scalable execution” (p.A-5).

The vanguard of GPU hardware provides several floating points units running simultaneously on Single Instruction Multiple Data (SIMD) vectors. They too run on scalar data types. Therefore a GPU could also carry out scalar functions concurrently, providing heterogeneous parallelism (Fritz, 2009). “Various generations of Intel Pentiums and Power PCs only feature up to three 4-way SIMD vector processing units.” (Fritz, 2009, p.2). What this signifies is that a GPU could provide SIMD parallelism in both a single component and even across a battalion of components; alternatively just the element wise parallelism is exploited by SIMD CPUs (Fritz, 2009).

It is believed by Intel that (many core) processors support tens and thousands of multiple threads. Following a series of tests with the hyper threading and dual core technologies, those who manufacture CPUs have now undeniably entered the (multi core) era. In the not so distant future, all-purpose processors would contain not just tens and hundreds but thousands of cores (Offerman, 2010).

However, according to nVidia, such processors do exist in the modern times; graphics processors contain tens and hundreds of cores and support thousands of multiple threads. GPUs are presently being separated as chips on graphics cards and motherboards. They are gradually being used by programmers who code applications (Offerman, 2010). “For specific problems they have found mappings onto these graphics engines that result in speedups by two orders of magnitude” (Offerman, 2010, p.1). These graphics processor manufacturers have become fully aware of this opportunity and are making efforts at devising their merchandise to access not just graphics processors (Offerman, 2010).

From the point of view of a programmer, a CPU offers multi thread models enabling many control-flow instructions, whereas a GPU offers stream processing models which puts “large performance penalty on control flow changes..Currently, new programming languages and extensions to current languages are developed, supporting both explicit and implicit parallelism” (Offerman, 2010, p.1).

All-purpose processing on a GPU is commonly known as “Stream Computing”. They stress on parallel computing which results in high performance. As Paolini (2009) quotes “Beyond simple multithreaded programming, stream computing represents a logical extreme, where a massive number of threads work concurrently toward a common goal” (p.49). However, even though a GPU consists of multiple core processors, they are mostly not all-purpose CPU cores. Therefore the cores become limited in susceptibility. For instance, the modern NVidia GPU consists of several multiprocessors where each processor contains several SIMD stream processors. However, the memory structure in the architecture seems to be complex. Multiple memory types in the GPUs are categorised according to their scope. “Registers serve individual processors; shared memory, constant cache, and texture cache serve multiprocessors; and global device memory serves all cores” (Brito Alves et al, 2009 p.785; Richardson and Gray, 2008). A reduced latency is allowed in the processor level by the memory so that the threads in the same block could communicate with each other whereas accessing the global device memory seems to have an increased latency which can be obtained from the CPU. “The problem must be highly parallel so that the program can break it into enough threads to keep the individual processors busy” (Wolfe, 2008, p.785).

GPU architecture today has great potential in scientific computing and has the ability to provide effective parallel solutions for linear systems (Brito Alves et al, 2009).

Conclusion

Based on the facts and arguments mentioned above, it can be concluded that GPUs are in fact highly parallel structures. They make use of specialist architectures to execute extreme parallel calculations. Great emphasis was placed on the CUDA as it plays a vital role in the functioning of GPUs; The CUDA being a parallel computing architecture is the computing engine in a GPU.

Professor Dongarra (2011), the Director of the Innovative Computing Laboratory of The University of Tennessee, states that “GPUs have evolved to the point where many real-world applications are easily implemented on them and run significantly faster than on multi-core systems. Future computing architectures will be hybrid systems with parallel-core GPUs working in tandem with multi-core CPUs.” (p.1).

“Graphics processors deliver their highest performance at successive, relatively simple, massively parallel operations on large matrixes, using as few as possible control ?ow instructions” (Offerman, 2010, p.34). Modern GPUs today have embedded architecture which enables the GPUs to emphasize on the execution of multiple concurrent threads. This unique GPGPU approach known to solving computational problems gives GPUs the better edge to carrying out parallel computing in a more effective and comprehensive manner, making it one of the most parallel structures in the computer world today.

Bibliography Alerstam, E., Svensson, T., Andersson-Engels, S. (2008) Parallel Computing with Graphics processing units for high-speed Monte Carlo simulation of phonton migration ONLINE: [Accessed 28 March 2011] Asanovic, K., Bodik, R., Catanzaro, C.B., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A. (2006) The Landscape of Parallel Computing Research: A View from Berkeley, Electrical Engineering and Computer Sciences University of California at Berkeley [Accessed 31 March 2011] Barney, B. (2010) Introduction to Parallel Computing, Lawrence Livermore National Laboratory ONLINE: [Accessed 30 March 2011] Brito Alves, R.M., Oller Nascimento, C.A., Biscaia Jr. E.C. (2009) 10th International Symposium on process systems engineering, Computer-Aided chemical engineering, 27 ONLINE: [Accessed 31 March 2011] Diamantaras, K., Duch,W., Lliadis, L.S. (2010) Artificial Neural Networks- ICANN 2010, Part III, Springer-Verlag Berlin Heiderlberg Copyright ONLINE: [Accessed 29 March 2011] Dongarra, J. (2011) Director of the Innovative Computing Laboratory, The University of Tennessee Nvidia Corporation (2011) ONLINE: [Accessed 4 April 2011] Fritz, N. (2009) SIMD Code Generation in Data-Parallel Programming. ONLINE: > [Accessed 1 April 2011] Halfhill, T.R. (2008) Parallel Processing with CUDA, Nvidia’s High-Performance Computing Platform Uses Massive Multithreading. Microprocessor: The Insider’s Guide to Microprocessor Hardware. ONLINE: [Accessed 29 March 2011] Karniadakis, G., Kirby, R.M. (2003) Parallel Scientific Computing in C++ and MPI A Seamless Approach to Parallel Algorithms and Their Implementation, Cambridge University Press ONLINE: [Accessed 1 April 2011] Knowledge Base (2010) What are parallel computing, grid computing and supercomputingUniversity Information Technology Services, Indiana University ONLINE: [Accessed 28 March 2011] Nvidia Corporation (2011) ONLINE: [Accessed 1 April 2011] Offerman, A. (2010) Modern Commodity Hardware for Parallel Computing, and Opportunities for Artificial Intelligence, Leiden University ONLINE: [Accessed 29 March 2011] Paolini. A.L. (2009) A real-time super resolution implementation using modern graphics processing units, University of Delaware ONLINE: [Accessed 30 March 2011] Patterson, D.A., Hennessy, J.L. (2009) Computer Organization and Design, The Hardware / Software Interface, 4th Edition ONLINE: [Accessed 29 March 2011] Reschke, J. (2004) Parallel Computing (Presentation)