Risc & Pipelining
What is RISC Architecture? * RISC stands for Reduced Instruction Set Computer. * An Instruction set is a set of instructions that helps the user to construct machine language programs to do computable tasks. History * In early days, the mainframes consumed a lot of resources for operations * Due to this, in 1980 David Paterson, University of Berkeley introduced the RISC concept. * This included fewer instructions with simple constructs which had faster execution, and less memory usage by the CPU. * Approximately a year was taken to design and fabricate RISC I in silicon * In 1983, Berkeley RISC II was produced.
It is with RISC II that RISC idea was opened to the industry. * In later years it was incorporated into Intel Processors * After some years, a revolution took place between the two Instruction Sets. * Whereby RISC started incorporating more complex instructions and CISC started to reduce the complexity of their instructions. * By mid 1990’s some RISC processors became more complex than CISC! * In today’s date the difference between the RISC and CISC is blurred. Characteristics and Comparisons * As mentioned, the difference between RISC and CISC is getting eradicated. But these were the initial differences between the two.
RISC| CISC| Fewer instructions| More (100-250)| More registers hence more on chip memory (faster)| Less registers| Operations done within the registers of the CPU| Can be done external to CPU eg memory| Fixed length instruction format hence easily decoded| Variable length| Instruction execution in one clock cycle hence simpler instructions| In multiple clock cycles| Hard wired hence faster| Micro programmed| Fewer addressing modes| A variety| Addressing modes : Register direct. Immediate addressing, Absolute addressing Give examples on one set of instructions for a particular operation, Instruction Formats ttp://www-cs-faculty. stanford. edu/~eroberts/courses/soco/projects/2000-01/risc/risccisc/ Advantages and Disadvantages * Speed of instruction execution is improved * Quicker time to market the processors since few instructions take less time to design and fabricate * Smaller chip size because fewer transistors are needed * Consumes lower power and hence dissipates less heat * Less expensive because of fewer transistors * Because of the fixed length of the instructions, it does not use the memory efficiently * For complex operations, the number of instructions will be larger
Pipelining The origin of pipelining is thought to be in the early 1940s. The processor has specialised units for executing each stage in the instruction cycle. The instructions are performed concurrently. It is like an assembly line. IF| ID| OF| OE| OS| | | | | | | IF| ID| OF| OE| OS| | | | | | | IF| ID| OF| OE| OS| | | | | | | IF| ID| OF| OE| OS| | | Time Steps (clocks) Pipelining is used to accelerate the speed of the processor by overlapping various stages in the instruction cycle. It improves the instruction execution bandwidth. Each instruction takes 5 clock cycles to complete.
When pipelining is used, the first instruction takes 5 clock cycles, but the next instructions finish 1 clock cycle after the previous one. Types of Pipelining There are various types of pipelining. These include Arithmetic pipeline, Instruction pipeline, superpipelining, superscaling and vector processing??? Arithmetic pipeline: Used to deal with scientific problems like floating point operations and fixed point multiplications. There are different segments or sub operations for these operations. These can be performed concurrently leading to faster execution.
Instruction pipeline: This is the general pipelining, which have been explained before. — Pipeline Hazards Data Dependency: When two or more instructions attempt to share the same data resource. When an instruction is trying to access or edit data which is being modified by another instruction. There are three types of data dependency: RAW: Read After Write – This happens when instruction ij reads before instruction ii writes the data. This means that the value read is too old. WAR: Write After Read – This happens when instruction ij writes before instruction ii reads the data.
This means that the value read is too new. WAW: Write After Write – This happens when instruction ij writes before instruction ii writes the data. This means that a wrong value is stored. Solutions Data Dependency: * Stall the pipeline – This means that a data dependency is predicted and the consequent instructions are not allowed to enter the pipeline. There is a need for special hardware to predict the data dependency. Also a time delay is caused * Flush the pipeline – This means that when a data dependency occurs, all other instructions are removed from the pipeline. This also causes a time delay. Delayed load – Insertion of No Operation Instructions in between data dependent instructions. This is done by the compiler and it avoids data dependency Clock Cycle| 1| 2| 3| 4| 5| 6| 1. Load R1| IF| OE| OS| | | | 2. Load R2| | IF| OE| OS| | | 3. Add R1 + R2| | | IF| OE| OS| | 4. Store R3| | | | IF| OE| OS| Clock Cycle| 1| 2| 3| 4| 5| 6| 7| 1. Load R1| IF| OE| OS| | | | | 2. Load R2| | IF| OE| OS| | | | 3. NOP| | | IF| OE| OS| | | 4. Add R1 + R2| | | | IF| OE| OS| | 5. Store R3| | | | | IF| OE| OS| Branch Dependency: this happens when one instruction in the pipeline branches into another instruction.
Since the instructions have already entered the pipeline, when a branch occurs this means that a branch penalty occurs. Solutions Branch Dependency 1. Branch prediction: A branch to an instruction to an instruction and its outcome is predicted and instructions are pipelined accordingly 2. Branch target buffer: 3. Delayed Branch: The compiler predicts branch dependencies and rearranges the code in such a way that this branch dependency is avoided. No operation instructions can also be used. No operation instructions 1. LOAD MEM R1 2. INCREMENT R2 3. ADD R3 R3 + R4 4. SUB R6 R6-R5 . BRA X Clock Cycle| 1| 2| 3| 4| 5| 6| 7| 8| 9| 1. Load| IF| OE| OS| | | | | | | 2. Increment| | IF| OE| OS| | | | | | 3. Add| | | IF| OE| OS| | | | | 4. Subtract| | | | IF| OE| OS| | | | 5. Branch to X| | | | | IF| OE| OS| | | 6. Next instructions| | | | | | | IF| OE| OS| Clock Cycle| 1| 2| 3| 4| 5| 6| 7| 8| 9| 1. Load| IF| OE| OS| | | | | | | 2. Increment| | IF| OE| OS| | | | | | 3. Add| | | IF| OE| OS| | | | | 4. Subtract| | | | IF| OE| OS| | | | 5. Branch to X| | | | | IF| OE| OS| | | 6. NOP| | | | | | IF| OE| OS| | 7. Instructions in X| | | | | | | IF| OE| OS| Adding NOP Instructions
Clock Cycle| 1| 2| 3| 4| 5| 6| 7| 8| 1. Load| IF| OE| OS| | | | | | 2. Increment| | IF| OE| OS| | | | | 3. Branch to X| | | IF| OE| OS| | | | 4. Add| | | | IF| OE| OS| | | 5. Subtract| | | | | IF| OE| OS| | 6. Instructions in X| | | | | | IF| OE| OS| Re arranging the instructions Intel Pentium 4 processors have 20 stage pipelines. Today, most of these circuits can be found embedded inside most micro-processors. Superscaling: It is a form of parallelism combined with pipelining. It has a redundant execution unit which provides for the parallelism. Superscalar: 1984 Star Technologies – Roger Chen
IF| ID| OF| OE| OS| | | | | | IF| ID| OF| OE| OS| | | | | | | IF| ID| OF| OE| OS| | | | | | IF| ID| OF| OE| OS| | | | | | | IF| ID| OF| OE| OS| | | | | | IF| ID| OF| OE| OS| | | | | | | IF| ID| OF| OE| OS| | | | | | IF| ID| OF| OE| OS| | | Superpipelining: It is the implementation of longer pipelines that is pipelines with more stages. It is mainly useful when some stages in the pipeline take longer than the others. The longest stage determines the clock cycle. So if these long stages can be broken down into smaller stages, then the clock cycle time can be reduced.
This reduces time wasted, which will be significant if a number of instructions are performed. Superpipelining is simple because it does not need any additional hardware like for superscaling. There will be more side effects for superpipelining since the number of stages in the pipeline is increased. There will be a longer delay caused when there is a data or branch dependency. Vector Processing: Vector Processors: 1970s Vector Processors pipeline the data also not just the instructions. For example, if many numbers need to be added together like adding 10 pairs of numbers, in a normal processor, each pair will be added at a time.
This means the same sequence of instruction fetching and decoding will have to be carried out 10 times. But in vector processing, since the data is also pipelined, the instruction fetch and decode will only occur once and the 10 pairs of numbers (operands) will be fetched altogether. Thus the time to process the instructions are reduced significantly. C(1:10) = A(1:10) + B(1:10) They are mainly used in specialised applications like long range weather forecasting, artificial intelligence systems, image processing etc.
Analysing the performance limitations of the rather conventional CISC style architectures of the period, it was discovered very quickly that operations on vectors and matrices were one of the most demanding CPU bound numerical computational problems faced. RISC Pipelining: RISC has simple instructions. This simplicity is utilised to reduce the number of stages in the instruction pipeline. For example the Instruction Decode is not necessary because the encoding in RISC architecture is simple. Operands are all stored in the registers hence there is no need to fetch them from the memory.
This reduces the number of stages further. Therefore, for pipelining with RISC architecture, the stages in the pipeline are instruction fetch, operand execute and operand store. Because the instructions are of fixed length, each stage in the RISC pipeline can be executed in one clock cycle. Questions 1. Is vector processing a type of pipelining 2. RISC and pipelining The simplest way to examine the advantages and disadvantages of RISC architecture is by contrasting it with it’s predecessor: CISC (Complex Instruction Set Computers) architecture. Multiplying Two Numbers in Memory
On the right is a diagram representing the storage scheme for a generic computer. The main memory is divided into locations numbered from (row) 1: (column) 1 to (row) 6: (column) 4. The execution unit is responsible for carrying out all computations. However, the execution unit can only operate on data that has been loaded into one of the six registers (A, B, C, D, E, or F). Let’s say we want to find the product of two numbers – one stored in location 2:3 and another stored in location 5:2 – and then store the product back in the location 2:3. The CISC Approach
The primary goal of CISC architecture is to complete a task in as few lines of assembly as possible. This is achieved by building processor hardware that is capable of understanding and executing a series of operations. For this particular task, a CISC processor would come prepared with a specific instruction (we’ll call it “MULT”). When executed, this instruction loads the two values into separate registers, multiplies the operands in the execution unit, and then stores the product in the appropriate register. Thus, the entire task of multiplying two numbers can be completed with one instruction: MULT 2:3, 5:2
MULT is what is known as a “complex instruction. ” It operates directly on the computer’s memory banks and does not require the programmer to explicitly call any loading or storing functions. It closely resembles a command in a higher level language. For instance, if we let “a” represent the value of 2:3 and “b” represent the value of 5:2, then this command is identical to the C statement “a = a * b. ” One of the primary advantages of this system is that the compiler has to do very little work to translate a high-level language statement into assembly.
Because the length of the code is relatively short, very little RAM is required to store instructions. The emphasis is put on building complex instructions directly into the hardware. The RISC Approach RISC processors only use simple instructions that can be executed within one clock cycle. Thus, the “MULT” command described above could be divided into three separate commands: “LOAD,” which moves data from the memory bank to a register, “PROD,” which finds the product of two operands located within the registers, and “STORE,” which moves data from a register to the memory banks.
In order to perform the exact series of steps described in the CISC approach, a programmer would need to code four lines of assembly: LOAD A, 2:3 LOAD B, 5:2 PROD A, B STORE 2:3, A At first, this may seem like a much less efficient way of completing the operation. Because there are more lines of code, more RAM is needed to store the assembly level instructions. The compiler must also perform more work to convert a high-level language statement into code of this form. CISC | RISC | Emphasis on hardware | Emphasis on software | Includes multi-clock complex instructions | Single-clock, educed instruction only | Memory-to-memory: “LOAD” and “STORE” incorporated in instructions | Register to register: “LOAD” and “STORE” are independent instructions | Small code sizes, high cycles per second | Low cycles per second, large code sizes | Transistors used for storing complex instructions | Spends more transistors on memory registers | However, the RISC strategy also brings some very important advantages. Because each instruction requires only one clock cycle to execute, the entire program will execute in approximately the same amount of time as the multi-cycle “MULT” command.
These RISC “reduced instructions” require less transistors of hardware space than the complex instructions, leaving more room for general purpose registers. Because all of the instructions execute in a uniform amount of time (i. e. one clock), pipelining is possible. Separating the “LOAD” and “STORE” instructions actually reduces the amount of work that the computer must perform. After a CISC-style “MULT” command is executed, the processor automatically erases the registers. If one of the operands needs to be used for another computation, the processor must re-load the data from the memory bank into a register.
In RISC, the operand will remain in the register until another value is loaded in its place. The Performance Equation The following equation is commonly used for expressing a computer’s performance ability: The CISC approach attempts to minimize the number of instructions per program, sacrificing the number of cycles per instruction. RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per program. RISC Roadblocks Despite the advantages of RISC based processing, RISC chips took over a decade to gain a foothold in the commercial world. This was largely due to a lack of software support.
Although Apple’s Power Macintosh line featured RISC-based chips and Windows NT was RISC compatible, Windows 3. 1 and Windows 95 were designed with CISC processors in mind. Many companies were unwilling to take a chance with the emerging RISC technology. Without commercial interest, processor developers were unable to manufacture RISC chips in large enough volumes to make their price competitive. Another major setback was the presence of Intel. Although their CISC chips were becoming increasingly unwieldy and difficult to develop, Intel had the resources to plow through development and produce powerful processors.
Although RISC chips might surpass Intel’s efforts in specific areas, the differences were not great enough to persuade buyers to change technologies. The Overall RISC Advantage Today, the Intel x86 is arguable the only chip which retains CISC architecture. This is primarily due to advancements in other areas of computer technology. The price of RAM has decreased dramatically. In 1977, 1MB of DRAM cost about $5,000. By 1994, the same amount of memory cost only $6 (when adjusted for inflation). Compiler technology has also become more sophisticated, so that the RISC use of RAM and emphasis on software has become ideal.