F. He K. Ganchev S. Finney CS23/E22 Lab Assignment 2 Friday, February 23rd, 2001 ABSTRACT In this lab we investigated the hierarchy of memory on a small number of architectures. To fully analyze the differences between different architectures, as well as to test the strengths and weakness of a given architecture with regards to memory, we wrote programs to stress the cache hierarchy, and the superscalar architecture. DESCRIPTION We intended our programs to inspect various properties of the architecture for 2-3 different hardware platforms. Those selected were the UltraSPARC IIi (*.cs.swarthmore.edu), the Mac Power PC 750 (uriel.eclipsed.net), and the i686 (celeborn.engin.swarthmore.edu). First we wrote a simple program to determine the endianness of a given platform. Next, we inspected the cache structure of the Ultra SPARC IIi and the Mac PPC 750, and wrote programs to test the characteristics of memory hierarchy of each. Next we examined the capabilities of a processor for doing large amounts of various precision integer and floating- point arithmetic. And finally, we experimented with altering that code to demonstrate the effect of a superscalar architecture TASK 1 -- ENDIANESS [src: task1.endian.c] Endian.c determines whether an architecture is based upon big- or little-endian architecture. Here are the results, when endian was run on three different architectures found on campus computer architecture endianness ------------------------------------------------------------ rosemary.cs.swarthmore.edu UltraSparc II big celeborn.engin.swarthmore.edu i686 little uriel.eclipsed.net PowerPC 750 big TASK 2 -- CACHE STRUCTURE we investigated the cache structure of two different computers on campus, with the following results: (uriel.eclipsed.net) Mac Power PC 750 64k L1 cache (split 32k d/32k i) 8-way set-associative 32-byte lines (allspice.cs.swarthmore.edu) UltraSPARC IIi 32k L1 cache (split 16k d/16k i) direct mapping(d), 2-way set-associative(i) 32-byte lines (note that allspice has two of these CPU's) TASK 3 -- Cache Performance [src: task3.*.c] We tested cache performance in running both cache friendly and unfriendly codes. Cases tested are: -- sequential access of a cached-sized block. -- access the first byte of each cache line. -- force misses on every access. -- force misses on direct mapping cache architecture but not -- 2-way associative architecture. The first table contains results obtained using "timeb" within the program, The second and third tables use the time utility. This is because the MacPPC 750 machine did not have the necessary libraries to time the code. The null row contains the times for a program that does not have any of the steps we are trying to measure. This is donein an attempt to eliminate that sorce of error. UltraSPARCIIi (using timeb) bs1: 51.518 51.631 52.525 51.361 bs2: 51.009 51.200 50.889 50.827 stress: 51.453 51.667 51.391 51.508 2wn: 52.165 52.178 52.057 52.186 null: 45.467 45.500 45.420 45.432 MacPPC 750 (using time utility) bs1 25.240 25.274 25.265 25.267 bs2 25.913 25.960 25.947 25.945 stress 25.064 25.069 25.038 25.076 2wn 88.563 88.603 88.606 88.612 null 21.233 21.261 21.261 21.276 Averages: UltraSPARCIIi bs1: 51.759 bs2: 50.981 stress: 51.505 2wn: 52.147 null: 45.455 MacPPC 750 (using time utility) bs1 25.262 bs2 25.941 stress 25.062 2wn 88.596 null 21.258 Our program design followed pretty closely to the instructions given in the lab description: a certain size of memory is allocated, and, since we know the cache structures we are investigating, it was easy to load certain blocks of memory to cache, and force misses/hits. For details and results, see answer to Question 1. TASK 4 -- Different Forms Of Calculation [src: task4.*.c] Calculations were performed using double-precision floating point numbers, single-precision floating point numbers, 32-bit integers, and 8-bit integers, to display performance differences. (average of tests on allspice.cs.swarthmore.edu -- UltraSparc II) test program system time (using cmdline time) --------------------------------------------------------- task4.32-bit real 0.17 user 0.16 sys 0.01 task4.8-bit real 0.18 user 0.17 sys 0.01 task4.single-precision real 0.43 user 0.41 sys 0.01 task4.double-precision real 0.67 user 0.66 sys 0.00 (average of tests on uriel.eclipsed.net -- PowerPC 750) test program system time (using cmdline time) --------------------------------------------------------- task4.32-bit real 0.13 user 0.12 sys 0.03 task4.8-bit real 0.13 user 0.07 sys 0.02 task4.double-precision real 0.23 user 0.19 sys 0.02 task4.single-precision real 0.28 user 0.17 sys 0.01 note that the 32 bit program actually has a slightly faster execution time than 8 bit program. This is a rather interesting peculiarity of the UltraSparc II architecture indeed! TASK 5 -- Effects of Superscalar Architecture [src: task5.*.c] Benefits of superscalar architecture are most obvious in the data provided by the examples using integer arithmetic. As frequency of dependency rose in the code, so did the time required to complete the program. Because so many operations where dependant on the results of previous operations, commands could not be fed as quickly in parallel to the pipeline. So, basically, having a deep pipeline serves little advantage if it can't be used. (results from testing on rosemary.cs.swarthmore.edu -- UltraSPARC II) test program system time (using cmdline time) ---------------------------------------------------------------------- baseline1.int real 0.1 user 0.1 sys 0.0 baseline2.int real 0.2 user 0.2 sys 0.0 dependency2.int real 0.4 user 0.4 sys 0.0 dependency3.int real 0.2 user 0.2 sys 0.0 baseline1.float real 54.2 user 9.8 sys 29.3 baseline2.float real 36.6 user 7.1 sys 19.4 dependency2.float real 0.6 user 0.5 sys 0.0 dependency3.float real 36.4 user 7.3 sys 19.2 (results from testing on uriel.eclipsed.net -- PowerPC 750) test program system time (using cmdline time) ---------------------------------------------------------------------- baseline1.int real 0.16 user 0.11 sys 0.03 baseline2.int real 0.21 user 0.14 sys 0.03 dependency2.int real 0.41 user 0.37 sys 0.01 dependency3.int real 0.25 user 0.14 sys 0.01 baseline1.float real 0.51 user 0.43 sys 0.01 baseline2.float real 0.57 user 0.51 sys 0.01 dependency2.float real 0.76 user 0.74 sys 0.03 dependency3.float real 0.52 user 0.46 sys 0.02 Troubling, though, where the results provided by the floating point intensive examples. In this case, time seemed to decrease as more and more dependancies arose. Particularly disturbing are the results of 'dependency2.float' on the UltraSPARC II, which was expected to be the slowest operation, but was by far the fastest. All your base are belong to us. Doubts arose over the correctness of the code, so we perfmed the same test with the same code on a different platform, which yielded the original expected results Seeing as all the code is basically the same set of commands with nothing but varying assignments, we are at a loss to describe why this is so. If we were to desire finding a solution to this discrepency, perhaps the next step to take would be to port the code into raw assembly language. This would enable us to have complete and total control of the programming process, and would eliminate any suspicions of compiler optomization. Q1: Describe how you implemented the cache stress programs and try to quantify the effects of having a cache on nice code. The cache stress programs are actually fairly simple. We force an access to memory by causing an assignment operation to be performed. We control for the time taken to perform this and loop operations by using a null program - this executes the calls without the extra accesses to memory that we are trying to measure. The base 1 and 2 ("BaSe" has been abbreviated to "bs" in the tables) tests are as described in the lab instructions. The cache stress is achieved by accessing an entire block of memory 4 times larger than the cache. A LIFO cache replacement policy would defeat this approach, but no architecture that I have heard of implements this (probably because this is not common, and it is difficult to decide when to change back to FIFO to flush the cache). Some database management systems implement LIFO for main memory though. The 2-way nice test accesses the first half of each of two cache sized blocks of memory in order. To get an idea of how much the speedup is, we take a look at the average execution times (below). It appears that there is not a huge difference in execution times. The two way nice on the MacPPC seems to be an anomily. UltraSPARCIIi bs1 - null = 6.304 bs2 - null = 5.526 stress - null = 6.050 2wn - null = 6.692 MacPPC 750 (using time utility) bs1 - null = 4.004 bs2 - null = 4.683 stress - null = 3.804 2wn - null = 67.338 Q2: Describe how you implemented the superscalar test programs and try to quantify the effects of having a superscalar architecture. To test the superscalar architecture we had a for loop that performed assignments on 8 variables. bs1 has no dependancies, bs2 assigns to each variable the variable 4 before it, so that no more than 3 operations can occur silmultaneously. dep3 does something similar, except with the shift being only 3, so that we can have at most 2 silmultaneous operations. With dep2 every operation depends on the previous (except the first one of course). Without the dependancies, execution took 1/3 of the time with no dependancies as it did with dependancies. The closert the dependancies were, the more the slowdown was (until the 1/3 speed with every line dependant upon the previous). Q3: Discuss how what you learned in this lab might affect your programming style in the future. Discuss how what you learned in this lab might affect your programming style in the future. Using the knowledge acquired in this lab, perhaps in the future we will take into consideration the impact of programming for different architectures. For example, whether an architecture is big- or little-endian will affect the speed of referencing multi-dimensioned arrays, which I imagine has great impacts on graphic-intensive software, AI perhaps, and also database management. And of course big-endian and little-endian machines also have different performance capabilities when it comes to raw integer/floating-point number crunching as well. Also, the specifics of cache architecture might play a role for anyone trying to refine their use of memory and processor time. This is already apparent in the refinement of compilers to automatically correct and optimize the code being compiled. I suppose if we didn't use a compiler with the capability such abilities, this would be something to definitely take into account. In addition, branch-prediction abilities (or lack thereof) will also have an effect on our style if extreme efficiency or memory control is ever a serious issue in our coding or compiling. Q4: Describe any lab extensions or further comparsons you decided to make to test computer performance. As part of the lab extension, we compared processing time between consistant/easily predictable conditional braches (consistantly choosing one branch of if statments) and those that are random (braching determined through a random number negerator). We observed marginal differences in executing time when each branch consists of only a small number of operations (up to 8 primitive operations on UltraSparc). If, however, more operations are required to perform for each branch (over 8 on UltraSparc), we observed a significant difference, and the executing time approaches the theoretical ratio of 1:2, ie. the processor gets it wrong half the time. We speculate the reason to be that more operations exhaust the capacify of the processor's superscalar architecture. We also performed most of the tests of this lab on multiple architectures, most frequently the Mac Power PC 750 and the UltraSPARC IIi. PROBLEMS -- And How We Solved Them. As is the case with any cooperative effort involving more than one person, many of our problems arose because we all counted on others to be doing most of the work. Towards the deadline, however, Sean took a heroic first step, and quickly the movtivated was working at over-drive, and the lab was successfuly completed. Other, more technical problems were present too. The programming itself wasn't too difficult, but care was taken into fooling the compiler. As one member of our lab group put it, "when you want the computer to do something real useless, it's hard to get it right". Also, testing for superscalar, cache, pipeline, etc., requires different program designs. For the most part we solved these problems by Platonic-style dialectics amongst ourselves (via emails, of course), or plain old hard thinking. Another problem worth mentioning is how we timed running-time of our programs. We first used the UNIX time ultility, which counts not only the loops in the programs (the time we want) but also other overheads such as variable initiation (the time we do not want). Hence we switched to use C library function ftime, which gives accurate timing up to milliseconds. TESTING PROGRAMS -- If Our Programs Worked or Not and How We Tested Them All of our programs worked, if not upon first compiling. This is confirmed with our testing, which produced results that follows theoretical predictions within experimental errors. An example is the cache-stressing program. We predicted that code that forces cache miss every time will out run code that is cahce-nice on the same machine. This turns out to be indeed the case. More challenging problems were encountered in doing the lab extensions. One extension that we failed to do is testing the L2 cache on UltraSparc. Either because of faulty program design, or the difference is immaterial, we could detect significant difference between L2-hit code and L2-miss code. WHAT WE'VE LEARNED We learned much about various cache architectures, and the strength and weakness of them. Understanding of the underlying cache designed is ensured as we write code to reveal such strengths and weaknesses. The 4 goals of this lab are achieved. Now we have first hand knowledge of sevearl computer architectures (at least at the cahce level), and can comfortably converse in computer system architecture language and be confident that we know what we are talking about. We see most clearly the effects of modern computer architecture concepts through running concrete tests in various machines and getting quantifiable results on computer performance in the form of running-time.