F. He
K. Ganchev
S. Finney

CS23/E22 Lab Assignment 2

Friday, February 23rd, 2001


ABSTRACT

	In this lab we investigated the hierarchy of memory on a
	small number of architectures.  To fully analyze the
	differences between different architectures, as well as
	to test the strengths and weakness of a given architecture
	with regards to memory, we wrote programs to stress the
	cache hierarchy, and the superscalar architecture.


DESCRIPTION

	We intended our programs to inspect various properties
	of the architecture for 2-3 different hardware platforms.
	Those selected were the UltraSPARC IIi (*.cs.swarthmore.edu),
	the Mac Power PC 750 (uriel.eclipsed.net), and the i686 
	(celeborn.engin.swarthmore.edu).

	First we wrote a simple program to determine the endianness
	of a given platform.  Next, we inspected the cache structure
	of the Ultra SPARC IIi and the Mac PPC 750, and wrote
	programs to test the characteristics of memory hierarchy of
	each.

	Next we examined the capabilities of a processor for doing
	large amounts of various precision integer and floating-
	point arithmetic.  And finally, we experimented with
	altering that code to demonstrate the effect of a
	superscalar architecture


TASK 1 -- ENDIANESS

	[src: task1.endian.c]

	Endian.c determines whether an architecture is based
	upon big- or little-endian architecture.  Here are
	the results, when endian was run on three different
	architectures found on campus 

	computer                  	architecture	endianness
	------------------------------------------------------------	
	rosemary.cs.swarthmore.edu	UltraSparc II	big
	celeborn.engin.swarthmore.edu	i686    	little
	uriel.eclipsed.net      	PowerPC 750	big	


TASK 2 -- CACHE STRUCTURE

	we investigated the cache structure of two different
	computers on campus, with the following results:

	(uriel.eclipsed.net)
	Mac Power PC 750 64k 
	L1 cache (split 32k d/32k i)
	8-way set-associative
	32-byte lines

	(allspice.cs.swarthmore.edu)
	UltraSPARC IIi
	32k L1 cache (split 16k d/16k i)
	direct mapping(d), 2-way set-associative(i)
	32-byte lines
	(note that allspice has two of these CPU's)


TASK 3 -- Cache Performance

	[src: task3.*.c]

	We tested cache performance in running both cache friendly and
	unfriendly codes. Cases tested are:
	-- sequential access of a cached-sized block.
	-- access the first byte of each cache line.
	-- force misses on every access.
	-- force misses on direct mapping cache architecture but not
	-- 2-way associative architecture.

	The first table contains results obtained using "timeb" within
	the program, The second and third tables use the time utility.
	This is because the MacPPC 750 machine did not have the
	necessary libraries to time the code. The null row contains the
	times for a program that does not have any of the steps we are
	trying to measure. This is donein an attempt to eliminate that
	sorce of error.


	UltraSPARCIIi (using timeb)
	bs1:	51.518	51.631	52.525	51.361
	bs2:	51.009	51.200	50.889	50.827
	stress: 51.453	51.667	51.391	51.508
	2wn:	52.165	52.178	52.057	52.186
	null:	45.467	45.500	45.420	45.432

	MacPPC 750  (using time utility)
	bs1	25.240	25.274	25.265	25.267
	bs2	25.913	25.960	25.947	25.945
	stress	25.064	25.069	25.038	25.076
	2wn	88.563	88.603	88.606	88.612
	null	21.233	21.261	21.261	21.276


	Averages:

	   UltraSPARCIIi
	   bs1:    51.759
	   bs2:    50.981
	   stress: 51.505
	   2wn:    52.147
	   null:   45.455

	   MacPPC 750  (using time utility)
	   bs1     25.262
	   bs2     25.941
	   stress  25.062
	   2wn     88.596
	   null    21.258


	Our program design followed pretty closely to the instructions
	given in the lab description: a certain size of memory is
	allocated, and, since we know the cache structures we are
	investigating, it was easy to load certain blocks of memory
	to cache, and force misses/hits. For details and results, see
	answer to Question 1.


TASK 4 -- Different Forms Of Calculation

	[src: task4.*.c]

	Calculations were performed using double-precision
	floating point numbers, single-precision floating point
	numbers, 32-bit integers, and 8-bit integers, to
	display performance differences.


	(average of tests on allspice.cs.swarthmore.edu -- UltraSparc II)
	test program    	system time (using cmdline time)
	---------------------------------------------------------
	task4.32-bit     	real 0.17 user 0.16 sys 0.01
	task4.8-bit      	real 0.18 user 0.17 sys 0.01
	task4.single-precision	real 0.43 user 0.41 sys 0.01
	task4.double-precision	real 0.67 user 0.66 sys 0.00

	(average of tests on uriel.eclipsed.net  -- PowerPC 750)
	test program    	system time (using cmdline time)
	---------------------------------------------------------
	task4.32-bit    	real 0.13 user 0.12 sys 0.03
	task4.8-bit     	real 0.13 user 0.07 sys 0.02
	task4.double-precision	real 0.23 user 0.19 sys 0.02
	task4.single-precision	real 0.28 user 0.17 sys 0.01


	note that the 32 bit program actually has a slightly
	faster execution time than 8 bit program.  This is
	a rather interesting peculiarity of the UltraSparc II
	architecture indeed!


TASK 5 -- Effects of Superscalar Architecture

	[src: task5.*.c]

	Benefits of superscalar architecture are most obvious in the
	data provided by the examples using integer arithmetic.  As
	frequency of dependency rose in the code, so did the time
	required to complete the program.  Because so many operations
	where dependant on the results of previous operations, commands
	could not be fed as quickly in parallel to the pipeline. 
	So, basically, having a deep pipeline serves little advantage 
	if it can't be used.

	(results from testing on rosemary.cs.swarthmore.edu -- UltraSPARC II)
	test program    	system time (using cmdline time)
	----------------------------------------------------------------------
	baseline1.int   	real 0.1 user 0.1 sys 0.0
	baseline2.int   	real 0.2 user 0.2 sys 0.0
	dependency2.int 	real 0.4 user 0.4 sys 0.0
	dependency3.int 	real 0.2 user 0.2 sys 0.0
	baseline1.float 	real 54.2 user 9.8 sys 29.3
	baseline2.float 	real 36.6 user 7.1 sys 19.4
	dependency2.float	real 0.6 user 0.5 sys 0.0
	dependency3.float	real 36.4 user 7.3 sys 19.2


	(results from testing on uriel.eclipsed.net -- PowerPC 750)
	test program    	system time (using cmdline time)
	----------------------------------------------------------------------
	baseline1.int   	real 0.16 user 0.11 sys  0.03
	baseline2.int   	real 0.21 user 0.14 sys  0.03
	dependency2.int 	real 0.41 user 0.37 sys  0.01
	dependency3.int 	real 0.25 user 0.14 sys  0.01
	baseline1.float 	real 0.51 user 0.43 sys  0.01
	baseline2.float 	real 0.57 user 0.51 sys  0.01
	dependency2.float	real 0.76 user 0.74 sys  0.03
	dependency3.float	real 0.52 user 0.46 sys  0.02

	Troubling, though, where the results provided by the floating
	point intensive examples.  In this case, time seemed to
	decrease as more and more dependancies arose.  Particularly
	disturbing are the results of 'dependency2.float' on the
	UltraSPARC II, which was expected to be the slowest operation, 
	but was by far the fastest.  All your base are belong to us.
	Doubts arose over the correctness of the code, so we perfmed
	the same test with the same code on a different platform, which
	yielded the original expected results Seeing as all the code is
	basically the same set of commands with nothing but varying
	assignments, we are at a loss to describe why this is so.  If
	we were to desire finding a solution to this discrepency,
	perhaps the next step to take would be to port the code into
	raw assembly language.  This would enable us to have complete
	and total control of the programming process, and would
	eliminate any suspicions of compiler optomization.


Q1:	Describe how you implemented the cache stress programs and try to
	quantify the effects of having a cache on nice code.
   
	The cache stress programs are actually fairly simple. We force an
	access to memory by causing an assignment operation to be
	performed. We control for the time taken to perform this and loop
	operations by using a null program - this executes the calls
	without the extra accesses to memory that we are trying to measure.
	
	The base 1 and 2 ("BaSe" has been abbreviated to "bs" in the
	tables) tests are as described in the lab instructions. The cache
	stress is achieved by accessing an entire block of memory 4 times
	larger than the cache. A LIFO cache replacement policy would defeat
	this approach, but no architecture that I have heard of implements
	this (probably because this is not common, and it is difficult to
	decide when to change back to FIFO to flush the cache). Some
	database management systems implement LIFO for main memory
	though. The 2-way nice test accesses the first half of each of two
	cache sized blocks of memory in order. 

	To get an idea of how much the speedup is, we take a look at the
	average execution times (below). It appears that there is not a
	huge difference in execution times. The two way nice on the MacPPC
	seems to be an anomily.
	
	UltraSPARCIIi
	bs1 - null = 6.304
	bs2 - null = 5.526
	stress - null = 6.050
	2wn - null = 6.692


	MacPPC 750  (using time utility)
	bs1 - null =  4.004
	bs2 - null =  4.683
	stress - null =  3.804
	2wn - null = 67.338
	

Q2:	Describe how you implemented the superscalar test programs and try
	to quantify the effects of having a superscalar architecture.

	To test the superscalar architecture we had a for loop that
	performed assignments on 8 variables. bs1 has no dependancies, bs2
	assigns to each variable the variable 4 before it, so that no more
	than 3 operations can occur silmultaneously. dep3 does something
	similar, except with the shift being only 3, so that we can have at
	most 2 silmultaneous operations. With dep2 every operation depends
	on the previous (except the first one of course). 

	Without the dependancies, execution took 1/3 of the time with no
	dependancies as it did with dependancies. The closert the
	dependancies were, the more the slowdown was (until the 1/3 speed
	with every line dependant upon the previous).


Q3:     Discuss how what you learned in this lab might affect your 
	programming style in the future. Discuss how what you learned
	in this lab might affect your programming style in the future.


	Using the knowledge acquired in this lab, perhaps in the future
	we will take into consideration the impact of programming for
	different architectures.  For example, whether an architecture
	is big- or little-endian will affect the speed of referencing
	multi-dimensioned arrays, which I imagine has great impacts on
	graphic-intensive software, AI perhaps, and also database
	management.  And of course big-endian and little-endian
	machines also have different performance capabilities when it
	comes to raw integer/floating-point number crunching as well.

	Also, the specifics of cache architecture might play a role for
	anyone trying to refine their use of memory and processor
	time.  This is already apparent in the refinement of compilers
	to automatically correct and optimize the code being compiled.
	I suppose if we didn't use a compiler with the capability such
	abilities, this would be something to definitely take into
	account.  In addition, branch-prediction abilities (or lack
	thereof) will also have an effect on our style if extreme
	efficiency or memory control is ever a serious issue in our
	coding or compiling.


Q4:	Describe any lab extensions or further comparsons you decided to
	make to test computer performance.

	As part of the lab extension, we compared processing time
	between consistant/easily predictable conditional braches
	(consistantly choosing one branch of if statments) and those
	that are random (braching determined through a random number
	negerator). We observed marginal differences in executing time
	when each branch consists of only a small number of operations
	(up to 8 primitive operations on UltraSparc). If, however, more
	operations are required to perform for each branch (over 8 on
	UltraSparc), we observed a significant difference, and the
	executing time approaches the theoretical ratio of 1:2, ie. the
	processor gets it wrong half the time. We speculate the reason
	to be that more operations exhaust the capacify of the
	processor's superscalar architecture.

	We also performed most of the tests of this lab on multiple
	architectures, most frequently the Mac Power PC 750 and the
	UltraSPARC IIi.


PROBLEMS -- And How We Solved Them.

	As is the case with any cooperative effort involving more than
	one person, many of our problems arose because we all counted
	on others to be doing most of the work. Towards the deadline,
	however, Sean took a heroic first step, and quickly the
	movtivated was working at over-drive, and the lab was
	successfuly completed.

	Other, more technical problems were present too. The
	programming itself wasn't too difficult, but care was taken
	into fooling the compiler. As one member of our lab group put
	it, "when you want the computer to do something real useless,
	it's hard to get it right".  Also, testing for superscalar,
	cache, pipeline, etc., requires different program designs. For
	the most part we solved these problems by Platonic-style
	dialectics amongst ourselves (via emails, of course), or plain
	old hard thinking.  Another problem worth mentioning is how we
	timed running-time of our programs. We first used the UNIX time
	ultility, which counts not only the loops in the programs (the
	time we want) but also other overheads such as variable
	initiation (the time we do not want). Hence we switched to use
	C library function ftime, which gives accurate timing up to
	milliseconds.


TESTING PROGRAMS -- If Our Programs Worked or Not and How We Tested Them

	All of our programs worked, if not upon first compiling. This
	is confirmed with our testing, which produced results that
	follows theoretical predictions within experimental errors. An
	example is the cache-stressing program. We predicted that code
	that forces cache miss every time will out run code that is
	cahce-nice on the same machine. This turns out to be indeed the
	case.

	More challenging problems were encountered in doing the lab
	extensions.  One extension that we failed to do is testing the
	L2 cache on UltraSparc.  Either because of faulty program
	design, or the difference is immaterial, we could detect
	significant difference between L2-hit code and L2-miss code.


WHAT WE'VE LEARNED

	We learned much about various cache architectures, and the
	strength and weakness of them. Understanding of the underlying
	cache designed is ensured as we write code to reveal such
	strengths and weaknesses.

	The 4 goals of this lab are achieved. Now we have first hand
	knowledge of sevearl computer architectures (at least at the
	cahce level), and can comfortably converse in computer system
	architecture language and be confident that we know what we are
	talking about. We see most clearly the effects of modern
	computer architecture concepts through running concrete tests
	in various machines and getting quantifiable results on
	computer performance in the form of running-time.