1. Due Date
Due by 11:59 pm, Wednesday, October 9, 2024
This lab should be done with your Lab 4 partner, listed here: Lab 4 partners
Our guidelines for working with partners: working with partners, etiquette and expectations
2. Lab Overview and Goals
In this lab, you will implement a C program that performs basic statistics on climate change data. To do so, your program will use C pointers and arrays to dynamically allocate space for the set of variable length input files.
2.1. Lab Goals
-
Gain experience using pointers and dynamic memory allocation (malloc) in C.
-
Practice using
gdb
andvalgrind
to debug programs.
3. Starting Point Code
3.1. Getting Your Lab Repo
Both you and your partner should clone your Lab 4 repo into
your cs31/Labs
subdirectory:
-
get your Lab 4 ssh-URL from the CS31 git org. The repository to clone is named Lab4-userID1-userID2 where the two user names match that of you and your Lab 4 lab partner.
-
cd into your
cs31/Labs
subdirectory:$ cd ~/cs31/Labs $ pwd
-
clone your repo
$ git clone [the ssh url to your your repo] $ cd Lab4-userID1-userID2
There are more detailed instructions about getting your lab repo from the "Getting Lab Starting Point Code" section of the Using Git for CS31 Labs page. To make changes, follow the directions in the "Sharing Code with your Lab Partner" section of the Using Git for CS31 Labs page.
3.2. Starting Point files
$ ls
Makefile large.txt readfile.h stats.c
README.md larger.txt readfile.c small.txt
-
Makefile:
A Makefile simplifies the process of compiling your program. You do not need to edit this file. We’ll look at these in more detail later in the course, if you are interested, take a look at Section 7 for more information aboutmake`
and makefiles. -
readfile.h
andreadfile.c
: a library for file I/O. This is the same library used in Lab 2. Your program will make calls to functions in this library to read values from an input file. The instructions for using this library are explained below. Do not modify any code in these two files. -
stats.c
: starting point code for the C stats program. It contains some code to get the file name from the command line argument and the start of theget_values
function that you will complete. your solution should be implemented in this file. -
Climate change input files: These files contain surface temperature readings recording rising temperatures at the granularity of each country, measured from 1961 to 2022. You can find more information about this dataset here.
-
small.txt
: Surface temperature change readings from the Maldives. -
large.txt
: Surface temperature change readings from Finland. -
largest.txt
: Surface temp. data from the five largest emitters: China, United States, Russia, Japan and India
-
3.3. Reading from a File
Below are some code snippets for file I/O in C, using C’s stream
interface for file I/O (fopen
, fclose
, fscanf
, fprintf
, etc.)
that is part of the C standard I/O library. You can find more extensive
documentation about file I/O in C here:
File I/O in C.
To use file I/O, you need to include the stdio.h
header file:
#include <stdio.h>
Now declare a pointer that you will use to access a file, it’s type is
FILE *
. A file pointer is not a pointer like pointers to other
C types: dereferencing it doesn’t make any sense.
FILE * fileptr; // declare a file pointer
To open a file, you use the fopen()
function. The first argument is
the filename (a char array/string) and the second is the mode you want to
open the file in. Here we’re only reading from the file, so use "r".
Check the man
page for fopen()
to see what it returns on success and
failure. You want to make sure you were able to succesfully open the file
and store the address in fileptr
.
//Note: filename variable is storing a character array of the filename
fileptr = fopen(filename, "r"); // open the file in read mode
if (fileptr == NULL) { // check if there was an error opening the file
printf("Error: failed to open file: %s\n", filename);
exit(1);
}
To read in some values, use fscanf()
. It’s like scanf()
but for files.
It takes in three arguments: (1) a FILE *
that points to the file that you
opened and will read from, (2) a format string specifier (e.g., "%d" for an
integer or "%f" for a float), and (3) a pointer (or address) to the location
where you want to store the value you read in. To read in more than one value,
you will need to call fscanf()
multiple times in a loop.
Look at the man
page for fscanf()
for more information. It returns
the number of items read in, or EOF
if you reach the end of the file
(it is a macro in C), or 0 if there was an error.
float read_in;
ret = fscanf(fileptr, "%f", &read_in); // read in a float and store it in read_in
if (ret == 0) { // check if there was an error
printf("Improper file format.\n");
exit(1);
}
if (ret == EOF) { // check if end of file is reached
printf("End of file reached.\n");
}
When you’re done reading from the file, you should close it using the
fclose()
function. It takes in a FILE *
as an argument and returns
0 on success and a non-zero value on error.
ret = fclose(fileptr); // close the file
if (ret != 0) { // check if there was an error closing the file
printf("Error: failed to close file: %s\n", argv[1]);
exit(1);
}
Now, modify your code in stats.c
to be able to read in the 19
values from small.txt
. How would you change your code to be able to
read in the 62 values from large.txt
? How about reading in values
dynamically without knowing the file size ahead of time?
4. Dynamic Memory Allocation
4.1. Computing & Climate Change
Climate Change and Big Data
Computers and smart devices contribute significantly to green house gas emissions. You can read more about the staggering impact of computation on climate change here. A significant research effort in computer science is Green Computing, where researchers try to measure and lower our carbon footprint, advance the use of renewable sources of energy, and re-think the design and lifecycle of computer-farms, personal computers and smart devices. In this lab, we will read in climate change data, spanning 1961 to 2022, and perform basic statistics on this dataset. In the era of "Big Data", the code we write must run on data sets spanning 10 data values to data sets with 10 million data values! This lab is an example of using dynamic memory allocation that allows your program to accommodate such vastly varying input data files without recompilation. While we are exploring writing basic statistics in the context of climate change data, thinking of memory, code efficiency, and speed of execution is an increasingly necessary skill when working with varied and large data sets. |
In this lab, we will learn how to read-in variable length input files using dynamic memory allocation with C arrays. We will also run basic statistics on real-world data, using top-down design, and by writing modular functions.
4.2. Compiling and Running
You will implement your code in stats.c
.
-
It takes a single command line argument-- the input text file (floats, one per line), and computes and prints out a set of statistics about the data:
$ ./stats small.txt Results: ------- num values: 19 min: -0.318 max: 1.555 mean: 0.529 median: 0.573 std dev: 0.556 unused array capacity: 1
Run make
to compile the stats
program.
Make compiles your solution in the stats.c
file, and also compiles and links
in the readfile.o
library, and links in the C math library (-lm
).
The math library has a sqrt
function, which you will need for the standard
deviation calculations.
$ make
Then run with some input file as a command line argument, for example:
$ ./stats small.txt
$ ./stats large.txt
4.3. Sample Output
The following shows an example of what a run of a complete stats
program looks like for a particular input file specified at the
command line:
$ ./stats small.txt
Results:
num values: 19
min: -0.318
max: 1.555
mean: 0.529
median: 0.573
std dev: 0.556
unused array capacity: 1
$ ./stats large.txt
num values: 62
min: -1.801
max: 3.317
mean: 0.794
median: 1.121
std dev: 1.158
unused array capacity: 18
Remember to test your implementation on multiple input files, and create your own to test certain cases.
4.4. Program Control Flow
Program control flow: When run, your program should do the following:
-
Make a call to
get_values
, passing in:-
input file containing data values
-
one pointer containing the address of an int variable to store the size of the array (number of values read in)
-
the second pointer holding the address of an int variable to store the total array capacity. See the Requirements Section for details on how to allocate the array that this function returns.
-
-
The
get_values
function will return:-
an array of float values that stores the values read in from the file, or,
-
NULL
on error (e.g.,malloc()
fails or the file cannot be opened).At this step you could add a debugging call to a printArray function to print out the values you read in to check that it is okay (remove this output from your final submission if you do).
-
-
Sort the array of values. (see the Tips section for hints on re-using your sorting function from Lab 2).
-
Compute the min, max, mean (average), median, and standard deviation of the set of values and print them out (see notes below about median).
-
Print out the statistical results, plus information about the number of values in the data set and the amount of unused capacity in the array storing the values.
4.5. Statistics To Compute
The statistics your program to compute on the set of values are the following:
-
num values: total number of values in the data set.
-
min: the smallest value in the data set.
-
max: the largest value in the data set.
-
mean: the average of the set of values. For example, if the set is: 5, 6, 4, 2, 7, the mean is 4.8 (24.0/5).
-
median: the middle value in the set of values. For example, if the set of values read in is: 5, 6, 4, 2, 7, the median value is 5 (2 and 4 are smaller and 6 and 7 are larger). If the total number of values is even, just use the (total/2) value as the median. For example, for the set of 4 values (2, 5, 6, 7), the median value is 6 because 4/2 is 2 and the value in position 2 of the sorted ordering of these values is 6 (2 is in position 0, 5 in position 1, 6 in position 2, 7 in position 3).
-
stddev: is given by the following formula:
\[s=\sqrt{\frac{1}{N-1} \sum_{i=1}^N(x_i - \overline{x})^2}\]Where \(N\) is the number of data values, \(x_i\) is the \(i\)th data value, and \(\overline{x}\) is the mean of the values.
-
unused array capacity: this is really a statistic about the amount of the total capacity of your array that is not used to store this set of values.
For increased precision, use the C double
type to store and compute the mean and the square root. The C math library has a function to compute square root:double sqrt(double val)
. You can passsqrt
a float value and C will automatically convert the float value to a double.
4.6. Lab Requirements
For full credit, your solution should meet the following requirements:
-
Design:
-
Implement your solution in
stats.c
, which includes some starting point code. -
Your code should be commented, modular, robust, and use meaningful variable and function names. This includes having a top-level comment describing your program, and every function should include a brief description of its behavior. Look at the C code style guide for examples of complete comments, tips on how not to wrap lines, good indenting styles and suggestions for meaningful variable and function names in your program. You may not use any global variables for this assignment.
-
It should be evident that you applied top-down design when constructing your submission (e.g., there are multiple functions, each with a specific, documented role). You should have at least 4 function-worthy functions.
-
Remove the "TODO" comment (if you want to, you can keep the comment itself without the TODO in front if it helps you explain/layout your code.)
-
-
Functionality:
-
The get_values function: takes the name of a file containing input values, reads in values from the file into a dynamically allocated array, and returns the dynamically allocated array to the caller. The array’s size and capacity are "returned" to the caller through pass-by-pointer parameters:
float *get_values(int *size, int *capacity, char *filename);
-
The array of values must be dynamically allocated on the heap by calling
malloc
. You must start out allocating an array of 20 float values. As you read in values into the current array, if you run out of capacity:-
Call
malloc
to allocate space for a new array that is twice the size of the current full one. -
Copy values from the old full array to the new array (and make the new array the current one).
-
Free the space allocated by the old array by calling
free
.
-
There are other ways to do this type of alloc and re-alloc in C. However, this is the way we want you to do it for this assignment: make sure you start out with a dynamically allocated array of 20 floats, then each time it fills up, allocate a new array of twice the current size, copy values from the old to new, and free the old. -
When all of the data values have been read in from the file, the function should return the filled, dynamically allocated, array to the caller (the function’s return type is (
float *
). The array’s size and total capacity are "returned" to the caller through the pass-by-pointer parameterssize
andcapacity
. -
After reading in the values from the file, your program should sort the array (and note that your program should be able to easily compute min, max and median on a sorted array of values). See the Tips section about how to re-use your sort function from the Lab 2.
-
-
Output and Testing:
-
When run, your program’s output should look like the output shown below from a run of a working program. To make grading easier, your output must match the example as closely as possible.
-
You should not assume that we will test your code with the sample input files that have been provided.
-
For full credit, your program must be free of valgrind errors. Do not forget to close the file you opened! If you don’t, valgrind will likely report a memory leak.
$ ./stats large.txt Results: num values: 62 min: -1.801 max: 3.317 mean: 0.794 median: 1.121 std dev: 1.158 unused array capacity: 18
+ Note: just like you can use
\n
in theprintf
format string to insert a new line, you can use\t
to insert a tab character for pretty formatting. Also, you can specify formatting of each placeholder in the printf string. For example, use%10.3f
to specific printing a float/double value in a field with of 10 with 3 places after the decimal point. Section 2.8 of the textbook has more information about formatting printf placeholders. -
4.7. Tips and Hints
-
Before starting to write code, use top-down design to break your program into manageable functionality.
-
Write get_values without the re-allocation and copying steps first (data with <20 values).
-
Then, go back and add in code for larger input files, requiring: (a) malloc-ing more space, (b) copying old values to the new larger space and (c) freeing up the old space (this step is really important!)
-
-
Sort function: You can use your sort function from Lab 2 in this program! First, copy your sort function from Lab 2. Then, change the array parameter to a pointer to float. You can now use the same sort function for your dynamically allocated array!
// change the prototype and function definition of your sorting function // so that its first parameter is float *values void sort(float *values, int size);
-
For debugging, you can copy over your
printArray
function from your Lab 2 solution. Note: you likely don’t want to do this with large files. Be sure to remove and comment out calls to printArray in your submitted program if you do this.
-
See the Lab 2 lab assignment for documentation about using the
readfile
library (and also look at thereadfile.h
comments for how to use its functions). -
Take a look at the textbook, weekly lab code and in-class exercises to remind yourself about malloc, free, pointer variables, dereferencing pointer variables, dynamically allocated arrays, and passing pointers to functions.
-
Use Ctrl-C to kill a running program stuck in an infinite loop.
5. Survey
Once you have submitted the final version of your entire lab, you should submit the required Lab 4 Questionnaire (each lab partner must do this). Note that the survey should be turned in after your lab is turned in, therefore the deadline for the survey is deliberately left vague. You should submit it, but if it’s a bit later than the deadline for the actual lab (even by a day or two), that’s completely fine.
6. Submitting your Lab
Please remove any debugging output prior to submitting.
To submit your code, commit your changes locally using git add
and
git commit
. Then run git push
while in your lab directory.
Only one partner needs to run the final git push
, but make sure both
partners have pulled and merged each others changes.
Also, it is good practice to run make clean
before doing a git add and
commit: you do not want to add to the repo any files that are built by gcc
(e.g. executable files). Included in your lab git repo is a .gitignore file
telling git to ignore these files, so you likely won’t add these types of files
by accident. However, if you have other gcc generated binaries in your repo,
please be careful about this.
Here are the commands to submit your solution in the sorter.c file (from
one of you or your partner’s ~/cs31/Labs/Lab4-userID1-userID2
subdirectory):
$ make clean
$ git add stats.c
$ git commit -m "correct and well commented Lab4 solution"
$ git push
Verify that the results appear (e.g., by viewing the the repository on CS31-f24). You will receive deductions for submitting code that does not run or repos with merge conflicts. Also note that the time stamp of your final submission is used to verify you submitted by the due date, or by the number of late days that you used on this lab, so please do not update your repo after you submit your final version for grading.
If you have difficulty pushing your changes, see the "Troubleshooting" section and "can’t push" sections at the end of the Using Git for CS31 Labs page. And for more information and help with using git, see the git help page.
7. Handy References
General Lab Resources
-
Class EdSTEM page for questions and answers about lab assignment