System calls are how userspace programs interact with the kernel. The general principle behind how they work is described below.
Each and every system call has a system call
number which is known by both the userspace and the
kernel. For example, both know that system call number 10 is
open()
, system call number 11
is read()
, etc.
The Application Binary Interface (ABI) is very similar to an API but rather than being for software is for hardware. The API will define which register the system call number should be put in so the kernel can find it when it is asked to do the system call.
System calls are no good without arguments; for example
open()
needs to tell the kernel
exactly what file to open. Once again the
ABI will define which registers arguments should be put into for
the system call.
To actually perform the system call, there needs to be
some way to communicate to the kernel we wish to make a system
call. All architectures define an instruction, usually called
break
or something similar,
that signals to the hardware we wish to make a system
call.
Specifically, this instruction will tell the hardware to modify the instruction pointer to point to the kernels system call handler (when the operating system sets its self up it tells the hardware where its system call handler lives). So once the userspace calls the break instruction, it has lost control of the program and passed it over to the kernel.
The rest of the operation is fairly straight forward. The kernel looks in the predefined register for the system call number, and looks it up in a table to see which function it should call. This function is called, does what it needs to do, and places it's return value into another register defined by the ABI as the return register.
The final step is for the kernel to make a jump instruction back to the userspace program, so it can continue off where it left from. The userpsace program gets the data it needs from the return register, and continues happily on it's way!
Although the details of the process can get quite hairy, this is basically all their is to a system call.
Although you can do all of the above by hand for each
system call, system libraries usually do most of the work for
you. The standard library that deals with system calls on
UNIX like systems is libc
; we
will learn more about it's roles in future weeks.
As the system libraries usually deal with making systems call for you, we need to do some low level hacking to illustrate exactly how the system calls work.
We will illustrate how probably the most simple system
call, getpid()
, works. This
call takes no arguments and returns the ID of the currently
running program (or process; we'll look more at the process in
later weeks).
1 #include <stdio.h> /* for syscall() */ 5 #include <sys/syscall.h> #include <unistd.h> /* system call numbers */ #include <asm/unistd.h> 10 void function(void) { int pid; 15 pid = __syscall(__NR_getpid); }
We start by writing a small C program which we can start
to illustrate the mechanism behind system calls. The first
thing to note is that there is a
syscall
argument provided by
the system libraries for directly making system calls. This
provides an easy way for programmers to directly make systems
calls without having to know the exact assembly language
routines for making the call on their hardware. So why do we
use getpid()
at all? Firstly,
it is much clearer to use a symbolic function name in your code.
However, more importantly,
getpid()
may work in very
different ways on different systems. For example, on Linux the
getpid()
call can be cached, so
if it is run twice the system library will not take the penalty
of having to make an entire system call to find out the same
information again.
By convention under Linux, system calls numbers are
defined in the asm/unistd.h
file from the kernel source. Being in the
asm
subdirectory, this is
different for each architecture Linux runs on. Again by
convention, system calls numbers are given a
#define
name consisting of
__NR_
. Thus you can see our
code will be making the getpid
system call, storing the value in
pid
.
We will have a look at how several architectures implement this code under the hood. We're going to look at real code, so things can get quite hairy. But stick with it -- this is exactly how your system works!
PowerPC is a RISC architecture common in older Apple computers, and the core of devices such as the latest version of the Xbox.
1 /* On powerpc a system call basically clobbers the same registers like a * function call, with the exception of LR (which is needed for the 5 * "sc; bnslr" sequence) and CR (where only CR0.SO is clobbered to signal * an error return status). */ #define __syscall_nr(nr, type, name, args...) \ 10 unsigned long __sc_ret, __sc_err; \ { \ register unsigned long __sc_0 __asm__ ("r0"); \ register unsigned long __sc_3 __asm__ ("r3"); \ register unsigned long __sc_4 __asm__ ("r4"); \ 15 register unsigned long __sc_5 __asm__ ("r5"); \ register unsigned long __sc_6 __asm__ ("r6"); \ register unsigned long __sc_7 __asm__ ("r7"); \ \ __sc_loadargs_##nr(name, args); \ 20 __asm__ __volatile__ \ ("sc \n\t" \ "mfcr %0 " \ : "=&r" (__sc_0), \ "=&r" (__sc_3), "=&r" (__sc_4), \ 25 "=&r" (__sc_5), "=&r" (__sc_6), \ "=&r" (__sc_7) \ : __sc_asm_input_##nr \ : "cr0", "ctr", "memory", \ "r8", "r9", "r10","r11", "r12"); \ 30 __sc_ret = __sc_3; \ __sc_err = __sc_0; \ } \ if (__sc_err & 0x10000000) \ { \ 35 errno = __sc_ret; \ __sc_ret = -1; \ } \ return (type) __sc_ret 40 #define __sc_loadargs_0(name, dummy...) \ __sc_0 = __NR_##name #define __sc_loadargs_1(name, arg1) \ __sc_loadargs_0(name); \ __sc_3 = (unsigned long) (arg1) 45 #define __sc_loadargs_2(name, arg1, arg2) \ __sc_loadargs_1(name, arg1); \ __sc_4 = (unsigned long) (arg2) #define __sc_loadargs_3(name, arg1, arg2, arg3) \ __sc_loadargs_2(name, arg1, arg2); \ 50 __sc_5 = (unsigned long) (arg3) #define __sc_loadargs_4(name, arg1, arg2, arg3, arg4) \ __sc_loadargs_3(name, arg1, arg2, arg3); \ __sc_6 = (unsigned long) (arg4) #define __sc_loadargs_5(name, arg1, arg2, arg3, arg4, arg5) \ 55 __sc_loadargs_4(name, arg1, arg2, arg3, arg4); \ __sc_7 = (unsigned long) (arg5) #define __sc_asm_input_0 "0" (__sc_0) #define __sc_asm_input_1 __sc_asm_input_0, "1" (__sc_3) 60 #define __sc_asm_input_2 __sc_asm_input_1, "2" (__sc_4) #define __sc_asm_input_3 __sc_asm_input_2, "3" (__sc_5) #define __sc_asm_input_4 __sc_asm_input_3, "4" (__sc_6) #define __sc_asm_input_5 __sc_asm_input_4, "5" (__sc_7) 65 #define _syscall0(type,name) \ type name(void) \ { \ __syscall_nr(0, type, name); \ } 70 #define _syscall1(type,name,type1,arg1) \ type name(type1 arg1) \ { \ __syscall_nr(1, type, name, arg1); \ 75 } #define _syscall2(type,name,type1,arg1,type2,arg2) \ type name(type1 arg1, type2 arg2) \ { \ 80 __syscall_nr(2, type, name, arg1, arg2); \ } #define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3) \ type name(type1 arg1, type2 arg2, type3 arg3) \ 85 { \ __syscall_nr(3, type, name, arg1, arg2, arg3); \ } #define _syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4) \ 90 type name(type1 arg1, type2 arg2, type3 arg3, type4 arg4) \ { \ __syscall_nr(4, type, name, arg1, arg2, arg3, arg4); \ } 95 #define _syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4,type5,arg5) \ type name(type1 arg1, type2 arg2, type3 arg3, type4 arg4, type5 arg5) \ { \ __syscall_nr(5, type, name, arg1, arg2, arg3, arg4, arg5); \ } 100
This code snippet from the kernel header file
asm/unistd.h
shows how we can
implement system calls on PowerPC. It looks very complicated,
but it can be broken down step by step.
Firstly, jump to the end of the example where the
_syscallN
macros are defined.
You can see there are many macros, each one taking
progressively one more argument. We'll concentrate on the
most simple version,
_syscall0
to start with. It
only takes two arguments, the return type of the system call
(e.g. a C int
or
char
, etc) and the name of
the system call. For getpid
this would be done as
_syscall0(int,getpid)
.
Easy so far! We now have to start pulling apart
__syscall_nr
macro. This is
not dissimilar to where we were before, we take the number of
arguments as the first parameter, the type, name and then the
actual arguments.
The first step is declaring some names for registers.
What this essentially does is says
__sc_0
refers to
r0
(i.e. register 0). The
compiler will usually use registers how it wants, so it is
important we give it constraints so that it doesn't decide to
go using register we need in some ad-hoc manner.
We then call
sc_loadargs
with the
interesting ##
parameter.
That is just a paste command, which gets
replaced by the nr
variable.
Thus for our example it expands to
__sc_loadargs_0(name, args);
.
__sc_loadargs
we can see
below sets __sc_0
to be the
system call number; notice the paste operator again with the
__NR_
prefix we talked about,
and the variable name that refers to a specific
register.
So, all this tricky looking code actually does is puts
the system call number in register 0! Following the code
through, we can see that the other macros will place the
system call arguments into r3
through r7
(you can only have
a maximum of 5 arguments to your system call).
Now we are ready to tackle the
__asm__
section. What we
have here is called inline assembly
because it is assembler code mixed right in with source code.
The exact syntax is a little to complicated to go into right
here, but we can point out the important parts.
Just ignore the
__volatile__
bit for now; it
is telling the compiler that this code is unpredictable so it
shouldn't try and be clever with it. Again we'll start at the
end and work backwards. All the stuff after the colons is a
way of communicating to the compiler about what the inline
assembly is doing to the CPU registers. The compiler needs to
know so that it doesn't try using any of these registers in
ways that might cause a crash.
But the interesting part is the two assembly statements
in the first argument. The one that does all the work is the
sc
call. That's all you need
to do to make your system call!
So what happens when this call is made? Well, the processor is interrupted knows to transfer control to a specific piece of code setup at system boot time to handle interrupts. There are many interrupts; system calls are just one. This code will then look in register 0 to find the system call number; it then looks up a table and finds the right function to jump to to handle that system call. This function receives it's arguments in registers 3 - 7.
So, what happens once the system call handler runs and
completes? Control returns to the next instruction after the
sc
, in this case a
memory fence instruction. What this
essentially says is "make sure everything is committed to
memory"; remember how we talked about pipelines in the
superscalar architecture? This instruction ensures that
everything we think has been written to memory actually has
been, and isn't making it's way through a pipeline
somewhere.
Well, we're almost done! The only thing left is to
return the value from the system call. We see that
__sc_ret
is set from r3 and
__sc_err
is set from r0.
This is interesting; what are these two values all
about?
One is the return value, and one is the error value. Why do we need two variables? System calls can fail, just as any other function. The problem is that a system call can return any possible value; we can not say "a negative value indicates failure" since a negative value might be perfectly acceptable for some particular system call.
So our system call function, before returning, ensures
its result is in register r3 and any error code is in register
r0. We check the error code to see if the top bit is set;
this would indicate a negative number. If so, we set the
global errno
value to it
(this is the standard variable for getting error information
on call failure) and set the return to be
-1
. Of course, if a valid
result is received we return it directly.
So our calling function should check the return value is
not -1
; if it is it can check
errno to find the exact reason why the call failed.
And that is an entire system call on a PowerPC!
Below we have the same interface as implemented for the x86 processor.
1 /* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */ #define __syscall_return(type, res) \ 5 do { \ if ((unsigned long)(res) >= (unsigned long)(-125)) { \ errno = -(res); \ res = -1; \ } \ 10 return (type) (res); \ } while (0) /* XXX - _foo needs to be __foo, while __NR_bar could be _NR_bar. */ #define _syscall0(type,name) \ 15 type name(void) \ { \ long __res; \ __asm__ volatile ("int $0x80" \ : "=a" (__res) \ 20 : "0" (__NR_##name)); \ __syscall_return(type,__res); } #define _syscall1(type,name,type1,arg1) \ 25 type name(type1 arg1) \ { \ long __res; \ __asm__ volatile ("int $0x80" \ : "=a" (__res) \ 30 : "0" (__NR_##name),"b" ((long)(arg1))); \ __syscall_return(type,__res); } #define _syscall2(type,name,type1,arg1,type2,arg2) \ 35 type name(type1 arg1,type2 arg2) \ { \ long __res; \ __asm__ volatile ("int $0x80" \ : "=a" (__res) \ 40 : "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2))); \ __syscall_return(type,__res); } #define _syscall3(type,name,type1,arg1,type2,arg2,type3,arg3) \ 45 type name(type1 arg1,type2 arg2,type3 arg3) \ { \ long __res; \ __asm__ volatile ("int $0x80" \ : "=a" (__res) \ 50 : "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2)), \ "d" ((long)(arg3))); \ __syscall_return(type,__res); \ } 55 #define _syscall4(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4) \ type name (type1 arg1, type2 arg2, type3 arg3, type4 arg4) \ { \ long __res; \ __asm__ volatile ("int $0x80" \ 60 : "=a" (__res) \ : "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2)), \ "d" ((long)(arg3)),"S" ((long)(arg4))); \ __syscall_return(type,__res); \ } 65 #define _syscall5(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4, \ type5,arg5) \ type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5) \ { \ 70 long __res; \ __asm__ volatile ("int $0x80" \ : "=a" (__res) \ : "0" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2)), \ "d" ((long)(arg3)),"S" ((long)(arg4)),"D" ((long)(arg5))); \ 75 __syscall_return(type,__res); \ } #define _syscall6(type,name,type1,arg1,type2,arg2,type3,arg3,type4,arg4, \ type5,arg5,type6,arg6) \ 80 type name (type1 arg1,type2 arg2,type3 arg3,type4 arg4,type5 arg5,type6 arg6) \ { \ long __res; \ __asm__ volatile ("push %%ebp ; movl %%eax,%%ebp ; movl %1,%%eax ; int $0x80 ; pop %%ebp" \ : "=a" (__res) \ 85 : "i" (__NR_##name),"b" ((long)(arg1)),"c" ((long)(arg2)), \ "d" ((long)(arg3)),"S" ((long)(arg4)),"D" ((long)(arg5)), \ "0" ((long)(arg6))); \ __syscall_return(type,__res); \ } 90
The x86 architecture is very different from the PowerPC that we looked at previously. The x86 is classed as a CISC processor as opposed to the RISC PowerPC, and has dramatically less registers.
Start by looking at the most simple
_syscall0
macro. It simply
calls the int
instruction
with a value of 0x80
. This
instruction makes the CPU raise interrupt 0x80, which will
jump to code that handles system calls in the kernel.
We can start inspecting how to pass arguments with the longer macros. Notice how the PowerPC implementation cascaded macros downwards, adding one argument per time. This implementation has slightly more copied code, but is a little easier to follow.
x86 register names are based around letters, rather than
the numerical based register names of PowerPC. We can see
from the zero argument macro that only the
A
register gets loaded; from
this we can tell that the system call number is expected in
the EAX
register. As we
start loading registers in the other macros you can see the
short names of the registers in the arguments to the
__asm__
call.
We see something a little more interesting in
__syscall6
, the macro taking
6 arguments. Notice the push
and pop
instructions? These
work with the stack on x86, "pushing" a value onto the top of
the stack in memory, and popping the value from the stack back
into memory. Thus in the case of having six registers we need
to store the value of the ebp
register in memory, put our argument in in (the
mov
instruction), make our
system call and then restore the original value into
ebp
. Here you can see the
disadvantage of not having enough registers; stores to memory
are expensive so the more you can avoid them, the
better.
Another thing you might notice there is nothing like the memory fence instruction we saw previously with the PowerPC. This is because on x86 the effect of all instructions will be guaranteed to be visible when the complete. This is easier for the compiler (and programmer) to program for, but offers less flexibility.
The only thing left to contrast is the return value. On
the PowerPC we had two registers with return values from the
kernel, one with the value and one with an error code.
However on x86 we only have one return value that is passed
into __syscall_return
. That
macro casts the return value to unsigned
long
and compares it to an (architecture and
kernel dependent) range of negative values that might
represent error codes (note that the
errno
value is positive, so
the negative result from the kernel is negated). However,
this means that system calls can not return small negative
values, since they are indistinguishable from error codes.
Some system calls that have this requirement, such as
getpriority()
, add an offset
to their return value to force it to always be positive; it is
up to the userspace to realise this and subtract this constant
value to get back to the "real" value.