Friday, June 09, 2006

Conflation of Pointer and Array Types

A common source of confusion for new C programmers is the conflation of pointers and arrays that C does. I often think of the dynamic semantics of the language when I'm thinking deeply about passing arrays to functions. Typically, you can tell an experienced programmer that C always passes arrays by reference, never by value, and they won't go wrong.

Not all languages are like this, so in Boomerang we try to represent pointers and arrays as seperate non-conflated types. In our type system an array is a type used to describe those bytes in memory that contain a finite number of objects of a particular base type. Similarly, a pointer is a type used to describe those bytes in memory that contain an address, which if followed will reveal a single object of a particular base type.

As such, it is necessary to refer to somethings explicitly using the Boomerang type system that are typically implied by the C type system. For example, a C string is often written in the C type system as char *, or "a pointer to a character". This is clearly a misnomer. The only string that is accurately defined by this type is the empty string as C strings must be zero terminated, and if you're only pointing to a single character then that character must be zero. In the Boomerang type system we would write a C string as char[] *, or "a pointer to an (unspecified length) array of characters".

When I wrote Boomerang's header file parser I had the choice of which type system to use. Should I assume the signature file was using the C type system and silently translate the contents into the Boomerang type system? Or should I allow the user to specify types exactly as they would appear if we were to call Type::print on the resultant object? I tried both and found that 99% of the time the C type and the Boomerang type were the same, so the signature file expects the types to be in Boomerang format. This means that ocassionally you will see something weird in a Boomerang signature file. For example:

typedef struct {
int token;
const char* name;
OptionValueType type;
ValueUnion value;
Bool found;
} OptionInfoRec;
typedef OptionInfoRec *OptionInfoPtr;
typedef OptionInfoRec OptionInfoRecs[];

typedef OptionInfoRecs *AvailableOptionsFunc(int chipid, int bustype);

what's that function pointer type returning? It's a pointer to an array of OptionInfoRec structs. In C we'd just use OptionInfoPtr, because we assume the programmer knows that an AvailableOptionsFunc will return more than just one OptionInfoRec. In fact, an AvailableOptionsFunc is supposed to return a pointer to a null terminated array of OptionInfoRec structs, where "null terminated" means most of the members of the last OptionInfoRec are zero. It's pretty hard to define a sensible type for that, but C programmers work with types like that all the time so we have to try.

This also means that in the code generator for C we have to recognise when certain operations are not needed. For example, this assignment:

344 *32* m[local6][local1].name := "foo"

would cause code like this to be emitted:

OptionInfoRecs *local6;
int local1;
(*local6)[local1].name = "foo";

which is correct, but is not very pretty. We're not using the full syntactic sugar of the language. This is much better:

OptionInfoRec *local6;
int local1;
local6[local1].name = "foo";

and is probably how the programmer wrote it in the first place.

No comments:

Post a Comment