=================================== ==== Reference Manual for the ===== == SAFE-C Programming Language == =================================== This document describes Samuro's SAFE-C programming language for compiler developers. Summmary: the SAFE-C programming language is a modern and powerful version of 'C'. It is very fast (because users want fast response times), and very secure (because users want safe and reliable applications). The language is small and easy to learn in order to gain wide acceptance among programmers; it can be used for a broad range of applications ranging from any kind of PC software, to web servers, or even embedded device controllers. =========================================================================== Belgium, March 2011, final version 1.41, contact: marcsamu@hotmail.com =========================================================================== 0. Why a new language ? 1. Lexical Elements 2. Data Types 3. Declarations 4. Names 5. Primaries 6. Expressions 7. Statements 8. Packages 9. Generics 10. Compilation Units 11. Implementation Issues 12. Libraries Appendix A : International support B : Unicode conversion for source files =========================================================================== 0. Why a new language ? ======================= The C language -------------- Although developed back in 1972, the C language is still in wide use today, 38 years later. It is probably the most successful programming language of all times. Its qualities (speed, generality, portability, small size, concise syntax) are widely recognized, and many derived languages (Java, C++, C#) use it as model. C is especially loved by programmers for its permissive nature : it is low-level and allows basically all constructs that machine code would, therefore exploiting the computer's resources to a maximum. C will not be replaced in the foreseable future. It is the base language in which all major operating systems (Windows, Unix) and all large commercial programs are written. Due to its generality and availability, C is also used by many programmers for building applications that don't really need its low-level features. C Drawbacks ----------- And this is where the main problem with C arises : C applications often 'crash' because the language rules are so permissive. Application crashes have catastrophic effects for the reputation of a product : not only do they cause user dissatisfaction, but they can potentially be exploited by hackers to gain access to the underlying computer. Basically, these crashes have only two causes : 1) array bounds are not checked, and 2) pointers are not checked. Fixing these causes is possible, however it is far more complex than it seems and requires a major redesign of many language constructs to compensate for the loss of unrestricted array and pointer operations. When modifying the language, there are basically two risks : a) restricting the language too much and make it unusable in practice, like standard Pascal. b) adding heavy runtime checks, that make programs too slow to be attractive. Derived languages ----------------- In recent years, new languages derived from C were born, with mixed success : C++ basically followed the C way and renounced providing security for the sake of efficiency, Java and C# on the other hand have chosen security and sacrified efficiency. None of these derived languages provides both security and efficiency. All of them furthermore loaded new features onto the language, like: exceptions, overloading, classes, constructors, derived types and garbage collection, all of this at the expense of runtime efficiency. For example, exceptions hinder the compiler to do many optimizations because language constructs cannot be re-ordered. Overloading is a compile-time feature, but like exceptions, it makes programs more complex to understand. Java, C#, and most object-oriented languages use a memory architecture in which all objects are created dynamically on the heap. This has negative effects both on the syntax (the 'new' operator appears very frequently) and on the runtime (cleaning up the heap requires a garbage collection routine to halt programs regularly for a few milliseconds). The usefulness of all these new features is questionnable : - very large programs can be built without them. - programs without them might be simpler to understand and thus cheaper to develop and maintain. - executables without them will certainly be smaller and faster. So, these derived languages can't be the way to follow, let's start again from basic C. Safety ------ The most desirable new feature to add to 'C' seems memory safety. Basically, memory safety means that one part of a program must no longer have the possibility of accessing another part's memory outside the language rules. A language that does not provide this is likely to produce programs that randomly crash with obscure illegal memory access errors. These kind of errors are so costly to examine and so hard to reproduce in large multi- thread programs that they often never get corrected; however today's computer users will no longer accept unreliable software. All language constructs must thus be made 'watertight', so that one small programmer's mistake hidden in a large application cannot possibly cause such an obscure crash. More specifically : - array bounds must always be checked : optimization can move many of these checks outside loops easily as the language does not support exceptions; moreover, a good support for constants and string slices makes the use of many array indexes unnecessary. - pointers must be completely secured by a tombstone and reference counting algorithm. Furthermore, they should not be used for all objects (like many modern languages do) but only where they are necessary. - strong type checking must not leave any loopholes; for example functions with a varying number of parameters must be tightly checked. Efficiency ---------- After safety, the second priority is speed, because that's what the users want. All features that made C so fast will be provided. Other features -------------- Other language features include : Compile-time features --------------------- Making uninitialized variables illegal by tracking program flow is a strong requirement. Data abstraction is supported through opaque and generic types, however these concepts are kept very simple to ease human understanding and allow efficient implementations (example: constructors are not supported). Safe replacements for unrestricted array and pointer operations --------------------------------------------------------------- Array slices offer a safe replacement for C's unchecked memcpy, especially with most types being convertible to byte arrays. Open structs are a new type of parametrizable struct offering a fast and secure alternative to pointer casts. Portability ----------- Precise size specifications of data types (and not minimum size requirements as in standard C), and "packed" data structures, allow better portability than in C for binary I/O of files and network. The size of intern data types like pointers are hidden to make programs portable on both small (32-bit pointers) and large (64-bit pointers) memory architectures. Support for multiple CPUs ------------------------- As fundamental physics limits are almost reached when manufacturing CPUs, single CPU's won't run a lot faster in the future. However, we can expect to have a lot of them in a single computer. This means that we need new software architectures based on cooperative threads. The langage must provide an easy way to start a new thread; libraries must provide simple synchronization primitives. International Character Set --------------------------- Modern applications run in an international environment, especially on the internet, so supporting Unicode characters is a must. Libraries must offer conversions from, and to, UTF-8, UTF-16 and UTF-32. Streamlined syntax ------------------ Error-prone syntax features from C were removed, in particular preprocessor macros (they are replaced by a much better support for constants and inline functions). Array and pointer notations were streamlined. Rejected features ----------------- Overflow checks in arithmetic operators are not included in the language because C programmers are used to overflows and regularly use them with a purpose. Such checks would also hinder some optimizations. The only arithmetic check we keep (as in C) is the division by zero check in integer division and remainder operators. =========================================================================== 1. Lexical Elements =================== 1.1 Source File --------------- The language supports source files in 8-bit ANSI, as well as in UTF-8 and UTF-16 Unicode encodings. The first step of the compilation consists in converting source files to UTF-16 characters (this is explained in appendix B). Each source line ends with a new line character of value 10 (except maybe the last line). White space characters (character values 1 to 32) are ignored between lexical elements, they are used to delimit them. No other characters are allowed between lexical elements. (note: there is no maximum source line length, a source text need not contain any new line characters) 1.2 Comments ------------ Single-line comments begin with "//" and continue to the end of the line. Multi-line comments begin with "/*" and end with "*/". Multi-line comments do not nest. The compiler might generate a warning if the characters /* are found in a comment. A multi-line comment must be closed before reaching the end of the source file, otherwise a compilation error occurs. Example: // this is a single-line comment /* this is a multi-line comment */ 1.3 Identifiers --------------- identifier ::= underscore_or_letter { underscore_or_letter_or_digit } underscore_or_letter ::= any of: - _ (underscore) - A to Z - a to z underscore_or_letter_or_digit ::= any of: - underscore_or_letter - 0 to 9 Example: total_amount MAX buffer2 Lower and upper case letters are distinct. Identifiers longer than 128 characters cause a compilation error. 1.4 Integer Literals -------------------- integer_literal ::= digit_sequence ["L"] | "0x" hex_digit_sequence ["L"] | "0b" bin_digit_sequence ["L"] digit_sequence ::= digit {["_"] digit} hex_digit_sequence ::= hex_digit {["_"] hex_digit} bin_digit_sequence ::= bin_digit {["_"] bin_digit} digit ::= any of: - 0 to 9 hex_digit ::= any of: - 0 to 9 - A to F - a to f bin_digit ::= any of: - 0 to 1 Example: 64 // compatible with any signed or unsigned integer 65_000_000 // underscores for improved readability 0xFF // hexadecimal value 0b1111_0000 // binary value 176L // long note: like Ada, underscores are allowed in integer literals for improved readability. An integer literal's value must be in range 0 to 9_223_372_036_854_775_807 (2^63-1), otherwise a compilation error occurs. The most negative integer value (-2^63) is not representable by an integer literal; the expression long'min can be used instead. If the literal has a suffix L it is of type long. Otherwise, its type will be decided later according to the integer promotion rules. note: C's octal numbers are not supported. The compiler should give a "C-non-compatibility" warning for any integer literal beginning with a zero digit followed by another digit. 1.5 Floating-Point Literals --------------------------- floating_point_literal ::= decimal_floating_point_literal | hexadecimal_floating_point_literal decimal_floating_point_literal ::= digit_sequence "." digit_sequence [("e"|"E") ["+"|"-"] digit_sequence] ['F'|'f'|'D'|'d'] hexadecimal_floating_point_literal ::= "0x" hex_digit_sequence "." hex_digit_sequence [("p"|"P") ["+"|"-"] digit_sequence] ['F'|'f'|'D'|'d'] Example: 1.2f 3.023_223E+2d 0x80.0p-1 A floating-point literal can have the following types : - if a suffix f or F is provided, it has type 'float', - if a suffix d or D is provided, it has type 'double'. - if no suffix is provided, its type will be decided later according to the float promotion rules. A hexadecimal floating-point literal is used for precise machine representations. The mantissa (hex_digit_sequence) is specified in base 16, the exponent (digit_sequence after 'p') is specified in base 10. The literal's value equals mantissa*(2**exponent). A compilation error occurs if the floating-point literal is too large to be representable in the given type, or if is too small and underflows to zero. 1.6 Character Literals ---------------------- character_literal ::= "'" c_char "'" (type char) "L'" c_char "'" (type wchar) c_char ::= any of: - escape_sequence - any character >= 32 except escape character 0x005C. escape_sequence: simple_escape_sequence hex_escape_sequence simple-escape-sequence: any of: \' \" \\ \0 \a \b \f \n \r \t \v \' 0x0027 Single quote \" 0x0022 Double quote \\ 0x005C Backslash \0 0x0000 Null \a 0x0007 Alert \b 0x0008 Backspace \f 0x000C Form feed \n 0x000A New line \r 0x000D Carriage return \t 0x0009 Horizontal tab \v 0x000B Vertical tab hex_escape_sequence: \x hex_digit hex_digit (for type char) \x hex_digit hex_digit hex_digit hex_digit (for type wchar) Example: 'a' 'é' L'\x00FF' Characters literals are constant values of type char (or wchar in case of prefix 'L'). (note: a character literal begins with a quote at position N, followed by a backslash at position N+1 or a quote at position N+2; otherwise it's considered as a quote, not a character literal) 1.7 String Literals ------------------- string_literal ::= '"' { s_char } '"' (for type string) 'L"' { s_char } '"' (for type wstring) s_char: any of: - escape_sequence - any character >= 32 except escape character 0x005C, except double quote character 0x0022. Example: "A" "Hello\n" "Hello" + " World" L"Hallöchen" L"\x1234" String literals are constant values of type string (or wstring in case of prefix 'L'). String literals longer than 128 characters (after replacement of escape sequences and removal of prefix and both double quote characters) cause a compilation error. Note however that longer constant string literals can be built using the '+' concatenation operator. 1.8 Keywords ------------ The following keywords are reserved and cannot be used as identifiers : #begin #define #elif #else #end #endif #error #if #warning Lnul _asm abort assert body bool break byte case char clear const continue default double else end enum false float for free from generic if inline int int1 int2 int4 int8 long new nul null object out package packed public ref return run short sleep string struct switch tiny true typedef _unused uint uint1 uint2 uint4 union use ushort void volatile wchar while wstring note: "all" and "as" are not keywords. 1.9 Delimiters -------------- The following character sequences have special meanings as delimiters : # (starts directive) { } ( ) [ ] ' : , ; ? ~ . .. + += ++ - -= -- -> * *= / /= // /* (start comment) % %= ! != = == => & && &= | || |= ^ ^= < <= << <<= > >= >> >>= note: usually the longest character sequence is prefered over shorter ones, but to prevent a syntax problem in qualified expressions, we add the rule : "a runtime_type followed by a character literal '(' or '{' is in fact followed by a single quote." Example: char '('a') means: char ' ('a') and not: char '(' a') exception: L'A' string(1) '{'a'} means: string(1) ' {'a'} and not: string(1) '{' a'} note: the token ^= and *= only appear in statements, excluding declarations within block statements; in any other context they mean ^ = and * =. 1.10 Preprocessing directives ----------------------------- Compilation symbols are identifiers that have a constant boolean value (true or false). Compilation symbols are only used within preprocessing directives, they are not visible within other language constructs. (note: compilation symbols don't cause source text replacement as in C). They are two kinds of compilation symbols : a) global symbols - they are defined in the make configuration file (mk.cfg); - they must be written all in UPPER CASE; - they are global for all the project. b) local symbols - they are defined using the preprocessing directive #define - they must be written all in lower case; - they are local to the source file where they are defined. - they never propagate to other source files. - they cannot be defined more than once in a given source file. Evaluation of a compilation symbol yields an error if the symbol has not been defined. (reason: to avoid, as in C, that such symbols get wrongly evaluated as a result of typing mistakes) A compilation symbol must contain at least one letter a-z. (reason: otherwise global and local symbols cannot be distinguished) During compilation, the set of global symbols used by each source file, together with their boolean value, is stored in the object file. During rebuild, this set is compared with the current symbols and values defined in the make file. Any source file that used symbols not present in the make file, or that used symbols that had a different boolean value, is invalidated and needs recompilation. note: precompiled libraries are not impacted by a change of compilation symbols in the make file because recompilation never occurs. Example: #define debug 0 // turn off debugging for this source file #define alpha true #define omega false #define as_defined (alpha || omega) #if alpha // do this #endif #if !as_defined #error no as defined #endif #if TEMENOS // global symbol in upper case is defined in mk.cfg if (i) { #else if (!i) { #endif treatment_A (); treatment_B (); } Preprocessor directives always appear alone on a single source line. They always start with the character "#" (except for possible leading white space). Multi-line comments /* */ are allowed on the line after "#", provided that they begin and end on the same line. Finally, a trailing single-line comment "//" is allowed. preprocessor_directive ::= define_directive | if_directive | elif_directive | else_directive | endif_directive | error_directive | warning_directive | begin_unsafe_directive | end_unsafe_directive define_directive ::= "#define" identifier p_expression if_directive ::= "#if" p_expression elif_directive ::= "#elif" p_expression else_directive ::= "#else" endif_directive ::= "#endif" error_directive ::= "#error" any text til end of line warning_directive ::= "#warning" any text til end of line begin_unsafe_directive ::= "#begin" "unsafe" end_unsafe_directive ::= "#end" "unsafe" p_expression ::= p_unary | p_primary {"||" p_unary} | p_primary {"&&" p_unary} | p_primary {"^" p_unary} p_unary ::= ["!"] p_primary p_primary ::= compilation_symbol_identifier | "false" | "true" | "0" (same as false) | "1" (same as true) | "(" p_expression ")" Conditional compilation ----------------------- #if p_expression #elif p_expression #else #endif Conditional compilation directives are used to comment out parts of source code. They can be nested up to a maximum of 16 levels. Example: #if 0 // remove this part of source code // ... #if true // ... #endif #endif Skipped lines are thrown away until a line is found starting with one of : "#if", "#elif", "#else" or "#endif", with possible leading white space. Skipped lines are not converted in tokens or checked for syntax, comments are not recognized, which means that, in skipped lines : - non-closed character literals, string literals or comments will not cause errors; - #define, #error or #warning directives are not evaluated and never cause syntax errors. - "#if", "#elif", "#else" and "#endif" token appearing in skipped lines are counted, but the characters appearing after them on the source line are not evaluated : #if and #endif pairs must simply match. Any number of #elif and finally at most one #else are allowed between #if and #endif lines. Conditional compilation can be applied on from/use clauses, i.e. a different set of files might be imported depending on the defined compilation symbols. Example: #if WINDOWS from std use io; #elif UNIX from std use io2; #endif Error reporting --------------- #error message that extends until the end of line #warning message that extends until the end of line Error messages cause a compilation error, warning messages are purely informative. The text after an error or warning directive is taken 'as is', i.e. token or comments are not recognized. 1.11 Unsafe directive --------------------- The directives "#begin unsafe" and "#end unsafe" enclose unsafe source code. Example: #begin unsafe char* p; #end unsafe Both directives must occur as pairs and must not be nested. Regions of unsafe source code are necessary to write libraries that interface with the operating system. Within those unsafe regions, the following operations are possible : - declaring and calling extern and callback functions. - declaring and using unsafe pointers. A compilation error occurs if such unsafe operations occur outside an unsafe region. Compiled units containing unsafe code might be restricted in an execution environment, for example, they might be disallowed when loaded from the internet. 1.12. Style Rules ----------------- The compiler should generate style recommendation notes, for example : - matching { } pairs should appear either on the same column or on the same line. - matching "if" and "else" should start on the same column. - all cases of a switch statement should start on the same column. =========================================================================== 2. Data Types ============= 2.1 integer 2.2 enumeration 2.3 floating-point 2.4 array, open array, jagged array 2.5 struct, open struct, jagged struct 2.6 union 2.7 pointer 2.8 incomplete 2.9 function pointer 2.10 opaque 2.11 object 2.12 generic 2.13 unsafe pointer 2.1 Integer Types ----------------- Signed Integer types -------------------- tiny 1 byte (-128 to +127) short 2 bytes (-32_768 to +32_767) int 4 bytes (–2_147_483_648 to +2_147_483_647) long 8 bytes (–9_223_372_036_854_775_808 to +9_223_372_036_854_775_807) (18-19 digits) Unsigned Integer types ---------------------- byte 1 byte (0 to 255) ushort 2 bytes (0 to 65_535) uint 4 bytes (0 to 4_294_967_295) Synonyms -------- int1 1 byte (-128 to +127) int2 2 bytes (-32_768 to +32_767) int4 4 bytes (–2_147_483_648 to +2_147_483_647) int8 8 bytes (–9_223_372_036_854_775_808 to +9_223_372_036_854_775_807) (18-19 digits) uint1 1 byte (0 to 255) uint2 2 bytes (0 to 65_535) uint4 4 bytes (0 to 4_294_967_295) (design notes: . there is no 8-byte unsigned type because it's not very useful and because the largest type for conversions is then ambiguous. . there is no fixed-point type because money types can be simulated with a structure and need not be fast. Different types of roundings and magnitudes are better implemented by a package. Finally, time durations can also be expressed by a float type, see statement sleep) Attributes 'min 'max -------------------- The attributes 'min and 'max return the minimum and maximum values of an integer type. N'min : minimum value (most negative) N'max : maximum value (most positive) N denotes an integer type or variable. Example: printf ("the maximum value of an int is : %d\n", int'max); 2.2 Enumeration types --------------------- The following enumeration types are predefined : bool 1 byte (0 to 1) char 1 byte (0 to 255) wchar 2 bytes (0 to 65535) The type 'wchar' stores a symbol from the Unicode character set which covers all spoken languages worldwide (see Appendix A for more details). enum bool (uint1) {false, true}; enum char (uint1) {nul, .., 'A', 'B', .. }; enum wchar (uint2) {Lnul, .., L'A', L'B', .. }; false and true are the literals of enumeration type bool; nul and Lnul are the first literals of the types char and wchar, respectively. (design note: false, true, nul and Lnul are reserved words so they cannot be redeclared) The values of enumeration literals cannot be specified explicitely : the first enumeration literal has always value 0, the second one has value 1, the third one has value 2, etc. (design note: . if explicit values are needed, the user should declare integer constants instead of using an enumeration type; . not allowing holes allows simple implementation of switch statements and open struct size tables) The number of literals of an enumeration type must not exceed the capacity of the base type. Attributes 'first 'last ----------------------- The attributes 'first and 'last return the first and last enumeration literals of an enumeration type. N'first : first enumeration literal N'last : last enumeration literal N denotes an enumeration type or variable. Example: enum Color {RED, GREEN, BLUE}; Color color; color'first // RED color'last // BLUE Color'first + 1 // GREEN Color'last - 1 // GREEN Color'first + 2 // BLUE Color'first + 3 // ERROR: compilation error : overflow Color'last - 3 // ERROR: compilation error : overflow Attribute 'string ----------------- The attribute 'string converts an enumeration value into its literal string. E'string : string representing the enumeration literal E E denotes a value of any enumeration type (except char and wchar). A compile-time or runtime error occurs if E is larger than the last enumeration literal. Example: color = RED; printf ("color : %s\n", color'string); (implementation note: to implement the 'string attribute, an intern string table is created for each declared enumeration type : Pseudo-example: const string[] table = {"RED", "GREEN", "BLUE"}; Unused tables are removed at link-time. ) Trap representations -------------------- At runtime, an enumeration variable can contain a value larger than the last enumeration literal. Example: int i = 3; bool b = (bool)i; This can happen for example: - as a result of a conversion or a "+", "-", "++", "--" operator, - in an union field evaluation, - after an I/O read operation. When such overflows occur in constants at compile-time, they cause a compilation error, however such overflows are not detected at runtime : treatment occurs as if the enumeration type was replaced by the base type. Representation -------------- By default, an enumeration type is mapped to the base type uint4. However, the syntax permits the specification of another base type which must be one of : uint1, uint2, uint4 (or synonyms: byte, ushort, uint). Example: enum Color (byte) {RED, GREEN, BLUE}; Note: small types such as uint1 or uint2 can overflow if the number of literals equals the maximum number of values of the base type; consider the following code : { char ch; for (ch=ch'first; ch<=ch'last; ch++) put (ch); } This code will loop forever because the statement ch++ will overflow and the value will wrap to zero, hence it will always be smaller than or equal to ch'last; to avoid this for small values, the default enumeration base type is uint4. Another advantage of choosing uint4 is that when adding enumeration literals, the size of the type will not change unexpectantly, which would modify an extern interface if the type is used for input/ouput. End Note. 2.3 Floating-point types ------------------------ float 4 bytes (1.5E-45 to 3.4E+38, 7 digits precision) double 8 bytes (5.0E-324 to 1.7E+308, 15-16 digits precision) Synonyms -------- float4 4 bytes (same as float) float8 8 bytes (same as double) (design note: a fast 4-byte float type is needed for 3D applications) 2.4 Array types --------------- An array E[L] is defined by an element type E and a length L. Example: char buffer[80]; // an array of 80 char int t1[10], t2[10]; // two arrays of 10 int char[80] buffer2; // an array of 80 char int[10] t3, t4; // two arrays of 10 int char screen1[25][80]; // a screen of 25 lines of 80 char char[80] screen2[25]; // same char[80][25] screen3; // same The length L of an array must have type int or uint and its value must be between 0 and 2_147_483_647 (int'max). Furthermore, an array's size cannot exceed int'max bytes. attribute 'length or 'length(N) ------------------------------- The attribute 'length returns an int value indicating the array length. The optional constant N specifies the array dimension (default N is 1). Example: buffer'length equals 80 t1'length equals 10 screen1'length equals 25 screen1'length(2) equals 80 using arrays ------------ Array constants and variables always have a constant length. Arrays of non-constant length can be created with the allocator 'new'. Aliases to arrays can be created by declaring references and parameters. Finally, array types can be combined with other array, struct, union and pointer types to create further types. Example: const char[5] welcome = "Hello"; // constant array of length 5 char[64] student; // variable of length 64 char[80]^ line = new char[80]; // heap object typedef char[10] NAME; // array type void print (char[20] title) // parameter { ref char[3] prefix = title[0:3]; // reference } array length matching --------------------- Array lengths must match exactly in assignments, function call parameters, aggregates, qualified expressions, initial expression of objects, or default expression of parameters. Example: char[5] name1 = "ABCDE"; // OK char[5] name2 = "AB"; // ERROR: array length (5 != 2) name2 = {'A', 'B', 'C'}; // ERROR: array length (5 != 3) void print (char[4] title); // parameter print ("ABCD"); // OK print ("DO"); // ERROR: array length (2 != 4) Open Arrays ----------- An open array is an incomplete form of array where the length is not specified : typedef char[] string; // string is an open array of char This can also be written as : typedef char string[]; Open types can be used to declare constants, parameters and references. However, they cannot be used to declare variables or heap objects without a length specification. Example: const string welcome = "Hello"; // constant string student; // ERROR: missing length string(64) student; // OK string^ line = new string; // ERROR: missing length string^ line = new string(len); // OK string^ line = new char[len]; // OK void print (string title) // parameter { ref string prefix = title[0:3]; // reference (first 3 chars) } A function having an open array parameter (like 'print' above) can be called by passing an array of char of any length. Internally, the function will receive an additional hidden length parameter. Similarly, an open array reference will keep an intern variable with the actual array length (unless the length is constant). Jagged Arrays ------------- A jagged array is an array where the element type is an open or jagged type. Each array element can have a different length. Example: const string table[4] = {"This", "is", "an", "example"}; Jagged arrays are always constant, thus jagged array types are allowed for declaring constants, but they can also appear as parameters of mode 'in', or references to these; they are not allowed for declaring variables, parameters of mode 'ref' or 'out' (except type object[]), or to allocate heap objects. Note that the attribute table'length(2) is not allowed in the example above because all array elements have a different length. Note that string(2) and string[2] are not the same : the first is an array of two chars, the second is an array of two strings. Open Jagged Arrays ------------------ An array can be open and jagged at the same time. Example: const string table[] = {"This", "is", "an", "example"}; Restrictions on open and jagged arrays -------------------------------------- The following restrictions apply to the use of open and jagged types : typedef declaration : open and jagged types are allowed. array elements : open and jagged types are allowed, but then the array type becomes a jagged type. struct fields : open and jagged types are allowed, but then the struct type becomes a jagged type. packed struct fields : open and jagged types are NOT allowed. union fields : open and jagged types are NOT allowed. constant : open and jagged types are allowed. variable : open and jagged types are NOT allowed. reference : open and jagged types are allowed. parameter of mode 'in' : open and jagged types are allowed. parameters of mode 'ref' and 'out' : open types are allowed, jagged types are not allowed, except type object[]. designated type of a pointer : open types are allowed, jagged types are not allowed. 'new' allocator : open types are allowed only with a length specification or initial value, jagged types are not allowed. Storage of constant arrays -------------------------- constant area ------------- Each constant (for example a character or string literal) is stored in the 'constant area' and is defined by : - a unique serial number for the constant; - a size; - a sequence of bytes; - a type (which cannot be an open type but can be a jagged type) which must match the size. length equivalence ------------------ When the array type has no length specification, then the length is taken from its constant value. In effect, this means that the following declaration : const string str = "123"; // ('1', '2', '3') is implicitely converted into : const string(3) str = "123"; matching constant ----------------- When the constant is identical to another one, it refers to the same constant (same serial number in 'constant area'). Example: const string(3) str2 = str; constructed constants --------------------- Constants constructed from aggregates of other constants, or from array elements, slices or struct fields, are stored by creating a new constant and taking copies from parts of previous constants. Example: const string(3)[2] str3 = {str, // ('1', '2', '3', str}; // '1', '2', '3') const string(2) str4 = str3[0][0:2]; // ('1', '2') storage of jagged arrays ------------------------ A constant jagged array is an array where the element type is an open array. Such a constant contains 'jag references' to the inner arrays. A 'jag reference' contains two fields : - a reference to the inner array/struct (a 4-or-8-byte address) - the inner array's length or struct's discriminant (an int4) A 4-byte filler can follow if 8-byte addresses are used. Example: const string[2] str5 = {str, // (&str, int 3, str}; // &str, int 3) const string[] str6 = {str, // (&str, int 3, str}; // &str, int 3) At creation, the compiler stores the constant's unique serial number in the jag reference's address. This will later be replaced by the actual address of the inner constant. constant types -------------- Constants in the 'constant area' can have the following types : - integer - enumeration - floating-point - array, jagged array - open array (is always converted into an array, see length equivalence) - struct (packed or not), jagged struct - open struct (packed or not) (is always converted into a struct) - pointer (only null constant, either 4 or 8 bytes) - function pointer (only null constant, either 4 or 8 bytes) - unsafe pointer (only null constant, either 4 or 8 bytes) Constants cannot have the following types : - union - incomplete - opaque/limited - generic (note: discriminant values of jagged open struct types are always stored as 4-byte headers, even if the enumeration type has another size) (implementation note: the creation of constants assumes that: - the address size (4 or 8 bytes) is fixed at compile-time by an option, - the byte ordering (LSB or MSB) is fixed for integer, enumeration, floating-point and address types; however byte ordering can still be changed at link-time because the type of each constant is known) (design note: constants can produce large executable files because they are stored in the constant memory segment and not built at runtime. A good support for constants is an important part of the language) loader ------ As each constant has a defined type, the following validity checks can be performed when loading a compiled unit : - constants have a correct type and size. - jagged arrays have a correct length, and size. - jagged structs have a correct discriminant value, and size. - 'jagged references' have a valid serial number and content. - pointers, unsafe pointers and function pointers have value null. linking constants ----------------- The linker's job is to cleanup the constant area by : 1- removing unreferenced constants; 2- merging identical constants. Constants can be referenced by : - pcodes - jag references from other constants Constants from non-jagged types can be merged easily by sorting them by size, then by searching if shorter sequences occur within longer ones. If found, the shorter sequences can be removed and replaced by references inside the longer sequences. This method can however be overriden for the purpose of better alignment in memory. Merging of jagged constants is complicated by the fact that addresses might be relocated later on. It is generally not worth the effort merging them. Constants --------- - an expression is constant if it contains only the following primaries: integer_literal, floating_point_literal, character_literal, string_literal, nul, Lnul, null, enumeration literal, constant object, aggregate containing only constants, parenthesized expression/qualified expression/conversion of a constant. It must not contain the unary address-of operator "&". - array elements and array slices of constant arrays are constants, provided all expressions are constant; struct fields of constants are constants. (note: for jagged types, evaluation of such constants can require following a jag reference chain from constant to constant) - discriminants of open struct objects are constant if the object has a constant constraint. - the following attributes always return constants : 'min, 'max, 'first, 'last. - the attribute 'length returns a constant if its prefix is a constant or a variable, or a non-open array parameter, heap object or reference; it is non-constant for a parameter, heap object or reference of an open array type. The attribute 'length(N) with N>1 always returns a constant. - the attribute 'size is constant if the prefix is an object of a non-open type, or denotes a non-open type. - the attributes 'byte and 'string return constants for a constant prefix. - the operator "+" on constant char/wchar/string/wstring returns a constant. String Variables ---------------- Important: ========== String variables have an OPTIONAL trailing nul character which is ONLY present if the string variable is not full. (design note: this concept is taken from Modula-2; it is an important difference from C where the trailing nul character is always expected but might be missing, with the risk of causing a buffer overflow) Example: from std use strings; void main() { string(4) name; strcpy (out name, "Luc"); // will copy "Luc" with trailing nul strcpy (out name, "Marc"); // will copy "Marc" (no nul) strcpy (out name, "Jacques"); // will cause a runtime error } In the above example, strcpy receives the length of both arrays and can perform the necessary checks. // the compilation unit 'strings' contains functions to handle string // variables with optional trailing nul : strcpy(out dest,src); : copy a string strcat(ref dest,src); : append a string to another len = strlen(s) : returns active length of string cmp = strcmp(s1,s2) : compare two strings cmp = stricmp(s1,s2) : compare two strings with insensitive case pos = strchr(s,c) : find a char in a string pos = strstr(s1,s2) : find a string s2 in a string s1 ... note: strcpy and strcat cause a runtime error if the target buffer is too small. note: similar functions exist for the wchar/wstring types: wstrcpy, wstrcat, wstrlen, wstrcmp, wstricmp, wstrchr, wstrstr. 2.5. Struct types ----------------- A struct type defines a collection of fields. Example: struct NODE { char[20] name; int count; byte[1024] free_text; NODE^ prev, next; } A struct's size cannot exceed int'max bytes. Open Structs ------------ Open struct types have a parameter called 'discriminant' whose value decides which struct fields exist. Example: enum TypeShape {POINT, SQUARE, CIRCLE, TRIANGLE}; struct Shape (TypeShape kind) { int x, y; switch (kind) { case POINT: null; case SQUARE: int side; case CIRCLE: int radius; case TRIANGLE: int base, height; } } The discriminant 'kind' is a parameter used to create the open struct type 'Shape'. A discriminant must have an enumeration type. Exactly one discriminant and one switch must be present in an open struct type. Open structs can be used to declare constants, parameters and references. They can be used to declare variables and heap objects provided a discriminant value is specified. Example: // the struct type Square defines the fields x, y and side. typedef Shape(SQUARE) Square; const Square square = {x => 10, y => 10, side => 5}; Shape sh; // ERROR: missing discriminant Shape(SQUARE) sq; // OK void put_square (ref Square s); // pass address void put_shape (ref Shape s); // pass address + discriminant Shape^ p1 = new Shape; // ERROR: missing discriminant Shape^ p2 = new Shape(k); // OK (k has type TypeShape) note: discriminants never change after object creation; discriminants are values, not objects. (implementation note: for each open struct type, the compiler creates a intern constant look-up table with the sizes of the open struct type for each possible discriminant value. This table is used in : assignment statement, new allocator, 'byte attribute) Jagged structs -------------- A jagged struct is a struct or open struct type containing fields of an open or jagged type. Jagged struct types are only allowed for declaring constants, parameters of mode 'in', or references to these; they are not allowed for declaring variables, parameters of mode 'ref' or 'out', or to allocate heap objects. Example: struct info { int code; string title; // field has open type Shape shape; // field has open type } const info msg = {10, // (int 10, "square", // &"square", int 6, Square'{0,0,10}}; // &square_0_0_10, int 1) Keyword 'packed' ---------------- The keyword 'packed' can be prefixed to a struct or open struct declaration. Example: struct Foo1 // size = 8 bytes, alignement at 4 { float f; char c; } packed struct Foo2 // size = 5 bytes, alignement at 1 { float f; char c; } Use of the keyword 'packed' has several important effects : 1) the compiler will not insert alignment gaps or trailing bytes in the struct. 2) packed types are convertible to the type "byte[]", so they can be passed from and to the outside world (network, files, ..) using many i/o functions. 3) pointers and function pointers fields are forbidden in packed structs. (note that unsafe pointer fields are allowed in packed structs) (reasons: . pointers must always be aligned so that access to them is atomic, otherwise they could get corrupted if several threads access them. . packed types are used for i/o, and pointers must never be read-in from extern sources) Thanks to the packing, the intern representation is precisely defined (except for MSB/LSB byte ordering). A packed struct can only contain fields of packed types. Enumeration, integer, floating-point, union and unsafe pointer types are always packed. Array types are packed if the element type is packed, non-open and non-jagged. Struct types are packed if they feature the keyword "packed". All their fields must then have packed, non-open and non-jagged types. Pointer, function pointer, opaque, incomplete, generic and jagged types are never packed. discriminant equivalence ------------------------ When a constant declaration of an open struct type has no specified discriminant, then the discriminant is taken from its constant value. In effect, this means that the following declaration : const Shape sh = Square'{0,0,10}; is implicitely converted into : const Shape(SQUARE) sh = Square'{0,0,10}; matching constant ----------------- When the constant is identical to another one, it refers to the same constant (same serial number in 'constant area'). Example: const Shape sh2 = sh; constructed constants --------------------- Constants constructed from aggregates of other constants, or from array elements, slices or struct fields are stored by creating a new constant and taking copies from parts of previous constants. Example: const Shape(SQUARE)[2] sh3 = {sh, // (int 0,0,10, sh}; // int 0,0,10) jagged structs -------------- A constant jagged struct is a struct where at least one field is an open type. Such a constant contains 'jag references' to the inner open type. Example: const Shape[2] sh4 = {sh, // (&sh, int SQUARE, sh}; // &sh, int SQUARE) 2.6 Union types --------------- Example: union U { int a; char b; } Union types are always packed implicitely (however the keyword 'packed' is not allowed) Unions can only contain fields of packed, non-open and non-jagged types, otherwise a compilation error occurs. (note: as a consequence: . unions can be converted to byte[]; . unions cannot contain pointer or function pointer fields, however they can contain unsafe pointers) There are no aggregates or constants for union types. Union types have all their fields mapped to the same memory area. The size of the union type is computed by taking the largest size of all its fields. 2.7. Pointer types ------------------ Pointers are used to access anonymous objects allocated on the heap; pointers cannot access global or local declared variables. A pointer is declared by using the symbol ^ (pronounced caret). Example: Shape^ prev, next; // two pointers to Shape Shape prev^, next^; // same Shape table[10]^; // table of 10 pointers to Shape Shape^ table[10]; // same Shape^[10] table; // same string^ name = new string (80); strcpy (out name^, "Hello"); free name; The accessed type cannot be a jagged type. Pointers to "open type" and pointers to "non-open type" are not compatible because heap objects of an open type have an additional intern header field. Example: string^ p1; string(10)^ p2; p1 = p2; // ERROR: pointers are not compatible (implementation note: pointers do not contain the address of the heap object but that of an intermediate Tombstone structure. See chapter 11.2 Implementation of pointers and heap objects) 2.8 Incomplete types -------------------- Incomplete types are used to declare types having mutual references. An incomplete type name can only be used : - in a pointer type definition (or unsafe pointer definition) as the designated type, - as the type of a formal parameter in a typedef declaration. Example: typedef node2; // incomplete type struct node1 { int count; node1^ up; // pointer to struct type itself node2^ left, right; // uses the incomplete type } struct node2 { int count; node1^ left, right; } A type with the same name as the incomplete type must be declared later in the same declarative region, possibly in another source file. It is called the full type. The full type cannot be a jagged, incomplete, opaque or generic type. It can be an open type. After the full type is declared, all occurences of the incomplete type at places where the full type is visible refer to the full type. 2.9. Function Pointers ---------------------- Example: void treat_node (Shape s); // function declaration typedef void TREAT (Shape s); // function pointer type void treatment () { TREAT treat; // function pointer variable treat = null; treat = treat_node; // parameter modes and types must match if (treat != null) treat (s); } 2.10. Opaque types ------------------ Opaque types are used to declare structs whose inner fields are hidden. Opaque types can only be declared in .h units or in package declarations. For each opaque type, a corresponding full struct type with the same name and all fields must be declared later in the corresponding .c unit or package body. Example: ---------------------------------------------- // drawing.h struct DRAW_CONTEXT; // opaque type void init (out DRAW_CONTEXT d); void circle (ref DRAW_CONTEXT d, int x, int y, int radius); ---------------------------------------------- // drawing.c struct DRAW_CONTEXT // full struct type { int x, y, dx, dy; IMAGE^ image; } public void init (out DRAW_CONTEXT d) { // .. } public void circle (ref DRAW_CONTEXT d, int x, int y, int radius) { // .. } ---------------------------------------------- limited types ------------- Opaque types have a special property : they are 'limited' outside their declarative region. This means that it is not allowed to take copies of such objects, for example using an assignment statement. Example: use drawing; void main() { DRAW_CONTEXT a, b; init (out a); b = a; // ERROR : assignment not allowed for limited types } The 'limited' property is propagated recursively, meaning that array or struct types having an element or field of a limited type are themselves limited. (note: pointers to a limited type are not limited). (note: the 'limited' property is also dynamic, in the sense that any such limited array or struct type becomes non-limited at places where all its elements or fields have become non-limited) Designers of opaque types can be sure that each occurence of an opaque object is unique and that users cannot take copies. (reason: the designer of a opaque type may want to provide an explicit function to clone an opaque object that does not simply copy inner pointers but that replicates referenced structures. Another reason could be that it is faster to copy only the active part of a stack or varying-length string. Finally, the designer may simply want to forbid copies by not providing such a function). The full struct type of an opaque type can be a packed or a non-packed struct type; it is not allowed to be an open or jagged struct type; it is allowed to have fields of another limited type, however a compilation error occurs (possibly at link-time) if the later replacing of the opaque type by its actual full type creates circular dependencies. Inside their declarative region, between the opaque type and the full struct type declaration, opaque types are not limited, meaning they have the same properties as struct types, except of course that the inner fields are not available. (note: constants and aggregates of opaque types are thus not available). However, outside the .h unit or the package that declared the opaque type (and that includes regions between the package declaration and the package body), limited types have the following constraints : - assignment to an object of a limited type is not allowed. - an object declaration with a limited type must not have a default expression (thus constants are not allowed). - function declarations or bodies that have parameters of mode 'in' of a limited type must not have default expressions for these parameters. - an allocator cannot have an initial value of a limited type, neither by an expression nor an aggregate (note: this would create a copy). - aggregates of a limited type are not allowed. note: the clear statement is allowed on a limited object. (reason: this is equivalent to a heap object of an opaque type that is initialized with zeroes by an allocator) After the full struct type is declared, all occurences of the opaque type at places where the full type is visible refer to the full type. (design note: a run-time implementation of opaque and generic types was abandonned due to the complex nature of computing the type of safe pointers at runtime; delaying the size of such types until link-time is must cleaner and produces faster executables as the sizes and offsets are constant) (design note: opaque parameters are always passed by address, never by copy, to avoid Ada's problems; this is why they are restricted to struct types). 2.11. Object type ----------------- The type 'object' is predefined as : typedef byte[] object; // open array of byte The jagged type 'object[]' is used to declare functions like sprintf() which take a variable number of parameters. Example: int sprintf (out string buffer, string format, object[] arg); int sscanf (string buffer, string format, out object[] arg); int trace (string format, object[] arg); note: usually, jagged types are not allowed for parameters of modes 'out' and 'ref'; however an exceptional rule allows this for type 'object[]'. A parameter of type 'object[]' can for example be used in the following contexts : arg'length // as prefix of the 'length attribute, // to query the number of actual parameters. arg[i] // as prefix of an array element : the result has type // 'byte[]' and is convertible to any packed type. func (arg) // as an actual parameter in a function call (implementation notes: . object[] is a temporary array of (address,size) built before the function call; each array element contains the address and the size of an actual parameter (+ a 4-byte filler in case of 8-byte addresses). The table has the same structure as a constant jagged array. . the function call passes then the length of the array and its address; . values of simple types are boxed, i.e: stored in temporary variables and their address and size are stored in the array) 2.12. Generic type ------------------ Generic types are used to parametrize the types of an algorithm, see the chapter on generics. 2.13. Unsafe pointer type ------------------------- Unsafe pointers are similar to C pointers. Unsafe pointers are not allowed to access jagged types. Example: int i, p*; p = &i; i = *p; i = p[0]; p++; (note: unlike C pointer declarations, the * symbol appears as suffix, not prefix) Arrays, structs or unions containing unsafe pointers are themselves unsafe. Using unsafe types is only allowed within unsafe regions. (see 1.11) =========================================================================== 3. Declarations =============== declarations ::= {declaration} declaration ::= enumeration_declaration | typedef_declaration | incomplete_declaration | struct_declaration | opaque_declaration | union_declaration | object_declaration | reference_declaration 3.0. Entity ----------- A declaration creates an entity and associates an identifier with it. The following kinds of entities exist : - integer type (predefined: int1, int2, int4, int8, uint1, uint2, uint4) - floating-point type (predefined: float4, float8) - enumeration type (predefined: bool, char, wchar) - (packed) struct type - (packed) open struct type - discriminant - struct field (varying part) - union type - union field - typedef (renaming, open array, array constraint, discriminant constraint, pointer, function pointer, unsafe pointer) (predefined: object, string, wstring) - incomplete type (and its full type) - opaque type (and its full type) - generic type - generic function declaration - enumeration literal - constant object - variable (global and local) - reference - parameter - function declaration - function body - package declaration (generic or non-generic) - package body - compilation unit declaration (a .h source file) - compilation unit body (a .c source file) An identifier must be unique within a declarative region, i.e. it is not allowed to declare two entities with the same identifier in the same declarative region, except if they logically form a single entity like a declaration together with its body, or an opaque or incomplete type with its full type. Furthermore, when declaring an entity within a block statement, the identifier must also be unique in all surrounding block statements and in the function body region. (reason: this reduces the risk of identifier misinterpretation in deeply nested blocks) Example: void main() { int a, b; { char a; // ERROR : identifier must be unique within region char c; } { char c; // OK } } For any function declaration, there must be a later corresponding function body (except if it contains the option "extern"). A function body without previous declaration is allowed. For any package body, there must be a previous corresponding package declaration. A package declaration without later body is allowed. If a .h source file or a package declaration contains any of : 1- a function declaration, 2- an opaque type, 3- an incomplete type without full type, or a, possibly nested, package declaration containing any of (1,2,3), then there must be a corresponding .c source file or package body containing : 1- the function body, 2- the full type for the opaque type, 3- the full type for the incomplete type. ------------------------------------------------------------------------- 3.1. Enumeration declaration ---------------------------- enumeration_declaration ::= "enum" identifier [ "(" base_type_name ")" ] "{" enumeration_literal {"," enumeration_literal} [","] "}" ";" base_type_name ::= type_name enumeration_literal ::= identifier note: the syntax allows an extra comma "," before } Each enumeration literal declares a constant value of the enumeration type in the same region as the enumeration declaration. 3.2. Typedef declaration ------------------------ A typedef declaration can be used to declare an alias name for a type, to add a type specifier to a type, or to declare a function pointer type. typedef_declaration ::= "typedef" type_definition declarator ";" | "typedef" function_specification ";" type_definition ::= type_name { type_specifier } declarator ::= identifier { type_specifier } type_specifier ::= array_specification | open_array_specification | array_constraint | discriminant_constraint | pointer_specification | unsafe_pointer_specification array_specification ::= "[" constant_expression "]" open_array_specification ::= "[" "]" array_constraint ::= "(" constant_expression ")" discriminant_constraint ::= "(" constant_expression ")" pointer_specification ::= "^" unsafe_pointer_specification ::= "*" All type specifiers of a type definition are applied successively on the type name, creating each time a new type. The type specifiers of a declarator are applied in reverse order. An array specification creates an array type. An open array specification creates an open array type. An array constraint creates an array type; it is only allowed on an open array type. A discriminant constraint creates a struct type; it is only allowed on an open struct type. A pointer specification creates a pointer type. The accessed type cannot be a jagged type. An unsafe pointer specification creates an unsafe pointer type; this is only allowed in unsafe regions (see 1.11). The accessed type of an unsafe pointer cannot be a jagged type. An array type's maximum allowed size is int'max bytes. visibility ---------- The identifier in a declarator is made visible as soon as the declarator is fully parsed (see also the visibility rules for a function specification). 3.3. Incomplete declaration --------------------------- incomplete_declaration ::= "typedef" identifier ";" 3.4. Struct declaration ----------------------- struct_declaration ::= ["packed"] "struct" identifier [ discriminant_part ] "{" { field_declaration } [ variant_part ] "}" discriminant_part ::= "(" type_name identifier ")" field_declaration ::= type_definition declarator {"," declarator} ";" variant_part ::= "switch" "(" identifier ")" "{" {struct_variant} "}" struct_variant ::= "case" constant_expression ":" variant_declarations variant_declarations ::= "null" ";" | field_declaration { field_declaration } The type name of a discriminant part must denote an enumeration type. A struct variant must be present if and only if there is a discriminant part, the struct type denotes then an open struct type. The keyword "null" indicates a variant without any fields. visibility ---------- A struct's identifier is visible as soon as it appears, however, inside the struct declaration itself, it can only be used to declare the type of a field, and in two cases only : . when there is at least a pointer or unsafe pointer specification in the type specifiers leading to the struct type, . in the variant part only, when no discriminant constraint is applied to the struct type (however array specifications may apply). (design note: this does not protect fully from mutual recursive types that can be created using mutually referencing opaque types; an error occurs during link-time at the latest when circularities are detected during the computation of a struct type's size) Example: struct Shape (TypeShape type) { Shape f1; // ERROR: only allowed with pointer specification Shape^ f2; // OK Shape[10]^ f3; // OK Shape^[10] f4; // OK Shape[2]^[3] f5; // OK Shape^[] f6; // OK switch (type) { case A: Shape(B) f7; // ERROR: no discriminant constraint allowed Shape f8; // OK case B: Shape[10] f9; // OK (array specifications allowed) case C: null; } } package P1 struct OP1; end P1; package P2 struct OP2; end P2; package body P1 struct OP1 { OP2 inner; } end P1; package body P2 struct OP2 { OP1 inner; // ERROR: mutually referencing types } end P2; packed struct BAD { int k; string str(BAD'size); // ERROR: BAD not allowed int k2; } The identifier of a field declaration is made visible only after the corresponding declarator was fully parsed. All fields and the discriminant must have unique identifiers. Fields or the discriminant cannot be referenced within the struct declaration, except the identifier in the variant part which must match the discriminant name. All constant expressions within the variant part must be of the type of the discriminant and must have distinct values. A compilation error occurs if a packed struct type contains a field of a non-packed type, or a field of an open type. A struct type's maximum allowed size is int'max bytes. 3.5. Opaque declaration ----------------------- opaque_declaration ::= "struct" identifier ";" 3.6. Union declaration ---------------------- union_declaration ::= "union" identifier "{" { field_declaration } "}" visibility ---------- The identifier of a union declaration is made visible as soon as it appears, however it may not be referenced within the union declaration. The identifier of a field declaration is made visible only after the corresponding declarator was fully parsed. Field identifiers cannot be referenced within the union declaration. Unions fields can only have packed, non-open, non-jagged types. 3.7. Object declaration ----------------------- An object declaration declares a constant or a variable. object_declaration ::= [object_qualifier] type_definition object_declarator {"," object_declarator} ";" object_qualifier ::= "const" | "volatile" object_declarator ::= declarator ["=" expression] Example: const int MAX = 100; string buffer(MAX); int i=0, j=0; visibility ---------- The identifier in an object declarator is made visible as soon as the corresponding declarator is fully parsed. However, the object entity it refers to is not accessible through an expanded name until the end of the object_declarator, otherwise a compilation error occurs. Example: int a = a; // ERROR: 'a' is not accessible int b = outer.b; // OK (if a package 'outer' declares 'b') int c = 2, // OK d = c; // OK scope ----- An object can be declared at the global level (inside compilation units or packages) or at the local level (inside functions and block statements). At the global level, the expression of an object declarator, if present, must be constant. constant and variable --------------------- If the keyword 'const' is specified, the object declaration declares a constant whose value is computed at compile-time and never changes; a constant expression is then mandatory. If no keyword 'const' is specified, the object declaration declares a variable; if an expression is specified, it is the initial value of the object, otherwise the variable's value is initially undefined. The value of a variable can only be changed by an assignment, clear statement, pre or postfixed_statement, prefixed object, postfixed object, or when the variable is updated when itself or one of its fields, or references or parameters of them, are specified as parameter of mode 'out' or 'ref' in a function call. It is an error to specify a constant at places where it can be changed. The type of the expression must be compatible with the type of the object. runtime ------- Global objects (and also heap objects) have a unique occurence; they can be accessed by several threads that can perform simultaneous updates. Local objects can have multiple occurences : a new occurence is created each time the declaring function is called; each local occurence can only be accessed by a single thread. The keyword 'volatile' should be specified for global variables that can be modified by several threads, or that might be modified by operating system calls. It instructs the compiler to always load a fresh copy of those variables, and to save them to memory immediately, instead of keeping them in registers. The keyword 'volatile' can only be specified for global variables of type integer (except long), enumeration, floating-point (except double), pointer, function pointer or unsafe pointer. (reason: volatile variables cannot be larger than 4 bytes; they must be aligned) Local objects larger than 4 Kbytes should be implicitely allocated on heap and automatically freed when leaving their declarative scope. initialization -------------- (design note: we need to ensure that variables are always initialized, in particular pointers, but we want to avoid filling them automatically with zeroes because this is very costly) Global variables, if no expression is specified in their declaration, are always initialized with binary zeroes. Local variables, if no expression is specified, have an initially undefined value. In each possible statement flow of the function, before a local variable or one of its elements, fields or references can be read, the full variable or a reference to it must receive a value (either by assignement, clear statement or by being specified as parameter of mode 'out' in a function call). Initialization of single array elements, slices or fields does not count as an initialization of the whole variable. Taking a variable's address using the unary operator (&) is considered as a write access followed by a read access of the variable. A parameter of mode 'out' (except if it has type 'object[]') is subject to the same rules as a local variable; in addition, such a parameter must receive a value for all possible statement flows of the function, i.e. there must not be a statement flow in which the function terminates while the parameter has not received a value. (note: a parameter of type 'object[]' and mode 'out' is excluded from the above rule because it cannot appear in a clear or assignment statement; this does not cause memory safety problems as pointer types cannot appear as object[] parameter) Evaluation of the statement flow must take into account that : - an object is only garanteed to be initialized if it has received a value in each possible flow of execution. - evaluation of the right-hand expression of operators || and && cannot be garanteed; - either the second or the third expression of operator ? : is evaluated; - only one branch of an if or case statement is executed; - execution of inner statements of a loop cannot be garanteed, except if the loop has a constant true condition, or no condition. - the for_next_statement of a for statement is executed at the end of the loop (before a new iteration), and it might never be executed. - break, continue and return statements might skip statements which are then never executed. Objects that are never read or used in an attribute might generate a warning saying that they are unused. 3.8. Reference declaration -------------------------- A reference declares an alias to an existing object. References can be used to rename objects into shorter names. References declarations are only allowed inside a function body or block statement. reference_declaration ::= "ref" type_definition declarator "=" object_designator ";" Example: ref char ch = buffer[i]; // reference to array element ref string str = buffer[0:3]; // reference to array slice ref int age = student.age; // reference to age field visibility ---------- The identifier in a reference declaration is made visible as soon as the corresponding declarator is fully parsed. However, the reference entity it refers to is not accessible through an expanded name until the end of the reference declaration, otherwise a compilation error occurs. (reason: to avoid that the following is valid: ref int i = i;) The object designator must denote an object. The type of the reference must be compatible with the type of the object. The reference cannot be changed after creation, i.e. it always references the same object. Assignment to a reference assigns a new value to the referenced object. (note: the object_designator can be a declared object (global or local), a heap object, a parameter, a reference, an array element or slice, a struct field, or a 'byte attribute of the earlier; it cannot denote a value, in particular discriminant values are not allowed). A reference is "read-only" if the object_designator denotes a constant, an 'in' parameter, or a reference/index/slice/field/byte_attribute of these. A reference behaves like a function parameter of mode 'out' (thus write before read) if the object_designator denotes a parameter of mode 'out', a variable that has not yet received a value or a reference/index/slice /field/byte_attribute of these. A reference behaves like a function parameter of mode 'ref' if the object_designator denotes an already initialized variable, a parameter of mode 'ref', a heap object, or a reference/index/slice/field /byte_ attribute of these. (note: when inlining a function, each parameter is replaced by a reference if the actual parameter denotes an object; or it is replaced by a variable if the actual parameter denotes a value) References (like parameters) are never constant : their value is not evaluated at compile-time. References to an array element or slice imply an index check. References to a field of an open struct imply a discriminant check to verify that the field exists. References to heap objects, or elements/slices/fields/byte_attribute of heap objects, will increment the tombstone count (see 11.2) corresponding to the heap object until the reference declaration goes out of scope (freeing the heap object when its tombstone count is not zero causes a runtime error). Example: { ref Element e = p^.next^.element; // inc tombstone count of p^.next^ e.head = 2; } // decrement tombstone count Before leaving a scope as consequence of a break, continue or return statement, or the end of a block statement or end of a function body, all tombstones counters of visible reference declarations must first be decremented. (implementation note: references are stored as one of three cases : - normal types (store address); - open array types (store address + length); - open struct types (store address + discriminant) If the reference denotes a heap object, or an element/slice/field/byte_ attribute of a heap object, an additional local variable must be allocated to store a pointer to the tombstone structure corresponding to the heap object; its count must be decremented at the end of the reference's scope) 3.9. Function declaration ------------------------- function_specification ::= [ "[" function_option "]" ] ["public"] ["inline"] ("void" | type_definition) identifier "(" formal_parameters ")" function_option ::= "callback" | "extern" string_literal function_declaration ::= function_specification ";" Specifying function options is only allowed in unsafe regions. (see 1.11) In a generic formal part, the options "callback", "extern" and the keywords "public" and "inline" are not allowed. In a typedef declaration, the option "callback" is allowed, the option "extern" and the keywords "public" and "inline" are not allowed. The keyword "public" is allowed only for a function body : it must be specified if and only if the function was declared earlier in a .h unit or package declaration. (design note: the asymmetry between declaration and body is intentional, the pragmatic goal here is to keep the use of keyword 'public' to a minimum) (implementation note: the ambiguous syntax : type_definition identifier "(" is resolved by testing the type definition: if it is an open type, it is an object_declaration, otherwise it is a function declaration) visibility ---------- For a function declaration or function body, the identifier of a function specification is made visible as soon as it appears. However, the entity it refers to is not accessible through an expanded name until the end of the function specification, otherwise a compilation error occurs. For a typedef declaration, the identifier of the function specification is made visible at the end of the function specification. Example: int func (char func); // OK int item (item t); // ERROR, global 'item' not accessible int check (int i = check); // ERROR, global 'check' not accessible int mine (int i = g.mine); // OK if g.mine exists (design note: the language does not support overloading because it makes programs harder to read) function return type -------------------- A function's return type is defined either by the keyword 'void', or by a type definition which can only denote one of the following types : integer, enumeration, floating-point, pointer, function pointer, unsafe pointer. It cannot denote an array/struct/union/opaque/limited/generic/incomplete type. Function without return value use the keyword void. Example: void initialize (int mode); formal parameters ----------------- formal_parameters ::= [formal_parameter {"," formal_parameter}] formal_parameter ::= [mode] type_definition declarator ["=" constant_expression] mode ::= "ref" | "out" If no mode is specified, the mode "in" is implicitly chosen. Parameterless functions have an empty set of parentheizes. Example: int start_treatment (); Formal parameters can have any type except type incomplete, however type incomplete is allowed if the function declaration occurs within a typedef declaration. visibility ---------- The identifier of a formal parameter is made visible as soon as the corresponding declarator is fully parsed. However, the parameter entity it refers to is not accessible through an expanded name until the end of the function specification, otherwise a compilation error occurs. Example: int check (int i = i); // ERROR, 'i' not accessible int check (int i, int j = i); // ERROR, 'i' not accessible If a parameter has a default constant expression, it must have mode 'in'; then all the following parameters must also have mode 'in' and have a default expression. mode access to formal parameter ==== ========================== in read-only ref read/write out write first, then read/write mode enum/int/float/pointer array/struct/union ===== ====================== ================== in by value by address ref by address by address out by address by address (design note: parameters of mode 'in' passed by address may nevertheless change within a function if the actual parameter is a global variable or a heap object that is modified either by the function itself, or by another thread) function specification equivalence ---------------------------------- The function specification of a function declaration and body must match. In particular : the function name, the function options ("callback"), the return type, the number of parameters, and for each parameter : its name, type, default value presence and value if any. The keywords "inline" and "public" need not match. keyword inline -------------- When the keyword "inline" is specified, the linker should replace calls to the function by an insertion of the function's code, replacing all parameters by variables or references. The linker can ignore the keyword "inline" for example in case of recursive calls where replacement is not possible. option "callback" ----------------- When the option "callback" is specified, then the function can be called by the operating system. Example: typedef [callback] int MAINWIN (HWND hwnd, UINT message, WORD w, LONG l); [callback] int MainWin (HWND hwnd, UINT message, WORD w, LONG l); [callback] int MainWin (HWND hwnd, UINT message, WORD w, LONG l) { return 0; } On the Windows platform, callback function bodies : - align the stack at M16 at entry, - save and restore registers RSI, RDI, RBX, - expect function arguments to be passed from right to left. option "extern" --------------- When the option "extern" is specified with a DLL filename, then the function is defined externally : no corresponding function body is allowed. Example: [extern "KERNEL32.DLL"] int GetStdHandle (int nStdHandle); [extern "user32.dll"] int GetPositional (IntPtr hwnd, MAINWIN callback, char * title, int nMaxCount); The main Windows DLL libraries are : Kernel32.dll, Ntdll.dll, Advapi32.dll, User32.dll, Gdi32.dll, comdlg32.dll, shell32.dll, wsock32.dll The "Platform SDK" from Microsoft details the API. On the Windows platform, extern functions : - expect function arguments to be passed from right to left. function body ------------- function_body ::= function_specification "{" declarations statements "}" Recursive function calls are allowed. A function having a non-void return type must be return-terminated, which means its last statement must be either : a return statement, an abort statement, or an if statement with its both branches being return-terminated, or a switch statement with all its branches being return-terminated, or a block statement being return-terminated, or a for-statement without condition and without inner break-statement leaving the for-statement. Before leaving a function body, the tombstone count of all referenced heap objects must first be decremented (see 3.8. reference declaration). =========================================================================== 4. Names ======== 4.1. Declarative Region ----------------------- A declarative region is a portion of a source file. It can be composed of several parts logically grouped together: - a compilation unit declaration (.h source file) together with its body (.c source file). Entities local to the unit are all entities declared immediately within it. - a package declaration (generic or not) together with its body. Entities local to the package are all entities declared immediately within it. For a generic package, this includes also any generic types and generic functions. - a function declaration together with its body. Entities local to the function declaration are all parameters, as well as all entities declared in the declarations of the function body. - a block statement Entities local to the block statement are all entities declared in the declarations of the block statement. - a struct. Entities local to the struct are the discriminant (if any) and all fields. - a union. Entities local to the union are all fields. 4.2. Simple names ----------------- simple_name ::= identifier The following declarative regions are used for searching simple names : a) anonymous root region. Entities local to this region are : - the compilation unit and body currently compiled, - all imported units (with their names replaced by an alias if any). b) compilation unit declaration and body c) package declaration (generic or not) and body. d) function declaration and body. e) block statement Simple name searching occurs successively in each declarative region, starting from the innermost to the outermost. For each region : 1- entities declared immediately within the region are searched. If an entity with the simple name is found, searching ends. 2- entities declared immediately within the region are scanned to build a list of "folder entities". Folder entities are units or non-generic packages, and they are not enclosing the current search point, i.e. we are currently not inside them. All entities declared immediately within these "folder entities" (but not their bodies) are searched for the simple name. If exactly one entity is found, searching ends. In case several entities are found, the simple name is ambiguous and an error occurs. If no entity was found, a further generation of folder entities that are local to the first generation is built (-> 2). If no more folder entities are found, the search continues with a more outer region (-> 1). Example: // ROOT // "unit": strings.h (alias is STR) int strcpy (out string target, string source); int strcat (out string target, string source); package P0 const int LENGTH = 80; end P0; // "unit": prog.h const int MAX = 1; package P1 const int TOTAL = 2; package P2 const int QUIT = 3; int f(); end P2; end P1; // "unit": prog.c from std use strings as STR; package body P1 package body P2 public int f() { int j; { int a; // X <-- search point } } end P2; end P1; Suppose we're at point X and we search for an entity. At level e, we will find : 1- a At level d, we will find : 1- j At level c, we will find : 1- QUIT, f At level c, we will find : 1- TOTAL, P2 2- nothing because X is inside P2. At level b, we will find : 1- MAX, P1 2- nothing because X is inside P1. At level a, we will find : 1- STR, prog 2- master entities: STR. locals: strcpy, strcat, P0. 2- master entities: P0 locals: LENGTH. 4.3. Expanded names ------------------- When a simple name declared in a unit declaration or package declaration is ambiguous, it can be prefixed with the unit (or its alias) or package name to make it unique. It is then called an expanded name. Example: strings.strcpy expanded_name ::= [unit_name "."] {package_name "."} simple_name The package name must not denote a generic package, unless the expanded name appears within the generic package or its body. An expanded name denotes an entity. The "." selector's prefix must denote a unit or package name, the suffix must denote an entity declared immediately within the unit declaration or package declaration. unit_name ::= expanded_name package_name ::= expanded_name function_name ::= expanded_name generic_package_name ::= expanded_name object_name ::= expanded_name enumeration_literal_name ::= expanded_name | "false" | "true" The expanded name for an object_name must denote a constant, variable, reference or parameter entity. type_name ::= "int1" | "int2" | "int4" | "int8" | "tiny" | "short" | "int" | "long" | "uint1" | "uint2" | "uint4" | "byte" | "ushort" | "uint" | "float4" | "float8" | "float" | "double" | "char" | "string" | "wchar" | "wstring" | "bool" | "object" | expanded_name The expanded name of a type_name must denote a type. A function name must denote a non-generic function declaration; it evaluates to a function pointer value to the function's start address. 4.4. Object and value Names --------------------------- Objects are entities having a type and a value. There are 6 kinds of objects : a) declared objects (global, local to a function, constant or variable) b) parameters (always local to a function, they can have 3 modes) c) references (always local to a function, referencing a global or local object, they can have 3 modes) d) dereferenced objects (dynamically allocated using the allocator new) e) array elements or slices, struct/union fields of another object. f) unsafe objects (dereferenced by an unsafe pointer) A name denotes an object or a value. name ::= | "(" expression ")" | object_name | type_name (only used as attribute prefix) | enumeration_literal_name (returns a value of type enumeration) | function_name (returns a value of type function pointer) | function_call | name array_element | name array_slice | name discriminant_value | name struct_field | name dereferenced_object | name postfixed_object | name deref_unsafe_field | name attribute array_element ::= "[" expression "]" array_slice ::= "[" expression ":" expression "]" discriminant_value ::= "." discriminant_simple_name struct_field ::= "." field_simple_name dereferenced_object ::= "^" postfixed_object ::= "--" | "++" deref_unsafe_field ::= "->" field_simple_name attribute ::= "'" attribute_id ["(" constant_expression ")"] discriminant_simple_name ::= simple_name field_simple_name ::= simple_name (note: of a struct or union) attribute_id ::= identifier | "string" | "byte" Using an object_name that denotes an unsafe pointer type is only allowed in unsafe regions. (see 1.11) The prefix of an array_element or array_slice must not be an array_slice. (reason: the first array slice is then redundant) Array Element -------------- Example: array [ index ] The prefix of the array element must denote an array object (or an unsafe pointer object or value whose accessed type is not an open type, see 1.11). The result is an object of same parameter mode than the prefix (or an unsafe object of mode ref if the prefix is an unsafe pointer). Its type is the array element type (or the accessed type for an unsafe pointer). The index expression must have type int or uint after integer promotion. For an array prefix, the index must evaluate to a value within the bounds of the array (index >= 0 && index < array'length). If this cannot be ensured at compile-time, a runtime check must be inserted. (for a prefix that is an unsafe pointer, the index is always converted implicitely to an int4 so that a signed offset is added to the prefix address; furthermore, no bound check is performed) Example: int i; for (i=0; i= 0 and offset <= array'length (2) length >= 0 and length <= array'length (3) offset + length <= array'length If this cannot be ensured at compile-time, runtime checks must be inserted. (for a prefix that is an unsafe pointer, the offset is always converted implicitely to an int4 so that a signed offset is added to the prefix address; furthermore, no checks on offset and length are performed) (implementation notes: We have 3 values : offset, length and array'length which can all three either be constant or non-constant. Array'length is never negative, but the other two can be, so this must be checked. If only offset is non-constant, the following runtime check is needed : (1) runtime unsigned check : offset <= array'length - length If only length is non-constant, the following runtime check is needed : (1) runtime unsigned check : length <= array'length - offset In all other cases, we need two run-time checks : (1) runtime unsigned check : length <= array'length (2) runtime unsigned check : offset <= array'length - length A compile-time or runtime error occurs if these checks fail. address += (offset * constant_element_size); end implementation notes) Discriminant value ------------------ Example: open_struct . discriminant The prefix of the discriminant value must denote an object of an open struct type. The result is a value of the discriminant's enumeration type. The result is a constant if the prefix object has a constant discriminant constraint, otherwise it is a runtime value. If the result is a constant, the prefix is evaluated at compile-time but no code is generated at runtime. Struct field ------------ struct . field The prefix of the field component must denote an object of a struct, open struct or union type. The result is an object of same parameter mode than the prefix. Its type is that of the field. If the field belong to a variant part of an open struct type, a runtime check is needed to verify that the open struct's discriminant value is appropriate for accessing this field, unless this can be ensured at compile-time. runtime : address += field_offset; Dereferenced object ------------------- Example: pointer ^ The prefix of the dereferenced object must denote a pointer object or value. (note: if the prefix denotes something else, then the "^" operator must not be parsed as it probably indicates a xor operator) In case of an object prefix, a "read" operation is performed on the pointer object. (this may cause a compile-time error in case of a not-yet initialized variable or parameter of mode 'out'). The result is the heap object designated by the pointer. A run-time error occurs if the prefix evaluates to null, or if the heap object was already freed earlier. Example: while (l != null) // flow analysis can tell l is never null inside loop { l = l^.cdr; // note: no null check needed before dereferencing } (implementation note: during dereferencing, the heap object is locked by atomically incrementing a counter inside its tombstone structure; it is decremented atomically again when no address inside the heap object is in use anymore, which is generally at the end of the statement or declaration in which the dereferencing appears, or until a further dereferencing - see also reference declarations) Postfixed object ---------------- The prefix must denote an object having an integer or enumeration type, or an unsafe pointer type. The object will be converted to a value, after which the object will be incremented / decremented. (in case of an unsafe pointer the pointer value will be incremented/decremented by the size of the accessed type) A "read" followed by a "write" operation are performed on the object. (note: this may cause an error in case of a not-yet initialized variable or parameter of mode 'out'). Example: int p*, q*; *p++ = *q++; Deref unsafe field ------------------ Example: unsafe_pointer -> discriminant_or_field The operator "->" is equivalent to the succession of the two operators "*" and "." (see "unary operator *", "Discriminant value" and "Struct field"). Attribute --------- The following attributes exist: min, max, first, last, length, length(N), byte, string, size. If the result is a constant, the prefix is evaluated at compile-time and no code is generated at runtime. Example: table[-1]'min // error: compile-time evaluation of prefix fails table[f()]'min // OK : f() will never be called at runtime attributes 'min 'max -------------------- I'min : minimum value (most negative) I'max : maximum value (most positive) I must denote an integer type or variable (a constant is not allowed). The result is a constant value of the same integer type. attributes 'first 'last ----------------------- E'first : first enumeration literal E'last : last enumeration literal E must denote an enumeration type or variable (a constant is not allowed). The result is a constant value of the same enumeration type. attribute 'length 'length(N) ----------------------------- A'length returns the number of elements of an array A. A must denote an array object, or a non-open array type. The optional integer constant N specifies an array dimension (>= 1) which is only valid if A is an N-dimensional array. (implementation note: length(N) with N > 1 is evaluated by successively checking the types A[i], A[i][j], A[i][j][k], .. until reaching dimension N. A compilation error occurs if any such type is not an array, or if it is an open array) The result is a value of type int; it is constant if A denotes a non-open array type. attribute 'byte --------------- O'byte converts an object O of any packed type into an array of byte (byte[]) having the same intern representation and the same size. The size of the prefix object must be evaluated (possibly at runtime for an open type, slice, parameter, heap object), after which it becomes the length of the resulting byte array. The result is an object of same parameter mode than the prefix. The result is a constant if the prefix is a constant. The result array length is constant if the prefix has a constant size. note: for enumeration, integer and floating-point constant objects, boxing (storing the value into read-only memory and taking its address and size) might be necessary to achieve the effect of the attribute. note: the 'byte attribute will return different results on big endian and little endian computer architectures. Example: int i = 4; float f; f'byte = i'byte; // copy i into f (4 bytes) if (f'byte[3] & 128 != 0) // assumes little-endian representation printf ("negative"); attribute 'string ----------------- E'string converts an enumeration value E into its literal string. The prefix must denote a value of any enumeration type, except char and wchar. The result is a read-only (parameter mode 'in') object of type string. A compile-time or runtime error occurs if the prefix value is larger than the last enumeration literal. Example: printf ("color : %s\n", color'string); attribute 'size --------------- T'size : returns the number of bytes used by type T. T must denote a packed and non-open type. (note: within an unsafe region, T can denote a non-packed type but it must not be renamed, incomplete, generic or opaque) O'size : returns the number of bytes used by object O. O must be an object of a packed type (it can be an open type). (note: within an unsafe region, O can denote an object of a non-packed type but the type must not be renamed, incomplete, generic or opaque) The result is a value of type uint. note: this can include a runtime computation for : - a reference/parameter/heap object of an open type, - a slice. note: the size of an open struct does NOT include the discriminant. Function name ------------- A function name used as name returns a value of type function pointer which corresponds to the executable code start of the corresponding function body. 4.5. Function call ------------------ function_call ::= name "(" actual_parameters ")" Example: p^.func(exp)^.f("a"); (*p)("a"); The name of a function call must be a name that denotes a value of type function pointer; if it evaluates to null, a runtime error occurs. The function referenced by the name must have a return value that is : - "void" if the function_call is used in the context of a run call; - non-void if it is used in the context of an inner name; - void or non-void if it is used in the context of a function_call_statement. The corresponding function specification must not have a "callback" option. actual_parameters ::= [actual_parameter {"," actual_parameter}] actual_parameter ::= [identifier "=>"] expression | "ref" [identifier "=>"] object_designator | "out" [identifier "=>"] object_designator (implementation note: parsing this requires a look-ahead of two tokens) Example: divide (i, j, out ans, out r); retrieve_object ( id => nr, out item => my_item); Some actual parameters may use a "named form", i.e. by prefixing the identifier of the actual parameter name. Actual parameters of packed type that are given for a formal parameter of type object[] must not be given in named form. All parameters must be provided, in order, except possibly the last parameters if a default constant expression is defined for them in the formal parameter declaration. Example: void put_int (int value, int base = 10); put_int (i, base => 16); put_int (i); // assumes base 10 parameter passing ----------------- The type of each actual parameter must be compatible with its corresponding formal parameter (see assignment statement for the notion of 'compatibility'). For parameters of mode in, additional implicit conversions may occur (see chapter 'implicit conversions'). If either the formal or the actual parameter has type array of byte, additional conversions are possible (see below). Finally, if the last formal parameter has type object[], it can accept a list of actual parameters of packed types (see below). mode ---- For a formal parameter of mode out or ref, the actual parameter must be an object (not a value); it must not be a constant, a parameter of mode in, or a reference to these. initialization -------------- For a formal parameter of mode in or ref, the actual parameter must not denote a local variable or parameter of mode 'out' that has not yet been initialized. For a formal parameter of mode out, the actual parameter is assumed to receive a value by the function call, and hence to become initialized. array/struct matching --------------------- When passing an actual parameter of open array type to a formal parameter of array type with length constraint, a check is done that the array lengths match. An error occurs if the check fails. When passing an actual parameter of open struct type to a formal parameter of open struct type with discriminant constraint, a check is done that the discriminant values match. An error occurs if the check fails. evaluation order ---------------- Actual parameters are evaluated from left to right, as they appear in the source text. Example: int i = 0; f (i++, i++, i++); // same as f (0, 1, 2); heap object passing ------------------- In case heap objects, or elements/slices/fields/.. of heap objects are passed as actual parameter, either directly or as part of a list of parameters for a formal parameter of type 'object[]', then their corresponding tombstone count must be incremented at dereferencing (before the call) and decremented after the call (see also 3.8.) This means that a pointer to the tombstone must be saved in an temporary local variable before the call. formal parameter of type array of byte -------------------------------------- An actual parameter of any packed type is compatible with a formal parameter of type array of byte (either the open array type byte[] or non-open array type byte[N]). The size of the actual parameter is computed and becomes the length of the byte array. If the formal parameter specifies a length, it must match with the computed length of the actual parameter. For a constant actual parameter of mode 'in' of a packed type, and passed by value (thus of type enumeration, integer, float or unsafe pointer), the constant is stored in the constant memory area and its address and size are passed as parameter. For a non-constant actual parameter of mode 'in' of a packed type, and passed by value (thus of type enumeration, integer, float or unsafe pointer), boxing (storing the parameter in a temporary variable and passing the address and size of this variable) occurs. For all other cases (object of mode 'out' or array/struct/union object or aggregate value), the address and size of the item are passed as parameter. If, after application of promotion rules for integer types, the type of the parameter is integer literal, then the parameter is converted implicitely either into type int if the value fits into an int's range, or into type long otherwise. If, after application of promotion rules for floating types, the type of the parameter is floating-point literal, then the parameter is converted implicitely into type float if the value fits into a float's range, or into type double otherwise. If the actual parameter is a null literal, it is converted implicitely into an unsafe pointer type; this must occur in a safe region (see 1.11). Example: void put (byte[] b); put (2); // type int put (2L); // type long put ((byte)2); // type byte put (1.2); // type float put (1.2d); // type double put (null); // type unsafe ptr formal parameter of any packed type ----------------------------------- An actual parameter of array type 'byte[]' is compatible with a formal parameter of any packed type. For a formal parameter of a non-open type, a size check is performed to make sure both sizes are identical. The address (for array/struct/union or other types of mode out/ref) or the value (for other types of mode in) of the parameter is passed. A formal parameter of open struct type is not allowed here. (design note: because the discriminant value is unknown); A formal parameter of open array type is allowed : in this case the length of the array is computed by dividing the actual parameter's size by the array element size. . a compilation error occurs if the array element size is zero. . a compilation error occurs for size truncation if the actual parameter has a constant size that cannot be divided exactly by the element size. . a compilation warning occurs for risk of size truncation if the array element size is larger than 1 and the actual parameter has a runtime size. note: these implicit conversions from, and to, byte[] will return different results on big endian and little endian computer architectures. formal parameter of type object[] --------------------------------- There's an implicit conversion of a list of actual parameters of packed types to a formal parameter of type "object[]" or "object[N]". Example: void f (string format, object[] arg) { ... arg'length arg[i]'byte } f ("%d%d%d", i, j, k); When the last formal parameter of a function has type object[], either : 1) a single actual parameter of type object[] must be given, or 2) a variable number (possibly zero) of actual parameters of packed types must be provided. In 1), the address and array length of the actual parameter of type object[] are passed as parameters (as for any open array type). In 2), a temporary jagged array[N] of (address, size) is allocated where N equals the number of actual parameters matching the object[] formal parameter. This array is then filled with the addresses and sizes of the actual parameters; boxing and implicit conversion of literals might occur as explained above in the case of a formal parameter of type byte[]; then the length and address of the temporary array are passed as parameters. If no actual parameter is provided, either the default expression of the formal parameter is passed, or a zero-length array is passed if there is no such default expression. When parsing the first actual parameter corresponding to a last formal parameter of type object[], a context of type byte[] is always used. (reason: we can't distinguish cases 1) and 2) so we are making an arbitrary choice based on the much more frequent use of case 2) If the formal parameter of type object[] is constrained by a length, then the number of corresponding actual parameters must match the length. string format ------------- In function calls where the last formal parameter has type object[] and the previous actual parameter is a constant of type string or wstring, the compiler checks the matching between the format pattern of the string and the actual parameters. A compiler error is generated in case of mismatch or in case of bad format. In case the previous actual parameter is not a constant or not of type string or wstring, no checks occur at compile-time, an error might then occur at runtime in case of bad format. Example: int trace (string format, object[] arg); trace ("number %d", 1.2); // error: string format mismatch The following format patterns are supported for a parameter of mode 'in': %[flags][width][.precision]type type output ---- ------ % '%%' will write '%' d any signed integer decimal number (-61) u any unsigned integer or enum unsigned number (12) x any integer or enum hexadecimal number (7fa) e float, double scientific (3.9265e+2) f float, double floating-point (392.65) c char or string all char, or 'precision' char C wchar or wstring all wchar, or 'precision' wchar s string stops at nul, or 'precision' char S wstring stops at Lnul, or 'precision' wchar width ----- (number) Minimum number of characters to be printed (padded with spaces or zeroes if necessary). The value is not truncated even if the result is larger. * The width is not specified in the format string but as an additional int value parameter preceding the parameter to be formatted. precision --------- .number A dot without number means precision zero. For f : this is the number of digits to be printed after the decimal point. Default precision is 6. No decimal point is printed for precision zero. For s/S : this is the maximum number of characters to be printed. Default is all characters. .* The precision is not specified in the format string but as an additional int value parameter preceding the parameter to be formatted. flags ----- - for all : left-justify within the given field width (default is right-justify). + for d, e, f : prefix '+' for positive or zero numbers. 0 for d, e, f, x, u : prefix '0' digits instead of spaces; (not valid together with flag '-') The following format patterns are supported for a parameter of mode 'out' or 'ref': white space (ascii 1 to 32) --------------------------- ignore zero or more white space characters. Non-whitespace character, except percentage sign (%) ---------------------------------------------------- must match, or the function fails. Format specifiers: %[*][width]type * data is read but not stored (there is no corresponding actual parameter). width maximum number of characters to be read. type input ---- ----- % '%%' will read '%' d any signed integer decimal number preceded by optional + or - (-61) u any unsigned integer or enum unsigned number (12) x same as d or u hexadecimal number (7fa) f float or double floating-point number (0.5) (12.4E+3) e same as f c char or string fill arg, or read max 'width' chars. C wchar or wstring fill arg, or read max 'width' wchars. s char or string same as c but stop at first white-space. S wchar or wstring same as C but stop at first white-space. calls to "callback" or "extern" functions ----------------------------------------- For "callback" or "extern" functions, the parameters calling conventions are those of the operating system. For example, for Windows : - all arguments are widened to 32 bits (or 64 bits on 64-bit Windows). - argument-passing order is right to left. - the called function is responsible for restoring the stack. - return values are widened to 32 bits and returned in the EAX register, except for 8-byte types, which are returned in the EDX:EAX register pair for 32-bit. - the system generates prolog and epilog code to save and restore the ESI, EDI, EBX, and EBP registers, if they are used in the function. - pointers and open type parameters should be avoided as the results are undefined; unsafe pointers should be used. Example: #begin unsafe [extern "KERNEL32.DLL"] int GetStdHandle (int nStdHandle); [extern "KERNEL32.DLL"] bool WriteFile ( int hFile, char *Buffer, int size, int *written, byte *lpOverlapped); void main () { int written; const string message = "Hello World !"; WriteFile (GetStdHandle(-11), &message, message'length, &written, null); } #end unsafe 4.6. Run call ------------- A run call starts a thread and passes it a function with one optional parameter. run_call ::= "run" function_call The function referenced by the function call must denote a function having a void return value. The corresponding function specification must not have a "callback" or "extern" option, nor be a generic function. (design note: threads could corrupt the stack when receiving unrestricted start parameters, hence they can't be implemented by generics and require a specialized syntax) The function started by a run call must have a void return type and at most one formal parameter that must either be : - of mode 'in', and of type integer (but not type 'long'), enumeration, pointer, function pointer or unsafe pointer, - of mode 'in', and of a non-open array/struct/union type, provided the actual parameter is a constant object, a global variable, or an element/ slice/ field thereof. - of mode 'out' or 'ref', and of any non-open type, provided the actual parameter is a global variable or an element/slice/field thereof. (reason: if a local variable, parameter or heap object passed by address were allowed, they could be freed before the thread ends, and the thread would then write anywhere and corrupt memory; heap objects cannot be passed because their tombstone counter could never be decremented again by the caller) (reason: most operating systems allow only passing a 4-byte parameter to a starting thread) Example: void T1 (param P) { } void main() { int rc; rc = run T1 (p); // starts a thread } A run call returns an int value indicating if the thread started correctly (zero indicates success, a negative value indicates an error). =========================================================================== 5. Primaries ============ A primary denotes a value. primary ::= integer_literal (integer) | floating_point_literal (floating-point) | character_literal (char, wchar) | string_literal (string, wstring) | "nul" (char) | "Lnul" (wchar) | "null" (pointer, function/unsafe pointer) | name (value or object) | run_call (int) | aggregate (array, struct) | qualified_expression (value) | new_allocator (pointer) Some primaries (aggregate, new allocator) need a "context" (the type of the variable they will be assigned to, or the type of the parameter they will be associated with) so that they can be parsed properly. A compilation error occurs if such a context is not available. integer literal --------------- an integer literal denote a value of type long or an intern type "integer literal", depending on the literal suffix. floating-point literal ---------------------- a floating-point literal denote a value of type float, double or the intern type "floating-point literal", depending on the literal suffix. character literal ----------------- a character literal denote a value of type char or wchar, depending on the character literal suffix. string literal -------------- a string literal denote a value of type string or wstring, depending on the string literal prefix. nul/Lnul -------- The reserved keywords 'nul' and 'Lnul' denote a value equal to the first enumeration literal of the types 'char' and 'wchar' respectively. null ---- The reserved keyword 'null' denotes a special pointer value that does not point at anything. It is compatible with all pointer, function pointer and unsafe pointer types. aggregate --------- An aggregate is an array or struct compound value. aggregate ::= positional_array_aggregate | open_array_aggregate | positional_struct_aggregate | named_struct_aggregate A constant aggregate contains only constant expressions, otherwise it is a runtime aggregate. Constant aggregates are stored in the constant memory area. Aggregates of jagged types are allowed, but they must be constant. The type of an aggregate is determined from the context. It cannot be a limited type (reason: this would create a copy). Aggregates used in 'new' allocators are built within a new heap object, otherwise they are built within a temporary stack variable. Aggregates used in assignments can be built directly in the result variable of the assignation if it can be garanteed that none of the aggregate expressions uses the result variable. array aggregate --------------- Two forms of array aggregates exist : a) positional array aggregate ----------------------------- positional_array_aggregate ::= "{" [ expression {"," expression} [","] ] "}" note: the syntax allows an extra comma ',' before the closing '}'. note: the syntax allows an empty aggregate with no values. Example: {'A', 'B', c1, 'D', 'I', ch1, ch2} A positional array aggregate is allowed in the context of an array with constant or runtime length (the length must then match with the aggregate length), or in the context of an open array (the aggregate defines then the array length). note: a string literal is an aggregate of type string. b) open array aggregate ----------------------- open_array_aggregate ::= "{" "all" "=>" expression "}" Example: {all => c} // c can be a variable note: these are used to fill arrays without the need for loops (replacing C's memset) An open array aggregate defines an array value in which all elements have the same value defined by the expression. An open array aggregate is allowed if the context specifies an array with constant or runtime length, but not in the context of an open array without length constraint. Example: str[ofs:len] = {all => c}; // aggregate with runtime length The expression is always evaluated once and stored in a temporary stack variable, even if the context length is zero. (implementation note: "all" is not a keyword, this causes no ambiguity because the context of the aggregate is an array type and the token after it is "=>") Each expression must, possibly after an implicit conversion, be compatible with the array element type. struct aggregate ---------------- Two forms of struct aggregates exist : a) positional struct aggregate ------------------------------ positional_struct_aggregate ::= "{" [ expression {"," expression} [","] ] "}" note: the syntax allows an extra comma ',' before the closing } note: the syntax allows an empty aggregate with no values. Example: struct Node { float length; char letter; bool active; } Node n = {1.0, 'A', b}; // b can be a variable b) named struct aggregate ------------------------- named_struct_aggregate ::= "{" [ identifier "=>" expression {"," identifier "=>" expression} [","] ] "}" note: the syntax allows an extra comma ',' before the closing } note: the syntax allows an empty aggregate with no values. Example: Node n2 = {length => 1.0, letter => 'A', active => b}; All fields in the aggregate must appear exactly once in the same order as in the struct declaration. Struct aggregates are allowed in the context of a struct, or in the context of an open struct with constant discriminant constraint; they are not allowed in the context of an open struct with runtime discriminant constraint, or an open struct without discriminant constraint. Struct aggregates are allowed for both packed and non-packed struct types. Each expression must, possibly after an implicit conversion, be compatible with the corresponding struct field type. qualified expression -------------------- qualified_expression ::= runtime_type "'" "(" expression ")" | runtime_type "'" aggregate runtime_type ::= type_name | type_name "[" "]" | type_name "[" expression "]" | type_name "(" expression ")" The expression of the runtime_type must be constant, except if the runtime_type is used inside an allocator, and the context denotes a pointer type designating an open type, and in two syntax forms only: - in a qualified expression, when followed by an open_array_aggregate, - in an allocator of the form: "new" runtime_type. a) qualified parenthesized expression ------------------------------------- Example: t = tiny ' (2); d = double ' (f+1); p = ptr ' (null); str3 = string(3) ' ("ABC"); str3 = char[3] ' ("ABC"); str3 = string ' ("ABC"); str3 = char[] ' ("ABC"); s = Shape ' (sh); The parenthesized expression is evaluated in the context of the runtime type. The parenthesized expression must, possibly after an implicit conversion, be compatible with the runtime type. b) qualified aggregate ---------------------- Example: string(3) str3; Shape(CIRCLE) sh; str3 = string ' {'a', 'b', 'c'}; str3 = char[] ' {'a', 'b', 'c'}; str3 = string(3)' {'a', 'b', 'c'}; str3 = char[3] ' {'a', 'b', 'c'}; str3 = string ' {all => '*'}; // uses length from context str3 = string(3)' {all => '*'}; sh = Shape'{0,0,12}; put (string(80)' {all => '*'}); str3[ofs:len] = string ' {all => c}; // runtime length pstr^ = string ' {all => c}; // runtime length The type of the aggregate must be compatible with the runtime type. The context type of a qualified expression can be : . a non-open type, . an open type with constant length or discriminant value, . an open type with runtime length or discriminant value (ex: formal parameter, heap object, array slice), . an open type with unspecified length or discriminant value (ex: allocator). The runtime type can be : . a non-open type, . an open type with constant length or discriminant value, . an open type with runtime length or discriminant value, . an open type with unspecified length or discriminant value. These two types are merged : . they must belong to the same base type (same elementary type, same array element type or same struct type). . if only one has a length or discriminant constraint, it is the merged type. . if both have a length or discriminant constraint, they are checked to be equal (possibly at runtime), and they are the merged type. . if none has a length or discriminant constraint, then the merged type is an open type. The merged type is used as context for the aggregate or the parenthesized expression. The expression of a runtime type is checked either at compile-time or at runtime to belong to the acceptable range for this type. In particular : - for an array : the length must be <= (int'max / element_size) - for an open struct : the discriminant must be <= (enum'last) new allocator ------------- An allocator creates a heap object. new_allocator ::= "new" qualified_expression | "new" runtime_type The allocator returns a pointer to the new heap object. If the operating system is out of memory, the program halts with an error. The context of the allocator must be a pointer to T type, T is then used as the context for the evaluation of the qualified expression; or T must be compatible with the runtime_type. For an allocator of the form: "new" runtime_type, the runtime_type must not be an unconstrained open type. If the allocator has a qualified expression, it is the initial value of the heap object; otherwise the heap object is implicitly initialized with binary zeroes. Examples with initial expression: string^ name1 = new string'{'a', 'b', 'c'}; string^ name2 = new char[]'{'a', 'b', 'c'}; string^ name3 = new string'("Hello"); string^ name4 = new string'(name2^); string^ name5 = new char[len] ' { all => c }; string^ name6 = new string'(s); // length depends on s'length p = new Shape(CIRCLE) ' { x => x, y => y+1, radius => r }; p = new Shape ' (sh); // size depends on discriminant of sh Examples without initial expression: NODE^ p = new NODE; // pointer to a new NODE Shape^ sh = new Shape(d); // size is computed using a table string^ name7 = new string; // ERROR (open types not allowed) char[100]^ name8 = new char[100]; char[100]^ name9 = new string(100); string^ name10 = new string(100); string^ name11 = new string(len); The allocator uses the context to determine if it must allocate an open type or a non-open type : if the context is a pointer type designating an open type, an additional 4-byte header is allocated at the start of the heap object for storing the array length or the discriminant value. If the heap object requires alignment at an 8-byte boundary, then there is an additional 4-byte unused filler between the header and the heap object. Example: string^ buffer = new string(30); // context : open array. string(30)^ buffer2 = new string(30); // context : non-open array buffer2 = buffer; // ERROR: non-compatible =========================================================================== 6. Expressions ============== expression ::= sub_expression ["?" sub_expression ":" sub_expression] sub_expression ::= relation { "&&" relation } | relation { "||" relation } | relation { "&" relation } | relation { "|" relation } | relation { "^" relation } relation ::= simple_expression [ relational_operator simple_expression ] relational_operator ::= "==" | "!=" | "<" | ">" | "<=" | ">=" simple_expression ::= sum | sum "<<" sum | sum ">>" sum sum ::= term { adding_operator term } adding_operator ::= "+" | "-" term ::= unary_expression { multiply_operator unary_expression } multiply_operator ::= "*" | "/" | "%" unary_expression ::= { unary_operator | conversion } primary unary_operator ::= "+" | "-" | "!" | "~" | "&" | "*" | "--" | "++" conversion ::= "(" restricted_type_definition ")" restricted_type_definition ::= type_name {"*"} object_designator ::= unary_expression constant_expression ::= expression (design note: to ease parsing, allowed conversions are restricted to type names and unsafe pointers to type names) (design note: as in Ada, for expressions like A && B || C the syntax rules require the use of parentheizes to avoid ambiguities) operator priorities ------------------- Unary + - ! ~ & * -- ++ Term * / % Sum + - Simple Expression << >> Relation == != < > <= >= Conditional && || & | ^ Conditional Test ? : evaluation ---------- Context information is only used for evaluating allocators, aggregates, and parenthesized expressions thereof (the context is then an array, a struct, or a pointer type). Evaluation of operators (except "?:") never uses the context. (design note: as a consequence they can be evaluated in one compiler pass) Expressions are evaluated from left to right, as they appear in the source text. promotion of integer types -------------------------- Operators use only types int, uint or long. All smaller types are converted to one of these three types before a computation is done. integer operators : + - * / % << >> & | ^ ~ < > == <= >= a) if all operands are integer literals : - the operator is applied to its constant operands using int8 size; an error occurs in case of overflow (for operators "+", "-", "*", "/", unary "-") or division by zero (for operators "/", "%"). - if at least one operand is an integer literal with an "L" suffix then the result type is 'long'; otherwise the result is equivalent to an "integer literal" without suffix (or the result has type bool in case of a relational operator). b) if at least one operand has type long (either an object, or an integer literal with suffix L, or is an integer literal outside both the ranges of int and uint), then the other integer operand is promoted to type long, and the result has type long (or the result has type bool in case of a relational operator). if both operands are constants, the operator is applied to its constant operands and an error occurs in case of overflow (for operators "+", "-", "*", "/", unary "-") or division by zero (for operators "/", "%"). c) if at least one operand is a signed integer (tiny, short, int), and the other operand either : - has a signed integer type (tiny, short, int), or - is an integer literal within range of an 'int', - or there is no second operand, then all operands are promoted to type 'int' and the result type is 'int' (or the result type is bool in case of a relational operator). If all operands are constants, the operator is applied to its constant operands and an error occurs in case of overflow (for operators "+", "-", "*", "/", unary "-") or division by zero (for operators "/", "%"). d) if at least one operand is an unsigned integer (byte, ushort, uint), and the other operand either : - has an unsigned integer type (byte, ushort, uint), or - is an integer literal within range of an 'uint', - or there is no second operand, then all operands are promoted to type 'uint' and the result type is 'uint' (or the result is bool in case of a relational operator). If all operands are constants, the operator is applied to its constant operands and an error occurs in case of overflow (for operators "+", "-", "*", "/", unary "-") or division by zero (for operators "/", "%"). promotion of float types ------------------------ e) if all operands are floating-point literals : - the operator is applied to its constant operands using size 'double'; an error occurs in case of overflow (for operators "+", "-", "*") or division by zero (for operator "/"). - if at least one operand is a floating-point literal with a "D" suffix then the result type is 'double'; otherwise, if at least one operand is a floating-point literal with a "F" suffix then the result type is 'float'; otherwise the result is equivalent to a "floating-point literal" without suffix (or the result has type bool in case of a relational operator). f) if at least one operand has type double (either an object, or a floating-point literal with suffix "D", or a floating-point literal outside the range of a float, then the other floating-point operand is promoted to type double, and the result has type double (or the result has type bool in case of a relational operator). If all operands are constants, the operator is applied to its constant operands and an error occurs in case of overflow (for operators "+", "-", "*") or division by zero (for operator "/"). If none of the cases a) to f) applies, an error occurs (reason: signed and unsigned types cannot be mixed; integer and floating-point types cannot be mixed) As a result, - compile-time results can have types : int, uint, long, or constant integer literal, float, double, or constant float literal. - runtime results can have types : int, uint, long, float, double. Example: i + 1 // integer literal converted to int u + 1 // integer literal converted to uint 1 - 2 // is equivalent to an integer literal u + (-1) // ERROR: -1 not in range of uint Example: tiny t1, t2; short s = (short)(t1 * 16 + t2); // conversion required here - in case of two integer literals, the resulting computed value is equivalent to a new integer literal; the compile-time computation occurs internally with range 'long'. numeric_literal + numeric_literal -> numeric literal - a mix of int and uint causes an error. uint + int (cannot be added without explicit conversion) uint < int (cannot be compared without explicit conversion) - an integer literal is converted if in range. uint + numeric_literal -> numeric literal is supposed to be uint uint < numeric_literal -> same semantic of operators --------------------- operator ? : ------------- types: argument 1 : bool argument 2 : any type argument 3 : same base type as argument 2, but possibly different array or struct constraint if unconstrained context. result : same type as argument 2, constant constraint when unconstrained context and both arguments 2 and 3 have the same constant constraint, or when constrained context and either argument 2 or 3 has a constant constraint; runtime constraint otherwise. effect: if (arg1) return arg2; else return arg3; note: arguments 1, 2 and 3 are evaluated with context, however an array or struct context with runtime constraint is passed onto the arguments 2 and 3 as an unconstrained context. In effect, this means that an open_array_aggregate cannot appear within arguments 2 or 3 without runtime_type. operators && and || ------------------- types: left : bool right : bool result : bool effect: && : if (!left) return false; else return right; || : if (left) return true; else return right; operators & | ^ --------------- 1) for integer types: left : integer right : integer result : integer effect: perform integer promotion on left & right arguments, giving result. logical bitwise operators : and, or, xor. 2) for bool types: left : bool right : bool result : bool effect: logical operators : and, or, xor. operators == != ---------------- types: left : enumeration, integer, floating-point, pointer, function pointer, unsafe pointer. right : same as left (in case of integer and float types, perform promotion on left & right argument) result : bool effect: boolean comparison. operators < > <= >= ----------------------- types: left : enumeration, integer, floating-point, unsafe pointer. right : same as left (in case of integer and float types, perform promotion on left & right argument) result : bool effect: boolean comparison. (design notes: - no array/struct/union comparisons. - arrays are not always packed; - comparison can use different methods, prefer custom functions like strcmp, stricmp, memcmp. allowing it could create confusion between nul-terminated and non nul-terminated comparisons) operators << >> ----------------- types: left : integer right : integer result : promoted integer type effect: left and right shift operators. note: perform integer promotion on both arguments. If the result has type long or is promoted to an integer literal without suffix 'L', then the right argument must be in range 0 .. 63; otherwise it must be in range 0 .. 31; this is only checked at compile-time for a constant right argument; the result is undefined if the right argument is out of range at runtime. operators + - -------------- 1) for integer, floating-point types: left : integer or floating-point type right : same as left (perform integer/float promotion on both arguments) result : promoted type effect: addition, subtraction. 2) for enumeration types: left : enumeration right : integer (perform integer promotion on right argument) result : same as left effect: addition, subtraction. 3) for unsafe pointer types: left : unsafe pointer right : integer type, but type long is not allowed. result : same as left effect: convert the right operand to a signed int4, multiply it by the accessed type's size, possibly truncate the result to an int4; then add/subtract this offset to/from the left address operand. restriction: this operator is only allowed in unsafe regions (see 1.11). the designated type of the unsafe pointer type must not be an open type. the result is undefined if the offset applied to the pointer is outside an int4 range. (design note: this garantees an identical behaviour on 32- and 64-bit platforms) 4) for unsafe pointer ("-" only) types: left : unsafe pointer right : same as left result : uint effect: subtraction of both operands, conversion to an int4, division by the size of the accessed type, conversion to an uint4. A runtime error occurs if the accessed type has size zero. restriction: this operator is only allowed in unsafe regions (see 1.11). the designated type of the unsafe pointer type must not be an open type. the result is undefined if it doesn't fit in the uint result type. (design note: this garantees an identical behaviour on 32- and 64-bit platforms) 5) for constant string ("+" only) types: left : constant string/wstring/char/wchar right : constant string/wstring/char/wchar result : constant string or wstring effect: concatenation of constant strings. if any argument is wide, then the result will have type wstring, otherwise the result has type string. example: "Marie-Jeanne is a friendly person." + (CR+LF) + "She often comes around to see us." + CR+LF binary operators * / ---------------------- types: left : integer or floating-point type right : same as left (perform integer/floating-point promotion on left & right arguments, giving the result type) result : same as left effect: multiplication, division. note: a compile-time error occurs in case of overflow or division by zero. note: a runtime error occurs only for integer types in case of division by zero. note: overflow in division can occur for example in int'min/(-1). operator % ----------- types: left : integer right : same as left (perform integer promotion on left & right arguments, giving the result type) result : same as left effect: remainder of division. note: a compile-time error occurs in case of overflow or division by zero. note: a runtime error occurs in case of division by zero. note: if the quotient a/b is representable, the expression (a/b)*b + a%b shall equal a; this is often called "truncation toward zero". unary operators + - ------------------- types: argument : integer, floating-point (perform integer/floating-point promotion) result : same as argument effect: + : no effect. - : negation (for integer, causes a compile-time error in case of overflow) unary operators ~ ----------------- types: argument : integer (perform integer promotion) result : same as argument effect: integer not (inverse all bits) unary operator ! ----------------- types: argument : bool result : bool effect: inverse bool value (zero becomes one and non-zero becomes zero). unary operator & ----------------- types: argument : any object of a non-jagged type; it can be a constant object; it can be an object of an open type. result : if the object has an array type, then the result value has type 'unsafe pointer to the array's element type'; otherwise it has type 'unsafe pointer to the object's type'. The unsafe pointer always points to the first byte of actual data of the object. effect: take the address of the object. restriction: this operator is only allowed in unsafe regions (see 1.11). unary operator * ---------------- types: argument : value of type unsafe pointer. result : object designated by the unsafe pointer. effect: the argument is evaluated, a run-time occurs if it has a null value; the result is the object accessed by the pointer value. restriction: this operator is only allowed in unsafe regions (see 1.11); the designated type of the unsafe pointer must not be an open type. Example: *p = 2; unary operators --/++ --------------------- types: argument : an object having an integer or enumeration type, or an unsafe pointer type. result : value of object designated by the unsafe pointer. effect: The object will be incremented / decremented; then it is converted into a value. (in case of an unsafe pointer the pointer value will be incremented / decremented by the size of the accessed type) A "read" followed by a "write" operation are performed on the object. (note: this may cause an error in case of a not-yet initialized variable or parameter of mode 'out'). restriction: in case of unsafe pointer, this operator is only allowed in unsafe regions (see 1.11). Example: --i; conversion ---------- The evaluation of the unary_expression of a conversion is done without any context information. Conversion is allowed for the following types: source type target type ----------- ----------- integer -> integer enumeration -> integer floating-point -> integer integer -> enumeration floating-point -> floating-point integer -> floating-point unsafe pointer -> unsafe pointer Conversions can imply a truncation. For a constant unary_expression a compilation error occurs if the constant source value does not fit in the target type. However, no runtime check is done in case of a non-constant unary_expression. Conversion to an enumeration type might create a value for which no enumeration literal exists. Such cases cause a compilation error for a conversion of a constant expression, but they are not detected at runtime for a non-constant expression. Example: bool b1 = (bool)2; // causes a compilation error bool b2 = (bool)i; // copies lower byte of i into b2 note: conversion from float to integer truncates (it does not round). Example: int i; char *p; p = (char *)&i; (design note: further conversions are possible using the attribute 'byte) (design note: 3D requires fast double/float to int/short conversions. 'C' example: // both for signed/unsigned - uses 16-bit decimal precision __forceinline int trunc_double_to_int (double val) { val = val + (68719476736.0*1.5); return *((int *)(&((char *)&val)[2])); } end design note) implicit conversions -------------------- Implicit conversions have no syntax; they occur automatically wherever an expression is used (except for a parenthesized expression used in the primary syntax rule). The following implicit conversions exist : - conversion from an integer to another integer provided the second integer includes the range of values of the first integer. - conversion from a floating-point to a larger floating-point type. - conversion from an integer literal to an integer type, provided the value fits into the type. - conversion from a floating-point literal to a floating-point type, provided the value fits into the type. - conversion from the literal null to a pointer, function pointer or unsafe pointer type. Example: short s = 2; int i = s; Several types might be acceptable for an expression, for example both int and uint expressions are allowed within an array index or slice. =========================================================================== 7. Statements ============= statements ::= { statement } statement ::= simple_statement ";" | compound_statement simple_statement ::= null_statement | clear_statement | assignment_statement | pre_or_postfixed_statement | function_call_statement | return_statement | break_statement | continue_statement | free_statement | assert_statement | sleep_statement | abort_statement | code_statement | unused_statement compound_statement ::= block_statement | if_statement | switch_statement | while_statement | for_statement (implementation note: a declaration might begin with a type_name, a statement never begins with a type_name) 7.1. Null statement ------------------- null_statement ::= The null statement has no effect. 7.2. Clear statement -------------------- clear_statement ::= "clear" object_designator { "," object_designator } Each object_designator must denote an object. The clear statement fills all specified objects with binary zeroes. Objects are not allowed to denote declared constants, parameters of mode 'in', or references of those. The objects cannot have a jagged type. (note: thus a parameter of type object[] is not allowed; however generic, opaque and limited types are allowed) 7.3. Assignment statement ------------------------- assignment_statement ::= object_designator assignment_symbol expression assignment_symbol ::= "=" | combined_assignment_symbol combined_assignment_symbol ::= one of { += -= *= /= %= <<= >>= &= |= ^= } Example: variable = expression; The object_designator must denote a non-constant declared object, a parameter of mode out or ref, a heap object, an unsafe object, an array element/slice, a struct field, or a reference of those. The object is not allowed to have an opaque, limited or jagged type. (note: thus a parameter of type object[] is not allowed) The evaluation of an assignment statement consists of the evaluation of the object designator, followed by the evaluation of the expression, a check that the expression satifies the object's length or discriminant constraint, if any, and finally, a copy of the expression's value into the object. A statement like : v -= e; is equivalent to : v = v - e; except that the address of v is only evaluated once. The same equivalence applies to the other combined assignment symbols. After applying 'implicit conversions' (see above), the object designator and the expression must be type compatible. type compatibility ------------------ Two types are compatible either: - if their types are defined by the same declaration, - in case of arrays or open arrays if they have the same element type; if both arrays have a constant length they must have the same length to be compatible; however, one array can be open and the other non- open; - in case of an open struct (with or without discriminant constraint), if they have the same open struct type; if both open structs have a constant discriminant they must also have the same discriminant value to be compatible; however one can have a discriminant constraint and not the other. - in case of pointers or unsafe pointers if they have the same accessed type (for a designated type of an open type, both designated types must be either open, or both must be constrained with the same length or discriminant value). - in case of function pointers if both functions are compatible. function compatibility ---------------------- Two functions are compatible if they have the same return type, function options ("callback", "extern"), number of parameters, and, for each parameter : same mode, identical type. identical types --------------- Identical types means compatible types as well as: . in case of arrays they must both be either open or non-open. . in case of open structs they must both either have or not have a discriminant constraint. The keywords "public" and "inline", the function name and parameter names or any default expressions need not match. Array/Open Array Assignment --------------------------- Example: char[80] line1, line2; line1 = line2; // copy full array variable In case of an array type object, a check is done that both lengths match, either at compile-time or at runtime for non-constant array lengths. The size of the object to copy is computed by multiplying the array length by the element size. Assignment of overlapping objects --------------------------------- Example: char name[12]; name[0:9] = "tar sauce"; name[3:9] = name[0:9]; Memory areas involved in assignment can overlap, for example when copying array slices/parameters/references, byte attributes, or objects that have been converted from/to packed types. In the above example, name must receive the value "tartar sauce", not "tartartartar". (implementation note: if memory areas might overlap, then the runtime must take care of copying memory in a safe way, i.e. from higher to lower addresses if the target variable has a larger address than the source, and from lower to higher addresses otherwise). Open struct assignment ---------------------- In case of an open struct type, a check is done that both discriminant values match, either at compile-time or at runtime for non-constant discriminant values. The size of an open struct object is computed by loading the size from a look-up table indexed by discriminant value. (implementation note: the discriminant value cannot be out of range because it is constant for a declared object and checked at runtime by the allocator for a heap object). 7.4. pre or postfixed_statement ------------------------------- pre_or_postfixed_statement ::= unary_expression A pre_or_postfixed_statement is an unary expression that must either begin with the token "++" or "--", or end with a postfixed_object. It must not begin with the token "*". Example: --(*p)^.count; b[k]++; *b[k]++; // error: not a valid statement The effect of the statement is to increment/decrement the object's value; overflows are ignored; the result value is ignored. 7.5. Function call statement ---------------------------- function_call_statement ::= ["(" "void" ")"] function_call Example: unlink (filename); (void) unlink (filename); p[k](i); f()(i); f(k)^.func(i); (*p)('a'); The optional prefix (void) is only allowed for a function that returns a value, it has no effect other than explicitely showing that we wish to ignore the return value coming from the function call. 7.6. Return statement --------------------- return_statement ::= "return" | "return" expression Return statements must not have an expression if the function has return type void, otherwise they must have an expression compatible with the function's return type. The expression must not be of type array, struct, union, generic or opaque. Before leaving a function, all heap objects referenced by reference declarations or used in the return expression, as well as all large local variables allocated on heap, must be released. 7.7. Block statement -------------------- block_statement ::= "{" declarations statements "}" Block statements can appear on their own, but also inside "if", "while" and "for" statements. Identifiers declared in the block statement must be distinct from all identifiers declared in surrounding block statements or local variables. Example: void main() { F(); G(); { // block statement H(); I(); } } (design note: if any local declaration allocates stack memory dynamically, then this memory must be freed when leaving the block statement either with the normal flow or with a break, continue or return statement) Before leaving a block either with the normal flow or as consequence of a break, continue or return statement, all heap objects referenced by reference declarations as well as all large local variables allocated on heap, must be released. 7.8. If statement ----------------- if_statement ::= "if" bool_parenthesized_expression true_statement ["else" false_statement] bool_parenthesized_expression ::= "(" expression ")" true_statement ::= statement false_statement ::= statement Statements can contain inner block statements with declarations. Example: if (b) { // inner block statement int a; } The bool parenthesized expression must have type bool. If an if-statement's true_statement contains, possibly nested within for-, while-, or if-statements (but not within switch- and block-statements), another if-statement that has an else part, then a compilation error occurs because the else is ambiguous. (design note: this rule prevents the well-known "dangling else" problem) Example: if () while (exp) if () { } else // else is ambiguous and causes a compilation error { } 7.9. Switch statement --------------------- switch_statement ::= "switch" "(" expression ")" "{" { case_alternative } default_case_alternative "}" case_alternative ::= "case" constant_expression ":" {"case" constant_expression ":"} statements default_case_alternative ::= "default" ":" statements Example: switch (e) { case 0: { f(); } break; case 1: { g(); } break; default: { h(); } break; } The expression must have an integer or enumeration type. All constant expressions must have distinct values, and must have the same type than the expression in the switch statement. notes: - the syntax allows several constant expressions to be specified for a case alternative. - the syntax imposes to specify a "default" case as last alternative. More than one break statement are allowed in the statements of a case alternative. The last statement of a case alternative must be either a break, continue, return, abort statement, or a block statement having one of these statements as last statement, recursively. (note: it is not allowed to "fall through" into the next case) (design note: a switch should be implemented as a branching tree where the final leaf nodes can be jump tables for nearly contiguous values) 7.10. While statement --------------------- while_statement ::= "while" bool_parenthesized_expression statement Example: while (count < 10) count++; 7.11. For statement ------------------- for_statement ::= "for" "(" [assignment_statement {"," assignment_statement}] ";" [bool_expression] ";" [for_next_statement {"," for_next_statement}] ")" statement bool_expression ::= expression for_next_statement ::= assignment_statement | pre_or_postfixed_statement Example: for (i=0; i"] { function_specification ";" } A generic package is a package template in which some types (called generic types) are parametrizable. A generic formal part declares a list of zero or more generic types, as well as a list of zero or more generic function declarations; these are declared inside the scope of the generic package. Example (a simple algorithm for sorting) : // bubble.h generic // generic type ELEMENT int compare (ELEMENT a, ELEMENT b); // return -1 if ab package BubbleSort void sort (ref ELEMENT table[]); end BubbleSort; // bubble.c package body BubbleSort public void sort (ref ELEMENT table[]) { int i, j; ELEMENT temp; for (i=1; i0; j--) { if (compare (table[j-1], table[j]) <= 0) break; temp = table[j-1]; table[j-1] = table[j]; table[j] = temp; } } } end BubbleSort; The package body of a generic package has the same syntax as a non-generic one. visibility ---------- The identifier of a generic package declaration is made visible as soon as it appears. Inside the generic package declaration and its body, the generic package name denotes a normal (non-generic) package, i.e. instantiation of the package itself cannot occur in itself or its body (however, instantiations of other generic packages are allowed). Outside the generic package declaration and its body, the generic package name can only be used in a generic instantiation, and the entities declared within the package declaration are not visible. Packages can be nested within other packages, whether generic or not. Example (an algorithm for managing a balanced tree) : // btree.h generic int compare (KEY a, KEY b); package BalancedTree struct BALTREE; // opaque type int create (out BALTREE bt); int close (ref BALTREE bt); int insert (ref BALTREE bt, KEY k, ELEMENT e); int remove (ref BALTREE bt, KEY k); int update (ref BALTREE bt, KEY k, ELEMENT e); int retrieve (ref BALTREE bt, KEY k, out ELEMENT e); end BalancedTree; In the above examples, KEY and ELEMENT are generic types, Swapping, BubbleSort and BalancedTree are generic packages. Inside the generic package, a generic type has the same properties that all the following types have in common : (enumeration, float, integer, (non-open)array, (non-open)struct, union, pointer, function pointer). A generic type can therefore be used for : - declaring typedefs, structs, variables, parameters, heap objects of generic types. - assignement to a variable of a generic type. - creating non-constant aggregates containing elements or fields of a generic type. - a clear statement on an object of a generic type. However, the following is not allowed : - declaring constants (reason: there are no constant expressions of generic type). - a function's return type cannot be a generic type. A generic type is not packed (reason: the actual type can be a pointer or a non-packed type) and it is not limited (the actual type cannot be a limited type). Instantiation of generic packages --------------------------------- package_instantiation ::= "package" identifier "=" "new" generic_package_name ["(" generic_association {"," generic_association} ")"] ";" generic_association ::= identifier "=>" type_definition | identifier "=>" function_name A generic instantiation creates a copy of a generic package by replacing each generic type by an actual type and each generic function declaration by an actual function declaration. The new copy is a normal (non-generic) package, i.e. it cannot be instantiated again. The generic association list must be given in the order in which the identifiers appear in the generic package declaration. Example : int compare_int (int a, int b) { if (a < b) return -1; if (a > b) return +1; return 0; } package Sort_int = new BubbleSort (ELEMENT => int, compare => compare_int); void main() { int table[5] = {2, 19, 3, 9, 4}; sort (ref table); // must be written Sort_int.sort if ambiguous } The actual type of a generic type : - can be any of : integer, float, enumeration, array, struct, union, pointer, function pointer, unsafe pointer, generic. - can be packed or non-packed. - must not be an open or jagged type (reason: a generic type has a defined size). - cannot be an incomplete type. - must not be an opaque or limited type (reason: assignment is allowed for a generic type). - can be another generic type. This can however cause circularities : a compilation error occurs if several generic packages mutually instantiate each other. In a package instantiation, the generic functions must be compatible with the actual functions (after replacement of any formal parameter generic types by the actual types). (see assignment statement for a definition of compatible functions). generic implementation notes ---------------------------- Generics have huge implications on the compiler architecture : a generic package's syntax and semantic must be fully checked so that no further compilation errors (except stack overflow or type'size is too large) can occur at instantiation. Any mutually recursive instantiation of package bodies causes a compilation error. =========================================================================== 10. Compilation Units ===================== source_file ::= h_source_file | c_source_file h_source_file ::= { import_clause } interface_declarations c_source_file ::= { import_clause } body_declarations interface_declarations ::= { declaration | early_declaration } body_declarations ::= interface_declarations { early_declaration | later_declaration } early_declaration ::= function_declaration | package_declaration | generic_package_declaration | package_instantiation later_declaration ::= function_body | package_body 10.1. Units ----------- Two kinds of source files exist : - .h : interface - .c : implementation A .h source file can appear alone. A .c source file can, but need not, appear alone (for the main program). When .h and .c source files appear together, they must be stored in the same directory. 10.2. Startup ------------- The main entry point in a program is a function called 'main' declared in a .c source file : - there must be only one function with this name. - it must have return type void or int. - it must have either no parameters or a single parameter of mode in and type string[]. Example: // hello.c from std use console; void main() { printf ("Hello World !\n"); } 10.3. Import clauses -------------------- import_clause ::= [ "from" library_name ] "use" source_name [ "as" identifier ] { "," source_name [ "as" identifier ] } ";" library_name ::= identifier { "." identifier } source_name ::= { ".." "/" } identifier { "/" identifier } All identifiers appearing in library names and source names must be in lower case. (reason: to avoid compatibility problems on different file systems) Two forms of import clauses exist : a) import from .LIB library files --------------------------------- Example: from std use io, strings, sys/signal; from be.msc.webcam use cam/webcam; note: the library files "std.lib" and "be.msc.webcam.lib" must be present in one of the directories listed in the config file mk.cfg note: A ".." prefix is not allowed for source names when the import clause features the keyword 'from' (reason: all files are stored within the library's root directory) b) import from .H source files ------------------------------ Example: use util, util2; // "util.h" and "util2.h" in current directory use cam/webcam; // "webcam.h" in subdirectory "cam" use ../interface; // "interface.h" in directory above this one. use /src/testing; // ERROR: absolute paths are NOT allowed. A .h and probably a .c file must be available in the specified directory. note: absolute paths to .H files are not allowed because the project can then not be moved easily from one directory to another. Therefore, all pathnames must be relative. 10.4. Aliases ------------- In case two units with the same last name are imported (possibly from two different directories), an error occurs. An alias name can then be specified to disambiguate them. Example: from std use bintree; from std use new/bintree as bintree2; bintree2.insert ( .. ); 10.5. Effect of import ---------------------- An import of unit B makes B visible in the outer scope, at the same level as the currently compiled unit. However, it does not make visible a unit C imported by unit B, although unit C with its entities is loaded too, but in some unreachable scope. In short : only directly imported units are visible. If several interface (.h) compilation units depend on each other in a circular fashion, a compilation error occurs. A compilation error occurs if a unit imports itself or if the same unit is imported twice, even using different aliases. The order of imports is not important (reordering imports cannot create errors). 10.6. Compiling --------------- Only the .h / .c source files and the make configuration file (mk.cfg) must be provided to compile a project. Example: C:\test> mk hello The make utility (mk.exe) will compile "hello.c" and create "hello.exe". Dependencies are automatically traced and the corresponding source files are recompiled if they need it. A manual make file is not necessary. All source filenames must be in lower case. 10.7. Make configuration file (mk.cfg) -------------------------------------- The make configuration file must be present in the current directory, or its filename must be supplied to the compiler through a compiler option; it contains all compiler options, global compilation symbols and a list of directories to search for imported libraries : - memory model (32 or 64 bit) (default is 64) this has an effect on the size of pointer, function pointer and unsafe pointer types. - flag indicating if pointers are checked using a safe tombstone mecanism (default is yes) - flag indicating if array indexes are checked (default is yes) - flag indicating if assertions are checked (default is yes) - a list of directories in which the compiler searches for imported libraries. The following global symbols are predefined and cannot be redefined in mk.cfg : WINDOWS (1 for windows compiler version, 0 otherwise) UNIX (1 for any unix compiler version, 0 otherwise) MEM32 (1 for 32-bit memory model, 0 otherwise) MEM64 (1 for 64-bit memory model, 0 otherwise) Example: // mk.cfg [symbols] ; global compilation symbols (must be in upper case) DEBUG = 1 T24 = 1 [options] memory_model = 32 ; 32 or 64 (default is 64) pointer_check = yes ; default is yes array_check = yes ; default is yes assertion_check = yes ; default is yes [library] ; libraries are searched in the following directories : dir = ./ ; default if mk.cfg is not provided dir = c:/safe-c/lib/ 10.8. Linker ------------ The linker instantiates generic package bodies, replaces generic and opaque types by their actual types, performs function and program-wide inlining, removed unused code and data blocks, and merges identical code and data sequences. Finally all units are linked together. =========================================================================== 11. Implementation Issues ========================= 11.1 Data Alignment ------------------- Structs that don't use the keyword 'packed' have their fields aligned. Alignement minimizes the number of bus cycles the CPU needs to access an object in memory. Each type has : - a size - an alignement requirement (1, 2, 4 or 8 bytes) int1 aligned at 1 int2 aligned at 2 int4 aligned at 4 int8 aligned at 8 float aligned at 4 double aligned at 8 addresses aligned either at 4 or 8 bytes, depending on their size An array's alignment requirement is the same as that of its element type. A struct's alignment must be computed as follows : - padding bytes might be added before each struct field to align its offset to the field's type alignement requirement. - the alignement requirements of the whole struct is computed as the largest alignement used on all fields of the struct; trailing bytes are added at the end of the struct to align it to these requirements so that the struct can be used as an array element without the need of further padding. Example: struct MyData // size = 6 bytes { // alignement at 2 short Data1; short Data2; short Data3; } note: even if the struct is used as array element, all fields will still be aligned. -------- struct MixedData // size = 12 bytes { // alignement at 4 char Data1; char Padding0[1]; // align 'short' at 2-byte short Data2; int Data3; char Data4; char Padding1[3]; // add trailing padding } It is important to note that the last member is padded with the number of bytes required to conform to the largest type of the structure. In this case 3 bytes are added to the last member to pad the structure to the size of an int. As a result all fields will still be aligned if the struct is used as an array element. -------- struct Both // size = 36 bytes { // alignment at 4 MyData a[3]; // 18 bytes, align at 2 short filler; // add 2-byte filler for b MixedData b; // 12 bytes, align at 4 char c; // 1 byte char filler2[3]; // add 3 bytes to align to all fields } -------- note: alignement requirements are also used to align the addresses of global and local variables. note: in 64-bit CPU's pushes and pops on the stack are always in 8-byte blocks, and pointers are 8 bytes wide. note: heap blocks are aligned at 16 bytes, in order to help performance; the stack pointer is aligned at 16 bytes, except for leaf functions. Packed types ------------ Objects of the following types are "packed" : - enumeration - integer - floating-point - array with "packed" element type - struct with "packed" qualifier - union (all union fields must be packed) - unsafe pointers Objects of the following types are NOT packed : - array with non-packed element type - struct without "packed" qualifier - jagged array or record types - pointer - function pointer Opaque and generic types depend on their actual type so they are considered non-packed. A compilation error occurs if a packed struct type or an union type has a field of a non-packed type. Advantage of packed types ------------------------- - packed types can be converted to byte[] using attribute 'byte. - packed types can be passed to a formal parameter of type byte[] (in a function call) - packed types can be passed to a formal parameter list of type object[] (in a function call) - a formal parameter of a packed type can receive an actual parameter of type byte[]. Pointers and function pointers are not allowed to appear in packed structures and be passed to I/O routines. (reason: pointers must always be aligned in memory so that access to them is atomic, otherwise several threads accessing the same pointer variable might corrupt it) 11.2 Implementation of pointers and heap objects ------------------------------------------------ Pointers are implemented as 4 or 8 bytes, depending on the machine architecture (32 or 64-bit). (design note: programmers expect that pointer updates occur atomically, without explicit locking) Internally, pointers don't contain the address of the actual heap object. Instead, they point at Tombstone structures : struct Tombstone { uint4 count; // nb of references into the heap object uint4 type; // unique nr of designated type (ex: int, char[],.) ADDRESS pdata; // address of heap object } Tombstone structures are never freed once allocated; when the corresponding heap object is freed they are put on a free list for a later allocation. (design note: A programmer's mistake can cause a tombstone that was freed and reused for a new allocation to still be referenced by an old pointer that kept the Tombstone's address. However, this cannot corrupt memory because each dereferencing checks that the type designated by the pointer matches the type designated by the Tombstone - see field .type) (implementation note: once freed, a tombstone should be reused as late as possible to avoid reusing an old pointer value, it should therefore be put at the end of the free list) (design note: a .count field is necessary, otherwise a thread might free a heap object that is still being referenced by a parameter or a reference) Pointer dereferencing occurs in two steps : (address p, type T) -----> tombstone ----> heap object In the first step, the tombstone is incremented to prevent it being freed, its type is checked, and finally the real heap object's address is returned : begin dereferencing (address p, type T): 1- lock inc p^.count; // hardware check p!=null detects null ptr 2- assert (p^.type == T); // checks type of Tombstone 3- return p^.pdata; // return address of heap object When access to the heap object is finished, then the tombstone structure is decremented again so that it can be freed when count reaches zero : end dereferencing (address p): 1- lock dec p^.count; (design note: this method requires four operations on each dereferencing, which is very costly, maybe more than garbage collection. The advantage is that it is scalable, as long as the heap's critical section is no bottleneck, and it especially avoids program freezing at arbitrary points that occur when using garbage collection. A mitigating factor is also that pointers are not so often used as in C or in object languages : parameter passing and arrays don't use them. Aggregates and references can be used to minimize the number of dereferencing operations) (design note: using a compiler optimization switch, the tombstone mecanism can be disabled, so that the heap object's address is stored directly into the pointers; however, this can only be done for all the program, including all libraries) Example: dereferencing: ; (eax contains the pointer value) mov [ebp]32,eax ; save address of tombstone into TEMP variable lock inc [eax] ; increment count (thread-safe)-> error if null cmp #563,4[eax] ; compare type with constant jne error mov eax,[eax]8 ; get actual address of heap object undereferencing: mov eax,[ebp]32 ; load again address of Tombstone lock dec [eax] ; decrement count 11.3. Run-Time Errors --------------------- When a runtime error occurs, the runtime will do the following actions : 1) stop all threads 2) create a file CRASH-REPORT.TXT in the current directory and save the following : . name of the program, link date+time, current date+time; . operating system version, . available ram memory, available disk memory, screen resolution . error context (evaluation stack, registers, CPU error nr), . all global variables, the stack of all threads, all heap objects. 3) halt the program. The file's size should not exceed 10 MB (large heap objects should be reduced). If the program is part of a distribution, it should add the following steps at program start : - check if there's a CRASH-REPORT.TXT file in the current directory, - ask the user if he allows to send debugging data to the developper, - if yes, zip the file and send it per network to the developer's website, - delete CRASH-REPORT.TXT A debugger should be able to create a full context (callstack, ..) so that the problem can be examined and corrected easily. =========================================================================== 12. Libraries ------------- The standard library "std" contains the following .h units : aes : AES cryption arithm : simple arithmetic (abs, min, max) base64 : base64 encoding bintree : balanced binary trees (AVL) calendar : calender clipboard : clipboard console : writing/reading on the dos console box crc : md5, sha2, adler and crc checksums db : database (insert/delete/update/retrieve, transations on .db files) des : DES cryption directx : interface for DirectX draw : draw in a memory buffer (line, circle, text, image) exception : exception handling files : text and binary files, directories, disk. ftp : file transfer protocol (ftp) - client and server gui : user interface (windows, listbox, etc ..) with a main window using directx component hash : hash table, dictionnary (key -> value) http : internet client and serveur (login, virtual html pages, ..) image : compression and decompression (jpg, gif, png, ..), image treatment inifile : reading of .ini files integer : operations on large unsigned integers linear_algebra : operations on vectors and matrices math : math functions (sin, cos, ..) media : playing video and audio files odbc : sql database opus : audio compression printer : printer random : generation of random numbers rsa : asymmetric encryption set : interval sets sorting : sorting of arrays (bubblesort, heapsort, quicksort) sound : microphone and speaker strings : characters and strings (strcpy, sprintf, ..) system : system information (PC name, memory, nb of CPUs) tcpip : network connection by tcpip (ipv4 and ipv6) text : storage of a text composed of lines thread : parallel threads, synchronisation, timers tracing : trace files url : internet url utf : conversion between ascii, utf-8 and utf-16 webcam : interface for video source w3d : interface for creating 3D windows, uses DirectX 9 win : user interface to create windows, buttons, listboxes, .. xml : xml reader zip : data compression : zip, unzip =========================================================================== Appendix A : International support ---------------------------------- Traditional programs used to store characters in single bytes because they were developped mainly for the US or single European countries. This has changed dramatically since Unicode normalized all character sets. Currently, the Unicode standard defines 1,114,112 characters worldwide. A.1 Unicode for intern use -------------------------- Although it may be tempting to define a modern 32-bit character type for storing Unicode (some Unix systems have done it), there are two reasons why a 16-bit character type is preferable : 1- Unicode has placed all modern living languages within the first 65536 codes (U+0000 to U+FFFF). This includes European, African, Arabic, Asian, Japanese and Chinese. All other symbols belong to old, obsolete or rare historical languages (f.e. Egyptian Hieroglyphs), which will probably be supported only by a handful of specialized applications. This means that the vast majority of applications need only a 2-byte character type. 2- Unicode text display engines generally cannot display single characters; many characters are linked with the previous and next ones (ligature formation), thus storing each symbol in 32-bit won't be a big help as the display engine needs access to the full text string anyway. UTF-16 seems thus the best choice for storing characters internally. A.2 Unicode for extern use -------------------------- When transfering Unicode between computer systems, a packing in so-called "UTF-8 character strings" is widely used on the internet. In UTF-8, each Unicode character is encoded in 1 to 4 bytes : Unicode Character Byte1 Byte2 Byte3 Byte4 ----------------- -------- -------- -------- -------- 0 to 127 0xxxxxxx 128 to 2047 110yyyxx 10xxxxxx 2048 to 65535 1110yyyy 10yyyyxx 10xxxxxx 65536 to 1114111 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx Converting between UTF-16 and UTF-8 is straightforward and available in standard library routines. =========================================================================== Appendix B : Unicode conversion for source files ------------------------------------------------ When a source file contains one of the following three headers, the header is removed and the corresponding format is used : EF BB BF : UTF-8 FF FE : UTF-16 little-endian FE FF : UTF-16 big-endian Otherwise (if the file contains no header), - if the first byte is zero, UTF-16 big-endian is assumed. - if the second byte is zero, UTF-16 little-endian is assumed. In all other cases, including the case in which the file is less than two bytes long, ANSI (8-bit characters) is assumed. Furthermore : for UTF-8 : - byte sequences that don't match with the bit patterns in the following UTF-8 table are illegal : UTF-8 table ----------- Unicode Character Byte1 Byte2 Byte3 Byte4 ----------------- -------- -------- -------- -------- 0 to 127 0xxxxxxx 128 to 2047 110yyyxx 10xxxxxx 2048 to 65535 1110yyyy 10yyyyxx 10xxxxxx 65536 to 1114111 11110zzz 10zzyyyy 10yyyyxx 10xxxxxx - the following is illegal : . 2-byte sequences yielding a character value < 128. . 3-byte sequences yielding a character value < 2048. . 4-byte sequences yielding a character value < 65536. for UTF-16 : - every two bytes are assembled into a WORD, in little-endian or big-endian format. - if the WORD is NOT in range 0xD800 to 0xDFFF, it yields the resulting character value; otherwise it indicates the first of a surrogate pair of two WORDs : . the first WORD must be in range 0xD800 to 0xDBFF; . the second WORD must be in range 0xDC00 to 0xDFFF. . Both WORDs yield a resulting character value of : 0x10000 + ((WORD1 - 0xD800) << 10) + (WORD2 - 0xDC00) for both UTF-8 and UTF-16 : - truncated byte sequences are illegal. - character values between 0xD800 and 0xDFFF are illegal. - character values where the lower 16-bits equal 0xFFFE or 0xFFFF are illegal. - character values larger than 0x10FFFF are illegal. Any illegal byte sequence generates a compilation error. Additional Note: UTF conversions routines usually convert illegal byte sequences using a '?' character as replacement; the '?' character is usually illegal in filenames, so it is a safe choice to avoid that misformed sequences are exploited by hackers. Illegal byte sequences should never be deleted or truncated, as this could bypass security checks performed before the conversion. ===========================================================================