The gcc compiler has always worked by writing out assembly code in text format. The assembler reads this text file to produce an object file. Most compilers work this way, although there have been some exceptions such as the MIPS compiler used on Irix.
Clearly this process of producing text and then parsing it again is inefficient. When the GNU assembler was first written, it read a text format that was very precisely specified. Spacing had to be exact and generally omitted, no comments were permitted, etc. A special directive, #APP
, was used to tell the assembler to go back to a normal parsing mode, which was implemented using an input scrubber which converted more free-form text into the precise form. The idea was that the compiler would generate this precise specification, reducing the parsing costs in the assembler, while still permitting users to write assembler code by hand for use in asm statements.
This idea still exists in the assembler, but it has been lost to some extent. By default, today, the assembler will accept input with arbitrary spacing, comments, etc. This can be disabled using the -f
option. The assembler used to also accept a #NO_APP
directive to go back to precise mode, but that is now only effective if it appears in the first line of the file. OgccGNU/Linux, the compiler does neither of these by default. Oddly, it does generate #NO_APP
after an asm statement, where the assembler ignores it. The assembler is also now built in a mode which does permit some whitespace even in precise mode that was, for a while, not permitted.
What this tells me is that nobody cares very much about how long it takes the assembler to parse text. This is not unreasonable, since the assembler is in fact quite fast, and is not a major part of overall compilation time. Still, time spent parsing in the assembler is time lost.
In 1997 David Henkel-Wallace at Cygnus proposed converting the GNU assembler into a library which would be invoked directly by the compiler. That plan was never implemented in gcc. In today’s world it no longer makes much sense. Running the compiler and assembler as separate processes takes better advantage of today’s multicore machines. It’s hard to make a compiler multi-threaded; let’s not take away the one limited form of multi-threading that we already have.
What does make sense is using a structured data format, rather than text, to communicate between the compiler and the assembler. In gcc’s terms, the compiler should generate insn patterns with associated operands. The assembler should piece those together into its own internal data structures. In fact, of course, ideally gcc and the assembler would use the same internal data structure. It would be interesting to design such a structure so that it could be transmitted in a file, or over a pipe, or in shared memory.
I think the time savings would happen less on the assembler side than on the gcc side: gcc would no longer have to format the output.
I think this would have the potential to cut compilation time by 5 to 10 percent. Not a big savings for the effort required, which is why nobody has done it. Compilation time is less important these days due to the use of big compilation clusters, but programs are also getting bigger, so it is not wholly unimportant.
Leave a Reply
You must be logged in to post a comment.