More PCH documentation

llvm-svn: 72743
This commit is contained in:
Douglas Gregor 2009-06-02 22:08:07 +00:00
parent 4b665ebb01
commit f727bb18d9
3 changed files with 1738 additions and 2 deletions

View File

@ -64,8 +64,249 @@ with the <b><tt>-include-pch</tt></b> option:</p>
require the PCH file to be up-to-date.</li> require the PCH file to be up-to-date.</li>
</ul> </ul>
<p>More to be written...</p> <p>Clang's precompiled headers are designed with a compact on-disk
representation, which minimizes both PCH creation time and the time
required to initially load the PCH file. The PCH file itself contains
a serialized representation of Clang's abstract syntax trees and
supporting data structures, stored using the same compressed bitstream
as <a href="http://llvm.org/docs/BitCodeFormat.html">LLVM's bitcode
file format</a>.</p>
<p>Clang's precompiled headers are loaded "lazily" from disk. When a
PCH file is initially loaded, Clang reads only a small amount of data
from the PCH file to establish where certain important data structures
are stored. The amount of data read in this initial load is
independent of the size of the PCH file, such that a larger PCH file
does not lead to longer PCH load times. The actual header data in the
PCH file--macros, functions, variables, types, etc.--is loaded only
when it is referenced from the user's code, at which point only that
entity (and those entities it depends on) are deserialized from the
PCH file. With this approach, the cost of using a precompiled header
for a translation unit is proportional to the amount of code actually
used from the header, rather than being proportional to the size of
the header itself.</p> </body>
<h2>Precompiled Header Contents</h2>
<img src="PCHLayout.png" align="right" alt="Precompiled header layout">
<p>Clang's precompiled headers are organized into several different
blocks, each of which contains the serialized representation of a part
of Clang's internal representation. Each of the blocks corresponds to
either a block or a record within <a
href="http://llvm.org/docs/BitCodeFormat.html">LLVM's bitstream
format</a>. The contents of each of these logical blocks are described
below.</p>
<h3 name="metadata">Metadata Block</h3>
<p>The metadata block contains several records that provide
information about how the precompiled header was built. This metadata
is primarily used to validate the use of a precompiled header. For
example, a precompiled header built for x86 (32-bit) cannot be used
when compiling for x86-64 (64-bit). The metadata block contains
information about:</p>
<dl>
<dt>Language options</dt>
<dd>Describes the particular language dialect used to compile the
PCH file, including major options (e.g., Objective-C support) and more
minor options (e.g., support for "//" comments). The contents of this
record correspond to the <code>LangOptions</code> class.</dd>
<dt>Target architecture</dt>
<dd>The target triple that describes the architecture, platform, and
ABI for which the PCH file was generated, e.g.,
<code>i386-apple-darwin9</code>.</dd>
<dt>PCH version</dt>
<dd>The major and minor version numbers of the precompiled header
format. Changes in the minor version number should not affect backward
compatibility, while changes in the major version number imply that a
newer compiler cannot read an older precompiled header (and
vice-versa).</dd>
<dt>Original file name</dt>
<dd>The full path of the header that was used to generate the
precompiled header.</dd> </dl>
<dt>Predefines buffer</dt>
<dd>Although not explicitly stored as part of the metadata, the
predefines buffer is used in the validation of the precompiled header.
The predefines buffer itself contains code generated by the compiler
to initialize the preprocessor state according to the current target,
platform, and command-line options. For example, the predefines buffer
will contain "<code>#define __STDC__ 1</code>" when we are compiling C
without Microsoft extensions. The predefines buffer itself is stored
within the <a href="#sourcemgr">source manager block</a>, but its
contents are verified along with the rest of the metadata.</dd> </dl>
<h3 name="sourcemgr">Source Manager Block</h3>
<p>The source manager block contains the serialized representation of
Clang's <a
href="InternalsManual.html#SourceLocation">SourceManager</a> class,
which handles the mapping from source locations (as represented in
Clang's abstract syntax tree) into actual column/line positions within
a source file or macro instantiation. The precompiled header's
representation of the source manager also includes information about
all of the headers that were (transitively) included when building the
precompiled header.</p>
<p>The bulk of the source manager block is dedicated to information
about the various files, buffers, and macro instantiations into which
a source location can refer. Each of these is referenced by a numeric
"file ID", which is a unique number (allocated starting at 1) stored
in the source location. Clang serializes the information for each kind
of file ID, along with an index that maps file IDs to the position
within the PCH file where the information about that file ID is
stored. The data associated with a file ID is loaded only when
required by the front end, e.g., to emit a diagnostic that includes a
macro instantiation history inside the header itself.</p>
<p>The source manager block also contains information about all of the
headers that were included when building the precompiled header. This
includes information about the controlling macro for the header (e.g.,
when the preprocessor identified that the contents of the header
dependent on a macro like <code>LLVM_CLANG_SOURCEMANAGER_H</code>)
along with a cached version of the results of the <code>stat()</code>
system calls performed when building the precompiled header. The
latter is particularly useful in reducing system time when searching
for include files.</p>
<h3 name="preprocessor">Preprocessor Block</h3>
<p>The preprocessor block contains the serialized representation of
the preprocessor. Specifically, it contains all of the macros that
have been defined by the end of the header used to build the
precompiled header, along with the token sequences that comprise each
macro. The macro definitions are only read from the PCH file when the
name of the macro first occurs in the program. This lazy loading of
macro definitions is trigged by lookups into the <a
href="#idtable">identifier table</a>.</p>
<h3 name="types">Types Block</h3>
<p>The types block contains the serialized representation of all of
the types referenced in the translation unit. Each Clang type node
(<code>PointerType</code>, <code>FunctionProtoType</code>, etc.) has a
corresponding record type in the PCH file. When types are deserialized
from the precompiled header, the data within the record is used to
reconstruct the appropriate type node using the AST context.</p>
<p>Each type has a unique type ID, which is an integer that uniquely
identifies that type. Type ID 0 represents the NULL type, type IDs
less than <code>NUM_PREDEF_TYPE_IDS</code> represent predefined types
(<code>void</code>, <code>float</code>, etc.), while other
"user-defined" type IDs are assigned consecutively from
<code>NUM_PREDEF_TYPE_IDS</code> upward as the types are encountered.
The PCH file has an associated mapping from the user-defined types
block to the location within the types block where the serialized
representation of that type resides, enabling lazy deserialization of
types. When a type is referenced from within the PCH file, that
reference is encoded using the type ID shifted left by 3 bits. The
lower three bits are used to represent the <code>const</code>,
<code>volatile</code>, and <code>restrict</code> qualifiers, as in
Clang's <a
href="http://clang.llvm.org/docs/InternalsManual.html#Type">QualType</a>
class.</p>
<h3 name="decls">Declarations Block</h3>
<p>The declarations block contains the serialized representation of
all of the declarations referenced in the translation unit. Each Clang
declaration node (<code>VarDecl</code>, <code>FunctionDecl</code>,
etc.) has a corresponding record type in the PCH file. When
declarations are deserialized from the precompiled header, the data
within the record is used to build and populate a new instance of the
corresponding <code>Decl</code> node. As with types, each declaration
node has a numeric ID that is used to refer to that declaration within
the PCH file. In addition, a lookup table provides a mapping from that
numeric ID to the offset within the precompiled header where that
declaration is described.</p>
<p>Declarations in Clang's abstract syntax trees are stored
hierarchically. At the top of the hierarchy is the translation unit
(<code>TranslationUnitDecl</code>), which contains all of the
declarations in the translation unit. These declarations---such as
functions or struct types---may also contain other declarations inside
them, and so on. Within Clang, each declaration is stored within a <a
href="http://clang.llvm.org/docs/InternalsManual.html#DeclContext">declaration
context</a>, as represented by the <code>DeclContext</code> class.
Declaration contexts provide the mechanism to perform name lookup
within a given declaration (e.g., find the member named <code>x</code>
in a structure) and iterate over the declarations stored within a
context (e.g., iterate over all of the fields of a structure for
structure layout).</p>
<p>In Clang's precompiled header format, deserializing a declaration
that is a <code>DeclContext</code> is a separate operation from
deserializing all of the declarations stored within that declaration
context. Therefore, Clang will deserialize the translation unit
declaration without deserializing the declarations within that
translation unit. When required, the declarations stored within a
declaration context will be serialized. There are two representations
of the declarations within a declaration context, which correspond to
the name-lookup and iteration behavior described above:</p>
<ul>
<li>When the front end performs name lookup to find a name
<code>x</code> within a given declaration context (for example,
during semantic analysis of the expression <code>p-&gt;x</code>,
where <code>p</code>'s type is defined in the precompiled header),
Clang deserializes a hash table mapping from the names within that
declaration context to the declaration IDs that represent each
visible declaration with that name. The entire hash table is
deserialized at this point (into the <code>llvm::DenseMap</code>
stored within each <code>DeclContext</code> object), but the actual
declarations are not yet deserialized. In a second step, those
declarations with the name <code>x</code> will be deserialized and
will be used as the result of name lookup.</li>
<li>When the front end performs iteration over all of the
declarations within a declaration context, all of those declarations
are immediately de-serialized. For large declaration contexts (e.g.,
the translation unit), this operation is expensive; however, large
declaration contexts are not traversed in normal compilation, since
such a traversal is unnecessary. However, it is common for the code
generator and semantic analysis to traverse declaration contexts for
structs, classes, unions, and enumerations, although those contexts
contain relatively few declarations in the common case.</li>
</ul>
<h3 name="idtable">Identifier Table Block</h3>
<p>The identifier table block contains an on-disk hash table that maps
each identifier mentioned within the precompiled header to the
serialized representation of the identifier's information (e.g, the
<code>IdentifierInfo</code> structure). The serialized representation
contains:</p>
<ul>
<li>The actual identifier string.</li>
<li>Flags that describe whether this identifier is the name of a
built-in, a poisoned identifier, an extension token, or a
macro.</li>
<li>If the identifier names a macro, the offset of the macro
definition within the <a href="#preprocessor">preprocessor
block</a>.</li>
<li>If the identifier names one or more declarations visible from
translation unit scope, the <a href="#decls">declaration IDs</a> of these
declarations.</li>
</ul>
<p>When a precompiled header is loaded, the precompiled header
mechanism introduces itself into the identifier table as an external
lookup source. Thus, when the user program refers to an identifier
that has not yet been seen, Clang will perform a lookup into the
on-disk hash table ... FINISH THIS!
<p>A separate table provides a mapping from the numeric representation
of identifiers used in the PCH file to the location within the on-disk
hash table where that identifier is stored. This mapping is used when
deserializing the name of a declaration, the identifier of a token, or
any other construct in the PCH file that refers to a name.</p>
</div> </div>
</body>
</html> </html>

1495
clang/docs/PCHLayout.graffle Normal file

File diff suppressed because it is too large Load Diff

BIN
clang/docs/PCHLayout.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 37 KiB