forked from OSchip/llvm-project
				
			
		
			
				
	
	
		
			259 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			HTML
		
	
	
	
			
		
		
	
	
			259 lines
		
	
	
		
			11 KiB
		
	
	
	
		
			HTML
		
	
	
	
| <html>
 | |
| <head>
 | |
| <title>Static Analyzer Design Document: Memory Regions</title>
 | |
| </head>
 | |
| <body>
 | |
|   
 | |
| <h1>Static Analyzer Design Document: Memory Regions</h1>
 | |
| 
 | |
| <h3>Authors</h3>
 | |
| 
 | |
| <p>Ted Kremenek, <tt>kremenek at apple</tt><br>
 | |
| Zhongxing Xu, <tt>xuzhongzhing at gmail</tt></p>
 | |
| 
 | |
| <h2 id="intro">Introduction</h2>
 | |
| 
 | |
| <p>The path-sensitive analysis engine in libAnalysis employs an extensible API
 | |
| for abstractly modeling the memory of an analyzed program. This API employs the
 | |
| concept of "memory regions" to abstractly model chunks of program memory such as
 | |
| program variables and dynamically allocated memory such as those returned from
 | |
| 'malloc' and 'alloca'. Regions are hierarchical, with subregions modeling
 | |
| subtyping relationships, field and array offsets into larger chunks of memory,
 | |
| and so on.</p>
 | |
| 
 | |
| <p>The region API consists of two components:</p>
 | |
| 
 | |
| <ul> <li>A taxonomy and representation of regions themselves within the analyzer
 | |
| engine. The primary definitions and interfaces are described in <tt><a
 | |
| href="http://clang.llvm.org/doxygen/MemRegion_8h-source.html">MemRegion.h</a></tt>.
 | |
| At the root of the region hierarchy is the class <tt>MemRegion</tt> with
 | |
| specific subclasses refining the region concept for variables, heap allocated
 | |
| memory, and so forth.</li> <li>The modeling of binding of values to regions. For
 | |
| example, modeling the value stored to a local variable <tt>x</tt> consists of
 | |
| recording the binding between the region for <tt>x</tt> (which represents the
 | |
| raw memory associated with <tt>x</tt>) and the value stored to <tt>x</tt>. This
 | |
| binding relationship is captured with the notion of "symbolic
 | |
| stores."</li> </ul>
 | |
| 
 | |
| <p>Symbolic stores, which can be thought of as representing the relation
 | |
| <tt>regions -> values</tt>, are implemented by subclasses of the
 | |
| <tt>StoreManager</tt> class (<tt><a
 | |
| href="http://clang.llvm.org/doxygen/Store_8h-source.html">Store.h</a></tt>). A
 | |
| particular StoreManager implementation has complete flexibility concerning the
 | |
| following:
 | |
| 
 | |
| <ul>
 | |
| <li><em>How</em> to model the binding between regions and values</li>
 | |
| <li><em>What</em> bindings are recorded
 | |
| </ul>
 | |
| 
 | |
| <p>Together, both points allow different StoreManagers to tradeoff between
 | |
| different levels of analysis precision and scalability concerning the reasoning
 | |
| of program memory. Meanwhile, the core path-sensitive engine makes no
 | |
| assumptions about either points, and queries a StoreManager about the bindings
 | |
| to a memory region through a generic interface that all StoreManagers share. If
 | |
| a particular StoreManager cannot reason about the potential bindings of a given
 | |
| memory region (e.g., '<tt>BasicStoreManager</tt>' does not reason about fields
 | |
| of structures) then the StoreManager can simply return 'unknown' (represented by
 | |
| '<tt>UnknownVal</tt>') for a particular region-binding. This separation of
 | |
| concerns not only isolates the core analysis engine from the details of
 | |
| reasoning about program memory but also facilities the option of a client of the
 | |
| path-sensitive engine to easily swap in different StoreManager implementations
 | |
| that internally reason about program memory in very different ways.</pp>
 | |
| 
 | |
| <p>The rest of this document is divided into two parts. We first discuss region
 | |
| taxonomy and the semantics of regions. We then discuss the StoreManager
 | |
| interface, and details of how the currently available StoreManager classes
 | |
| implement region bindings.</p>
 | |
| 
 | |
| <h2 id="regions">Memory Regions and Region Taxonomy</h2>
 | |
| 
 | |
| <h3>Pointers</h3>
 | |
| 
 | |
| <p>Before talking about the memory regions, we would talk about the pointers
 | |
| since memory regions are essentially used to represent pointer values.</p>
 | |
| 
 | |
| <p>The pointer is a type of values. Pointer values have two semantic aspects.
 | |
| One is its physical value, which is an address or location. The other is the
 | |
| type of the memory object residing in the address.</p>
 | |
| 
 | |
| <p>Memory regions are designed to abstract these two properties of the pointer.
 | |
| The physical value of a pointer is represented by MemRegion pointers. The rvalue
 | |
| type of the region corresponds to the type of the pointee object.</p>
 | |
| 
 | |
| <p>One complication is that we could have different view regions on the same
 | |
| memory chunk. They represent the same memory location, but have different
 | |
| abstract location, i.e., MemRegion pointers. Thus we need to canonicalize the
 | |
| abstract locations to get a unique abstract location for one physical
 | |
| location.</p>
 | |
| 
 | |
| <p>Furthermore, these different view regions may or may not represent memory
 | |
| objects of different types. Some different types are semantically the same,
 | |
| for example, 'struct s' and 'my_type' are the same type.</p>
 | |
| 
 | |
| <pre>
 | |
| struct s;
 | |
| typedef struct s my_type;
 | |
| </pre>
 | |
| 
 | |
| <p>But <tt>char</tt> and <tt>int</tt> are not the same type in the code below:</p>
 | |
| 
 | |
| <pre>
 | |
| void *p;
 | |
| int *q = (int*) p;
 | |
| char *r = (char*) p;
 | |
| </pre
 | |
| 
 | |
| <p>Thus we need to canonicalize the MemRegion which is used in binding and
 | |
| retrieving.</p>
 | |
| 
 | |
| <h3>Regions</h3>
 | |
| <p>Region is the entity used to model pointer values. A Region has the following
 | |
| properties:</p>
 | |
| 
 | |
| <ul>
 | |
| <li>Kind</li>
 | |
| 
 | |
| <li>ObjectType: the type of the object residing on the region.</li>
 | |
| 
 | |
| <li>LocationType: the type of the pointer value that the region corresponds to.
 | |
|   Usually this is the pointer to the ObjectType. But sometimes we want to cache
 | |
|   this type explicitly, for example, for a CodeTextRegion.</li>
 | |
| 
 | |
| <li>StartLocation</li>
 | |
| 
 | |
| <li>EndLocation</li>
 | |
| </ul>
 | |
| 
 | |
| <h3>Symbolic Regions</h3>
 | |
| 
 | |
| <p>A symbolic region is a map of the concept of symbolic values into the domain
 | |
| of regions. It is the way that we represent symbolic pointers. Whenever a
 | |
| symbolic pointer value is needed, a symbolic region is created to represent
 | |
| it.</p>
 | |
| 
 | |
| <p>A symbolic region has no type. It wraps a SymbolData. But sometimes we have
 | |
| type information associated with a symbolic region. For this case, a
 | |
| TypedViewRegion is created to layer the type information on top of the symbolic
 | |
| region. The reason we do not carry type information with the symbolic region is
 | |
| that the symbolic regions can have no type. To be consistent, we don't let them
 | |
| to carry type information.</p>
 | |
| 
 | |
| <p>Like a symbolic pointer, a symbolic region may be NULL, has unknown extent,
 | |
| and represents a generic chunk of memory.</p>
 | |
| 
 | |
| <p><em><b>NOTE</b>: We plan not to use loc::SymbolVal in RegionStore and remove it
 | |
|   gradually.</em></p>
 | |
| 
 | |
| <p>Symbolic regions get their rvalue types through the following ways:</p>
 | |
| 
 | |
| <ul>
 | |
| <li>Through the parameter or global variable that points to it, e.g.:
 | |
| <pre>
 | |
| void f(struct s* p) {
 | |
|   ...
 | |
| }
 | |
| </pre>
 | |
| 
 | |
| <p>The symbolic region pointed to by <tt>p</tt> has type <tt>struct
 | |
| s</tt>.</p></li>
 | |
| 
 | |
| <li>Through explicit or implicit casts, e.g.:
 | |
| <pre>
 | |
| void f(void* p) {
 | |
|   struct s* q = (struct s*) p;
 | |
|   ...
 | |
| }
 | |
| </pre>
 | |
| </li>
 | |
| </ul>
 | |
| 
 | |
| <p>We attach the type information to the symbolic region lazily. For the first
 | |
| case above, we create the <tt>TypedViewRegion</tt> only when the pointer is
 | |
| actually used to access the pointee memory object, that is when the element or
 | |
| field region is created. For the cast case, the <tt>TypedViewRegion</tt> is
 | |
| created when visiting the <tt>CastExpr</tt>.</p>
 | |
| 
 | |
| <p>The reason for doing lazy typing is that symbolic regions are sometimes only
 | |
| used to do location comparison.</p>
 | |
| 
 | |
| <h3>Pointer Casts</h3>
 | |
| 
 | |
| <p>Pointer casts allow people to impose different 'views' onto a chunk of
 | |
| memory.</p>
 | |
| 
 | |
| <p>Usually we have two kinds of casts. One kind of casts cast down with in the
 | |
| type hierarchy. It imposes more specific views onto more generic memory regions.
 | |
| The other kind of casts cast up with in the type hierarchy. It strips away more
 | |
| specific views on top of the more generic memory regions.</p>
 | |
| 
 | |
| <p>We simulate the down casts by layering another <tt>TypedViewRegion</tt> on
 | |
| top of the original region. We simulate the up casts by striping away the top
 | |
| <tt>TypedViewRegion</tt>. Down casts is usually simple. For up casts, if the
 | |
| there is no <tt>TypedViewRegion</tt> to be stripped, we return the original
 | |
| region. If the underlying region is of the different type than the cast-to type,
 | |
| we flag an error state.</p>
 | |
| 
 | |
| <p>For toll-free bridging casts, we return the original region.</p>
 | |
| 
 | |
| <p>We can set up a partial order for pointer types, with the most general type
 | |
| <tt>void*</tt> at the top. The partial order forms a tree with <tt>void*</tt> as
 | |
| its root node.</p>
 | |
| 
 | |
| <p>Every <tt>MemRegion</tt> has a root position in the type tree. For example,
 | |
| the pointee region of <tt>void *p</tt> has its root position at the root node of
 | |
| the tree. <tt>VarRegion</tt> of <tt>int x</tt> has its root position at the 'int
 | |
| type' node.</p>
 | |
| 
 | |
| <p><tt>TypedViewRegion</tt> is used to move the region down or up in the tree.
 | |
| Moving down in the tree adds a <tt>TypedViewRegion</tt>. Moving up in the tree
 | |
| removes a <Tt>TypedViewRegion</tt>.</p>
 | |
| 
 | |
| <p>Do we want to allow moving up beyond the root position? This happens
 | |
| when:</p> <pre> int x; void *p = &x; </pre>
 | |
| 
 | |
| <p>The region of <tt>x</tt> has its root position at 'int*' node. the cast to
 | |
| void* moves that region up to the 'void*' node. I propose to not allow such
 | |
| casts, and assign the region of <tt>x</tt> for <tt>p</tt>.</p>
 | |
| 
 | |
| <p>Another non-ideal case is that people might cast to a non-generic pointer
 | |
| from another non-generic pointer instead of first casting it back to the generic
 | |
| pointer. Direct handling of this case would result in multiple layers of
 | |
| TypedViewRegions. This enforces an incorrect semantic view to the region,
 | |
| because we can only have one typed view on a region at a time. To avoid this
 | |
| inconsistency, before casting the region, we strip the TypedViewRegion, then do
 | |
| the cast. In summary, we only allow one layer of TypedViewRegion.</p>
 | |
| 
 | |
| <h3>Region Bindings</h3>
 | |
| 
 | |
| <p>The following region kinds are boundable: VarRegion, CompoundLiteralRegion,
 | |
| StringRegion, ElementRegion, FieldRegion, and ObjCIvarRegion.</p>
 | |
| 
 | |
| <p>When binding regions, we perform canonicalization on element regions and field
 | |
| regions. This is because we can have different views on the same region, some
 | |
| of which are essentially the same view with different sugar type names.</p>
 | |
| 
 | |
| <p>To canonicalize a region, we get the canonical types for all TypedViewRegions
 | |
| along the way up to the root region, and make new TypedViewRegions with those
 | |
| canonical types.</p>
 | |
| 
 | |
| <p>For Objective-C and C++, perhaps another canonicalization rule should be
 | |
| added: for FieldRegion, the least derived class that has the field is used as
 | |
| the type of the super region of the FieldRegion.</p>
 | |
| 
 | |
| <p>All bindings and retrievings are done on the canonicalized regions.</p>
 | |
| 
 | |
| <p>Canonicalization is transparent outside the region store manager, and more
 | |
| specifically, unaware outside the Bind() and Retrieve() method. We don't need to
 | |
| consider region canonicalization when doing pointer cast.</p>
 | |
| 
 | |
| <h3>Constraint Manager</h3>
 | |
| 
 | |
| <p>The constraint manager reasons about the abstract location of memory objects.
 | |
| We can have different views on a region, but none of these views changes the
 | |
| location of that object. Thus we should get the same abstract location for those
 | |
| regions.</p>
 | |
| 
 | |
| </body>
 | |
| </html>
 |