172 lines
		
	
	
		
			7.4 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
			
		
		
	
	
			172 lines
		
	
	
		
			7.4 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
.. _Readers:
 | 
						|
 | 
						|
Developing lld Readers
 | 
						|
======================
 | 
						|
 | 
						|
Introduction
 | 
						|
------------
 | 
						|
 | 
						|
The purpose of a "Reader" is to take an object file in a particular format
 | 
						|
and create an `lld::File`:cpp:class: (which is a graph of Atoms)
 | 
						|
representing the object file.  A Reader inherits from
 | 
						|
`lld::Reader`:cpp:class: which lives in
 | 
						|
:file:`include/lld/Core/Reader.h` and
 | 
						|
:file:`lib/Core/Reader.cpp`.
 | 
						|
 | 
						|
The Reader infrastructure for an object format ``Foo`` requires the
 | 
						|
following pieces in order to fit into lld:
 | 
						|
 | 
						|
:file:`include/lld/ReaderWriter/ReaderFoo.h`
 | 
						|
 | 
						|
   .. cpp:class:: ReaderOptionsFoo : public ReaderOptions
 | 
						|
 | 
						|
      This Options class is the only way to configure how the Reader will
 | 
						|
      parse any file into an `lld::Reader`:cpp:class: object.  This class
 | 
						|
      should be declared in the `lld`:cpp:class: namespace.
 | 
						|
 | 
						|
   .. cpp:function:: Reader *createReaderFoo(ReaderOptionsFoo &reader)
 | 
						|
 | 
						|
      This factory function configures and create the Reader. This function
 | 
						|
      should be declared in the `lld`:cpp:class: namespace.
 | 
						|
 | 
						|
:file:`lib/ReaderWriter/Foo/ReaderFoo.cpp`
 | 
						|
 | 
						|
   .. cpp:class:: ReaderFoo : public Reader
 | 
						|
 | 
						|
      This is the concrete Reader class which can be called to parse
 | 
						|
      object files. It should be declared in an anonymous namespace or
 | 
						|
      if there is shared code with the `lld::WriterFoo`:cpp:class: you
 | 
						|
      can make a nested namespace (e.g. `lld::foo`:cpp:class:).
 | 
						|
 | 
						|
You may have noticed that :cpp:class:`ReaderFoo` is not declared in the
 | 
						|
``.h`` file. An important design aspect of lld is that all Readers are
 | 
						|
created *only* through an object-format-specific
 | 
						|
:cpp:func:`createReaderFoo` factory function. The creation of the Reader is
 | 
						|
parametrized through a :cpp:class:`ReaderOptionsFoo` class. This options
 | 
						|
class is the one-and-only way to control how the Reader operates when
 | 
						|
parsing an input file into an Atom graph. For instance, you may want the
 | 
						|
Reader to only accept certain architectures. The options class can be
 | 
						|
instantiated from command line options or be programmatically configured.
 | 
						|
 | 
						|
Where to start
 | 
						|
--------------
 | 
						|
 | 
						|
The lld project already has a skeleton of source code for Readers for
 | 
						|
``ELF``, ``PECOFF``, ``MachO``, and lld's native ``YAML`` graph format.
 | 
						|
If your file format is a variant of one of those, you should modify the
 | 
						|
existing Reader to support your variant. This is done by customizing the Options
 | 
						|
class for the Reader and making appropriate changes to the ``.cpp`` file to
 | 
						|
interpret those options and act accordingly.
 | 
						|
 | 
						|
If your object file format is not a variant of any existing Reader, you'll need
 | 
						|
to create a new Reader subclass with the organization described above.
 | 
						|
 | 
						|
Readers are factories
 | 
						|
---------------------
 | 
						|
 | 
						|
The linker will usually only instantiate your Reader once.  That one Reader will
 | 
						|
have its loadFile() method called many times with different input files.
 | 
						|
To support multithreaded linking, the Reader may be parsing multiple input
 | 
						|
files in parallel. Therefore, there should be no parsing state in you Reader
 | 
						|
object.  Any parsing state should be in ivars of your File subclass or in
 | 
						|
some temporary object.
 | 
						|
 | 
						|
The key method to implement in a reader is::
 | 
						|
 | 
						|
  virtual error_code loadFile(LinkerInput &input,
 | 
						|
                              std::vector<std::unique_ptr<File>> &result);
 | 
						|
 | 
						|
It takes a memory buffer (which contains the contents of the object file
 | 
						|
being read) and returns an instantiated lld::File object which is
 | 
						|
a collection of Atoms. The result is a vector of File pointers (instead of
 | 
						|
simple a File pointer) because some file formats allow multiple object
 | 
						|
"files" to be encoded in one file system file.
 | 
						|
 | 
						|
 | 
						|
Memory Ownership
 | 
						|
----------------
 | 
						|
 | 
						|
Atoms are always owned by their File object. During core linking when Atoms
 | 
						|
are coalesced or stripped away, core linking does not delete them.
 | 
						|
Core linking just removes those unused Atoms from its internal list.
 | 
						|
The destructor of a File object is responsible for deleting all Atoms it
 | 
						|
owns, and if ownership of the MemoryBuffer was passed to it, the File
 | 
						|
destructor needs to delete that too.
 | 
						|
 | 
						|
Making Atoms
 | 
						|
------------
 | 
						|
 | 
						|
The internal model of lld is purely Atom based.  But most object files do not
 | 
						|
have an explicit concept of Atoms, instead most have "sections". The way
 | 
						|
to think of this is that a section is just a list of Atoms with common
 | 
						|
attributes.
 | 
						|
 | 
						|
The first step in parsing section-based object files is to cleave each
 | 
						|
section into a list of Atoms. The technique may vary by section type. For
 | 
						|
code sections (e.g. .text), there are usually symbols at the start of each
 | 
						|
function. Those symbol addresses are the points at which the section is
 | 
						|
cleaved into discrete Atoms.  Some file formats (like ELF) also include the
 | 
						|
length of each symbol in the symbol table. Otherwise, the length of each
 | 
						|
Atom is calculated to run to the start of the next symbol or the end of the
 | 
						|
section.
 | 
						|
 | 
						|
Other sections types can be implicitly cleaved. For instance c-string literals
 | 
						|
or unwind info (e.g. .eh_frame) can be cleaved by having the Reader look at
 | 
						|
the content of the section.  It is important to cleave sections into Atoms
 | 
						|
to remove false dependencies. For instance the .eh_frame section often
 | 
						|
has no symbols, but contains "pointers" to the functions for which it
 | 
						|
has unwind info.  If the .eh_frame section was not cleaved (but left as one
 | 
						|
big Atom), there would always be a reference (from the eh_frame Atom) to
 | 
						|
each function.  So the linker would be unable to coalesce or dead stripped
 | 
						|
away the function atoms.
 | 
						|
 | 
						|
The lld Atom model also requires that a reference to an undefined symbol be
 | 
						|
modeled as a Reference to an UndefinedAtom. So the Reader also needs to
 | 
						|
create an UndefinedAtom for each undefined symbol in the object file.
 | 
						|
 | 
						|
Once all Atoms have been created, the second step is to create References
 | 
						|
(recall that Atoms are "nodes" and References are "edges"). Most References
 | 
						|
are created by looking at the "relocation records" in the object file. If
 | 
						|
a function contains a call to "malloc", there is usually a relocation record
 | 
						|
specifying the address in the section and the symbol table index. Your
 | 
						|
Reader will need to convert the address to an Atom and offset and the symbol
 | 
						|
table index into a target Atom. If "malloc" is not defined in the object file,
 | 
						|
the target Atom of the Reference will be an UndefinedAtom.
 | 
						|
 | 
						|
 | 
						|
Performance
 | 
						|
-----------
 | 
						|
Once you have the above working to parse an object file into Atoms and
 | 
						|
References, you'll want to look at performance.  Some techniques that can
 | 
						|
help performance are:
 | 
						|
 | 
						|
* Use llvm::BumpPtrAllocator or pre-allocate one big vector<Reference> and then
 | 
						|
  just have each atom point to its subrange of References in that vector.
 | 
						|
  This can be faster that allocating each Reference as separate object.
 | 
						|
* Pre-scan the symbol table and determine how many atoms are in each section
 | 
						|
  then allocate space for all the Atom objects at once.
 | 
						|
* Don't copy symbol names or section content to each Atom, instead use
 | 
						|
  StringRef and ArrayRef in each Atom to point to its name and content in the
 | 
						|
  MemoryBuffer.
 | 
						|
 | 
						|
 | 
						|
Testing
 | 
						|
-------
 | 
						|
 | 
						|
We are still working on infrastructure to test Readers. The issue is that
 | 
						|
you don't want to check in binary files to the test suite. And the tools
 | 
						|
for creating your object file from assembly source may not be available on
 | 
						|
every OS.
 | 
						|
 | 
						|
We are investigating a way to use YAML to describe the section, symbols,
 | 
						|
and content of a file. Then have some code which will write out an object
 | 
						|
file from that YAML description.
 | 
						|
 | 
						|
Once that is in place, you can write test cases that contain section/symbols
 | 
						|
YAML and is run through the linker to produce Atom/References based YAML which
 | 
						|
is then run through FileCheck to verify the Atoms and References are as
 | 
						|
expected.
 | 
						|
 | 
						|
 | 
						|
 |