forked from OSchip/llvm-project
				
			
		
			
				
	
	
		
			465 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
			
		
		
	
	
			465 lines
		
	
	
		
			16 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
| ======================================
 | |
| Kaleidoscope: Adding Debug Information
 | |
| ======================================
 | |
| 
 | |
| .. contents::
 | |
|    :local:
 | |
| 
 | |
| Chapter 9 Introduction
 | |
| ======================
 | |
| 
 | |
| Welcome to Chapter 9 of the "`Implementing a language with
 | |
| LLVM <index.html>`_" tutorial. In chapters 1 through 8, we've built a
 | |
| decent little programming language with functions and variables.
 | |
| What happens if something goes wrong though, how do you debug your
 | |
| program?
 | |
| 
 | |
| Source level debugging uses formatted data that helps a debugger
 | |
| translate from binary and the state of the machine back to the
 | |
| source that the programmer wrote. In LLVM we generally use a format
 | |
| called `DWARF <http://dwarfstd.org>`_. DWARF is a compact encoding
 | |
| that represents types, source locations, and variable locations.
 | |
| 
 | |
| The short summary of this chapter is that we'll go through the
 | |
| various things you have to add to a programming language to
 | |
| support debug info, and how you translate that into DWARF.
 | |
| 
 | |
| Caveat: For now we can't debug via the JIT, so we'll need to compile
 | |
| our program down to something small and standalone. As part of this
 | |
| we'll make a few modifications to the running of the language and
 | |
| how programs are compiled. This means that we'll have a source file
 | |
| with a simple program written in Kaleidoscope rather than the
 | |
| interactive JIT. It does involve a limitation that we can only
 | |
| have one "top level" command at a time to reduce the number of
 | |
| changes necessary.
 | |
| 
 | |
| Here's the sample program we'll be compiling:
 | |
| 
 | |
| .. code-block:: python
 | |
| 
 | |
|    def fib(x)
 | |
|      if x < 3 then
 | |
|        1
 | |
|      else
 | |
|        fib(x-1)+fib(x-2);
 | |
| 
 | |
|    fib(10)
 | |
| 
 | |
| 
 | |
| Why is this a hard problem?
 | |
| ===========================
 | |
| 
 | |
| Debug information is a hard problem for a few different reasons - mostly
 | |
| centered around optimized code. First, optimization makes keeping source
 | |
| locations more difficult. In LLVM IR we keep the original source location
 | |
| for each IR level instruction on the instruction. Optimization passes
 | |
| should keep the source locations for newly created instructions, but merged
 | |
| instructions only get to keep a single location - this can cause jumping
 | |
| around when stepping through optimized programs. Secondly, optimization
 | |
| can move variables in ways that are either optimized out, shared in memory
 | |
| with other variables, or difficult to track. For the purposes of this
 | |
| tutorial we're going to avoid optimization (as you'll see with one of the
 | |
| next sets of patches).
 | |
| 
 | |
| Ahead-of-Time Compilation Mode
 | |
| ==============================
 | |
| 
 | |
| To highlight only the aspects of adding debug information to a source
 | |
| language without needing to worry about the complexities of JIT debugging
 | |
| we're going to make a few changes to Kaleidoscope to support compiling
 | |
| the IR emitted by the front end into a simple standalone program that
 | |
| you can execute, debug, and see results.
 | |
| 
 | |
| First we make our anonymous function that contains our top level
 | |
| statement be our "main":
 | |
| 
 | |
| .. code-block:: udiff
 | |
| 
 | |
|   -    auto Proto = llvm::make_unique<PrototypeAST>("", std::vector<std::string>());
 | |
|   +    auto Proto = llvm::make_unique<PrototypeAST>("main", std::vector<std::string>());
 | |
| 
 | |
| just with the simple change of giving it a name.
 | |
| 
 | |
| Then we're going to remove the command line code wherever it exists:
 | |
| 
 | |
| .. code-block:: udiff
 | |
| 
 | |
|   @@ -1129,7 +1129,6 @@ static void HandleTopLevelExpression() {
 | |
|    /// top ::= definition | external | expression | ';'
 | |
|    static void MainLoop() {
 | |
|      while (1) {
 | |
|   -    fprintf(stderr, "ready> ");
 | |
|        switch (CurTok) {
 | |
|        case tok_eof:
 | |
|          return;
 | |
|   @@ -1184,7 +1183,6 @@ int main() {
 | |
|      BinopPrecedence['*'] = 40; // highest.
 | |
| 
 | |
|      // Prime the first token.
 | |
|   -  fprintf(stderr, "ready> ");
 | |
|      getNextToken();
 | |
| 
 | |
| Lastly we're going to disable all of the optimization passes and the JIT so
 | |
| that the only thing that happens after we're done parsing and generating
 | |
| code is that the LLVM IR goes to standard error:
 | |
| 
 | |
| .. code-block:: udiff
 | |
| 
 | |
|   @@ -1108,17 +1108,8 @@ static void HandleExtern() {
 | |
|    static void HandleTopLevelExpression() {
 | |
|      // Evaluate a top-level expression into an anonymous function.
 | |
|      if (auto FnAST = ParseTopLevelExpr()) {
 | |
|   -    if (auto *FnIR = FnAST->codegen()) {
 | |
|   -      // We're just doing this to make sure it executes.
 | |
|   -      TheExecutionEngine->finalizeObject();
 | |
|   -      // JIT the function, returning a function pointer.
 | |
|   -      void *FPtr = TheExecutionEngine->getPointerToFunction(FnIR);
 | |
|   -
 | |
|   -      // Cast it to the right type (takes no arguments, returns a double) so we
 | |
|   -      // can call it as a native function.
 | |
|   -      double (*FP)() = (double (*)())(intptr_t)FPtr;
 | |
|   -      // Ignore the return value for this.
 | |
|   -      (void)FP;
 | |
|   +    if (!F->codegen()) {
 | |
|   +      fprintf(stderr, "Error generating code for top level expr");
 | |
|        }
 | |
|      } else {
 | |
|        // Skip token for error recovery.
 | |
|   @@ -1439,11 +1459,11 @@ int main() {
 | |
|      // target lays out data structures.
 | |
|      TheModule->setDataLayout(TheExecutionEngine->getDataLayout());
 | |
|      OurFPM.add(new DataLayoutPass());
 | |
|   +#if 0
 | |
|      OurFPM.add(createBasicAliasAnalysisPass());
 | |
|      // Promote allocas to registers.
 | |
|      OurFPM.add(createPromoteMemoryToRegisterPass());
 | |
|   @@ -1218,7 +1210,7 @@ int main() {
 | |
|      OurFPM.add(createGVNPass());
 | |
|      // Simplify the control flow graph (deleting unreachable blocks, etc).
 | |
|      OurFPM.add(createCFGSimplificationPass());
 | |
|   -
 | |
|   +  #endif
 | |
|      OurFPM.doInitialization();
 | |
| 
 | |
|      // Set the global so the code gen can use this.
 | |
| 
 | |
| This relatively small set of changes get us to the point that we can compile
 | |
| our piece of Kaleidoscope language down to an executable program via this
 | |
| command line:
 | |
| 
 | |
| .. code-block:: bash
 | |
| 
 | |
|   Kaleidoscope-Ch9 < fib.ks | & clang -x ir -
 | |
| 
 | |
| which gives an a.out/a.exe in the current working directory.
 | |
| 
 | |
| Compile Unit
 | |
| ============
 | |
| 
 | |
| The top level container for a section of code in DWARF is a compile unit.
 | |
| This contains the type and function data for an individual translation unit
 | |
| (read: one file of source code). So the first thing we need to do is
 | |
| construct one for our fib.ks file.
 | |
| 
 | |
| DWARF Emission Setup
 | |
| ====================
 | |
| 
 | |
| Similar to the ``IRBuilder`` class we have a
 | |
| `DIBuilder <http://llvm.org/doxygen/classllvm_1_1DIBuilder.html>`_ class
 | |
| that helps in constructing debug metadata for an LLVM IR file. It
 | |
| corresponds 1:1 similarly to ``IRBuilder`` and LLVM IR, but with nicer names.
 | |
| Using it does require that you be more familiar with DWARF terminology than
 | |
| you needed to be with ``IRBuilder`` and ``Instruction`` names, but if you
 | |
| read through the general documentation on the
 | |
| `Metadata Format <http://llvm.org/docs/SourceLevelDebugging.html>`_ it
 | |
| should be a little more clear. We'll be using this class to construct all
 | |
| of our IR level descriptions. Construction for it takes a module so we
 | |
| need to construct it shortly after we construct our module. We've left it
 | |
| as a global static variable to make it a bit easier to use.
 | |
| 
 | |
| Next we're going to create a small container to cache some of our frequent
 | |
| data. The first will be our compile unit, but we'll also write a bit of
 | |
| code for our one type since we won't have to worry about multiple typed
 | |
| expressions:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|   static DIBuilder *DBuilder;
 | |
| 
 | |
|   struct DebugInfo {
 | |
|     DICompileUnit *TheCU;
 | |
|     DIType *DblTy;
 | |
| 
 | |
|     DIType *getDoubleTy();
 | |
|   } KSDbgInfo;
 | |
| 
 | |
|   DIType *DebugInfo::getDoubleTy() {
 | |
|     if (DblTy)
 | |
|       return DblTy;
 | |
| 
 | |
|     DblTy = DBuilder->createBasicType("double", 64, 64, dwarf::DW_ATE_float);
 | |
|     return DblTy;
 | |
|   }
 | |
| 
 | |
| And then later on in ``main`` when we're constructing our module:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|   DBuilder = new DIBuilder(*TheModule);
 | |
| 
 | |
|   KSDbgInfo.TheCU = DBuilder->createCompileUnit(
 | |
|       dwarf::DW_LANG_C, "fib.ks", ".", "Kaleidoscope Compiler", 0, "", 0);
 | |
| 
 | |
| There are a couple of things to note here. First, while we're producing a
 | |
| compile unit for a language called Kaleidoscope we used the language
 | |
| constant for C. This is because a debugger wouldn't necessarily understand
 | |
| the calling conventions or default ABI for a language it doesn't recognize
 | |
| and we follow the C ABI in our LLVM code generation so it's the closest
 | |
| thing to accurate. This ensures we can actually call functions from the
 | |
| debugger and have them execute. Secondly, you'll see the "fib.ks" in the
 | |
| call to ``createCompileUnit``. This is a default hard coded value since
 | |
| we're using shell redirection to put our source into the Kaleidoscope
 | |
| compiler. In a usual front end you'd have an input file name and it would
 | |
| go there.
 | |
| 
 | |
| One last thing as part of emitting debug information via DIBuilder is that
 | |
| we need to "finalize" the debug information. The reasons are part of the
 | |
| underlying API for DIBuilder, but make sure you do this near the end of
 | |
| main:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|   DBuilder->finalize();
 | |
| 
 | |
| before you dump out the module.
 | |
| 
 | |
| Functions
 | |
| =========
 | |
| 
 | |
| Now that we have our ``Compile Unit`` and our source locations, we can add
 | |
| function definitions to the debug info. So in ``PrototypeAST::codegen()`` we
 | |
| add a few lines of code to describe a context for our subprogram, in this
 | |
| case the "File", and the actual definition of the function itself.
 | |
| 
 | |
| So the context:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|   DIFile *Unit = DBuilder->createFile(KSDbgInfo.TheCU.getFilename(),
 | |
|                                       KSDbgInfo.TheCU.getDirectory());
 | |
| 
 | |
| giving us an DIFile and asking the ``Compile Unit`` we created above for the
 | |
| directory and filename where we are currently. Then, for now, we use some
 | |
| source locations of 0 (since our AST doesn't currently have source location
 | |
| information) and construct our function definition:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|   DIScope *FContext = Unit;
 | |
|   unsigned LineNo = 0;
 | |
|   unsigned ScopeLine = 0;
 | |
|   DISubprogram *SP = DBuilder->createFunction(
 | |
|       FContext, P.getName(), StringRef(), Unit, LineNo,
 | |
|       CreateFunctionType(TheFunction->arg_size(), Unit),
 | |
|       false /* internal linkage */, true /* definition */, ScopeLine,
 | |
|       DINode::FlagPrototyped, false);
 | |
|   TheFunction->setSubprogram(SP);
 | |
| 
 | |
| and we now have an DISubprogram that contains a reference to all of our
 | |
| metadata for the function.
 | |
| 
 | |
| Source Locations
 | |
| ================
 | |
| 
 | |
| The most important thing for debug information is accurate source location -
 | |
| this makes it possible to map your source code back. We have a problem though,
 | |
| Kaleidoscope really doesn't have any source location information in the lexer
 | |
| or parser so we'll need to add it.
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|    struct SourceLocation {
 | |
|      int Line;
 | |
|      int Col;
 | |
|    };
 | |
|    static SourceLocation CurLoc;
 | |
|    static SourceLocation LexLoc = {1, 0};
 | |
| 
 | |
|    static int advance() {
 | |
|      int LastChar = getchar();
 | |
| 
 | |
|      if (LastChar == '\n' || LastChar == '\r') {
 | |
|        LexLoc.Line++;
 | |
|        LexLoc.Col = 0;
 | |
|      } else
 | |
|        LexLoc.Col++;
 | |
|      return LastChar;
 | |
|    }
 | |
| 
 | |
| In this set of code we've added some functionality on how to keep track of the
 | |
| line and column of the "source file". As we lex every token we set our current
 | |
| current "lexical location" to the assorted line and column for the beginning
 | |
| of the token. We do this by overriding all of the previous calls to
 | |
| ``getchar()`` with our new ``advance()`` that keeps track of the information
 | |
| and then we have added to all of our AST classes a source location:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|    class ExprAST {
 | |
|      SourceLocation Loc;
 | |
| 
 | |
|      public:
 | |
|        ExprAST(SourceLocation Loc = CurLoc) : Loc(Loc) {}
 | |
|        virtual ~ExprAST() {}
 | |
|        virtual Value* codegen() = 0;
 | |
|        int getLine() const { return Loc.Line; }
 | |
|        int getCol() const { return Loc.Col; }
 | |
|        virtual raw_ostream &dump(raw_ostream &out, int ind) {
 | |
|          return out << ':' << getLine() << ':' << getCol() << '\n';
 | |
|        }
 | |
| 
 | |
| that we pass down through when we create a new expression:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|    LHS = llvm::make_unique<BinaryExprAST>(BinLoc, BinOp, std::move(LHS),
 | |
|                                           std::move(RHS));
 | |
| 
 | |
| giving us locations for each of our expressions and variables.
 | |
| 
 | |
| To make sure that every instruction gets proper source location information,
 | |
| we have to tell ``Builder`` whenever we're at a new source location.
 | |
| We use a small helper function for this:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|   void DebugInfo::emitLocation(ExprAST *AST) {
 | |
|     DIScope *Scope;
 | |
|     if (LexicalBlocks.empty())
 | |
|       Scope = TheCU;
 | |
|     else
 | |
|       Scope = LexicalBlocks.back();
 | |
|     Builder.SetCurrentDebugLocation(
 | |
|         DebugLoc::get(AST->getLine(), AST->getCol(), Scope));
 | |
|   }
 | |
| 
 | |
| This both tells the main ``IRBuilder`` where we are, but also what scope
 | |
| we're in. The scope can either be on compile-unit level or be the nearest
 | |
| enclosing lexical block like the current function.
 | |
| To represent this we create a stack of scopes:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|    std::vector<DIScope *> LexicalBlocks;
 | |
| 
 | |
| and push the scope (function) to the top of the stack when we start
 | |
| generating the code for each function:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|   KSDbgInfo.LexicalBlocks.push_back(SP);
 | |
| 
 | |
| Also, we may not forget to pop the scope back off of the scope stack at the
 | |
| end of the code generation for the function:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|   // Pop off the lexical block for the function since we added it
 | |
|   // unconditionally.
 | |
|   KSDbgInfo.LexicalBlocks.pop_back();
 | |
| 
 | |
| Then we make sure to emit the location every time we start to generate code
 | |
| for a new AST object:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|    KSDbgInfo.emitLocation(this);
 | |
| 
 | |
| Variables
 | |
| =========
 | |
| 
 | |
| Now that we have functions, we need to be able to print out the variables
 | |
| we have in scope. Let's get our function arguments set up so we can get
 | |
| decent backtraces and see how our functions are being called. It isn't
 | |
| a lot of code, and we generally handle it when we're creating the
 | |
| argument allocas in ``FunctionAST::codegen``.
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|     // Record the function arguments in the NamedValues map.
 | |
|     NamedValues.clear();
 | |
|     unsigned ArgIdx = 0;
 | |
|     for (auto &Arg : TheFunction->args()) {
 | |
|       // Create an alloca for this variable.
 | |
|       AllocaInst *Alloca = CreateEntryBlockAlloca(TheFunction, Arg.getName());
 | |
| 
 | |
|       // Create a debug descriptor for the variable.
 | |
|       DILocalVariable *D = DBuilder->createParameterVariable(
 | |
|           SP, Arg.getName(), ++ArgIdx, Unit, LineNo, KSDbgInfo.getDoubleTy(),
 | |
|           true);
 | |
| 
 | |
|       DBuilder->insertDeclare(Alloca, D, DBuilder->createExpression(),
 | |
|                               DebugLoc::get(LineNo, 0, SP),
 | |
|                               Builder.GetInsertBlock());
 | |
| 
 | |
|       // Store the initial value into the alloca.
 | |
|       Builder.CreateStore(&Arg, Alloca);
 | |
| 
 | |
|       // Add arguments to variable symbol table.
 | |
|       NamedValues[Arg.getName()] = Alloca;
 | |
|     }
 | |
| 
 | |
| 
 | |
| Here we're first creating the variable, giving it the scope (``SP``),
 | |
| the name, source location, type, and since it's an argument, the argument
 | |
| index. Next, we create an ``lvm.dbg.declare`` call to indicate at the IR
 | |
| level that we've got a variable in an alloca (and it gives a starting
 | |
| location for the variable), and setting a source location for the
 | |
| beginning of the scope on the declare.
 | |
| 
 | |
| One interesting thing to note at this point is that various debuggers have
 | |
| assumptions based on how code and debug information was generated for them
 | |
| in the past. In this case we need to do a little bit of a hack to avoid
 | |
| generating line information for the function prologue so that the debugger
 | |
| knows to skip over those instructions when setting a breakpoint. So in
 | |
| ``FunctionAST::CodeGen`` we add some more lines:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|   // Unset the location for the prologue emission (leading instructions with no
 | |
|   // location in a function are considered part of the prologue and the debugger
 | |
|   // will run past them when breaking on a function)
 | |
|   KSDbgInfo.emitLocation(nullptr);
 | |
| 
 | |
| and then emit a new location when we actually start generating code for the
 | |
| body of the function:
 | |
| 
 | |
| .. code-block:: c++
 | |
| 
 | |
|   KSDbgInfo.emitLocation(Body.get());
 | |
| 
 | |
| With this we have enough debug information to set breakpoints in functions,
 | |
| print out argument variables, and call functions. Not too bad for just a
 | |
| few simple lines of code!
 | |
| 
 | |
| Full Code Listing
 | |
| =================
 | |
| 
 | |
| Here is the complete code listing for our running example, enhanced with
 | |
| debug information. To build this example, use:
 | |
| 
 | |
| .. code-block:: bash
 | |
| 
 | |
|     # Compile
 | |
|     clang++ -g toy.cpp `llvm-config --cxxflags --ldflags --system-libs --libs core mcjit native` -O3 -o toy
 | |
|     # Run
 | |
|     ./toy
 | |
| 
 | |
| Here is the code:
 | |
| 
 | |
| .. literalinclude:: ../../examples/Kaleidoscope/Chapter9/toy.cpp
 | |
|    :language: c++
 | |
| 
 | |
| `Next: Conclusion and other useful LLVM tidbits <LangImpl10.html>`_
 | |
| 
 |