402 lines
		
	
	
		
			19 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
			
		
		
	
	
			402 lines
		
	
	
		
			19 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
| ======================================
 | |
| XRay Flight Data Recorder Trace Format
 | |
| ======================================
 | |
| 
 | |
| :Version: 1 as of 2017-07-20
 | |
| 
 | |
| .. contents::
 | |
|    :local:
 | |
| 
 | |
| 
 | |
| Introduction
 | |
| ============
 | |
| 
 | |
| When gathering XRay traces in Flight Data Recorder mode, each thread of an
 | |
| application will claim buffers to fill with trace data, which at some point
 | |
| is finalized and flushed.
 | |
| 
 | |
| A goal of the profiler is to minimize overhead, the flushed data directly
 | |
| corresponds to the buffer.
 | |
| 
 | |
| This document describes the format of a trace file.
 | |
| 
 | |
| 
 | |
| General
 | |
| =======
 | |
| 
 | |
| Each trace file corresponds to a sequence of events in a particular thread.
 | |
| 
 | |
| The file has a header followed by a sequence of discriminated record types.
 | |
| 
 | |
| The endianness of byte fields matches the endianness of the platform which
 | |
| produced the trace file.
 | |
| 
 | |
| 
 | |
| Header Section
 | |
| ==============
 | |
| 
 | |
| A trace file begins with a 32 byte header.
 | |
| 
 | |
| +-------------------+-----------------+----------------------------------------+
 | |
| | Field             | Size (bytes)    | Description                            |
 | |
| +===================+=================+========================================+
 | |
| | version           | ``2``           | Anticipates versioned  readers. This   |
 | |
| |                   |                 | document describes the format when     |
 | |
| |                   |                 | version == 1                           |
 | |
| +-------------------+-----------------+----------------------------------------+
 | |
| | type              | ``2``           | An enumeration encoding the type of    |
 | |
| |                   |                 | trace. Flight Data Recorder mode       |
 | |
| |                   |                 | traces have type == 1                  |
 | |
| +-------------------+-----------------+----------------------------------------+
 | |
| | bitfield          | ``4``           | Holds parameters that are not aligned  |
 | |
| |                   |                 | to bytes. Further described below.     |
 | |
| +-------------------+-----------------+----------------------------------------+
 | |
| | cycle_frequency   | ``8``           | The frequency in hertz of the CPU      |
 | |
| |                   |                 | oscillator used to measure duration of |
 | |
| |                   |                 | events in ticks.                       |
 | |
| +-------------------+-----------------+----------------------------------------+
 | |
| | buffer_size       | ``8``           | The size in bytes of the data portion  |
 | |
| |                   |                 | of the trace following the header.     |
 | |
| +-------------------+-----------------+----------------------------------------+
 | |
| | reserved          | ``8``           | Reserved for future use.               |
 | |
| +-------------------+-----------------+----------------------------------------+
 | |
| 
 | |
| The bitfield parameter of the file header is composed of the following fields.
 | |
| 
 | |
| +-------------------+----------------+-----------------------------------------+
 | |
| | Field             | Size (bits)    | Description                             |
 | |
| +===================+================+=========================================+
 | |
| | constant_tsc      | ``1``          | Whether the platform's timestamp        |
 | |
| |                   |                | counter used to record ticks between    |
 | |
| |                   |                | events ticks at a constant frequency    |
 | |
| |                   |                | despite CPU frequency changes.          |
 | |
| |                   |                | 0 == non-constant. 1 == constant.       |
 | |
| +-------------------+----------------+-----------------------------------------+
 | |
| | nonstop_tsc       | ``1``          | Whether the tsc continues to count      |
 | |
| |                   |                | despite whether the CPU is in a low     |
 | |
| |                   |                | power state. 0 == stop. 1 == non-stop.  |
 | |
| +-------------------+----------------+-----------------------------------------+
 | |
| | reserved          | ``30``         | Not meaningful.                         |
 | |
| +-------------------+----------------+-----------------------------------------+
 | |
| 
 | |
| 
 | |
| Data Section
 | |
| ============
 | |
| 
 | |
| Following the header in a trace is a data section with size matching the
 | |
| buffer_size field in the header.
 | |
| 
 | |
| The data section is a stream of elements of different types.
 | |
| 
 | |
| There are a few categories of data in the sequence.
 | |
| 
 | |
| - ``Function Records``: Function Records contain the timing of entry into and
 | |
|   exit from function execution. Function Records have 8 bytes each.
 | |
| 
 | |
| - ``Metadata Records``: Metadata records serve many purposes. Mostly, they
 | |
|   capture information that may be too costly to record for each function, but
 | |
|   that is required to contextualize the fine-grained timings. They also are used
 | |
|   as markers for user-defined Event Data payloads. Metadata records have 16
 | |
|   bytes each.
 | |
| 
 | |
| - ``Event Data``: Free form data may be associated with events that are traced
 | |
|   by the binary and encode data defined by a handler function. Event data is
 | |
|   always preceded with a marker record which indicates how large it is.
 | |
| 
 | |
| - ``Function Arguments``: The arguments to some functions are included in the
 | |
|   trace. These are either pointer addresses or primitives that are read and
 | |
|   logged independently of their types in a high level language. To the tracer,
 | |
|   they are all numbers. Function Records that have attached arguments will
 | |
|   indicate their presence on the function entry record. We only support logging
 | |
|   contiguous function argument sequences starting with argument zero, which will
 | |
|   be the "this" pointer for member function invocations. For example, we don't
 | |
|   support logging the first and third argument.
 | |
| 
 | |
| A reader of the memory format must maintain a state machine. The format makes no
 | |
| attempt to pad for alignment, and it is not seekable.
 | |
| 
 | |
| 
 | |
| Function Records
 | |
| ----------------
 | |
| 
 | |
| Function Records have an 8 byte layout. This layout encodes information to
 | |
| reconstruct a call stack of instrumented function and their durations.
 | |
| 
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | Field         | Size (bits)  | Description                                   |
 | |
| +===============+==============+===============================================+
 | |
| | discriminant  | ``1``        | Indicates whether a reader should read a      |
 | |
| |               |              | Function or Metadata record. Set to ``0`` for |
 | |
| |               |              | Function records.                             |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | action        | ``3``        | Specifies whether the function is being       |
 | |
| |               |              | entered, exited, or is a non-standard entry   |
 | |
| |               |              | or exit produced by optimizations.            |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | function_id   | ``28``       | A numeric ID for the function. Resolved to a  |
 | |
| |               |              | name via the xray instrumentation map. The    |
 | |
| |               |              | instrumentation map is built by xray at       |
 | |
| |               |              | compile time into an object file and pairs    |
 | |
| |               |              | the function ids to addresses. It is used for |
 | |
| |               |              | patching and as a lookup into the binary's    |
 | |
| |               |              | symbols to obtain names.                      |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | tsc_delta     | ``32``       | The number of ticks of the timestamp counter  |
 | |
| |               |              | since a previous record recorded a delta or   |
 | |
| |               |              | other TSC resetting event.                    |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| 
 | |
| On little-endian machines, the bitfields are ordered from least significant bit
 | |
| bit to most significant bit. A reader can read an 8 bit value and apply the mask
 | |
| ``0x01`` for the discriminant. Similarly, they can read 32 bits and unsigned
 | |
| shift right by ``0x04`` to obtain the function_id field.
 | |
| 
 | |
| On big-endian machine, the bitfields are written in order from most significant
 | |
| bit to least significant bit. A reader would read an 8 bit value and unsigned
 | |
| shift right by 7 bits for the discriminant. The function_id field could be
 | |
| obtained by reading a 32 bit value and applying the mask ``0x0FFFFFFF``.
 | |
| 
 | |
| Function action types are as follows.
 | |
| 
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | Type          | Number       | Description                                   |
 | |
| +===============+==============+===============================================+
 | |
| | Entry         | ``0``        | Typical function entry.                       |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | Exit          | ``1``        | Typical function exit.                        |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | Tail_Exit     | ``2``        | An exit from a function due to tail call      |
 | |
| |               |              | optimization.                                 |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | Entry_Args    | ``3``        | A function entry that records arguments.      |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| 
 | |
| Entry_Args records do not contain the arguments themselves. Instead, metadata
 | |
| records for each of the logged args follow the function record in the stream.
 | |
| 
 | |
| 
 | |
| Metadata Records
 | |
| ----------------
 | |
| 
 | |
| Interspersed throughout the buffer are 16 byte Metadata records. For typically
 | |
| instrumented binaries, they will be sparser than Function records, and they
 | |
| provide a fuller picture of the binary execution state.
 | |
| 
 | |
| Metadata record layout is partially record dependent, but they share a common
 | |
| structure.
 | |
| 
 | |
| The same bit field rules described for function records apply to the first byte
 | |
| of MetadataRecords. Within this byte, little endian machines use lsb to msb
 | |
| ordering and big endian machines use msb to lsb ordering.
 | |
| 
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | Field         | Size         | Description                                   |
 | |
| +===============+==============+===============================================+
 | |
| | discriminant  | ``1 bit``    | Indicates whether a reader should read a      |
 | |
| |               |              | Function or Metadata record. Set to ``1`` for |
 | |
| |               |              | Metadata records.                             |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | record_kind   | ``7 bits``   | The type of Metadata record.                  |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | data          | ``15 bytes`` | A data field used differently for each record |
 | |
| |               |              | type.                                         |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| 
 | |
| Here is a table of the enumerated record kinds.
 | |
| 
 | |
| +--------+---------------------------+
 | |
| | Number | Type                      |
 | |
| +========+===========================+
 | |
| | 0      | NewBuffer                 |
 | |
| +--------+---------------------------+
 | |
| | 1      | EndOfBuffer               |
 | |
| +--------+---------------------------+
 | |
| | 2      | NewCPUId                  |
 | |
| +--------+---------------------------+
 | |
| | 3      | TSCWrap                   |
 | |
| +--------+---------------------------+
 | |
| | 4      | WallTimeMarker            |
 | |
| +--------+---------------------------+
 | |
| | 5      | CustomEventMarker         |
 | |
| +--------+---------------------------+
 | |
| | 6      | CallArgument              |
 | |
| +--------+---------------------------+
 | |
| 
 | |
| 
 | |
| NewBuffer Records
 | |
| -----------------
 | |
| 
 | |
| Each buffer begins with a NewBuffer record immediately after the header.
 | |
| It records the thread ID of the thread that the trace belongs to.
 | |
| 
 | |
| Its data segment is as follows.
 | |
| 
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | Field         | Size (bytes) | Description                                   |
 | |
| +===============+==============+===============================================+
 | |
| | thread_Id     | ``2``        | Thread ID for buffer.                         |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | reserved      | ``13``       | Unused.                                       |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| 
 | |
| 
 | |
| WallClockTime Records
 | |
| ---------------------
 | |
| 
 | |
| Following the NewBuffer record, each buffer records an absolute time as a frame
 | |
| of reference for the durations recorded by timestamp counter deltas.
 | |
| 
 | |
| Its data segment is as follows.
 | |
| 
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | Field         | Size (bytes) | Description                                   |
 | |
| +===============+==============+===============================================+
 | |
| | seconds       | ``8``        | Seconds on absolute timescale. The starting   |
 | |
| |               |              | point is unspecified and depends on the       |
 | |
| |               |              | implementation and platform configured by the |
 | |
| |               |              | tracer.                                       |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | microseconds  | ``4``        | The microsecond component of the time.        |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | reserved      | ``3``        | Unused.                                       |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| 
 | |
| 
 | |
| NewCpuId Records
 | |
| ----------------
 | |
| 
 | |
| Each function entry invokes a routine to determine what CPU is executing.
 | |
| Typically, this is done with readtscp, which reads the timestamp counter at the
 | |
| same time.
 | |
| 
 | |
| If the tracing detects that the execution has switched CPUs or if this is the
 | |
| first instrumented entry point, the tracer will output a NewCpuId record.
 | |
| 
 | |
| Its data segment is as follows.
 | |
| 
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | Field         | Size (bytes) | Description                                   |
 | |
| +===============+==============+===============================================+
 | |
| | cpu_id        | ``2``        | CPU Id.                                       |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | absolute_tsc  | ``8``        | The absolute value of the timestamp counter.  |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | reserved      | ``5``        | Unused.                                       |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| 
 | |
| 
 | |
| TSCWrap Records
 | |
| ---------------
 | |
| 
 | |
| Since each function record uses a 32 bit value to represent the number of ticks
 | |
| of the timestamp counter since the last reference, it is possible for this value
 | |
| to overflow, particularly for sparsely instrumented binaries.
 | |
| 
 | |
| When this delta would not fit into a 32 bit representation, a reference absolute
 | |
| timestamp counter record is written in the form of a TSCWrap record.
 | |
| 
 | |
| Its data segment is as follows.
 | |
| 
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | Field         | Size (bytes) | Description                                   |
 | |
| +===============+==============+===============================================+
 | |
| | absolute_tsc  | ``8``        | Timestamp counter value.                      |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | reserved      | ``7``        | Unused.                                       |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| 
 | |
| 
 | |
| CallArgument Records
 | |
| --------------------
 | |
| 
 | |
| Immediately following an Entry_Args type function record, there may be one or
 | |
| more CallArgument records that contain the traced function's parameter values.
 | |
| 
 | |
| The order of the CallArgument Record sequency corresponds one to one with the
 | |
| order of the function parameters.
 | |
| 
 | |
| CallArgument data segment:
 | |
| 
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | Field         | Size (bytes) | Description                                   |
 | |
| +===============+==============+===============================================+
 | |
| | argument      | ``8``        | Numeric argument (may be pointer address).    |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | reserved      | ``7``        | Unused.                                       |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| 
 | |
| 
 | |
| CustomEventMarker Records
 | |
| -------------------------
 | |
| 
 | |
| XRay provides the feature of logging custom events. This may be leveraged to
 | |
| record tracing info for RPCs or similarly trace data that is application
 | |
| specific.
 | |
| 
 | |
| Custom Events themselves are an unstructured (application defined) segment of
 | |
| memory with arbitrary size within the buffer. They are preceded by
 | |
| CustomEventMarkers to indicate their presence and size.
 | |
| 
 | |
| CustomEventMarker data segment:
 | |
| 
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | Field         | Size (bytes) | Description                                   |
 | |
| +===============+==============+===============================================+
 | |
| | event_size    | ``4``        | Size of preceded event.                       |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | absolute_tsc  | ``8``        | A timestamp counter of the event.             |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| | reserved      | ``3``        | Unused.                                       |
 | |
| +---------------+--------------+-----------------------------------------------+
 | |
| 
 | |
| 
 | |
| EndOfBuffer Records
 | |
| -------------------
 | |
| 
 | |
| An EndOfBuffer record type indicates that there is no more trace data in this
 | |
| buffer. The reader is expected to seek past the remaining buffer_size expressed
 | |
| before the start of buffer and look for either another header or EOF.
 | |
| 
 | |
| 
 | |
| Format Grammar and Invariants
 | |
| =============================
 | |
| 
 | |
| Not all sequences of Metadata records and Function records are valid data. A
 | |
| sequence should be parsed as a state machine. The expectations for a valid
 | |
| format can be expressed as a context free grammar.
 | |
| 
 | |
| This is an attempt to explain the format with statements in EBNF format.
 | |
| 
 | |
| - Format := Header ThreadBuffer* EOF
 | |
| 
 | |
| - ThreadBuffer := NewBuffer WallClockTime NewCPUId BodySequence* End
 | |
| 
 | |
| - BodySequence := NewCPUId | TSCWrap | Function | CustomEvent
 | |
| 
 | |
| - Function := (Function_Entry_Args CallArgument*) | Function_Other_Type
 | |
| 
 | |
| - CustomEvent := CustomEventMarker CustomEventUnstructuredMemory
 | |
| 
 | |
| - End := EndOfBuffer RemainingBufferSizeToSkip
 | |
| 
 | |
| 
 | |
| Function Record Order
 | |
| ---------------------
 | |
| 
 | |
| There are a few clarifications that may help understand what is expected of
 | |
| Function records.
 | |
| 
 | |
| - Functions with an Exit are expected to have a corresponding Entry or
 | |
|   Entry_Args function record precede them in the trace.
 | |
| 
 | |
| - Tail_Exit Function records record the Function ID of the function whose return
 | |
|   address the program counter will take. In other words, the final function that
 | |
|   would be popped off of the call stack if tail call optimization was not used.
 | |
| 
 | |
| - Not all functions marked for instrumentation are necessarily in the trace. The
 | |
|   tracer uses heuristics to preserve the trace for non-trivial functions.
 | |
| 
 | |
| - Not every entry must have a traced Exit or Tail Exit. The buffer may run out
 | |
|   of space or the program may request for the tracer to finalize toreturn the
 | |
|   buffer before an instrumented function exits.
 |