MT/content/hardware_tracing.tex

\chapter{Hardware Trace Measurement}
\label{section:trace_measurement}

Computer systems can be analyzed with measurement tools that detect events,
i.e.\ changes in the state of a system \cite[p. 28]{ferrari1978computer}.  The
same event can be interpreted on different levels as shown in
\autoref{fig:trace_event_levels}.  A hardware trace tool can detect a voltage
change in memory, e.g.\ triggered by the processor which is a hardware event.
Accordingly, the variable that maps to the changed memory register changes too
which is a software event.  If this variable is related to the state of a task,
a change of the variable also means a change of the task state which is then
called a system event.

In many cases, the event of interest cannot be measured directly.  One or more
transformation steps are required to retrieve the required result.  If a
transformation process is executed the measurement is said to be indirect
\cite[p. 28]{ferrari1978computer}.  Considering the previous example a task
termination event cannot be measured directly.  However, a variable that
contains the current task state can be measured.  If the task corresponding
to the variable and the mapping from value to task state is known, a change of
the variable can be transformed into a higher level event the termination of a
task.  After the transformation process the measurement results can be
displayed to the user as shown in \autoref{fig:concept_measurement}.

\begin{figure}[]
 \centering
 \includegraphics[width=\textwidth]{./media/trace/concept_measurement.pdf}
 \caption[Measurement process]{The conceptual parts of a measurement process
 according to Ferrari \cite{ferrari1978computer}.  A sensor measures data.  One
 or more transformation steps are required if the data is not yet in the
 desired format.  Finally the result can be presented to the user.}
 \label{fig:concept_measurement}
\end{figure}

During the transformation step the collected data may be manipulated which is
called prereduction.  Prereduction may for example be used when the actual
event is not required, but rather the amount of events of a certain type that
occurred.  For this case the transformer would increment a counter whenever a
certain event type is collected.  If no prereduction is executed, the
measurement process is called tracing.  Tracing is the process of recording a
sequence of events in chronological order of occurrence \cite[p.
30]{ferrari1978computer}.  The result of this process is called a trace.

\section{Trace Tools}

Ferrari \cite[p. 31ff]{ferrari1978computer} distinguishes three trace
measurement tools: software, hybrid, and hardware tools.  All tools are meant
to examine the behavior of a system.  However, there are differences in
interference, resolution, and cost as summarized in
\autoref{tab:trace_tool_overview}.

If a measurement tool uses resources of the target system it causes
interference by using computational power and memory that could otherwise be
utilized by the application.  A tool that causes interference is said to be
intrusive and may cause degradation, a reduction in performance of the target
system \cite[p. 29]{ferrari1978computer}.  Consequently, intrusive trace tools
change the real-time behavior of an application.

An event can be represented on different levels.  A voltage level change in
memory can map to a variable which can map to the state of a task as
visualized in \autoref{fig:trace_event_levels}.  Those levels are called
hardware level, software level, and system level.  To clarify the level of a
trace, it can be mentioned explicitly. For instance, a trace consisting of
hardware level events is a hardware level trace \cite[p.  29f]{felixproject2}.
Tools that can detect hardware events occurring at a microscopic level are
said to have a higher resolution than tools that can detect software events
only.

\begin{figure}[]
 \centering
 \includegraphics[width=\textwidth]{./media/trace/trace_event_levels.pdf}
 \caption[Measurement levels]{A measurement event can be interpreted on
 different levels.  A voltage change in memory can be detected by a hardware
 trace tool capable of supervising the memory bus that triggers the voltage
 change.  The memory section can relate to a variable, that changes in
 consequence of the voltage change, which is a software event.  If the variable
 is related to the state of a task, a change of the variable also means a
 change of the task state which is then called a system event.}
 \label{fig:trace_event_levels}
\end{figure}

Different trace techniques can detect and record events with different
frequencies.  The maximum frequency is usually not limited by the speed with
which events can be detected, but by the available bandwidth to process and
record the detected events.

The cost of different trace tools depends on several factors,  the price for
hardware and software licenses,  the price for installing and maintaining the
tool, educational costs, like training for the users of a tool, and the costs
of operating the tool.

\textbf{Software tools} add instructions to a hardware-software system in order
to detect and record events of interest.  Added instructions are called
instrumentation.  The simplest kind of instrumentation is a classical write to
the standard output interface, e.g.\ a \lstinline{printf} statement in the C
programming language.  Instructions may be added to the application code
directly, via the compiler or post compilation via dynamic binary
instrumentation \cite{trumper2012maintenance}\cite{felixarc2015}.  If no
standard output interface is available,  events are recorded into memory on
target.  From there they can be read out via debugger or serial interface.
Instrumentation always interferes with the application.  There are two
components of interference, a space, and a time component \cite[p.
44]{ferrari1978computer}.  Execution of instrumentation code takes time and
storing detected events uses memory space.  Software tools have a low
resolution because they cannot detect events on a hardware level.  Event
detection frequency is limited by the available computational resources.  On
the upside they are usually cheap and easy to implement and use.

\textbf{Hardware tools} do not rely on instrumentation which means that they
are non intrusive and do not interfere with the application
\cite{felixarc2014}.  Hardware tracing works via a dedicated trace device chip
that is located on the silicon of the CPU\@.  Trace devices provide a very high
resolution since they are capable of detecting events at hardware level
\cite{mink1989performance}.  Additionally the event detection frequency can be
as high as the actual system frequency,  thus it is possible to record a
complete hardware-software system in real-time.  Hardware tools are more
expensive compared to software solutions.  Installation and maintenance are
more complex and require properly qualified users.

\textbf{Hybrid tools} rely on instrumentation and  a dedicated hardware
interface to record events.  The boundary between software, hybrid, and
hardware tools can be fuzzy in certain cases.  Software tools need some kind of
hardware interface to send recorded traces off-chip.  In this sense, all
software tools are hybrid tools.  However, industry hybrid solutions often
require proprietary target interfaces which justifies why these tools fit into
a separate category \cite{richterganzheitliche}.  Compared to pure software
tools, hybrid tools interfere with the system to a lesser extent
\cite{nacht1989hardware}.  A dedicated hardware interface allows it to send
events off-chip in real-time.  Consequently, more memory becomes available on
target.

As shown in \autoref{tab:trace_tool_overview} hardware trace tools have many
advantages over hybrid and software based solutions.  Hardware tracing does not
interfere with the system, which is especially important for real-time systems.
Hardware trace tools are capable of detecting events with a higher resolution
and frequency.  Additionally the trace duration of software and hybrid traces
is limited to the available memory on target and to the trace interface
bandwidth.  When the same quantity can be measured by a hardware and a software
tool, the values obtained by the hardware tool are usually to be considered
more accurate because of the lower interference \cite[p.
45]{ferrari1978computer}.

\begin{table}[]
  \centering
  \begin{tabular}{r|c c c}
                 & Software & Hybrid & Hardware \\
    \hline
    Interference & high     & low    & no   \\
    Resolution   & low      & low    & high \\
    Cost         & low      & low    & high \\
    Frequency    & low      & low    & high \\
  \end{tabular}
  \caption[Trace techniques]{Properties of different trace
  measurement tools \cite[p.  6]{felixproject1}.  Hardware tools are superior
  to software and hybrid tools but come with higher expenses.}
  \label{tab:trace_tool_overview}
\end{table}

\section{Hardware Tracing}
\label{subsection:hardware_tracing}

Hardware tracing is capable of recording events on hardware level.  A dedicated
on-chip trace device and trace interface is required to record hardware events
and send them off-chip \cite{mink1990multiprocessor}.  Target access hardware
is connected to the trace interface to readout the trace measurement results.
From there the events are forwarded to a host computer for further processing.
Software that runs on the host computer in order to analyze the recorded trace
data is provided by the target access hardware vendor \cite{winidea}.  The term
host software is used to refer to such applications.

The on-chip trace device is designed to record hardware events executed by the
microcontroller.  It occupies a separate section on the silicon.  Usually a
controller is delivered in two versions, one with and one without trace device.
In production the ability to execute trace measurement is not required
\cite{felixarc2014}.  Therefore, the trace device would only increase chip
costs without providing any benefits.

\begin{figure}[]
 \centering
 \includegraphics[width=\textwidth]{./media/trace/tc27_emulation_device.png}
 \caption[Infineon TC27x trace device]{A microcontroller with hardware trace
 support consists of two sections.  A regular product chip part and the trace
 device part.  The trace device part can be omitted in the production version
 of a chip to save costs \cite{tc27block}.}
 \label{fig:tc27_emulation_device}
\end{figure}

\autoref{fig:tc27_emulation_device} shows the trace device of the Infineon
TC27x microcontroller family \cite{tc27x}.  The upper part belongs to the
product chip while the lower part displays the trace device.  The trace device
can gather data from the product part via two interfaces.  \glspl{pob}
(\glsdesc{pob}) record processor events while \glspl{bob} record bus events.
All events are collected, enhanced with a timestamp and buffered in the on-chip
trace memory.  From there they are sent off-chip via the dedicated trace
interface.


\begin{figure}[]
 \centering
 \includegraphics[width=\textwidth]{./media/trace/timestamp_generation_event.pdf}
 \caption[Timestamp per event]{Each trace event is assigned a timestamp
 relative to the previous event.  By summing up the relative timestamps
 absolute values can be generated.}
 \label{fig:timestamp_generation_event}
\end{figure}

\begin{figure}[]
 \centering
 \includegraphics[width=\textwidth]{./media/trace/timestamp_generation_dedicated.pdf}
 \caption[Dedicated timestamp generation]{Via dedicated timestamp events, the
 timestamps of the other events can be interpolated.  In this example two
 events are recorded between the previous and the next timestamp event.  This
 is why both events get the same timestamp, based on these events.  The value
 is calculated via \autoref{eq:timestamp_interpolation} as $t_i = 5 +
 \frac{(15-5)}{2}=10$.}
 \label{fig:timestamp_generation_dedicated}
\end{figure}


\begin{figure}[]
 \centering
 \includegraphics[width=\textwidth]{./media/trace/timestamp_generation_io.pdf}
 \caption[Timestamp via \gls{io}]{Dedicated \gls{io} pins can be used to output
 a timestamp value whenever a measurement event is sent off-chip.}
 \label{fig:timestamp_generation_io}
\end{figure}

There exist different techniques to add timestamp information to a trace event.
The obvious way is shown in \autoref{fig:timestamp_generation_event}.  A
timestamp is added to each trace event that is sent off-chip.  To save
bandwidth timestamps are provided relatively to the previous event.  An
absolute value is computed by summing up all previous timestamp.

Another way is to send dedicated timestamp messages as shown in
\autoref{fig:timestamp_generation_dedicated}.  The timestamps for the actual
trace events are then interpolated, e.g., via the equation

\begin{equation}
\label{eq:timestamp_interpolation}
  t_{i} = t_p + \frac{(t_n - t_p)}{2},
\end{equation}

where $t_p$ is the previous timestamp (the latest timestamp before the event),
$t_n$ the next timestamp (the soonest timestamp after the event) and $t_i$ the
timestamp interpolated based on the dedicated timestamp events.

Finally, timestamps can also be created via dedicated \gls{io} pins as
specified by the Nexus \cite{turley2004nexus} standard.  This means that
whenever a trace event is sent off-chip via the trace interface, the current
timestamp is provided via the \gls{io} pins as shown in
\autoref{fig:timestamp_generation_io}.

Cycle accurate timestamps are feasible with all timestamp generation
techniques.  However, timestamp accuracy and resolution are only partly
dependent on the generation technique.  More important factors are CPU and
trace device clock frequency, as well as the design of CPU and trace device.
For cycle accurate timestamps, trace device frequency must be greater or equal
to CPU frequency.  Even if this is the case, cycle accurate time\-stamps cannot
necessarily be guaranteed.

For example, super scalar processors like the Infineon TC277 \cite{tc27x} are
capable of executing more than one instructions per cycle.  However, only one
event can be processed per cycle by the trace device as shown in
\autoref{fig:timestamp_cycle}.  The processor observation block filters the
instructions according to user specified filter rules and forwards them for
further processing.  If two instructions, executed during the same processor
cycle, match the filter and are thus forwarded to the trace device, one of
those instructions is delayed by one cycle (in this example Instruction 2.1).
For a processor running at \unit[100]{MHz} this would set the timestamp off by
\unit[10]{ns} for this particular event.

\begin{figure}[]
 \centering
 \includegraphics[width=\textwidth]{./media/trace/timestamp_cycle.pdf}
 \caption[Timestamp generation accuracy]{Even if the trace device runs at CPU
 clock frequency, cycle accurate timestamps cannot be guaranteed.}
 \label{fig:timestamp_cycle}
\end{figure}

The design of trace devices differs depending on the processor family and the
processor vendor.  However, the general concept and provided functionality are
the same for all devices.  Various standards for the implementation of
trace devices are specified and used by chip vendors.  Three common standards
are Nexus used by PowerPC processors \cite{turley2004nexus}, \gls{etm}
(\glsdesc{etm}) used by ARM processors \cite[p.  476]{yiu2013definitive}, and
the \glsdesc{imds} \cite{stollon2011infineon} discussed here and shown in
\autoref{fig:tc27_emulation_device}.

According to \autoref{fig:concept_measurement}, a measurement process starts
with the detection of an event by a sensor.  In case of the trace process the
sensors are the \glspl{pob} and \glspl{bob}.  Each \gls{pob} monitors the
instructions executed by one processor core.  This means the complete program
flow executed by a processor core can be recorded.  \glspl{bob} are connected
to the data busses of the microcontroller and can detect memory access events.
A memory access event may be for example, writing to a variable or reading
from a special function register.  A typical data trace event contains in
addition to the timestamp, details like address, data value, transfer size, and
whether a read or write access occurred \cite{hopkins2006debug}.

Filters can be specified by the user to reduce the amount of recorded trace
events.  They can be set for an address or for an address range.  Different
events can be executed if an address filter matches: the corresponding event
can be recorded, discarded or another event can be triggered.  For example, it
is possible to start or stop the trace process if a specific function is
accessed or a variable is written.  Filter configuration is done via the host
software.

Corresponding to the two main hardware event types, instruction, and data
access events, two hardware trace techniques can be distinguished, program flow
trace and data trace \cite{felixarc2014}.  The two trace techniques can be
executed in parallel or individually as configured by the user.

A \textbf{program flow trace} (also called function trace) shows the complete
execution path of an application for the duration of the trace recording.  This
means it is possible to detect when a certain function is called or which
branch of an if statement is executed.  The amount of instructions and the
resulting data stream bandwidth produced by a modern CPU is too big to be
transmitted via the trace interface.  To solve this problem trace devices use
trace compression.  The most commonly used program flow trace compression
technique works by detecting and recording only such instructions that cause a
change in program flow such as conditional jumps and traps
\cite{hopkins2006debug}.  Using the application binary the host software is
able to reconstruct the complete program flow.

A \textbf{data trace} is a sequence of data access events.  Data tracing allows
it to supervise and to debug the state of variables in memory.  Data tracing of
all active units is becoming increasingly important because not all data
interactions involve a processor \cite{mayer2003debug}.  Thus, trace devices
must also be able to detect memory accesses via \gls{dma} (\glsdesc{dma}) and
accesses to memory of special on-chip modules like FlexRay or Ethernet.  The
units that are supported by a microcontroller are depended on the trace device,
but all trace devices support tracing the main memory of a controller.
Compression is also applied to data traces.  However, those techniques are
usually not sufficient to record a complete data trace of significant length
since the amount of generated data is too big.  The best way to solve this
problem is to apply filters to avoid detecting and recording data events in
memory sections that are not of interest \cite{hopkins2006debug}.

A recorded hardware trace event is buffered into an on-chip trace memory.  From
there the events can be read via the trace interface.  On-chip trace memories
can be operated in different modes \cite{felixarc2014}.  In continuous mode
the trace data is streamed of chip in real-time.  This technique is limited by
the bandwidth of the trace interface.  If it is high enough the trace duration
is only depended on the available memory on the host computer and traces of
arbitrary length can be recorded.  If the bandwidth is too small to process the
recorded trace stream \emph{buffer mode} must be used.  This means the recorded
trace is written into trace memory and read out by the target access hardware
post tracing.  Buffer mode can be used in pre- and post-trigger mode.  In
pre-trigger mode the trace buffer is filled like a circular buffer.  The oldest
events are discarded for new events.  The trace process can be stopped at an
arbitrary point in time and the latest trace events become available.  In
post-trigger mode the trace process is stopped as soon as the buffer has been
filled for the first time.

A trace device operated in buffer mode is limited by the available trace
memory. The trace memory size of an Infineon TC275 microcontroller
(\autoref{fig:workbench} a)is \unit[2]{MB} which allows for approximately
\unit[33]{ms} of unfiltered function and data trace of a single processor core
running at \unit[200]{MHz} \cite{felixarc2014}.  Depending on the measurement
use case this may be sufficient or not.  If the trace duration should be
increased tracing in continuous mode is mandatory.  Continues tracing requires
a high bandwidth interface such as \gls{agbt} (\glsdesc{agbt}).

\section{Hardware Trace Toolchain}

Multiple steps are required from recording a hardware trace on target to
presenting it to the user on a personal computer as shown in
\autoref{fig:toolchain}.  Many different solutions exist for each of those
steps.  Nevertheless, the basic functionalities provided by all solutions is
comparable to each other.

\begin{figure}[]
 \centering
 \includegraphics[width=\textwidth]{./media/trace/toolchain.pdf}
 \caption[Trace toolchain]{Recording a hardware trace and making it
 available to the user requires multiple steps.  Hardware events must be
 measured on target via a trace device.  Using a trace interface the recorded
 data can be readout by the target access hardware and transmitted to a host
 computer.  Target access hardware vendors provide special software to analyze
 and visualize the recorded trace.}
 \label{fig:toolchain}
\end{figure}

The basic prerequisite for executing a hardware trace is the availability of an
on-chip trace device.  All major chip vendors provide trace devices for their
microcontrollers that support program flow and data trace.
\autoref{tab:trace_devices} gives an overview of the state-of-the-art trace
solutions.

\begin{table}[]
  \centering
  \begin{tabular}{r|c c c}
    Standard    & Architecture & Function Trace & Data Trace\\
    \hline
    Nexus                       &
    PowerPC                     &
    \begin{tabular}[x]{@{}c@{}}  Branch Trace \\ Messaging \end{tabular}  &
    \begin{tabular}[x]{@{}c@{}}  Data Trace \\ Messaging \end{tabular}    \\
    \hline
    \gls{etm}                   &
    ARM                         &
    \begin{tabular}[x]{@{}c@{}}Program Trace \\ Macrocell   \end{tabular} &
    \begin{tabular}[x]{@{}c@{}}Embedded Trace \\ Macrocell  \end{tabular} \\
    \hline
    \gls{imds}                  &
    TriCore                     &
    \begin{tabular}[x]{@{}c@{}}Processor \\ Observation Block \end{tabular} &
    \begin{tabular}[x]{@{}c@{}}Bus \\ Observation Block       \end{tabular} \\
  \end{tabular}
  \caption[Trace devices for different architectures]{Trace devices exist for
  different CPU architectures.  All solutions provide methods for recording
  program flow and data traces.}
  \label{tab:trace_devices}
\end{table}

Events that have been recorded by the trace device are sent off-chip via a
dedicated trace interface.  If the bandwidth provided by an interface is lower
than the transfer rate of created events continuous tracing is not possible.
However, this use case is often required.  There are two ways two solve this
problem.  The amount of created trace data can be reduced using filters or the
available bandwidth can be increased.  If an entire application must be
analyzed as a whole the first way is not an option.

\begin{table}[]
  \centering
  \begin{tabular}{r|l c}
    Interface  & Pros/Cons & DAQ rate \small{$[MB/s]$}\\
    \hline
    JTAG           &
    \begin{tabular}[x]{@{}l@{}}
      $+$ Reuse of existing interface \\
      $+$ Small chip area \\
      $-$ Low bandwidth \\
      \vspace{1mm}
    \end{tabular}  &
    1.2            \\
    DAP2/SWD   &
    \begin{tabular}[x]{@{}l@{}}
      $+$ High bandwidth with few pins \\
      $+$ Small silicon area \\
      $-$ Proprietary \\
      \vspace{1mm}
    \end{tabular}  &
    10   \\
    \gls{agbt} &
    \begin{tabular}[x]{@{}l@{}}
      $+$ Very high bandwidth with few pins \\
      $-$ Large silicon area \\
      $-$ High cost \\
      \vspace{1mm}
    \end{tabular}  &
    30   \\
    CAN        &
    \begin{tabular}[x]{@{}l@{}}
      $+$ Robust and well known standard \\
      $+$ Low cost \\
      $-$ Very low bandwidth \\
    \end{tabular}  &
    0.05 \\
  \end{tabular}
  \caption[Trace interfaces]{Commonly used trace interfaces and their \gls{daq}
  (\glsdesc{daq}) rates. \gls{agbt} (\glsdesc{agbt}) is the only interface
  capable of recording continuous hardware traces of a complete system.}
  \label{tab:interfaces}
\end{table}

Mayer et al.\ \cite{interfaces} give an overview of trace interfaces used in
the automotive industry as shown in \autoref{tab:interfaces}.  \gls{jtag}
(\glsdesc{jtag}) is a common debug standard \cite{ieee5001}, suitable for
regular debugging.  It can be used to read out a buffered traced post tracing,
but for continuous tracing it is not sufficient due to its low bandwidth of
\unit[1.2]{MB/s}.  Because of that DAP and DAP2 were developed by Infineon and
SWD by ARM\@.  Both protocols are based on \gls{jtag} but use a higher
frequency and improved communication protocols to provided more bandwidth.

\gls{agbt} is currently the fastest trace interface.  It was specified by
XILINX and adopted by the Nexus standard.  \gls{agbt} is the only interface
which is theoretically capable of recording a continuous trace of a complete
application running on a processor with a frequency of \unit[200]{MHz}.  CAN is
used by some hybrid trace tools but is only mentioned for completeness since
its bandwidth is too low to be considered for hardware tracing.

\begin{figure}[]
 \centering
 \includegraphics[width=\textwidth]{./media/trace/workbench.png}
 \caption[Trace workbench]{A complete trace workbench.  An Infineon TriCore
 evaluation board (a) can be traced by the iSYSTEM iC6000 (b) or the Lauterbach
 PowerTrace-2 (e) via the highspeed \gls{agbt} interface.  Host software is
 used to control the hardware and to analyze the recorded trace, for example
 WinIDEA (c) by iSYSTEM and TRACE32 (d) by Lauterbach \cite{maxmaster}.}
 \label{fig:workbench}
\end{figure}

Target access hardware is connected to the hardware interface to readout
recorded trace events.  From the target access hardware the data is transmitted
to a host computer for further analysis via USB 3.0 or Ethernet.  Examples for
target access hardware are the iC6000 by iSYSTEM \cite{ic6000}
(\autoref{fig:workbench} b) and the PowerTrace-II by Lauterbach
\cite{powertrace2} (\autoref{fig:workbench} e).  Both devices support
different architectures and trace interfaces by using architecture specific
debug cables.  Besides reading hardware traces those devices also support all
functionalities provided by a regular debugger such as step wise debugging,
reading of memory content, and manipulation of CPU configuration registers.

Dedicated software on the host computer is used to configure and control the
target access hardware and the trace device itself.  After recording, this
software transforms the recorded hardware trace into a software trace (see
\autoref{fig:trace_event_levels}).  For this process the host software must
have access to the \gls{elf} file of an application.  This is required to map
the addresses of hardware trace events to the corresponding software entities.
Based on the software trace, different analysis techniques such as metric
evaluation, performance analysis, and code coverage are supported.  Gantt
charts are provided to examine the trace visually.  Via export functions a
software level program flow and data trace can be made available for external
tools.  \autoref{fig:workbench} shows the toolchain described in this section.