\chapter{Hardware Trace Measurement} \label{section:trace_measurement} Computer systems can be analyzed with measurement tools that detect events, i.e.\ changes in the state of a system \cite[p. 28]{ferrari1978computer}. The same event can be interpreted on different levels as shown in \autoref{fig:trace_event_levels}. A hardware trace tool can detect a voltage change in memory, e.g.\ triggered by the processor which is a hardware event. Accordingly, the variable that maps to the changed memory register changes too which is a software event. If this variable is related to the state of a task, a change of the variable also means a change of the task state which is then called a system event. In many cases, the event of interest cannot be measured directly. One or more transformation steps are required to retrieve the required result. If a transformation process is executed the measurement is said to be indirect \cite[p. 28]{ferrari1978computer}. Considering the previous example a task termination event cannot be measured directly. However, a variable that contains the current task state can be measured. If the task corresponding to the variable and the mapping from value to task state is known, a change of the variable can be transformed into a higher level event the termination of a task. After the transformation process the measurement results can be displayed to the user as shown in \autoref{fig:concept_measurement}. \begin{figure}[] \centering \includegraphics[width=\textwidth]{./media/trace/concept_measurement.pdf} \caption[Measurement process]{The conceptual parts of a measurement process according to Ferrari \cite{ferrari1978computer}. A sensor measures data. One or more transformation steps are required if the data is not yet in the desired format. Finally the result can be presented to the user.} \label{fig:concept_measurement} \end{figure} During the transformation step the collected data may be manipulated which is called prereduction. Prereduction may for example be used when the actual event is not required, but rather the amount of events of a certain type that occurred. For this case the transformer would increment a counter whenever a certain event type is collected. If no prereduction is executed, the measurement process is called tracing. Tracing is the process of recording a sequence of events in chronological order of occurrence \cite[p. 30]{ferrari1978computer}. The result of this process is called a trace. \section{Trace Tools} Ferrari \cite[p. 31ff]{ferrari1978computer} distinguishes three trace measurement tools: software, hybrid, and hardware tools. All tools are meant to examine the behavior of a system. However, there are differences in interference, resolution, and cost as summarized in \autoref{tab:trace_tool_overview}. If a measurement tool uses resources of the target system it causes interference by using computational power and memory that could otherwise be utilized by the application. A tool that causes interference is said to be intrusive and may cause degradation, a reduction in performance of the target system \cite[p. 29]{ferrari1978computer}. Consequently, intrusive trace tools change the real-time behavior of an application. An event can be represented on different levels. A voltage level change in memory can map to a variable which can map to the state of a task as visualized in \autoref{fig:trace_event_levels}. Those levels are called hardware level, software level, and system level. To clarify the level of a trace, it can be mentioned explicitly. For instance, a trace consisting of hardware level events is a hardware level trace \cite[p. 29f]{felixproject2}. Tools that can detect hardware events occurring at a microscopic level are said to have a higher resolution than tools that can detect software events only. \begin{figure}[] \centering \includegraphics[width=\textwidth]{./media/trace/trace_event_levels.pdf} \caption[Measurement levels]{A measurement event can be interpreted on different levels. A voltage change in memory can be detected by a hardware trace tool capable of supervising the memory bus that triggers the voltage change. The memory section can relate to a variable, that changes in consequence of the voltage change, which is a software event. If the variable is related to the state of a task, a change of the variable also means a change of the task state which is then called a system event.} \label{fig:trace_event_levels} \end{figure} Different trace techniques can detect and record events with different frequencies. The maximum frequency is usually not limited by the speed with which events can be detected, but by the available bandwidth to process and record the detected events. The cost of different trace tools depends on several factors, the price for hardware and software licenses, the price for installing and maintaining the tool, educational costs, like training for the users of a tool, and the costs of operating the tool. \textbf{Software tools} add instructions to a hardware-software system in order to detect and record events of interest. Added instructions are called instrumentation. The simplest kind of instrumentation is a classical write to the standard output interface, e.g.\ a \lstinline{printf} statement in the C programming language. Instructions may be added to the application code directly, via the compiler or post compilation via dynamic binary instrumentation \cite{trumper2012maintenance}\cite{felixarc2015}. If no standard output interface is available, events are recorded into memory on target. From there they can be read out via debugger or serial interface. Instrumentation always interferes with the application. There are two components of interference, a space, and a time component \cite[p. 44]{ferrari1978computer}. Execution of instrumentation code takes time and storing detected events uses memory space. Software tools have a low resolution because they cannot detect events on a hardware level. Event detection frequency is limited by the available computational resources. On the upside they are usually cheap and easy to implement and use. \textbf{Hardware tools} do not rely on instrumentation which means that they are non intrusive and do not interfere with the application \cite{felixarc2014}. Hardware tracing works via a dedicated trace device chip that is located on the silicon of the CPU\@. Trace devices provide a very high resolution since they are capable of detecting events at hardware level \cite{mink1989performance}. Additionally the event detection frequency can be as high as the actual system frequency, thus it is possible to record a complete hardware-software system in real-time. Hardware tools are more expensive compared to software solutions. Installation and maintenance are more complex and require properly qualified users. \textbf{Hybrid tools} rely on instrumentation and a dedicated hardware interface to record events. The boundary between software, hybrid, and hardware tools can be fuzzy in certain cases. Software tools need some kind of hardware interface to send recorded traces off-chip. In this sense, all software tools are hybrid tools. However, industry hybrid solutions often require proprietary target interfaces which justifies why these tools fit into a separate category \cite{richterganzheitliche}. Compared to pure software tools, hybrid tools interfere with the system to a lesser extent \cite{nacht1989hardware}. A dedicated hardware interface allows it to send events off-chip in real-time. Consequently, more memory becomes available on target. As shown in \autoref{tab:trace_tool_overview} hardware trace tools have many advantages over hybrid and software based solutions. Hardware tracing does not interfere with the system, which is especially important for real-time systems. Hardware trace tools are capable of detecting events with a higher resolution and frequency. Additionally the trace duration of software and hybrid traces is limited to the available memory on target and to the trace interface bandwidth. When the same quantity can be measured by a hardware and a software tool, the values obtained by the hardware tool are usually to be considered more accurate because of the lower interference \cite[p. 45]{ferrari1978computer}. \begin{table}[] \centering \begin{tabular}{r|c c c} & Software & Hybrid & Hardware \\ \hline Interference & high & low & no \\ Resolution & low & low & high \\ Cost & low & low & high \\ Frequency & low & low & high \\ \end{tabular} \caption[Trace techniques]{Properties of different trace measurement tools \cite[p. 6]{felixproject1}. Hardware tools are superior to software and hybrid tools but come with higher expenses.} \label{tab:trace_tool_overview} \end{table} \section{Hardware Tracing} \label{subsection:hardware_tracing} Hardware tracing is capable of recording events on hardware level. A dedicated on-chip trace device and trace interface is required to record hardware events and send them off-chip \cite{mink1990multiprocessor}. Target access hardware is connected to the trace interface to readout the trace measurement results. From there the events are forwarded to a host computer for further processing. Software that runs on the host computer in order to analyze the recorded trace data is provided by the target access hardware vendor \cite{winidea}. The term host software is used to refer to such applications. The on-chip trace device is designed to record hardware events executed by the microcontroller. It occupies a separate section on the silicon. Usually a controller is delivered in two versions, one with and one without trace device. In production the ability to execute trace measurement is not required \cite{felixarc2014}. Therefore, the trace device would only increase chip costs without providing any benefits. \begin{figure}[] \centering \includegraphics[width=\textwidth]{./media/trace/tc27_emulation_device.png} \caption[Infineon TC27x trace device]{A microcontroller with hardware trace support consists of two sections. A regular product chip part and the trace device part. The trace device part can be omitted in the production version of a chip to save costs \cite{tc27block}.} \label{fig:tc27_emulation_device} \end{figure} \autoref{fig:tc27_emulation_device} shows the trace device of the Infineon TC27x microcontroller family \cite{tc27x}. The upper part belongs to the product chip while the lower part displays the trace device. The trace device can gather data from the product part via two interfaces. \glspl{pob} (\glsdesc{pob}) record processor events while \glspl{bob} record bus events. All events are collected, enhanced with a timestamp and buffered in the on-chip trace memory. From there they are sent off-chip via the dedicated trace interface. \begin{figure}[] \centering \includegraphics[width=\textwidth]{./media/trace/timestamp_generation_event.pdf} \caption[Timestamp per event]{Each trace event is assigned a timestamp relative to the previous event. By summing up the relative timestamps absolute values can be generated.} \label{fig:timestamp_generation_event} \end{figure} \begin{figure}[] \centering \includegraphics[width=\textwidth]{./media/trace/timestamp_generation_dedicated.pdf} \caption[Dedicated timestamp generation]{Via dedicated timestamp events, the timestamps of the other events can be interpolated. In this example two events are recorded between the previous and the next timestamp event. This is why both events get the same timestamp, based on these events. The value is calculated via \autoref{eq:timestamp_interpolation} as $t_i = 5 + \frac{(15-5)}{2}=10$.} \label{fig:timestamp_generation_dedicated} \end{figure} \begin{figure}[] \centering \includegraphics[width=\textwidth]{./media/trace/timestamp_generation_io.pdf} \caption[Timestamp via \gls{io}]{Dedicated \gls{io} pins can be used to output a timestamp value whenever a measurement event is sent off-chip.} \label{fig:timestamp_generation_io} \end{figure} There exist different techniques to add timestamp information to a trace event. The obvious way is shown in \autoref{fig:timestamp_generation_event}. A timestamp is added to each trace event that is sent off-chip. To save bandwidth timestamps are provided relatively to the previous event. An absolute value is computed by summing up all previous timestamp. Another way is to send dedicated timestamp messages as shown in \autoref{fig:timestamp_generation_dedicated}. The timestamps for the actual trace events are then interpolated, e.g., via the equation \begin{equation} \label{eq:timestamp_interpolation} t_{i} = t_p + \frac{(t_n - t_p)}{2}, \end{equation} where $t_p$ is the previous timestamp (the latest timestamp before the event), $t_n$ the next timestamp (the soonest timestamp after the event) and $t_i$ the timestamp interpolated based on the dedicated timestamp events. Finally, timestamps can also be created via dedicated \gls{io} pins as specified by the Nexus \cite{turley2004nexus} standard. This means that whenever a trace event is sent off-chip via the trace interface, the current timestamp is provided via the \gls{io} pins as shown in \autoref{fig:timestamp_generation_io}. Cycle accurate timestamps are feasible with all timestamp generation techniques. However, timestamp accuracy and resolution are only partly dependent on the generation technique. More important factors are CPU and trace device clock frequency, as well as the design of CPU and trace device. For cycle accurate timestamps, trace device frequency must be greater or equal to CPU frequency. Even if this is the case, cycle accurate time\-stamps cannot necessarily be guaranteed. For example, super scalar processors like the Infineon TC277 \cite{tc27x} are capable of executing more than one instructions per cycle. However, only one event can be processed per cycle by the trace device as shown in \autoref{fig:timestamp_cycle}. The processor observation block filters the instructions according to user specified filter rules and forwards them for further processing. If two instructions, executed during the same processor cycle, match the filter and are thus forwarded to the trace device, one of those instructions is delayed by one cycle (in this example Instruction 2.1). For a processor running at \unit[100]{MHz} this would set the timestamp off by \unit[10]{ns} for this particular event. \begin{figure}[] \centering \includegraphics[width=\textwidth]{./media/trace/timestamp_cycle.pdf} \caption[Timestamp generation accuracy]{Even if the trace device runs at CPU clock frequency, cycle accurate timestamps cannot be guaranteed.} \label{fig:timestamp_cycle} \end{figure} The design of trace devices differs depending on the processor family and the processor vendor. However, the general concept and provided functionality are the same for all devices. Various standards for the implementation of trace devices are specified and used by chip vendors. Three common standards are Nexus used by PowerPC processors \cite{turley2004nexus}, \gls{etm} (\glsdesc{etm}) used by ARM processors \cite[p. 476]{yiu2013definitive}, and the \glsdesc{imds} \cite{stollon2011infineon} discussed here and shown in \autoref{fig:tc27_emulation_device}. According to \autoref{fig:concept_measurement}, a measurement process starts with the detection of an event by a sensor. In case of the trace process the sensors are the \glspl{pob} and \glspl{bob}. Each \gls{pob} monitors the instructions executed by one processor core. This means the complete program flow executed by a processor core can be recorded. \glspl{bob} are connected to the data busses of the microcontroller and can detect memory access events. A memory access event may be for example, writing to a variable or reading from a special function register. A typical data trace event contains in addition to the timestamp, details like address, data value, transfer size, and whether a read or write access occurred \cite{hopkins2006debug}. Filters can be specified by the user to reduce the amount of recorded trace events. They can be set for an address or for an address range. Different events can be executed if an address filter matches: the corresponding event can be recorded, discarded or another event can be triggered. For example, it is possible to start or stop the trace process if a specific function is accessed or a variable is written. Filter configuration is done via the host software. Corresponding to the two main hardware event types, instruction, and data access events, two hardware trace techniques can be distinguished, program flow trace and data trace \cite{felixarc2014}. The two trace techniques can be executed in parallel or individually as configured by the user. A \textbf{program flow trace} (also called function trace) shows the complete execution path of an application for the duration of the trace recording. This means it is possible to detect when a certain function is called or which branch of an if statement is executed. The amount of instructions and the resulting data stream bandwidth produced by a modern CPU is too big to be transmitted via the trace interface. To solve this problem trace devices use trace compression. The most commonly used program flow trace compression technique works by detecting and recording only such instructions that cause a change in program flow such as conditional jumps and traps \cite{hopkins2006debug}. Using the application binary the host software is able to reconstruct the complete program flow. A \textbf{data trace} is a sequence of data access events. Data tracing allows it to supervise and to debug the state of variables in memory. Data tracing of all active units is becoming increasingly important because not all data interactions involve a processor \cite{mayer2003debug}. Thus, trace devices must also be able to detect memory accesses via \gls{dma} (\glsdesc{dma}) and accesses to memory of special on-chip modules like FlexRay or Ethernet. The units that are supported by a microcontroller are depended on the trace device, but all trace devices support tracing the main memory of a controller. Compression is also applied to data traces. However, those techniques are usually not sufficient to record a complete data trace of significant length since the amount of generated data is too big. The best way to solve this problem is to apply filters to avoid detecting and recording data events in memory sections that are not of interest \cite{hopkins2006debug}. A recorded hardware trace event is buffered into an on-chip trace memory. From there the events can be read via the trace interface. On-chip trace memories can be operated in different modes \cite{felixarc2014}. In continuous mode the trace data is streamed of chip in real-time. This technique is limited by the bandwidth of the trace interface. If it is high enough the trace duration is only depended on the available memory on the host computer and traces of arbitrary length can be recorded. If the bandwidth is too small to process the recorded trace stream \emph{buffer mode} must be used. This means the recorded trace is written into trace memory and read out by the target access hardware post tracing. Buffer mode can be used in pre- and post-trigger mode. In pre-trigger mode the trace buffer is filled like a circular buffer. The oldest events are discarded for new events. The trace process can be stopped at an arbitrary point in time and the latest trace events become available. In post-trigger mode the trace process is stopped as soon as the buffer has been filled for the first time. A trace device operated in buffer mode is limited by the available trace memory. The trace memory size of an Infineon TC275 microcontroller (\autoref{fig:workbench} a)is \unit[2]{MB} which allows for approximately \unit[33]{ms} of unfiltered function and data trace of a single processor core running at \unit[200]{MHz} \cite{felixarc2014}. Depending on the measurement use case this may be sufficient or not. If the trace duration should be increased tracing in continuous mode is mandatory. Continues tracing requires a high bandwidth interface such as \gls{agbt} (\glsdesc{agbt}). \section{Hardware Trace Toolchain} Multiple steps are required from recording a hardware trace on target to presenting it to the user on a personal computer as shown in \autoref{fig:toolchain}. Many different solutions exist for each of those steps. Nevertheless, the basic functionalities provided by all solutions is comparable to each other. \begin{figure}[] \centering \includegraphics[width=\textwidth]{./media/trace/toolchain.pdf} \caption[Trace toolchain]{Recording a hardware trace and making it available to the user requires multiple steps. Hardware events must be measured on target via a trace device. Using a trace interface the recorded data can be readout by the target access hardware and transmitted to a host computer. Target access hardware vendors provide special software to analyze and visualize the recorded trace.} \label{fig:toolchain} \end{figure} The basic prerequisite for executing a hardware trace is the availability of an on-chip trace device. All major chip vendors provide trace devices for their microcontrollers that support program flow and data trace. \autoref{tab:trace_devices} gives an overview of the state-of-the-art trace solutions. \begin{table}[] \centering \begin{tabular}{r|c c c} Standard & Architecture & Function Trace & Data Trace\\ \hline Nexus & PowerPC & \begin{tabular}[x]{@{}c@{}} Branch Trace \\ Messaging \end{tabular} & \begin{tabular}[x]{@{}c@{}} Data Trace \\ Messaging \end{tabular} \\ \hline \gls{etm} & ARM & \begin{tabular}[x]{@{}c@{}}Program Trace \\ Macrocell \end{tabular} & \begin{tabular}[x]{@{}c@{}}Embedded Trace \\ Macrocell \end{tabular} \\ \hline \gls{imds} & TriCore & \begin{tabular}[x]{@{}c@{}}Processor \\ Observation Block \end{tabular} & \begin{tabular}[x]{@{}c@{}}Bus \\ Observation Block \end{tabular} \\ \end{tabular} \caption[Trace devices for different architectures]{Trace devices exist for different CPU architectures. All solutions provide methods for recording program flow and data traces.} \label{tab:trace_devices} \end{table} Events that have been recorded by the trace device are sent off-chip via a dedicated trace interface. If the bandwidth provided by an interface is lower than the transfer rate of created events continuous tracing is not possible. However, this use case is often required. There are two ways two solve this problem. The amount of created trace data can be reduced using filters or the available bandwidth can be increased. If an entire application must be analyzed as a whole the first way is not an option. \begin{table}[] \centering \begin{tabular}{r|l c} Interface & Pros/Cons & DAQ rate \small{$[MB/s]$}\\ \hline JTAG & \begin{tabular}[x]{@{}l@{}} $+$ Reuse of existing interface \\ $+$ Small chip area \\ $-$ Low bandwidth \\ \vspace{1mm} \end{tabular} & 1.2 \\ DAP2/SWD & \begin{tabular}[x]{@{}l@{}} $+$ High bandwidth with few pins \\ $+$ Small silicon area \\ $-$ Proprietary \\ \vspace{1mm} \end{tabular} & 10 \\ \gls{agbt} & \begin{tabular}[x]{@{}l@{}} $+$ Very high bandwidth with few pins \\ $-$ Large silicon area \\ $-$ High cost \\ \vspace{1mm} \end{tabular} & 30 \\ CAN & \begin{tabular}[x]{@{}l@{}} $+$ Robust and well known standard \\ $+$ Low cost \\ $-$ Very low bandwidth \\ \end{tabular} & 0.05 \\ \end{tabular} \caption[Trace interfaces]{Commonly used trace interfaces and their \gls{daq} (\glsdesc{daq}) rates. \gls{agbt} (\glsdesc{agbt}) is the only interface capable of recording continuous hardware traces of a complete system.} \label{tab:interfaces} \end{table} Mayer et al.\ \cite{interfaces} give an overview of trace interfaces used in the automotive industry as shown in \autoref{tab:interfaces}. \gls{jtag} (\glsdesc{jtag}) is a common debug standard \cite{ieee5001}, suitable for regular debugging. It can be used to read out a buffered traced post tracing, but for continuous tracing it is not sufficient due to its low bandwidth of \unit[1.2]{MB/s}. Because of that DAP and DAP2 were developed by Infineon and SWD by ARM\@. Both protocols are based on \gls{jtag} but use a higher frequency and improved communication protocols to provided more bandwidth. \gls{agbt} is currently the fastest trace interface. It was specified by XILINX and adopted by the Nexus standard. \gls{agbt} is the only interface which is theoretically capable of recording a continuous trace of a complete application running on a processor with a frequency of \unit[200]{MHz}. CAN is used by some hybrid trace tools but is only mentioned for completeness since its bandwidth is too low to be considered for hardware tracing. \begin{figure}[] \centering \includegraphics[width=\textwidth]{./media/trace/workbench.png} \caption[Trace workbench]{A complete trace workbench. An Infineon TriCore evaluation board (a) can be traced by the iSYSTEM iC6000 (b) or the Lauterbach PowerTrace-2 (e) via the highspeed \gls{agbt} interface. Host software is used to control the hardware and to analyze the recorded trace, for example WinIDEA (c) by iSYSTEM and TRACE32 (d) by Lauterbach \cite{maxmaster}.} \label{fig:workbench} \end{figure} Target access hardware is connected to the hardware interface to readout recorded trace events. From the target access hardware the data is transmitted to a host computer for further analysis via USB 3.0 or Ethernet. Examples for target access hardware are the iC6000 by iSYSTEM \cite{ic6000} (\autoref{fig:workbench} b) and the PowerTrace-II by Lauterbach \cite{powertrace2} (\autoref{fig:workbench} e). Both devices support different architectures and trace interfaces by using architecture specific debug cables. Besides reading hardware traces those devices also support all functionalities provided by a regular debugger such as step wise debugging, reading of memory content, and manipulation of CPU configuration registers. Dedicated software on the host computer is used to configure and control the target access hardware and the trace device itself. After recording, this software transforms the recorded hardware trace into a software trace (see \autoref{fig:trace_event_levels}). For this process the host software must have access to the \gls{elf} file of an application. This is required to map the addresses of hardware trace events to the corresponding software entities. Based on the software trace, different analysis techniques such as metric evaluation, performance analysis, and code coverage are supported. Gantt charts are provided to examine the trace visually. Via export functions a software level program flow and data trace can be made available for external tools. \autoref{fig:workbench} shows the toolchain described in this section.