Scheduling Computational and Energy Harvesting Tasks in Deadline-Aware Intermittent Systems
Bashima Islam (University of North Carolina at Chapel Hill)
Shahriar Nirjon (University of North Carolina at Chapel Hill)

Full Text Talk Slides
Video Slack Thread

The sporadic nature of harvestable energy and the mutually exclusive computing and charging cycles of intermittently powered batteryless systems pose a unique and challenging real-time scheduling problem. Existing literature focus either on the time or the energy constraints but not both at the same time. In this paper, we propose two scheduling algorithms for intermittent systems (an offline and an online) that schedule both computational and energy harvesting tasks by harvesting the required minimum amount of energy while maximizing the schedulability of computational jobs. To evaluate the proposed algorithms, we conduct simulation as well as trace-based and real-life experiments. Our results show that our proposed offline scheduling algorithm has 92% similar performance as an optimal scheduler, and our online algorithm schedules 8% – 22% more jobs than earliest deadline first (EDF), rate monotonic (RM), and as late as possible (ALAP) scheduling algorithms. We deployed solar-powered batteryless systems where four intermittent applications are executed in the TI-MSP430FR5994 microcontroller and demonstrate that the system with our proposed online scheduler misses 63% less deadline than a non-realtime system and 8% less deadline than the system with a baseline (as late as possible) scheduler.

Real-Time Scheduling upon a Host-Centric Acceleration Architecture with Data Offloading
Jinghao Sun (Northeastern University)
Jing Li (New Jersey Institute of Technology)
Zhishan Guo (University of Central Florida)
An Zou (Washington University in St. Louis)
Xuan Zhang (Washington University in St. Louis)
Kunal Agrawal (Washington University in St. Louis)
Sanjoy Baruah (Washington University in St. Louis)

Full Text Talk Slides
Video Slack Thread

Challenging scheduling problems arise in the implementation of safety-critical cyber-physical systems (CPS) upon heterogeneous platforms with (serial) data offloading and (parallel) computation. In this paper, we adapt techniques from scheduling theory to model heterogeneous systems with offloading time. We derive algorithms for solving this problem, and characterize the performance of these algorithms both analytically via the approximation ratio metric, and experimentally through simulation experiments upon synthetic workloads that are justified via a case study on a CPU-GPU platform. Such characterizations expose some divergence between analytical characterizations and experimental ones; recommendations that seek to balance such divergent characterizations are made regarding the choice of algorithmic approaches.

On Dynamic Thermal Conditions in Mixed-Criticality Systems
Seyedmehdi Hosseinimotlagh (University of California, Riverside)
Ali Ghahramannezhad (University of California, Riverside)
Hyoseung Kim (University of California, Riverside)

Full Text Talk Slides
Video Slack Thread

The rising demand for powerful embedded systems to support modern complex real-time applications signifies the on-chip temperature challenges.

Heat conduction between CPU cores interferes the execution time of tasks running on other cores. The violation of thermal constraints causes timing unpredictability to real-time tasks due to transient performance degradation or permanent system failure. Moreover, dynamic ambient temperature affects the operating temperature on multi-core systems significantly.

In this paper, we propose a thermal-aware server framework to safely upper-bound the maximum temperature of multi-core mixed-criticality systems. With the proposed analysis on the impact of ambient temperature, our framework manages mixed-criticality tasks to satisfy both thermal and timing requirements. We present techniques to find the maximum ambient temperature for each criticality level to guarantee the safe operating temperature bound. We also analyze the minimum time required for a criticality mode change from one level to another. The thermal properties of our framework have been evaluated on a commercial embedded platform. A case study with real-world mixed-critical applications demonstrates the effectiveness of our framework in bounding operating temperature under dynamic ambient temperature changes.

Debugging FPGA-accelerated Real-time Systems
Best Presentation Award!
Martin Geier (Technical University of Munich)
Marian Brändle (Technical University of Munich)
Dominik Faller (Technical University of Munich)
Samarjit Chakraborty (Technical University of Munich)

Full Text Talk Slides
Video Slack thread

The high computation/communication requirements along with reliability needs and limited power budgets necessitate complex processing platforms for emerging autonomous systems. Due to the current focus on performance, however, such platforms are increasingly difficult to predict and analyze. This holds true in terms of both performance (e.g., behavioral and temporal) aspects and power consumption. Ensuring functional safety thus requires new techniques to analyze performance, predictability and power.

In this paper, we thus propose a novel hybrid trace methodology to monitor (and, subsequently, optimize) temporal, functional and energy-related properties of Real-time Systems (RTSs). We target current Programmable SoCs (pSoCs) integrating a fixed-function System-on-Chip (SoC) with flexible Field Programmable Gate Array (FPGA) fabric. Although such heterogeneous systems are well suited for high-end, mixed hardware/software real-time pipelines, they also offer more complex performance/energy trade-offs than software-only platforms. To systematically exploit this complexity, we present a resource-efficient trace IP core for the pSoC’s fabric and an external measurement/interface system — jointly capturing hybrid power/state traces for subsequent (i.e., offline) analysis. By fusing state data from our IP core with events-of-interest gathered from power traces of pSoC and co-monitored I/O components, we gain a holistic view on temporal RTS aspects. Events and synchronized multi-rail power data jointly extend the debugging coverage via automated identification of processing phases, computation of energy baselines plus estimation of potential savings. Our solution thus integrates functional, temporal and energy monitoring into a single, unified workflow, which, in contrast to traditional separate tools, delivers valuable new insights helpful during debugging and reduces both cost and effort. Experimental evaluations on a Zynq-based Visual Servoing System show the method’s various benefits.

Enforcing Deadlines for Skeleton-based Parallel Programming
Paul Metzger (University of Edinburgh)
Murray Cole (University of Edinburgh)
Christian Fensch (University of Edinburgh)
Marco Aldinucci (University of Turin)
Enrico Bini (University of Turin)

Full Text Talk Slides
Video Slack thread

Applications that require high throughput with real-time guarantees are increasingly popular. For these applications, parallelism must be exposed to meet deadlines. Directed Acyclic Graphs (DAGs) are a popular and very general application model that can capture any possible interaction among threads. However, we argue that by constraining the application structure to given “skeletons”, at the price of losing some generality w.r.t. DAGs, the following advantages are gained: (i) a finer model of the application enables tighter analysis, (ii) specialised scheduling policies are applicable, (iii) programming is simplified, (iv) specialised implementation techniques can be exploited transparently, and (v) the program can be automatically tuned to minimise resource usage while still meeting its hard deadlines.

We conduct a case study with the job farm skeleton and the hard real-time XMOS xCore-200 microcontroller. We present an analytical framework that reduces the number of required cores by scheduling jobs in batches, while ensuring that deadlines are still met. Our experimental results demonstrate that batching reduces the minimum sustainable period by up to 22%, leading to a reduced number of required cores. The framework chooses the best parameters in 83% of cases and never selects parameters that cause deadline misses. Finally, we show that the overheads introduced by the skeleton abstraction layer are negligible.

A novel flow control mechanism to avoid multi-point progressive blocking in hard real-time priority-preemptive NoCs
Leandro Indrusiak (University of York)
Alan Burns (University of York)
Nickolay Smirnov (University of York)
James Harrison (University of York)

Full Text Talk Slides
Video Slack Thread

The recently uncovered problem of multi-point progressive blocking (MPB) has significantly increased the complexity of schedulability analysis of priority-preemptive wormhole networks-on-chip. While state-of-the-art analysis is currently deemed safe, there is still significant inherent pessimism when it comes to considering backpressure issues caused by downstream indirect interference. In this paper, we attempt to simplify the problem by considering a novel flow control protocol that can avoid backpressure issues, enabling simpler schedulability analysis approaches to be used. Rather than construct the analysis to fit the protocol, we modify the protocol so that effective analysis

applies. We describe the changes to a baseline wormhole router in order to implement the proposed protocol, and comment on the impact on hardware overheads. We also examine the number of routers that actually require these changes. Comparative analysis of FPGA implementations show that the hardware overheads of the proposed NoC router are comparable or lower than those of the baseline, while analytical comparison shows that the proposed approach can guarantee schedulability in up to 77% more cases.

Boomerang: Real-Time I/O Meets Legacy Systems
Richard West (Boston University)
Ahmad Golchin (Boston University)
Soham Sinha (Boston University)

Full Text Talk Slides
Video Slack Thread

This paper presents Boomerang, a system that integrates a legacy non-real-time OS with one that is customized for timing-sensitive tasks. A relatively small RTOS benefits from the pre-existing libraries, drivers and services of the legacy system. Additionally, timing-critical tasks are isolated from less critical tasks by securely partitioning machine resources among the separate OSes. Boomerang guarantees end-to-end processing delays on input data that requires outputs to be generated within specific time bounds.

We show how to construct composable task pipelines in Boomerang that combine functionality spanning a custom RTOS and a legacy Linux system. By assigning time-critical I/O to the RTOS, we ensure that complementary services provided by Linux are sufficiently predictable to meet end-to-end service guarantees. While Boomerang benefits from spatial isolation, it also outperforms a standalone Linux system using deadline-based CPU reservations for pipeline tasks. We also show how Boomerang outperforms a virtualized system called ACRN, designed for automotive systems.

Time-Triggered Traffic Planning for Data Networks with Conflict Graphs
Jonathan Falk (University of Stuttgart)
Frank Dürr (University of Stuttgart)
Kurt Rothermel (University of Stuttgart)

Full Text Talk Slides
Video Slack thread

Traffic planning is the key enabler of time-triggered real-time communication in distributed systems, and it is known to be notoriously hard.

Current approaches predominantly tackle the problem in the domain of the traffic planning problem, e.g., by formulating constraints on the transmission schedules for individual data streams, or the links used by the data streams.

This results in a high degree of coupling of the configuration of an individual data stream and the global (network-wide) traffic configuration with detrimental effects on the scalability and runtime of the planning phase.

In contrast, we present a configuration-conflict graph based approach, which solves the original traffic planning problem by searching an independent vertex set in the conflict graph.

We show how to derive the configuration-conflict graph, and discuss the conceptual advantages of this approach.

To show the practical advantages of the conflict-graph based traffic planning approach we additionally present a proof-of-concept implementation and evaluate it against a reference ILP-based implementation.

In our evaluations, our proof-of-concept implementation of the conflict-graph based approach outperforms the reference ILP and is more memory efficient, making it a promising alternative to current constraint-based traffic planning approaches.

Co-Optimizing Performance and Memory Footprint Via Integrated CPU/GPU Memory Management, an Implementation on Autonomous Driving Platform
Soroush Bateni (University of Texas at Dallas)
Zhendong Wang (University of Texas at Dallas)
Yuankun Zhu (University of Texas at Dallas)
Yang Hu (University of Texas at Dallas)
Cong Liu (University of Texas at Dallas)

Full Text Talk Slides
Video Slack Thread

Cutting-edge embedded system applications, such as self-driving cars and unmanned drone software, are reliant on integrated CPU/GPU platforms for their DNNs-driven workload, such as perception and other highly parallel components. In this work, we set out to explore the hidden performance implication of GPU memory management methods of integrated CPU/GPU architecture. Through a series of experiments on micro-benchmarks and real-world workloads, we find that the performance under different memory management methods may vary according to application characteristics. Based on this observation, we develop a performance model that can predict system overhead for each memory management method based on application characteristics. Guided by the performance model, we further propose a runtime scheduler. By conducting per-task memory management policy switching and kernel overlapping, the scheduler can significantly relieve the system memory pressure and reduce the multitasking co-run response time. We have implemented and extensively evaluated our system prototype on the NVIDIA Jetson TX2, Drive PX2, and Xavier AGX platforms, using both Rodinia benchmark suite and two real-world case studies of drone software and autonomous driving software.

Slite: OS Support for Near Zero-Cost, Configurable Scheduling
Phani Kishore Gadepalli (The George Washington University)
Runyu Pan (The George Washington University)
Gabriel Parmer (The George Washington University)

Full text Talk Slides
Video Slack Thread

Despite over 35 years of wildly changing requirements and applications, real-time systems have treated the kernel implementation of system scheduling policy as given. New time management policies are either adapted to the kernel, and rarely adopted, or emulated in user-level under the restriction that they must stay within the confines of the underlying system’s mechanisms. This not only hinders the agility of the system to adapt to new requirements, but also harms non-functional properties of the system such as the effective use of parallelism, and the application of isolation to promote security.

This paper introduces Slite, a system designed to investigate the possibilities of near zero-cost scheduling of system-level threads at user-level. This system efficiently and predictably enables the user-level implementation of configurable scheduling policies. This capability has wide-ranging impact, and we investigate how it can increase the isolation, thus dependability, of a user-level real-time OS, and how it can provide a real-time parallel runtime with better analytical properties using thread-based — rather than the conventional task-based — scheduling. We believe these benefits motivate a strong reconsideration of the fundamental scheduling structure in future real-time systems.

Interference-Aware Memory Allocation for Real-Time Multi-Core Systems
Simon Reder (Karlsruhe Institute of Technology (KIT))
Juergen Becker (Karlsruhe Institute of Technology (KIT))

Full Text Talk Slides
Video Slack Thread

Computing tight upper bounds for the Worst-Case Execution Time (WCET) at design-time is a crucial step when developing hard real-time software. For multi-core processors, however, timing interference between processor cores is a major problem, which may lead to overestimated WCET bounds. This work investigates possible solutions to reduce interference costs using synchronization-based interference models and appropriate memory allocation schemes. An interference-aware Integer Linear Programming (ILP) formulation of the memory allocation problem is presented to optimally map the variables of parallel programs to a set of distributed memory segments. The approach uses a generic model of the hardware platform, such that it applies to a wide range of multi-core targets, including complex Network-on-Chip (NoC) interconnects. A case study with six different platform configurations shows that interference costs can be bounded more tightly using the proposed interference model. An evaluation of the allocation scheme furthermore shows that the optimization approach can reduce interference costs by up to 49%.

Timing of Autonomous Driving Software: Problem Analysis and Prospects for Future Solutions
Miguel Alcon (Barcelona Supercomputing Center (BSC))
Hamid Tabani (Barcelona Supercomputing Center (BSC))
Leonidas Kosmidis (Barcelona Supercomputing Center (BSC))
Enrico Mezzetti (Barcelona Supercomputing Center (BSC))
Jaume Abella (Barcelona Supercomputing Center (BSC))
Francisco J. Cazorla (Barcelona Supercomputing Center (BSC))

Full Text Talk Slides
Video Slack thread

The software used to implement advanced functionalities in critical domains (e.g. autonomous operation) impairs software timing. This is not only due to the complexity of the underlying high performance hardware deployed to provide the required levels of computing performance, but also due to complexity, non-deterministic nature, and huge input space of the artificial intelligence (AI) algorithms used. In this paper, we focus on Apollo as a representative industrial-quality Autonomous Driving (AD) software framework: we statistically characterize its observed execution time variability and reason on the sources behind it. We discuss the main challenges and limitations in finding a satisfactory software timing analysis solution for Apollo and also show the main traits for the acceptability of statistical timing analysis techniques as a feasible path. While providing a consolidated solution for the software timing analysis of Apollo is a huge effort far beyond the scope of a research paper, our work aims to set the basis for future and more elaborated techniques for the timing analysis of AD software.

Real-Time Replica Consistency over Ethernet with Reliability Bounds
Arpan Gujarati (MPI-SWS, Germany)
Sergey Bozkho (MPI-SWS, Germany)
Björn B. Brandenburg (MPI-SWS, Germany)

Full Text Talk Slides
Video Slack thread

Ethernet is expected to play a key role in the development of the next generation of safety-critical distributed real-time systems. Unfortunately, the use of switched Ethernet in place of traditional field buses such as CAN exposes systems to the risk of Byzantine errors (or inconsistent broadcasts) due to environmentally-induced transient faults.

Byzantine fault tolerance (BFT) protocols can mitigate such errors to a large extent. However, no BFT protocol has yet been investigated from the perspective of hard real-time predictability. Classical Byzantine safety guarantees (e.g., 3f + 1 processes can tolerate up to f Byzantine faults) are oblivious to non-uniform fault rates across different system components that arise due to environmental disturbances. Furthermore, existing analyses abstract from the underlying network topology despite its strong influence on actual failure rates.

In this work, we present (i) a hard real-time interactive consistency protocol that allows distributed processes to agree on a common state despite Byzantine errors; and (ii) the first quantitative, real-time-aware reliability analysis of such a protocol deployed over switched Ethernet in the presence of stochastic transient faults. Our analysis is free of reliability anomalies and, as we show in our evaluation, can be used for a reliability-aware design space exploration of different fault tolerance alternatives

Addressing Resource Contention and Timing Predictability for Multi-Core Architectures with Shared Memory Interconnects
Haitong Wang (U of York)
Neil Audsley (U of York)
Wanli Chang (U of York)

Full Text Talk Slides
Video Slack Thread

Multi-core architectures are increasingly being used in real-time embedded systems. In general, such systems have more processors than the shared memory modules, potentially causing severe interference over the memory access. This resource contention could lead to substantial variation on the memory access latency, and thus wide fluctuation in the overall system performance, which is highly undesirable especially for the time-critical applications. In this paper, we address resource contention and timing predictability for multi-core architectures with distributed memory interconnects. We focus on the locally arbitrated interconnect constructed by pipelined multiplexing stages with local arbitration, while the globally arbitrated interconnect employing global scheduling to the same architecture potentially suffers the synchronisation issue and requires strict coordination. Our contributions are mainly threefold: (i) We analyse the resource contention across the memory access data path, and report the accurate calculational method to bound the worst-case behaviour. (ii) We compare the average-case behaviour of the locally arbitrated and the globally arbitrated architectures with hardware simulations, demonstrating varying memory latencies caused by the resource sharing issue. (iii) We propose an architectural modification to smooth the resource sharing. Evaluations on simulators and FPGA implementations with synthetic memory workload show that the latency variation is significantly reduced, contributing towards timing predictability of multi-core systems.

Bringing Inter-Thread Cache Benefits to Federated Scheduling
Corey Tessler (Wayne State University)
Prashant Modekurthy (Wayne State University)
Nathan Fisher (Wayne State University)
Abusayeed Saifullah (Wayne State University)

Full Text Talk Slides
Video Slack Thread

Multiprocessor scheduling of hard real-time tasks modeled by directed acyclic graphs (DAGs) exploits the inherent parallelism presented by the model. For DAG tasks, a node represents a request to execute an object on one of the available processors. In one DAG task, there may be multiple execution requests for one object, each represented by a distinct node. These distinct execution requests offer an opportunity to reduce their combined cache overhead through coordinated scheduling of objects as threads within a parallel task. The goal of this work is to realize this opportunity by incorporating the cache-aware BUNDLE-scheduling algorithm into federated scheduling of sporadic DAG task sets.

This is the first work to incorporate cache overhead into federated scheduling. The result is a modification of the DAG model named the DAG with objects and threads (DAG-OT). Under the DAG-OT model, descriptions of nodes explicitly include their underlying executable object and number of threads. When possible, nodes assigned the same executable object are collapsed into a single node; joining their threads when BUNDLE-scheduled. Compared to the DAG model, the DAG-OT model with cache-aware scheduling reduces the number of cores allocated to individual tasks by approximately 20 percent in the synthetic evaluation and up to 50 percent on a novel parallel computing platform implementation. By reducing the number of allocated cores, the DAG-OT model is able to schedule a subset of previously infeasible task sets.

A Holistic Memory Contention Analysis for Parallel Real-Time Tasks under Partitioned Scheduling
Daniel Casini (Scuola Superiore Sant’Anna)
Alessandro Biondi (Scuola Superiore Sant’Anna)
Geoffrey Nelissen (CISTER, ISEP, Polytechnic Institute of Porto)
Giorgio Buttazzo (Scuola Superiore Sant’Anna)

Full Text Talk Slides
Video Slack Thread

When adopting multi-core systems for safety-critical applications, certification requirements mandate bounding the delays incurred in accessing shared resources. This is the case of global memories, whose access is often regulated by memory controllers optimized for average-case performance and not designed to be predictable. As a consequence, worst-case bounds on memory access delays often result to be too pessimistic, drastically reducing the advantage of having multiple cores. This paper proposes a fine-grained analysis of the memory contention experienced by parallel tasks running in a multi-core platform.

To this end, an optimization problem is formulated to bound the memory interference by leveraging a three-phase execution model and holistically considering multiple memory transactions issued during each phase.

Experimental results show the advantage in adopting the proposed approach on both synthetic task sets and benchmarks.

CARSS: Client-Aware Resource Sharing and Scheduling for Heterogeneous Applications
ILJOO Baek (Carnegie Mellon University)
Matthew Harding (Carnegie Mellon University)
Akshit Kanda (Carnegie Mellon University)
Kyung Ryeol Choi (George Washington University)
Soheil Samii (GM Motor R&D)
Ragunathan (Raj) Rajkumar (Carnegie Mellon University)

Full Text Talk Slides
Video Slack Thread

Hardware accelerators such as GP-GPUs and DSPs are increasingly being used in high-performance real-time and multimedia systems. In fact, many heterogeneous applications including 3D graphics, deep learning, planning, tracking and other computationally demanding tasks can benefit from the use of hardware accelerators. Such heterogeneity can affect the performance of applications running on the same accelerator in parallel. Prior studies on resource sharing and scheduling on hardware accelerators have not attempted to account for the heterogeneity of applications using hardware accelerators in parallel. In this paper, we provide a portable tagging-based resource management framework for use by heterogeneous applications sharing a single hardware accelerator. We also offer practical insight into how various types of applications use the hardware accelerators differently. We substantiate the feasibility of the tagging-based approach and examine its benefits over a legacy NVIDIA GPU scheduler by presenting case-studies on 2 NVIDIA platforms: a GeForce GTX 1070 GPU and an Xavier embedded platform using real-world vision applications. Although we focus on GPUs in this paper, our underlying observations and framework can also be used for sharing execution on other types of hardware accelerators.

Cache-aware response time analysis for real-time tasks with fixed preemption points
Filip Markovic (Malardalen University)
Jan Carlson (Malardalen University)
Radu Dobrin (Malardalen University)

Full Text Talk Slides
Video Slack Thread

In real-time systems that employ preemptive scheduling and cache architecture, it is essential to account as precisely as possible for cache-related preemption delays in the schedulability analysis, as an imprecise estimation may falsely deem the system unschedulable.

In the current state of the art for preemptive scheduling of tasks with fixed preemption points, the existing schedulability analysis considers overly pessimistic estimation of cache-related preemption delay, which eventually leads to overly pessimistic schedulability results. In this paper, we propose a novel response time analysis for real-time tasks with fixed preemption points, accounting for a more precise estimation of cache-related preemption delays. The evaluation shows that the proposed analysis significantly dominates the existing approach by being able to always identify more schedulable tasksets.

Slow and Steady: Measuring and Tuning Multicore Interference
Dan Iorga (Imperial College London)
Tyler Sorensen (Princeton University)
John Wickerson (Imperial College London)
Alastair Donaldson (Imperial College London)

Full Text Talk Slides
Video Slack thread

Now ubiquitous, multicore processors provide replicated compute cores that allow independent programs to run in parallel. However, shared resources, such as last-level caches, can cause otherwise-independent programs to interfere with one another, leading to significant and unpredictable effects on their execution time. Indeed, prior work has shown that specially crafted enemy programs can cause software systems of interest to experience orders-of-magnitude slowdowns when both are run in parallel on a multicore processor. This undermines the suitability of these processors for tasks that have real-time constraints, including safety-critical applications.

In this work, we explore the design and evaluation of techniques for empirically testing interference using enemy programs, with an eye towards reliability (how reproducible the interference results are) and portability (how interference testing can be effective across chips). We first show that different methods of measurement yield significantly different magnitudes of, and variation in, observed interference effects when applied to an enemy process that was shown to be particularly effective in prior work. We propose a method of measurement based on percentiles and confidence intervals and show that it provides both competitive and reproducible observations. The reliability of our measurements allows us to explore auto-tuning, where enemy programs are further specialised per architecture. We evaluate three different tuning approaches (random search, simulated annealing, and Bayesian optimisation) on five different multicore chips, spanning x86 and Arm architectures. To show that our tuned enemy programs generalise to applications, we evaluate the slowdowns caused by our approach on the AutoBench and CoreMark benchmark suites. We observe statistically larger slowdowns compared to those from prior work in 35 out of 105 benchmark/board combinations, with the slowdown going up to 3.8× higher in some cases.

Ultimately, we anticipate that our approach will be valuable for ‘first pass’ evaluation when investigating which multicore processors are suitable for real-time tasks.

BRU: Bandwidth Regulation Unit for Real-Time Multicore Processors
Farzad Farshchi (University of Kansas)
Qijing Huang (University of California, Berkeley)
Heechul Yun (University of Kansas)

Full Text Talk Slides
Video Slack Thread

Poor time-predictability of the multicore processors is a well-known issue that hinders their adoption in the real-time systems due to contention in the shared memory resources. In this paper, we present the Bandwidth Regulator Unit (BRU), a drop-in hardware module that enables per-core memory bandwidth regulation at fine-grained time intervals. Additionally, BRU has the ability to regulate the memory access bandwidth of multiple cores collectively to improve bandwidth utilization. Besides eliminating the overhead of software regulation methods, our evaluation results using SD-VBS and synthetic benchmarks show that BRU improves time-predictability of real-time tasks, while it lets the best-effort tasks to better utilize the memory system bandwidth. In addition, we have synthesized our design for a 7nm technology node and show that the chip area overhead of BRU is negligible.

Sharing-aware Data Acquisition Scheduling for Multiple Rules in the IoT
Seonyeong Heo (POSTECH)
Seungbin Song (Yonsei University)
Bongjun Kim (POSTECH)
Hanjun Kim (Yonsei University)

Full Text Talk Slides
Video Slack Thread

In the Internet-of-Things (IoT) environments, users define event-condition-action (ECA) rules, and expect IoT frameworks to evaluate conditions and take appropriate actions within a certain time limit after an event occurs. To evaluate the conditions with fresh data items, the frameworks acquire required data from IoT sensors. Since the data acquisition causes battery consumption of sensors, the frameworks should minimize the number of the data acquisition while keeping the sensor data fresh until finishing the condition evaluation. However, existing data acquisition schedulers inefficiently acquire sensor data because the schedulers assume each ECA rule in a program is independent of each other although different rules may share some sensing data from the same sensors. This work proposes an efficient sharing-aware data acquisition scheduling algorithm that reduces unnecessary data acquisition by sharing sensor data commonly used in different rules while satisfying time constraints. To evaluate the proposed scheduling algorithm, this work deploys 19 devices in an office, collects values of 26 different sensors for 144 hours, and simulates the proposed algorithm and a baseline algorithm. Compared to the baseline algorithm, the proposed algorithm reduces communication count and deadline miss ratio by 31.9% and 50.2% respectively.

SubFlow: A Dynamic Induced-Subgraph Strategy Toward Real-Time DNN Inference and Training
Seulki Lee (University of North Carolina at Chapel Hill)
Shahriar Nirjon (University of North Carolina at Chapel Hill)

Full Text Talk Slides
Video Slack Thread

We introduce SubFlow–a dynamic adaptation and execution strategy for a deep neural network (DNN), which enables real-time DNN inference and training. The goal of SubFlow is to complete the execution of a DNN task within a timing constraint that may dynamically change while ensuring comparable performance to executing the full network by executing a subset of the DNN at run-time. To this end, we propose two online algorithms that enable SubFlow: 1) dynamic construction of a sub-network which constructs the best sub-network of the DNN in terms of size and configuration, and 2) time-bound execution which executes the sub-network within a given time budget either for inference or training. We implement and open-source SubFlow by extending TensorFlow with full compatibility by adding SubFlow operations for convolutional and fully-connected layers of a DNN. We evaluate SubFlow with three popular DNN models (LeNet-5, AlexNet, and KWS), which shows that it provides flexible run-time execution and increases the utility of a DNN under dynamic timing constraints, e.g., 1x–6.7x range of dynamic execution speed with average -3% of performance (inference accuracy) difference. We also implement an autonomous robot as an example system that uses SubFlow and demonstrate that its obstacle detection DNN is flexibly executed to meet a range of deadlines that varies depending on its running speed.

Bounded-time recovery for distributed real-time systems
Neeraj Gandhi (University of Pennsylvania)
Edo Roth (University of Pennsylvania)
Robert Gifford (University of Pennsylvania)
Linh Thi Xuan Phan (University of Pennsylvania)
Andreas Haeberlen (University of Pennsylvania)

Full Text Talk Slides
Video Slack Thread

This paper explores bounded-time recovery (BTR), a new approach to making cyber-physical systems robust to crash faults. Rather than trying to mask the symptoms of a fault

with massive redundancy, BTR detects faults at runtime and enables the system to recover from them – e.g., by transferring tasks to other nodes that are still working correctly. When a fault does occur, there is a brief period of instability during which the system can produce incorrect outputs. However, many cyber-physical systems have physical properties – such as inertia or thermal capacity – that limit the rate at which the state of the system can change; thus, a very brief outage is often acceptable, as long as its duration can be bounded, to perhaps a few milliseconds.

BTR has some interesting properties: for instance, it has a much lower overhead than Paxos, and, unlike Paxos, it can take useful actions even when the system partitions or a majority of the nodes fails. However, it also poses a very unusual scheduling problem that involves creating sets of interrelated schedules for different failure modes. We present a scheduling algorithm called Cascade that can quickly find suitable schedules. Using a prototype implementation, we show that Cascade scales far better than a baseline algorithm and reduces the scheduling time from hours to a few seconds, without sacrificing quality.

DRAMbulism: Balancing Performance and Predictability through Dynamic Pipelining
Reza Mirosanlou (University of Waterloo)
Mohamed Hassan (McMaster University)
Rodolfo Pellizzoni (University of Waterloo)

Full Text Talk Slides
Video Slack Thread

Worst-case execution bounds for real-time programs are highly impacted by the latency of accessing hardware shared resources, such as off-chip DRAM. While many different memory controller designs have been proposed in the literature, there is a trade-off between average-case performance and predictable worst-case bounds, as techniques targeted at improving the former can harm the latter and vice-versa. We find that taking advantage of pipelining between different commands can improve both, but incorporating pipelining effects in worst-case analysis is challenging. In this work, we introduce a novel DRAM controller that successfully balances performance and predictability by employing a dynamic pipelining scheme. We show that the schedule of DRAM commands is akin to a two-stage, two-modes pipeline, and design an easily-implementable admission rule that allows us to dynamically add requests to the pipeline without hurting worst-case bounds.

Modeling Contention Interference in Crossbar-based Systems via Sequence-Aware Pairing (SeAP)
Jeremy Giesen (Barcelona Supercomputing Center (BSC))
Pedro Benedicte (Barcelona Supercomputing Center (BSC))
Enrico Mezzetti (Barcelona Supercomputing Center (BSC))
Jaume Abella (Barcelona Supercomputing Center (BSC))
Francisco J. Cazorla (Barcelona Supercomputing Center (BSC))

Full Text Talk Slides
Video Slack thread

The Infineon AURIX TriCore family of microcontrollers has consolidated as the reference multicore computing platform for safety-critical systems in the automotive domain. As a distinctive trait, AURIX microcontrollers are designed to promote high timing predictability as witnessed by the presence of large scratchpad memories and a crossbar interconnect. The latter has been introduced to reduce inter-core interference in accessing the memory system and peripherals. Nonetheless, the crossbar does not prevent requests from different cores to the same target resource to suffer contention. Applications are, therefore, inherently exposed to inter-core timing interference, which needs to be taken into account in the determination of reliable execution time bounds. In this paper we propose a contention modeling technique for crossbar-based systems, and hence suitable for bounding contention effects in the AURIX family. Unlike state of the art techniques that build on total request counts, we exploit the sequence of requests to the different target resources produced by each core to produce tighter bounds by discarding contention scenarios that cannot occur in practice. To that end, we adapt existing techniques from the pattern matching domain to derive the worst-case contention effects from the sequences of requests each core sends over the crossbar. Results on wide set synthetic scenarios and real scenarios and benchmark on an AURIX TC297TX show that our technique consistently outperforms other contention modeling approaches.

The Potential of Programmable Logic in the Middle: Cache Bleaching
Shahin Roozkhosh (Boston University)
Renato Mancuso (Boston University)

Full Text Talk Slides
Video Slack Thread

Consolidating hard real-time systems onto modern multi-core Systems-on-Chip (SoC) is an open challenge. The extensive sharing of hardware resources at the memory hierarchy raises important unpredictability concerns. The problem is exacerbated as more computationally demanding workload is expected to be handled with real-time guarantees in next-generation Cyber-Physical Systems (CPS). A large body of works has approached the problem by proposing novel hardware re-designs, and by proposing software-only solutions to mitigate performance interference.

Strong from the observation that unpredictability arises from a lack of fine-grained control over the behavior of shared hardware components, we outline a promising new resource management approach. We demonstrate that it is possible to introduce Programmable Logic In-the-Middle (PLIM) between a traditional multi-core processor and main memory. This provides the unique capability of manipulating individual memory transactions. We propose a proof-of-concept system implementation of PLIM modules on a commercial multi-core SoCs. The PLIM approach is then leveraged to solve long-standing issues with cache coloring. Thanks to PLIM, colored sparse addresses can be re-compacted in main memory. This is the base principle behind the technique we call Cache Bleaching. We evaluate our design on real applications and propose hypervisor-level adaptations to showcase the potential of the PLIM approach.

Latency-Aware Generation of Single-Rate DAGs from Multi-Rate Task Sets
Micaela Verucchi (University of Modena and Reggio Emilia,
University of Munich)
Mirco Theile (Technical University of Munich)
Marco Caccamo (Technical University of Munich)
Marko Bertogna (University of Modena and Reggio Emilia)

Full Text Talk Slides
Video Slack Thread

Modern automotive and avionics embedded systems integrate several functionalities that are subject to complex timing requirements. A typical application in these fields is composed of sensing, computation, and actuation. The ever increasing complexity of heterogeneous sensors implies the adoption of multi-rate task models scheduled onto parallel platforms. Aspects like freshness of data or first reaction to an event are crucial for the performance of the system. The Directed Acyclic Graph (DAG) is a suitable model to express the complexity and the parallelism of these tasks. However, deriving age and reaction timing bounds is not trivial when DAG tasks have multiple rates. In this paper, a method is proposed to convert a multi-rate DAG task-set with timing constraints into a single-rate DAG that optimizes schedulability, age and reaction latency, by inserting suitable synchronization constructs. An experimental evaluation is presented for an autonomous driving benchmark, validating the proposed approach against state-of-the-art solutions.

Real-Time Object Detection System with Multi-Path Neural Networks
Seonyeong Heo (POSTECH)
Sungjun Cho (POSTECH)
Youngsok Kim (Yonsei University)
Hanjun Kim (Yonsei University)

Full Text Talk Slides
Video Slack Thread

Thanks to the recent advances in Deep Neural Networks (DNNs), DNN-based object detection systems becomes highly accurate and widely used in real-time environments such as autonomous vehicles, drones and security robots. Although the systems should detect objects within a certain time limit that can vary depending on their execution environments such as vehicle speeds, existing systems blindly execute the entire long-latency DNNs without reflecting the time-varying time limits, and thus they cannot guarantee real-time constraints. This work proposes a novel real-time object detection system that employs multi-path neural networks based on a new worst-case execution time (WCET) model for DNNs on a GPU. This work designs the WCET model for a single-layer of DNNs analyzing processor and memory contention on GPUs, and extends the WCET model to the end-to-end networks. This work also designs the multi-path networks with three new operators such as skip, switch, and dynamic generate proposals that dynamically change their execution paths and the number of target objects. Finally, this work proposes a path decision model that chooses the optimal execution path at run-time reflecting dynamically changing environments and time constraints. Our detailed evaluation using widely-used driving datasets shows that the proposed real-time object detection system performs as good as a baseline object detection system without violating the time-varying time limits. Moreover, the WCET model predicts the worst-case execution latency of convolutional and group normalization layers with only 28% and 64% errors on average, respectively.

Dissecting the CUDA scheduling hierarchy: a Performance and Predictability Perspective
Ignacio Sañudo (University of Modena and Reggio Emilia)
Nicola Capodieci (University of Modena and Reggio Emilia)
Jorge Luis Martinez Garcia (University of Modena and
Emilia)
Andrea Marongiu (University of Modena and Reggio Emilia)
Marko Bertogna (University of Modena and Reggio Emilia)

Full Text Talk Slides
Video Slack thread

Over the last few years, the ever-increasing use of Graphic Processing Units (GPUs) in safety-related domains has opened up many research problems in the real-time community.

The closed and proprietary nature of the scheduling mechanisms deployed in NVIDIA GPUs, for instance, represents a major obstacle in deriving a proper schedulability analysis for latency-sensitive applications.

Existing literature addresses these issues by either (i) providing simplified models for heterogeneous CPU-GPU systems and their associated scheduling policies, or (ii) providing insights about these arbitration mechanisms obtained through reverse engineering.

In this paper, we take one step further by correcting and consolidating previously published assumptions about the hierarchical scheduling policies of NVIDIA GPUs and their proprietary CUDA application programming interface.

We also discuss how such mechanisms evolved with recently released GPU micro-architectures, and how such changes influence the scheduling models to be exploited by real-time system engineers.