Tools and methodologies for post-silicon technologies
As we improve our understanding of new technologies and materials, we see how new application domains open up. We also get a better grasp of the potential benefit that these technologies can bring for already established application domains. To accelerate progress in these domains, programming and design tools are needed, which is the matter of this research line at the CC chair.
Racetrack memories (RTMs) is an exciting new class of the emerging non-volatile memory (NVM) technologies that unify qualities of the different memory technologies of today. The nano-scale RTMs promise access latency comparable to SRAM, the density of magnetic hard disk drives, and non-volatility that make them energy efficient. Since their conception in 2008, RTMs have evolved significantly with the recent versions eliminating major impediments and demonstrating improvements in both device speed and reliance.
A single cell in RTMs is a magnetic nanoribbon called track. Each track is equipped with one or more magnetic tunnel junction (MTJ) sensors referred to as access ports and stores a series of data bits – up to 500 – in the form of magnetic domains. Tracks in RTMs can be organized vertically (3D) or horizontally (2D) on the surface of a silicon wafer as shown below.
In order to access data from the RTM, the desired bits need to be shifted and aligned to the port positions prior to their access. These shift operations are undesirable for two reasons. (a) They consume energy (b) They not only prolong the RTM access latency but also make it variable. The amount and impact of these shifting operations on the memory subsystem can be mitigated by designing smart compilation tools and memory system designs.
At the chair for compiler construction, we are working on compilation and architectural simulation tools that not only optimize RTMs’ performance but also enable their exploration at various levels in the memory hierarchy. Together with Stuart Parkin, head of the Max Planck Institue Halle, we have developed an architectural simulation tool and are conducting research on optimizing compilers for RTMs.
RTSim is an architectural-level cycle-accurate memory simulation framework. It accurately models the shift operations in RTMs, manages the access ports and the RTM specific memory commands sequence. RTSim is configurable and allows architects to explore the design space of RTMs by varying the design parameters such as the number of tracks, domains and access ports per track. It also implements different access ports management policies which one can choose from while simulating RTMs.
For computer architects and memory researchers, RTSim provides a solid foundation to explore and implement various architectural optimizations for RTMs. For instance, the latency of the shifting operations can be effectively hidden via pre-shifting. Similarly, smart memory controllers can be designed that promote the frequently accessed data objects to domains closer to the access ports and / or reorder memory requests based on the access port positions, aiming at minimizing the total number of shifts.
RTSim is developed on top of a well-know memory simulator NVMain 2.0 which not only enables simulation of other NVMs but also features connections to system simulators such as gem5, empowering full-system simulation. RTSim is opensource and is hosted at GitHub.
- A. A. Khan, F. Hameed, R. Bläsing, S. Parkin and J. Castrillon, "RTSim: A Cycle-Accurate Simulator for Racetrack Memories", in IEEE Computer Architecture Letters, vol. 18, no. 1, pp. 43-46, 1 Jan.-June 2019.
Most existing methods introduce hardware extensions to abate the shifting overhead in RTMs. The additional hardware in such methods not only consumes the precious chip budget but also induces latency and energy overheads. In addition, since hardware solutions are blind to the memory access behavior of the running application(s), they in some cases might under-perform and result in low energy efficiency.
Software solutions such as compiler-guided data placement optimize the RTM performance and energy consumption by leveraging the knowledge of the application’s memory access pattern. We at the chair for compiler construction have developed a set of data placement techniques for RTMs that maximize the likelihood that consecutive references access the same or nearby memory locations at runtime, thereby minimizing the number of shifts. We have formulated the data placement problem in RTMs as an integer linear program (ILP) and have developed a novel heuristic called ShiftsRedcue that provides a near-optimal solution . When combined with a genetic search, our experimental results showed a reduction in RTM shifts by up to 52.5%. While ShiftsReduce targets a specific RTM architecture, our generalized heuristic in  finds comparable solutions in an architecture-independent way.
We have also investigated data layouts for high dimensional tensorial data structures in the RTM-based scratchpad memories (SPMs). We examined strategies to improve the performance and energy efficiency for the specific case of the tensor contraction operations. Compiler optimizations such as the schedule and the data layout transformations, paired with suitable architectural support, were employed to avoid the unnecessary shifts in RTMs. The proposed optimizations not only reduced the amount of RTM shifts to the absolute minimum but also enabled guaranteed single cycle SPM accesses [3, 4]. Our experimental results showed that the proposed optimizations were mandatory to make RTMs outperform SRAMs.
Motivated by the general-purpose heuristics in [1,2] and application-specific solutions in [3, 4], we extended Polly – the polyhedral optimizer in LLVM – and developed an end-to-end automatic compilation framework that generates RTM-efficient code . The RTM compiler takes an input program, analyzes the memory access pattern for potential shift optimizations, and transforms the schedule/layout in a way that reduces (long) shifts in RTM. This is joint work with Tobias Grosser and Torsten Hoefler.
In addition to the data placement optimizations, we, in collaboration with researchers from the Tampere University of Technology, have also proposed shift-reducing instruction memory placement (SHRIMP), an efficient instruction placement strategy that best exploits the sequentiality of both the instruction stream and the RTMs . With negligible memory overhead, our experimental results demonstrated an up-to 40% reduction in the number of RTMs shifts and a best-case average of 23% reduction in total cycle counts.
- Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019.
- Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020.
- Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019.
- Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020.
- Asif Ali Khan, Hauke Mewes, Tobias Grosser, Torsten Hoefler, Jeronimo Castrillon, "Polyhedral Compilation for Racetrack Memories", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 39, no. 11, pp. 3968-3980, Oct 2020.
- Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, pp. 6pp, New York, NY, USA, Jul 2019.