Tools and methodologies for post-silicon technologies
As we improve our understanding of new technologies and materials, we see how new application domains open up. We also get a better grasp of the potential benefit that these technologies can bring for already established application domains. To accelerate progress in these domains, programming and design tools are needed, which is the matter of this research line at the CC chair.
Racetrack memories (RTMs) is an exciting new class of the emerging non-volatile memory (NVM) technologies that unify qualities of the different memory technologies of today. The nano-scale RTMs promise access latency comparable to SRAM, the density of magnetic hard disk drives, and non-volatility that make them energy efficient. Since their conception in 2008, RTMs have evolved significantly with the recent versions eliminating major impediments and demonstrating improvements in both device speed and reliance.
A single cell in RTMs is a magnetic nanoribbon called track. Each track is equipped with one or more magnetic tunnel junction (MTJ) sensors referred to as access ports and stores a series of data bits – up to 500 – in the form of magnetic domains. Tracks in RTMs can be organized vertically (3D) or horizontally (2D) on the surface of a silicon wafer as shown below.
In order to access data from the RTM, the desired bits need to be shifted and aligned to the port positions prior to their access. These shift operations are undesirable for two reasons. (a) They consume energy (b) They not only prolong the RTM access latency but also make it variable. The amount and impact of these shifting operations on the memory subsystem can be mitigated by designing smart compilation tools and memory system designs.
At the chair for compiler construction, we are working on compilation and architectural simulation tools that not only optimize RTMs’ performance but also enable their exploration at various levels in the memory hierarchy. Together with Stuart Parkin, head of the Max Planck Institue Halle, we have developed an architectural simulation tool and are conducting research on optimizing compilers for RTMs.
RTSim is an architectural-level cycle-accurate memory simulation framework. It accurately models the shift operations in RTMs, manages the access ports and the RTM specific memory commands sequence. RTSim is configurable and allows architects to explore the design space of RTMs by varying the design parameters such as the number of tracks, domains and access ports per track. It also implements different access ports management policies which one can choose from while simulating RTMs.
For computer architects and memory researchers, RTSim provides a solid foundation to explore and implement various architectural optimizations for RTMs. For instance, the latency of the shifting operations can be effectively hidden via pre-shifting. Similarly, smart memory controllers can be designed that promote the frequently accessed data objects to domains closer to the access ports and / or reorder memory requests based on the access port positions, aiming at minimizing the total number of shifts.
RTSim is developed on top of a well-know memory simulator NVMain 2.0 which not only enables simulation of other NVMs but also features connections to system simulators such as gem5, empowering full-system simulation. RTSim is opensource and is hosted at GitHub.
- A. A. Khan, F. Hameed, R. Bläsing, S. Parkin and J. Castrillon, "RTSim: A Cycle-Accurate Simulator for Racetrack Memories", in IEEE Computer Architecture Letters, vol. 18, no. 1, pp. 43-46, 1 Jan.-June 2019.
Most existing methods introduce hardware extensions to abate the shifting overhead in RTMs. The additional hardware in such methods not only consumes the precious chip budget but also induces latency and energy overheads. In addition, since hardware solutions are blind to the memory access behavior of the running application(s), they in some cases might under-perform and result in low energy efficiency.
Software solutions such as compiler-guided data placement optimize the RTM performance and energy consumption by leveraging the knowledge of the application’s memory access pattern. We at the chair for compiler construction have developed a set of data placement techniques for RTMs that maximize the likelihood that consecutive references access the same or nearby memory locations at runtime, thereby minimizing the number of shifts. We have formulated the data placement problem in RTMs as an integer linear program (ILP) and have developed a novel heuristic called ShiftsRedcue that provides a near-optimal solution . When combined with a genetic search, our experimental results showed a reduction in RTM shifts by up to 52.5%. While ShiftsReduce targets a specific RTM architecture, our generalized heuristic in  finds comparable solutions in an architecture-independent way.
We have also investigated data layouts for high dimensional tensorial data structures and the non-linear tree data structures in the RTM-based scratchpad memories (SPMs). We examined strategies to improve the performance and energy efficiency for the specific cases of the tensor contraction operations and decision trees. For tensor contraction, compiler optimizations such as the schedule and the data layout transformations, paired with suitable architectural support, were employed to avoid unnecessary shifts in RTMs. The proposed optimizations not only reduced the amount of RTM shifts to the absolute minimum but also enabled guaranteed single-cycle SPM accesses [3, 4]. Our experimental results showed that the proposed optimizations were mandatory to make RTMs outperform SRAMs. For decision trees, in collaboration with TU Dortmund, we exploited the domain knowledge, i.e., the node access probabilities to map temporaly closely accessed nodes to successive locations in RTM .
Motivated by the general-purpose heuristics in [1,2] and application-specific solutions in [3, 4, 7], we extended Polly – the polyhedral optimizer in LLVM – and developed an end-to-end automatic compilation framework that generates RTM-efficient code . The RTM compiler takes an input program, analyzes the memory access pattern for potential shift optimizations, and transforms the schedule/layout in a way that reduces (long) shifts in RTM. This is joint work with Tobias Grosser and Torsten Hoefler.
In addition to the data placement optimizations, we, in collaboration with researchers from the Tampere University of Technology, have also proposed shift-reducing instruction memory placement (SHRIMP), an efficient instruction placement strategy that best exploits the sequentiality of both the instruction stream and the RTMs . With negligible memory overhead, our experimental results demonstrated an up-to 40% reduction in the number of RTMs shifts and a best-case average of 23% reduction in total cycle counts.
- Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart S. P. Parkin, Jeronimo Castrillon, "ShiftsReduce: Minimizing Shifts in Racetrack Memory 4.0", In ACM Transactions on Architecture and Code Optimization (TACO), ACM, vol. 16, no. 4, pp. 56:1–56:23, New York, NY, USA, Dec 2019.
- Asif Ali Khan, Andrés Goens, Fazal Hameed, Jeronimo Castrillon, "Generalized Data Placement Strategies for Racetrack Memories", Proceedings of the 2020 Design, Automation and Test in Europe Conference (DATE), IEEE, pp. 1502–1507, Mar 2020.
- Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads", Proceedings of the 20th ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, Tools and Theory of Embedded Systems (LCTES), ACM, pp. 5–18, New York, NY, USA, Jun 2019.
- Asif Ali Khan, Norman A. Rink, Fazal Hameed, Jeronimo Castrillon, "Optimizing Tensor Contractions for Embedded Devices with Racetrack and DRAM Memories", In ACM Transactions on Embedded Computing Systems (TECS), Association for Computing Machinery, vol. 19, no. 6, New York, NY, USA, Sep 2020.
- Asif Ali Khan, Hauke Mewes, Tobias Grosser, Torsten Hoefler, Jeronimo Castrillon, "Polyhedral Compilation for Racetrack Memories", In IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), vol. 39, no. 11, pp. 3968-3980, Oct 2020.
- Joonas Multanen, Asif Ali Khan, Pekka Jääskeläinen, Fazal Hameed, Jeronimo Castrillon, "SHRIMP: Efficient Instruction Delivery with Domain Wall Memory", Proceedings of the International Symposium on Low Power Electronics and Design, ACM, pp. 6pp, New York, NY, USA, Jul 2019.
- Christian Hakert, Asif Ali Khan, Kuan-Hsun Chen, Fazal Hameed, Jerónimo Castrillón and Jian-Jia Chen. “BLOwing Trees to the Ground: Layout Optimization of Decision Trees on Racetrack Memory.” 2021 58th ACM/IEEE Design Automation Conference (DAC) (2021): 1111-1116.
Near-memory and in-memory Computing
Emerging application domains such as machine learning and computational genomics require processing a mammoth volume of data and hence demand significantly higher off-chip memory bandwidth. In conventional CMOS-based Von-Neumann machines, increasing the off-chip bandwidth is becoming increasingly expensive and is strictly constrained by the chip package and system models. On the contrary, non-Von-Neumann system models like computing-near-memory (CNM) and computing-in-memory computing (CIM) have shown great promise by outperforming the conventional Von-Neumann system models by orders of magnitude, both in terms of latency and energy consumption. The idea consists in bringing computations closer to the data or processing data where it makes more sense.
At the chair for compiler construction, we are exploring CNM and CIM systems based on various memory technologies, for different use cases. We are also developing tools and software methods for the design space exploration and effective utilization of these systems.
We at the chair for compiler construction have developed CNM systems for pre-alignment filtering in genome analysis. We have proposed ALPHA , a co-designed filtering solution based on conventional DRAM that minimizes the number of memory accesses, improving performance and reducing energy consumption. We have recently proposed FIRM , an NMC system based on the emerging racetrack memory (see the figure below). We have demonstrated that with intelligent system design, RTMs outperform DRAM by more than 50% in terms of the total runtime and the overall energy consumption. This is joint work with Sebastien Ollivier and Alex K. Jones from the University of Pittsburgh.
- Fazal Hameed, Asif Ali Khan and Jeronimo Castrillon, "ALPHA: A Novel Algorithm-Hardware Co-design for Accelerating DNA Seed Location Filtering", in IEEE Transactions on Emerging Topics in Computing, doi: 10.1109/TETC.2021.3093840.
- Fazal Hameed, Asif Ali Khan, Sebastien Ollivier, Alex K. Jones, and Jeronimo Castrillon, "DNA Pre-alignment Filter using Processing Near Racetrack Memory", arXiv preprint arXiv:2205.02046 (2022).
Computing in-memory (CIM), unlike the CNM, does not require dedicated CMOS logic for computations. Instead, the computation and storage are performed directly in memory using the memory devices' properties. Memristor crossbars (based on the phase-change memory and resistive RAM) have attracted significant interest due to their ability to efficiently perform matrix-matrix and matrix-vector multiplications — the dominant computational kernels in machine learning (deep neural networks). On the other hand, RTMs have demonstrated their dominance in efficiently implementing various logic operations.
For memristors-based CIM systems, we developed the open CIM compiler (OCC) - a multi-level intermediate representation (MLIR) based compilation framework that transparently detects and offloads computational primitives to memristor blocks in the CIM system . OCC leverages the hierarchical abstractions of the MLIR compiler infrastructure (see below) to perform code matching and device-agnostic and device-specific code transformations at the most appropriate level. This is joint work with the University of Eindhoven, the University of Edinburgh, Inria France, and the University of Oklahoma.
For RTM-based CIM accelerators, we explored RTM architectures to implement the entire hyperdimensional computing (HDC) use case . We implemented the XOR and the pop count operations using the RTM device properties and proposed an RTM nanowires-based counting mechanism. Since shifting is an inherent property of the RTM, we proposed mapping strategies that minimize the number of RTM shifts.
Presently, we are investigating CIM systems that integrate multiple, heterogeneous CIM accelerators. We are developing software methodologies that progressively lower different functions in the input kernels to different CIM accelerators, depending on the underlying hardware.
- Adam Siemieniuk, Lorenzo Chelini, Asif Ali Khan, Jeronimo Castrillon, Andi Drebes, Henk Corporaal, Tobias Grosser, and Martin Kong, "OCC: An Automated End-to-End Machine Learning Optimizing Compiler for Computing-In-Memory", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (2021).
- Asif Ali Khan, Sebastien Ollivier, Stephen Longofono, Gerald Hempel, Jeronimo Castrillon, and Alex K. Jones. "Brain-inspired Cognition in Next Generation Racetrack Memories", ACM Transactions on Embedded Computing Systems (TECS) (2022).