ALCHEM(Architecture Lab for Creative High-performance Energy-efficient Machines) focuses on computer architecture, its interactions with software/systems and the implication of emerging technologies.
Memory consistency models (MCM) specify the order in which memory accesses performed by one processor become visible to other processors. Sequential Consistency (SC) is a strong and intuitive memory model that requires the memory accesses of a program appear to have been executed in a global sequential order consistent with the per-processor program order. Since MCM is the interface between software and architecture, it is not sufficient to consider it only in architecture or software. For example, compiler optimizations can easily violate SC on a machine that implements SC. Therefore, an MCM must be supported “end-to-end” from application to architecture, which involves multiple system layers. I argue that SC is a key factor to ensure good programmability in shared memory multicores. An architecture should either implement SC and provide mechanisms to software to ensure “end-to-end” SC, or, in case it implements a relaxed MCM, should detect SC violations (SCVs) when they happen.
The fundamental limitations imposed by increasing energy consumption associated with moving vast data among the growing parallel compute resources has led to the so called “bandwidth taper” prevalent in current system architectures.
Our group demonstrated that the programming model of graph processing affects both data movement and performance. A top-down approach allows the consideration of the fundamental trade-offs. We proposed the 3D partition that reduces the data communication between nodes in distributed graph processing (e.g. PowerGraph, PowerLyra).
We proposed a more expressive programming model designed for out-of-core graph processing systems (e.g. GraphChi, X-Stream, GridGraph) that allows the implementation of more efficient algorithms. We applied the same approach to optimizing Processing-In-Memory (PIM) based graph processing architectures with less and more regular data movements, — bridging the programming model and architecture. Leveraging the emerging metal oxide resistive random access memory (ReRAM), we proposed PipeLayer, a ReRAM-based accelerator with in-situ computation for machine learning and graph processing that fundamentally eliminate significant amount of data movement, — bridging applications and emerging technology.
The current durable transactions for NVM are usually either 1) undo log based, which needs memory fence for every write; or 2) redo log based, which needs to intercept all the reads to uncommitted data. We proposed DeDTM, making the best of both worlds by decoupling the execution of a durable transaction into three fully asynchronous steps (Perform, Persist, Reproduce). DeDTM adds guarantees of crash consistency and durability to the TM by adding only 7.4% to 21.6% overhead, which is 2.2x to 4.6x faster than existing works Mnemosyne and NVML. In future, we want to systematically study the applications of NVM on data centers.
Stochastic Computing (SC), which uses a bit-stream to represent a number within [-1, 1] by counting the number of ones in the bit-stream, has high potential for implementing Deep CNNs (DCNNs) with high scalability and ultra-low hardware footprint. We proposed SC-DCNN, the first comprehensive design and optimization framework of SC-based DCNNs using a bottom-up approach.
Our results show that SC-DCNN implementing LeNet5 working at 200 MHz consumes only 17 mm2 area and 1.53 W power, but achieves throughput of 781250 images/s, area efficiency of 21439 images/s/mm2, and energy efficiency of 221287 images/J.
In many emerging applications, the machine learning and graph analytics are ideally performed in the edge (e.g., mobile or IoT devices) in order to allow the decisions to be made and relationships between events to be discovered in the field where they unfold. Unfortunately, the existing embedded systems equipped with conventional computing units like CPU/GPU cannot efficiently process large data sets or graphs in real time. Currently, most of the machine learning and graph processing tasks are performed in the Cloud with the data sent back and force between the devices and the data center. The current approach clearly incurs extra latency and energy due to data communication and only provides forensic (offline) machine learning or graph analysis. To bridge the gap between the end-user application need and the limited capability of existing systems, our group propose to develop the Embedded Intelligence systems, which will be co-designed with an embedded machine learning and graph processing framework and emerging technologies including low-cost ReRAM-based and SC-based acceleration. We will consider several key problems on the device side and the data center side.