Graphics processing units (GPUs) have been widely used in high throughput general purpose computing because of their high power efficiency. As the GPU programming languages improve, such as OpenCL and NVIDIA CUDA, GPUs are becoming a better computing platform choice for general purpose applications with regular parallelism.
Our research group made research contributions that allow GPU architectures to be more friendly to applications. G-TSC [HPCA'18] is a novel cache coherence protocol for GPUs that is based on timestamp ordering. Hardware cache coherence is critical when GPUs are used to accelerate applications with irregular parallelism (e.g., graph processing). G-TSC conducts its coherence transactions in logical time. G-TSC does not require a globally synchronized clock and avoids execution stalls by the capability to logically schedule operations in future with proper timestamps. [IPDPS'17] proposes a sharing-aware CTA scheduler that attempts to assign CTAs with data sharing to the same SM to reduce redundant storage of data in private L1 caches across SMs. We further enhance the scheduler with a sharing-aware cache allocation and replacement policy.