General Memory Specialization for Massive Multi-Cores


Mr. WANG Zhengrong
Postdoc researcher
University of California, Los Angeles (UCLA)


Meeting ID: 972 0049 1693 // Passcode: 202400

(Students must login with CUHK account, i.e.,, for valid attendance record)

In the last two decades, computer architects have heavily relied on specialization and scaling up to continue performance and energy efficiency improvement as Moore’s law fading away. The former customizes the system for particular program behaviors (e.g., the neural engine in Apple chips to accelerate machine learning), while the latter evolves into massive multi-core systems (e.g., 96 cores for AMD EPYC 9654 CPU).

This works until we hit the “memory wall” – as modern systems continue to scale up, data movements have become increasingly the bottleneck. Unfortunately, conventional memory systems are extremely inefficient in reducing data movements, suffering from excessive NoC traffic and limited off-chip bandwidth to bring the data to computing cores.

These inefficiencies originate from the essential core-centric view: the memory hierarchy simply reacts to individual requests from the core but is unaware of high-level program behaviors. This makes the hardware oblivious, as they must guess highly irregular and transient memory semantics from the primitive memory abstraction of simple load and store instructions.

This calls for a fundamental redesign of the memory interface to express rich memory semantics, so that the memory system can promptly adjust to evolving program behaviors and efficiently orchestrate data and computation together throughout the entire system. For example, simple computations can be directly associated with memory requests and naturally distributed across the memory hierarchy without bringing all the data to the core. More importantly, the new interface should integrate seamlessly with conventional von Neumann ISAs, enabling end-to-end memory specialization while maintaining generality and transparency. Overall, in this talk, I will discuss our solution to enable general memory specialization for massive multi-core systems that unlock order-of-magnitude speedup/energy efficiency on plain-C programs. Such data-computation orchestration is the key to continuing the performance and energy efficiency scaling.


Zhengrong is currently a post-doc researcher at UCLA. His research aims to build general, automatic, and end-to-end near-data acceleration by revolutionizing the orchestration between data and computation throughout the entire system. His open-source work has been accepted by multiple top-tier conferences in computer architecture, including ISCA, MICRO, ASPLOS, HPCA, and awarded Best Paper Runner-Ups as well as IEEE Micro Top Pick Honorable Mentions. He is also one of the maintainers of gem5, a widely used cycle accurate simulator in computer architecture.


Ms. FUNG Wing Chi Mary (

Mr. WONG O-Bong (


Apr 24, 2024


10:00 am - 11:00 am



Comments are closed.