Skip to content

Qingcai Jiang (姜庆彩)


Last Modified: 2023.11



Hello! I am Qingcai Jiang, a PhD student from University of Science and Technology of China. I major in computer architechture under the supervision of Prof. Hong An. I have broad interest on computer architecture, parallel computing and workload characterization. Currently I am visiting ETH Zurich with Prof. Onur Mutlu's research group.


  • Ph.D. Student in Computer Architecture. University of Science and Technology of China. Advisor: Hong An. September 2019 - Present.
  • B.S. in Computer Science. University of Science and Technology of China. Advisor: Hong An. September 2015 - June 2019.

Research Experiences

Accelerate Linear-Response Time-Dependent Density Functional Theory (LR-TDDFT) calculations using a multi-GPU platform. January 2019 ~ August 2019.

  • Port LR-TDDFT calculations to an 8x V100 GPU server and design shared memory and mixed-precision techniques, among others, to improve parallel efficiency.
  • Design a pipelining strategy to handle large-scale GEMM and MPI_Reduce operations for overlapping computation and communication.
  • Output: [1] in Selected Publications.

Accelerate LR-TDDFT calculations by utilizing low-rank approximation and an iterative eigensolver. November 2019 ~ August 2020.

  • Develop a K-Means based Interpolative Separable Density Fitting (ISDF) method to approximate two-electron integrals in LR-TDDFT calculations, and implement an implicit Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) method to iteratively solve for the lowest N eigenvalues.
  • Push the limit of LR-TDDFT to 12,288 cores on NERSC's Cori supercomputer, achieving, for the first time, LR-TDDFT calculations for 4,096 silicon atoms.
  • Output: [2] in Selected Publications and [3] in Competitions and Awards.

Accelerate the discontinuous Galerkin density functional theory (DGDFT) method on the Sunway and New Sunway supercomputers. August 2019 ~ April 2022.

  • Optimize DGDFT's computational kernels to align with the Sunway processor's architecture, which includes adapting data structures and reprogramming the codes to accommodate the processor's limited local data memory.
  • Implement a key-value format and a read-and-broadcast mechanism to improve the I/O performance of the DGDFT method.
  • Output: [5] and [7] in Selected Publications and 2022 ACM Gordon Bell Finalist.

Study on workload characterization for Kunpeng processor, as a joint research project with Huawei Technologies Co. Ltd, China. September 2021 ~ August 2022.

  • Establish a workflow to collect information required to describe an instruction within the framework of a basic block throughput predictor. This includes throughput, latency, port pressure, operands, and micro-operations (uops) for the Kunpeng 920 processor.
  • Construct an extensive benchmark composed of real-world applications and standard evaluation tools, along with a runtime environment system designed to execute a basic block and accurately measure the corresponding throughput.
  • Enhance the accuracy of llvm-mca by refining the pipeline simulation algorithm and calibrating it with the Kunpeng 920's hardware, which results in accuracy that significantly exceeds that of current tools on the AArch64 architecture.

  • Output: [3] and [6] in Selected Publications.

Industry Positions

Software Engineer Intern at Huawei Technologies Co. Ltd, China. October 2018 ~ March 2019. Mentor: Fan Yu.

Research Intern at Fundamental Software Innovation Lab, Huawei Technologies Co. Ltd, China. June 2023 ~ September 2023. Mentor: Han Lin.

Selected Publications

  1. [HPCC'2020] Qingcai Jiang, Lingyun Wan, Shizhe Jiao, et al. An Efficient Multi-GPU Implementation for Linear-Response Time-Dependent Density Functional Theory, in 2020 IEEE 22nd International Conference on High Performance Computing and Communications (HPCC'2020). IEEE, 2020: 197-205. [pdf]

  2. [ICPP'2022] Qingcai Jiang, Junshi Chen, Lingyun Wan, et al. Accelerating Parallel First-Principles Excited-State Calculation by Low-Rank Approximation with K-Means Clustering, in 51st International Conference on Parallel Processing (ICPP'2022). [pdf] [video]

  3. [HPCC'2022] Qingcai Jiang, Shaojie Tan, Zhenwei Cao, et al. Quantifying Throughput of Basic Blocks on ARM Microarchitectures by Static Code Analyzers: A Case Study on Kunpeng 920, in 2022 IEEE 24th Int Conf on High Performance Computing & Communications (HPCC'2022). [pdf]

  4. [DATE'2024] Qingcai Jiang*, Shaojie Tan*, Junshi Chen and Hong An. A3PIM: An Automated, Analytic and Accurate Processing-in-Memory Offloader, to appear in 27th Design, Automation and Test in Europe Conference (DATE'2024).

  5. [SC'2022] Wei Hu*, Hong An, Zhuoqiang Guo*, Qingcai Jiang*, et al. 2.5 Million-Atom Ab Initio Electronic-Structure Simulation of Complex Metallic Heterostructures with DGDFT, in Proceedings of the 2022 International Conference for High Performance Computing, Networking, Storage and Analysis (SC'2022). Awarded as a 2022 ACM Gordon Bell Finalist. [link] [pdf] [news in Chinese]

  6. [THPC] Shaojie Tan*, Qingcai Jiang*, Zhenwei Cao, et al. Uncovering the performance bottleneck of modern HPC processor with static code analyzer: a case study on Kunpeng 920, in CCF Trans. HPC, 2023: 1-22. [pdf]

  7. [Science Bulletin] Wei Hu, Xinming Qin, Qingcai Jiang, et al. High performance computing of DGDFT for tens of thousands of atoms using millions of cores on Sunway TaihuLight, in Science Bulletin, 2021, 66(2): 111-119. [pdf] [news in Chinese]

* : co-first author

Teaching Experiences

University of Science and Technology of China

  • Teaching Assistant of Introduction to Computing Systems A (CS1002A). Fall 2021.
  • Teaching Assistant of Computer Programs Design II (011175). Spring 2020.
  • Teaching Assistant of Introduction to Computing Systems H (011704). Fall 2019.
  • Teaching Assistant of Fundamentals of Artificial Intelligence (011119). Spring 2019.

Competitions and Awards

  1. First place in “2019 The 7th Student RDMA Programming Competition”. [news in Chinese]
  2. First place in “2020 The 8th APAC RDMA Programming Competition”. [news in Chinese] [news in English]
  3. First place in "The 8th 'Intel Cup' Parallel Application Challenge-PAC". [news in Chinese] [news in English]
  4. 2020 ASML Computational Lithography Scholarship Award. [photo]
  5. 2022 Global Digital Creations Technology Scholarship. [photo]


  • Programming languages: C/C++ (Programming), Python (Data Processing and Ploting), LaTeX (Drawing complex pictures with Tikz [demo]).
  • Tools: Vim, Linux, Office.