Automatic tuning of sparse matrix-vector multiplication on multicore clusters SG Li, CJ Hu, JC Zhang, YQ Zhang Science China Information Sciences 58 (9), 1-14, 2015 | 137 | 2015 |
NUMA-aware shared-memory collective communication for MPI S Li, T Hoefler, M Snir Proceedings of the 22nd international symposium on High-performance parallel …, 2013 | 112 | 2013 |
Parallel processing systems for big data: a survey Y Zhang, T Cao, S Li, X Tian, L Yuan, H Jia, AV Vasilakos Proceedings of the IEEE 104 (11), 2114-2136, 2016 | 88 | 2016 |
Deep learning for post-processing ensemble weather forecasts P Grönquist, C Yao, T Ben-Nun, N Dryden, P Dueben, S Li, T Hoefler Philosophical Transactions of the Royal Society A 379 (2194), 20200092, 2021 | 55 | 2021 |
Taming unbalanced training workloads in deep learning with partial collective operations S Li, T Ben-Nun, SD Girolamo, D Alistarh, T Hoefler Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of …, 2020 | 35 | 2020 |
Data Movement Is All You Need: A Case Study on Optimizing Transformers A Ivanov, N Dryden, T Ben-Nun, S Li, T Hoefler Proceedings of Machine Learning and Systems 3, 2021 | 31 | 2021 |
Improved MPI collectives for MPI processes in shared address spaces S Li, T Hoefler, C Hu, M Snir Cluster Computing 17 (4), 1139-1155, 2014 | 29 | 2014 |
CAS‐ESM 2: Description and climate simulation performance of the Chinese Academy of Sciences (CAS) Earth System Model (ESM) version 2 H Zhang, M Zhang, J Jin, K Fei, D Ji, C Wu, J Zhu, J He, Z Chai, J Xie, ... Journal of Advances in Modeling Earth Systems, e2020MS002210, 2020 | 23 | 2020 |
Cache-oblivious MPI all-to-all communications based on Morton order S Li, Y Zhang, T Hoefler IEEE Transactions on Parallel and Distributed Systems, 2018 | 19 | 2018 |
Kernel optimization for short-range molecular dynamics C Hu, X Wang, J Li, X He, S Li, Y Feng, S Yang, H Bai Computer Physics Communications, 2016 | 18 | 2016 |
Massively Scaling the Metal Microscopic Damage Simulation on Sunway TaihuLight Supercomputer S Li, B Wu, Y Zhang, X Wang, J Li, C Hu, J Wang, Y Feng, N Nie Proceedings of the 47th International Conference on Parallel Processing, 47, 2018 | 14 | 2018 |
Asynchronous work stealing on distributed memory systems S Li, J Hu, X Cheng, C Zhao 2013 21st Euromicro International Conference on Parallel, Distributed, and …, 2013 | 14 | 2013 |
Efficient parallel optimizations of a high-performance SIFT on GPUs Z Li, H Jia, Y Zhang, S Liu, S Li, X Wang, H Zhang Journal of Parallel and Distributed Computing, 2018 | 13 | 2018 |
Fast Convolution Operations on Many-Core Architectures S Li, Y Zhang, C Xiang, L Shi High Performance Computing and Communications (HPCC), 2015 IEEE 7th …, 2015 | 13 | 2015 |
Hybrid-optimization strategy for the communication of large-scale Kinetic Monte Carlo simulation B Wu, S Li, Y Zhang, N Nie Computer Physics Communications, 2016 | 12 | 2016 |
Chimera: efficiently training large-scale neural networks with bidirectional pipelines S Li, T Hoefler Proceedings of the International Conference for High Performance Computing …, 2021 | 11 | 2021 |
A Cross-Platform SpMV Framework on Many-Core Architectures Y Zhang, S Li, S Yan, H Zhou ACM Transactions on Architecture and Code Optimization (TACO) 13 (4), 33, 2016 | 10 | 2016 |
Analyzing MPI-3.0 Process-Level Shared Memory: A Case Study with Stencil Computations X Zhu, J Zhang, K Yoshii, S Li, Y Zhang, P Balaji Cluster, Cloud and Grid Computing (CCGrid), 2015 15th IEEE/ACM International …, 2015 | 10 | 2015 |
Asynchronous Decentralized SGD with Quantized and Local Updates G Nadiradze, A Sabour, P Davies, S Li, D Alistarh Advances in Neural Information Processing Systems 34, 2021 | 9* | 2021 |
OpenKMC: a KMC design for hundred-billion-atom simulation using millions of cores on Sunway Taihulight K Li, H Shang, Y Zhang, S Li, B Wu, D Wang, L Zhang, F Li, D Chen, ... Proceedings of the International Conference for High Performance Computing …, 2019 | 9 | 2019 |