Follow
Ammar Ahmad Awan
Ammar Ahmad Awan
Microsoft
Verified email at osu.edu - Homepage
Title
Cited by
Cited by
Year
S-caffe: Co-designing mpi runtimes and caffe for scalable deep learning on modern gpu clusters
AA Awan, K Hamidouche, JM Hashmi, DK Panda
ACM PPoPP '17 52 (8), 193-205, 2017
1702017
An in-depth performance characterization of CPU-and GPU-based DNN training on modern architectures
AA Awan, H Subramoni, DK Panda
Proceedings of the Machine Learning on HPC Environments, 1-8, 2017
702017
Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale
S Rajbhandari, C Li, Z Yao, M Zhang, RY Aminabadi, AA Awan, J Rasley, ...
International Conference on Machine Learning, 18332-18346, 2022
582022
Efficient large message broadcast using NCCL and CUDA-aware MPI for deep learning
AA Awan, K Hamidouche, A Venkatesh, DK Panda
Proceedings of the 23rd European MPI Users' Group Meeting, 15-22, 2016
522016
Scalable distributed dnn training using tensorflow and cuda-aware mpi: Characterization, designs, and performance evaluation
AA Awan, J Bédorf, CH Chu, H Subramoni, DK Panda
2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid …, 2019
482019
Privacy-aware searching with oblivious term matching for cloud storage
Z Pervez, AA Awan, AM Khattak, S Lee, EN Huh
The Journal of Supercomputing 63, 538-560, 2013
452013
Optimized broadcast for deep learning workloads on dense-GPU InfiniBand clusters: MPI or NCCL?
AA Awan, CH Chu, H Subramoni, DK Panda
Proceedings of the 25th European MPI Users' Group Meeting, 1-9, 2018
442018
Nv-group: link-efficient reduction for distributed deep learning on modern dense gpu systems
CH Chu, P Kousha, AA Awan, KS Khorassani, H Subramoni, DK Panda
Proceedings of the 34th ACM International Conference on Supercomputing, 1-12, 2020
352020
Scalable and efficient moe training for multitask multilingual models
YJ Kim, AA Awan, A Muzio, AFC Salinas, L Lu, A Hendy, S Rajbhandari, ...
arXiv preprint arXiv:2109.10465, 2021
342021
Oc-dnn: Exploiting advanced unified memory capabilities in cuda 9 and volta gpus for out-of-core dnn training
AA Awan, CH Chu, H Subramoni, X Lu, DK Panda
2018 IEEE 25th International Conference on High Performance Computing (HiPC …, 2018
332018
1-bit adam: Communication efficient large-scale training with adam’s convergence speed
H Tang, S Gan, AA Awan, S Rajbhandari, C Li, X Lian, J Liu, C Zhang, ...
International Conference on Machine Learning, 10118-10129, 2021
322021
Performance characterization of dnn training using tensorflow and pytorch on modern clusters
A Jain, AA Awan, Q Anthony, H Subramoni, DKDK Panda
2019 IEEE International Conference on Cluster Computing (CLUSTER), 1-11, 2019
282019
Gems: Gpu-enabled memory-aware model-parallelism system for distributed dnn training
A Jain, AA Awan, AM Aljuhani, JM Hashmi, QG Anthony, H Subramoni, ...
SC20: International Conference for High Performance Computing, Networking …, 2020
262020
CUDA kernel based collective reduction operations on large-scale GPU clusters
CH Chu, K Hamidouche, A Venkatesh, AA Awan, DK Panda
2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid …, 2016
242016
Efficient and scalable multi-source streaming broadcast on GPU clusters for deep learning
CH Chu, X Lu, AA Awan, H Subramoni, J Hashmi, B Elton, DK Panda
2017 46th International Conference on Parallel Processing (ICPP), 161-170, 2017
232017
Exploiting GPUDirect RDMA in designing high performance OpenSHMEM for NVIDIA GPU clusters
K Hamidouche, A Venkatesh, AA Awan, H Subramoni, CH Chu, ...
2015 IEEE International Conference on Cluster Computing, 78-87, 2015
222015
Scaling tensorflow, pytorch, and mxnet using mvapich2 for high-performance deep learning on frontera
A Jain, AA Awan, H Subramoni, DK Panda
2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS), 76-83, 2019
192019
Communication profiling and characterization of deep-learning workloads on clusters with high-performance interconnects
AA Awan, A Jain, CH Chu, H Subramoni, DK Panda
IEEE Micro 40 (1), 35-43, 2019
192019
Designing non-blocking personalized collectives with near perfect overlap for rdma-enabled clusters
H Subramoni, AA Awan, K Hamidouche, D Pekurovsky, A Venkatesh, ...
High Performance Computing: 30th International Conference, ISC High …, 2015
182015
Deepspeed inference: Enabling efficient inference of transformer models at unprecedented scale
RY Aminabadi, S Rajbhandari, M Zhang, AA Awan, C Li, D Li, E Zheng, ...
arXiv preprint arXiv:2207.00032, 2022
152022
The system can't perform the operation now. Try again later.
Articles 1–20