Porting Computational Physics Applications to the Titan Supercomputer with OpenACC and OpenMP

This session presents valuable "lessons learned" during the process of porting

  1. Cray Inc
    This session presents valuable "lessons learned" during the process of porting
    Transcript Header:
    Porting Computational Physics Applications to the Titan Supercomputer with OpenACC and OpenMP
    Transcript Body:
    • 1. C O M P U T E | S T O R E | A N A L Y Z E Porting Computational Physics Applications to the Titan Supercomputer with OpenACC and OpenMP Aaron Vose - Cray Inc. GTC - 03/19/2015
    • 2. C O M P U T E | S T O R E | A N A L Y Z E Porting - Overview ● Porting methodology: ● Express underlying algorithmic parallelism. ● Port to OpenMP first. ● Port to OpenACC second. ● Case studies / examples: ● TACOMA ● Delta5D ● NekCEM 2 Copyright 2015 Cray Inc. & GE
    • 3. C O M P U T E | S T O R E | A N A L Y Z E Porting - Overview ● For each case study / example code: (TACOMA, Delta5D, and NekCEM) ● Introduction to the code and example loop. ● OpenMP / OpenACC porting of the loop. ● Express underlying algorithmic parallelism. ● OpenACC data motion with simplified call tree. ● Performance results. 3 Copyright 2015 Cray Inc. & GE
    • 4. C O M P U T E | S T O R E | A N A L Y Z E Porting - OpenMP ● Express existing loop-level parallelism with OpenMP directives. ● Cray’s Reveal tool can do much of this automatically. ● Port to OpenMP before OpenACC. ● OpenACC can reuse most OpenMP scoping. ● OpenMP porting to CPU is easier than OpenACC porting to GPU. ● Data motion can be ignored when porting to OpenMP. ● Modify loops to expose more of underlying algorithms’ parallelism. 4 Copyright 2015 Cray Inc. & GE
    • 5. C O M P U T E | S T O R E | A N A L Y Z E Porting - OpenACC ● Identify candidate loops: ● Check loops’ trip/iteration count (CrayPAT). ● Add OpenACC directives / Optimize Kernels: ● Check compiler listing for proper vectorization. ● Ignore data motion (best performed once kernels are done and have known data requirements). ● Finally, optimize device host data motion. ● Perform bottom-up, hierarchical data optimization. 5 Copyright 2015 Cray Inc. & GE
    • 6. C O M P U T E | S T O R E | A N A L Y Z E Porting - TACOMA 6 Copyright 2015 Cray Inc. & GE Case Study I: TACOMA
    • 7. C O M P U T E | S T O R E | A N A L Y Z E Porting - TACOMA ● From GE’s Brian Mitchell. ● Computational fluid dynamics is essential to design jet engines, gas/steam turbines, and more. ● Finite-volume, block- structured, compressible flow solver, with stability achieved via JST. 7 Copyright 2015 Cray Inc. & GE
    • 8. C O M P U T E | S T O R E | A N A L Y Z E Porting - TACOMA ● Example loop nest from TACOMA. ● Representative of a number of costly routines. ● Can be made to parallelize on CPUs with OpenMP. ● GPU vectorization requires more work. 8 Copyright 2015 Cray Inc. & GE
    • 9. C O M P U T E | S T O R E | A N A L Y Z E TACOMA - Algo. Parallelism 9 Copyright 2015 Cray Inc. & GE do k=1,n3 do j=1,n2 do i=1,n1 df(1:3) = dflux(i,j,k) R(i,j,k) += df(1) + df(2) + df(3) R(i-1,j,k) -= df(1) R(i,j-1,k) -= df(2) R(i,j,k-1) -= df(3) end do end do end do
    • 10. C O M P U T E | S T O R E | A N A L Y Z E TACOMA - Algo. Parallelism 10 Copyright 2015 Cray Inc. & GE do k=1,n3 do j=1,n2 do i=1,n1 df(1:3) = dflux(i,j,k) R(i,j,k) += df(1) + df(2) + df(3) R(i-1,j,k) -= df(1) R(i,j-1,k) -= df(2) R(i,j,k-1) -= df(3) end do end do end do t0 OpenMP t1
    • 11. C O M P U T E | S T O R E | A N A L Y Z E TACOMA - Algo. Parallelism 11 Copyright 2015 Cray Inc. & GE do k=1,n3 do j=1,n2 do i=1,n1 df(1:3) = dflux(i,j,k) R(i,j,k) += df(1) + df(2) + df(3) R(i-1,j,k) -= df(1) R(i,j-1,k) -= df(2) R(i,j,k-1) -= df(3) end do end do end do do k=ts3,tn3 do j=ts2,tn2 do i=ts1,tn1 df(1:3) = dflux(i,j,k) if mycolor(i,j,k,tid) R(i,j,k) += df(1) + df(2) + df(3) if mycolor(i-1,j,k,tid) R(i-1,j,k) -= df(1) end do end do end do OpenMP
    • 12. C O M P U T E | S T O R E | A N A L Y Z E TACOMA - Algo. Parallelism 12 Copyright 2015 Cray Inc. & GE do k=1,n3 do j=1,n2 do i=1,n1 df(1:3) = dflux(i,j,k) R(i,j,k) += df(1) + df(2) + df(3) R(i-1,j,k) -= df(1) R(i,j-1,k) -= df(2) R(i,j,k-1) -= df(3) end do end do end do do df(i,j,k,1:3) = dflux(i,j,k) end do do R(i,j,k) += df(i,j,k,1) + df(i,j,k,2) + df(i,j,k,3) R(i,j,k) -= df(i+1,j,k,1) + df(i,j+1,k,2) + df(i,j,k+1,3) end do OpenACC
    • 13. C O M P U T E | S T O R E | A N A L Y Z E TACOMA - OpenACC Data ● Create OpenACC data regions: ● Keep data on the GPU device as long as possible. ● Create data regions in bottom-up, hierarchical fashion. 13 Copyright 2015 Cray Inc. & GE
    • 14. C O M P U T E | S T O R E | A N A L Y Z E TACOMA - Performance 14 Copyright 2015 Cray Inc. & GE
    • 15. C O M P U T E | S T O R E | A N A L Y Z E Porting - Delta5D 15 Copyright 2015 Cray Inc. & GE Case Study II: Delta5D
    • 16. C O M P U T E | S T O R E | A N A L Y Z E Porting - Delta5D ● From ORNL’s Donald Spong. ● Monte-Carlo fusion code. ● Boozer space particle orbits. ● Hamiltonian guiding center equations solved with 4th order Runge Kutta. 16 Copyright 2015 Cray Inc. & GE
    • 17. C O M P U T E | S T O R E | A N A L Y Z E Porting - Delta5D ● Example loop from Delta5D. ● Fast enough to run in serial on CPU; slow on GPU. ● Data motion rules out running on CPU. ● Needs to run in parallel on GPU. 17 Copyright 2015 Cray Inc. & GE
    • 18. C O M P U T E | S T O R E | A N A L Y Z E Delta5D - Algo. Parallelism ● If a particle’s trajectory takes it outside the confined plasma volume, append it to a list of escaped particles: 18 Copyright 2015 Cray Inc. & GE … …
    • 19. C O M P U T E | S T O R E | A N A L Y Z E Delta5D - Algo. Parallelism 19 Copyright 2015 Cray Inc. & GE do i=1,maxorb ! -- Record this particle if it has "escaped". if(psinor(i) .gt. 1.) then iloss = iloss + 1 phi_loss(iloss) = y(6*i-3) psi_loss(iloss) = y(6*i-4)/psimax thet_loss(iloss) = y(6*i-5) elost = elost + hkin(i)/ejoule end if end do v1
    • 20. C O M P U T E | S T O R E | A N A L Y Z E Delta5D - Algo. Parallelism 20 Copyright 2015 Cray Inc. & GE do i=1,maxorb ! -- Record this particle if it has "escaped". if(psinor(i) .gt. 1.) then !$acc atomic capture iloss = iloss + 1 ! update-statement my_iloss = iloss ! capture-statement !$acc end atomic phi_loss(my_iloss) = y(6*i-3) psi_loss(my_iloss) = y(6*i-4)/psimax thet_loss(my_iloss) = y(6*i-5) elost = elost + hkin(i)/ejoule end if end do v2
    • 21. C O M P U T E | S T O R E | A N A L Y Z E Delta5D - OpenACC Data 21 Copyright 2015 Cray Inc. & GE
    • 22. C O M P U T E | S T O R E | A N A L Y Z E Delta5D - OpenACC Performance 22 Copyright 2015 Cray Inc. & GE OpenACC Sequential OpenACC Atomics 19.446s 0.425s ● 45x kernel speedup. ● Up to ~5-10% improvement in total runtime.
    • 23. C O M P U T E | S T O R E | A N A L Y Z E Porting - NekCEM 23 Copyright 2015 Cray Inc. & GE Case Study III: NekCEM
    • 24. C O M P U T E | S T O R E | A N A L Y Z E Porting - NekCEM ● From ANL’s Mi Sun Min. ● Nekton for Computational Electro Magnetics. ● High-fidelity electro- magnetics solver based on spectral element methods. ● Written in Fortran and C. 24 Copyright 2015 Cray Inc. & GE
    • 25. C O M P U T E | S T O R E | A N A L Y Z E Porting - NekCEM ● Example loop from NekCEM. ● Initial loop structure does not vectorize on GPU. ● Gather/scatter benefits from high GPU bandwidth. ● Data motion needed around MPI communication. 25 Copyright 2015 Cray Inc. & GE
    • 26. C O M P U T E | S T O R E | A N A L Y Z E NekCEM - Algo. Parallelism ● Scatter from u to dbuf with indirect addressing using description vector snd_map internally terminated by -1. 26 Copyright 2015 Cray Inc. & GE 103 -1 0 3 -1 6 2 … -1 -1 211 3 5 … 13213 1 5 8 2 … snd_map[]: dbuf[]: u[]:
    • 27. C O M P U T E | S T O R E | A N A L Y Z E NekCEM - Algo. Parallelism 27 Copyright 2015 Cray Inc. & GE for(k=0; k
    View More