International Workshop on OpenPOWER for HPC (IWOPH'16)

chaired by Dirk Pleiter (Forschungszentrum Juelich & Universitaet Regensburg), Jack C. Wells (Oak Ridge Leadership Computing Facility)
Thursday, 23 June 2016 from to (Europe/Berlin)
Description
This workshop will take place within the ISC High Performance conference 2016. It is intended to provide a venue for the broader HPC community to further understand OpenPOWER technologies and discuss how they can be harnessed for HPC applications needs. The latest advances in OpenPOWER and how they might be used to address challenges regarding system architecture, networking, memory designs, exploitation of accelerators, programming models, and porting applications are of current interest within the HPC community and this workshop.
Go to day
  • Thursday, 23 June 2016
    • 09:00 - 11:20 Session 1
      • 09:00 Welcome 15'
      • 09:15 Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond 30'
        With the appearance of the heterogeneous platform OpenPOWER, many-core accelerator devices have been coupled with POWER host processors for the first time. Towards utilizing their full potential, it is worth investigating performance portable algorithms that allow to choose the best-fitting hardware for each domain-specific compute task. Suiting even the high level of parallelism on modern GPGPUs, our presented approach relies heavily on abstract meta-programming techniques, which are essential to focus on fine-grained tuning rather than code porting.  With this in mind, the CUDA-based open-source plasma simulation code PIConGPU is currently being abstracted to support the heterogeneous OpenPOWER platform using our fast porting interface cupla, which wraps the abstract parallel C++11 kernel acceleration library Alpaka.  We demonstrate how PIConGPU can benefit from the tunable kernel execution strategies of the Alpaka library, achieving portability and performance with single-source kernels on conventional CPUs, POWER8 CPUs and NVIDIA GPUs.
        Material: Slides pdf file
      • 09:45 High Performance Computing on the IBM Power8 platform 45'
        This paper discusses the performance of IBM’s Power8 CPUs, on a number of skeleton, financial and CFD benchmarks and applications. Implicitly, the performance of the software toolchain is also tested - the bare-bones Little-Endian Ubuntu, the GNU 5.3 and the XL 14.1.3 compilers and OpenMP runtimes. First, we aim to establish some roofline numbers on bandwidth and compute throughput, then move on to benchmark explicit and implicit one-/three-factor Black-Scholes computations, and CFD applications based on the OP2 and OPS frameworks, such as the Airfoil and BookLeaf unstructured-mesh codes, and the CloverLeaf 2D/3D structured mesh simulations. These applications all exhibit different characteristics in terms of computations, communications, memory access patterns, etc. Finally we briefly discuss performance of an industrial CFD code, Rolls-Royce Hydra, and we show initial results from IBM’s CUDA Fortran compiler. Both absolute and relative performance metrics are computed and compared to NVIDIA GPUs and Intel Xeon CPUs.
        Material: Slides pdf file
      • 10:30 Early Application Performance at the Hartree Centre with the OpenPOWER Architecture 30'
        The Hartree Centre has been established as a UK focus for industrial engagement. STFC has acquired a new IBM system based on the OpenPOWER architecture, comprising 32 nodes with POWER8 CPUs and nVIDIA Kepler K80 GPUs. We report early evaluation of the system using some real applications based on the Lattice Boltzmann Method and using FFTs. No optimisation has been carried out yet, but results are encouraging with performance comparable or better on a per core basis to Intel IvyBridge CPUs. Use of the GPUs for suitable algorithms such as Lattice Boltzmann kernels and for FFTs provides further performance enhancements.
        Material: Slides pdf file
    • 11:00 - 11:30 Coffee
    • 11:30 - 13:00 Session 2
      • 11:30 Early Experiences Porting the NAMD and VMD Molecular Simulation and Analysis Software to GPU-Accelerated OpenPOWER Platforms 45'
        All-atom molecular dynamics simulations of biomolecules provide a powerful tool for exploring the structure and dynamics of large protein complexes within realistic cellular environments. Unfortunately, such simulations are extremely demanding in terms of their computational requirements, and they present many challenges in terms of preparation, simulation methodology, and analysis and visualization of results. We describe our early experiences porting the popular molecular dynamics simulation program NAMD and the simulation preparation, analysis, and visualization tool VMD to GPU-accelerated OpenPOWER hardware platforms. We report our experiences with compiler-provided autovectorization and compare with hand-coded vector intrinsics for the POWER8 CPU. We explore the performance benefits obtained from unique POWER8 architectural features such as 8-way SMT and its value for particular molecular modeling tasks. Finally, we evaluate the performance of several GPU-accelerated molecular modeling kernels and relate them to other hardware platforms.
        Material: Slides pdf file
      • 12:15 Performance Analysis of Spark/GraphX on POWER8 Cluster 45'
        POWER 8, the latest RISC (Reduced Instruction Set Computer) microprocessor of the IBM Power architecture family, was designed to significantly benefit emerging workloads, including Business Analytics, Cloud Computing and High Performance Computing. In this paper, we provide a thorough performance evaluation on a widely used large-scale graph processing framework, Spark/GraphX, on a POWER 8 cluster. We examine the performance with several important graph kernels such as Breadth-First Search, Connected Components, and PageRank using both large real-world social graphs and synthetic graphs of billions of edges. We study the Spark/GraphX performance against some architectural aspects and perform the first Spark/GraphX scalability test with up to 16 POWER 8 nodes.
        Material: Slides pdf file
    • 13:00 - 14:00 Lunch
    • 14:00 - 16:00 Session 3
      • 14:00 Measuring and Managing Energy in OpenPOWER 45'
        This paper presents the design and implementation of energy measurement and management features found in OpenPOWER systems. The firmware and its ecosystem are open source to allow the community to extend the capabilities.
        Material: Slides pdf file
      • 14:45 Exploring Energy Efficiency for GPU-Accelerated POWER Servers 45'
        Modern servers provide different features for managing the amount of energy that is needed to execute a given work-load. In this article we focus on a new generation of GPU-accelerated servers with POWER8 processors. For different scientific applications, which have in common that they have been written for massively-parallel computers, we measure energy-to-solution for different system configurations. By combining earlier developed performance models and a simple power model, we derive an energy model that can help to optimize for energy
        efficiency.
        Material: Slides pdf file
      • 15:30 First Experiences with ab initio Molecular Dynamics on OpenPOWER: the case of CPMD 30'
        In this article we present algorithmic rethinking and code re-engineering that is required in porting highly successful and popular planewave codes to next generation heterogeneous OpenPOWER architectures, that foster acceleration and high bandwidth links to GPUs. In this work we focus on CPMD as the most representative software for ad initio molecular dynamics simulations. We have ported to the GPU the construction of the electronic density, the application of the potential to the wavefunctions and the orthogonalization procedure. The different GPU kernels consist mainly of fast Fourier transforms (FFT) and basic linear algebra operations (BLAS). The performance of the new implementation obtained on Firestone (POWER8/Tesla) is discussed. We show that the communication between the host and the GPU contributes to a large fraction of the total run time. We expect a strong attenuation of the communication bottleneck when the NVLink high-speed interconnect will be available.
        Material: Slides pdf file
    • 16:00 - 16:30 Coffee
    • 16:30 - 18:05 Session 4
      • 16:30 Performance of the 3D Combustion Simulation code RECOM-AIOLOS on IBM POWER8 architecture 30'
        The IBM POWER8 CPU is a high-performance multi-core hardware which targets the usage with computational intense numerical codes. Combustion modeling is among the most computational demanding mathematical problems. Therefore, in this paper we present a performance analysis of the 3D-combustion modeling software RECOM-AIOLOS on a POWER8 node. The analysis reveals the strengths of the POWER8 hardware being a NUMA system, but also the importance of a proper memory allocation when using OpenMP or a hybrid (OpenMP + MPI) parallelization approach on such a system.
        Material: Slides pdf file
      • 17:00 Panel discussion 45'
        The panel will discuss opportunities for HPC based on OpenPOWER technologies. On the panel will be:
        * Jeff Vetter (ORNL Future Technologies Group)
        * Rich Graham (Mellanox)
        * Piero Altoe (E4)
        * John Stone (University of Illinois at Urbana-Champaign)
      • 17:45 Conclusion 15'