International Workshop on OpenPOWER for HPC (IWOPH'19)

chaired by Dirk Pleiter (JSC), Jack Wells (ORNL), Jesus Labarta
Thursday, 20 June 2019 from to (Europe/Berlin)
at Frankfurt

This workshop will take place as part of the ISC High Performance conference 2019. This workshop will provide a venue for HPC and extreme-scale data technology experts as well as for application developers to present results obtained on OpenPOWER technologies. The latest advances in OpenPOWER addressing challenges in system architecture, networking, memory designs, exploitation of accelerators, programming models, and porting applications in machine learning, data analytics, modeling and simulation are of current interest within the HPC community and this workshop. In particular, the use of the new NVLink technology as well as the use of the CAPI/OpenCAPI technology is of interest. Closely related topics also include OpenMP, OpenACC, MPI, OpenSHMEM, developer and performance tools, etc. for the OpenPOWER ecosystem.

Slides pdf filedown arrow
Go to day
  • Thursday, 20 June 2019
    • 09:00 - 11:00 Session 1
      • 09:00 Welcome 15'
      • 09:15 Keynote: Landing on Mars: Petascale Unstructured Computational Fluid Dynamics on Summit 1h0'
        Human exploration of Mars will require retropropulsion to decelerate payloads an order of magnitude beyond those of recent efforts. Simulating the interactions between the atmosphere and retropropulsion exhaust plumes at sufficient spatial resolution to resolve governing phenomena with a high level of confidence is not feasible with conventional computational capabilities. Researchers from NASA Langley and Ames Research Centers, NVIDIA Corporation, and Old Dominion University have developed a GPU-accelerated version of Langley's FUN3D flow solver. An on-going campaign on Oak Ridge National Lab's Summit supercomputer is using this capability to apply detached eddy simulation (DES) methods to retropropulsion in atmospheric environments for nominal operation of a human-scale Mars lander concept.
        Speaker: Mr. Aaron Walden (NASA)
      • 10:15 Performance Evaluation of MPI Libraries on GPU-enabled OpenPOWER Architectures: Early Experiences 45'
        The advent of Graphics Processing Unit (GPU)-enabled OpenPOWER architectures are empowering the advancement of various High-Performance Computing (HPC) applications from dynamic modular simulation to deep learning training. GPU-Aware Message Passing Interface (MPI) is one of the most efficient libraries used to exploit the computing power on GPU-enabled HPC systems at scale. However, there is a lack of thorough performance evaluations for GPU-Aware MPI libraries to provide insights into the varying costs and benefits of using each one on GPU-enabled OpenPOWER systems. In this paper, we provide a detailed performance evaluation and analysis of point-to-point communication using various GPU-Aware MPI libraries including SpectrumMPI, OpenMPI+UCX, and MVAPICH2-GDR on an OpenPOWER GPU-enabled system. We demonstrate that all three MPI libraries deliver approximately 98% of achievable bandwidth for NVLink communication between two GPUs on the same socket. For inter-node communication where the InfiniBand network dominates the peak bandwidth, MVAPICH2-GDR attains approximately 95% achievable bandwidth, while SpectrumMPI achieves approximately 58%, and OpenMPI delivers close to 50%. This evaluation is useful to determine which MPI library can provide the highest performance enhancement.
        Speaker: Mr. Subramoni Hari
    • 11:00 - 11:30 Coffee break
    • 11:30 - 13:00 Session 2
      • 11:30 Porting Adaptive Ensemble Workflows to the Summit Supercomputer 45'
        Molecular dynamics (MD) simulations must take very small steps in simulation-time to avoid numerical errors. Efficient use of parallel programming models and accelerators in state-of-the art MD programs are now pushing Moore's limit for time-per-MD step. Therefore, directly  simulating experimental timescales will not be attainable directly, even at exascale. Using concepts from statistical physics, many parallel simulations can be combined to provide information about longer timescales and to adequately sample the simulation space, while maintaining details about the dynamics of the system. Implementing such an approach requires a workflow program that allows interactive steering of task assignments based on extensive statistical analysis of intermediate results. Here we report the implementation of such an adaptable workflow program to drive simulations on the Summit supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). This type of workflow generator can be used to provide adaptive steering of ensemble simulations of all types, not just for MD. We compare to experiences on Titan, report the performance of the workflow and its components, and describe
        the porting process.
      • 12:15 Evaluating POWER Architecture for Distributed Training of Generative Adversarial Networks 45'
        The increased availability of High-Performance Computing resources can enable data scientists to deploy and evaluate data-driven approaches, notably in the eld of deep learning, at a rapid pace. As deep neural networks become more complex and are ingesting increasingly larger datasets, it becomes unpractical to perform the training phase on single machine instances due to memory constraints, and extremely long training time. Rather than scaling up, scaling out the computing resources is a productive approach to improve performance. The paradigm of data parallelism allows us to split the training dataset into manageable chunks that can be processed in parallel. In this work, we evaluate the scaling performance of training a 3D generative adversarial network (GAN) on an IBM POWER8 cluster, equipped with 12 NVIDIA P100 GPUs. The full training duration of the GAN, including evaluation, is reduced from 20 hours and 16 minutes on a single GPU, to 2 hours and 14 minutes on all 12 GPUs. We achieve a scaling eciency of 98.9% when scaling from 1 to 12 GPUs, taking only the training process into consideration.
    • 13:00 - 14:00 Lunch break
    • 14:00 - 16:00 Session 3
      • 14:00 Scaling the Summit: Deploying the World's Fastest Supercomputer 45'
        Summit, the latest flagship supercomputer deployed at Oak Ridge Leadership Computing Facility (OLCF), became the number one system in the Top500[17] list in June 2018 and remains in the top spot in the most recent edition of the list. An extensive acceptance test plan was developed to evaluate the unique features introduced in the Summit architecture and system software stack. The acceptance test also includes tests to ensure that the system is reliable, stable, and performant.
      • 14:45 Performance Comparison for Neuroscience Application Benchmarks 45'
        Researchers within the Human Brain Project and related projects have in the last couple of years expanded their needs for high-performance computing infrastructures. The needs arise from a diverse set of science challenges that range from large-scale simulations of brain models to processing of extreme-scale experimental data sets. The ICEI project, which is in the process of creating a distributed infrastructure optimised for brain research, started to build-up a set of benchmarks that reflect the diversity of applications in this field. In this paper we analyse the performance of some selected benchmarks on an IBM POWER8 and Intel Skylake based systems with and without GPUs.
      • 15:30 Parallelware tools: An experimental evaluation on POWER systems 30'
        Static code analysis tools are designed to aid software developers to build better quality software in less time, by detecting defects early in the software development life cycle. Even the most experienced developer regularly introduces coding defects. Identifying, mitigating and resolving defects is an essential part of the software development process, but frequently defects can go undetected. One defect can lead to a minor malfunction or cause serious security and/or safety issues. Thus, there is an urgent need for new static code analysis tools to help in building better concurrent and parallel software. The paper reports preliminary results about the use of Appentra's Parallelware technology to address this problem from the following three perspectives: Finding concurrency issues in the code, discovering new opportunities for parallelization in the code, and generating parallel-equivalent code that enable tasks to runs faster. The paper also presents experimental results using well-known scientific codes and POWER systems.
    • 16:00 - 16:30 Coffee break
    • 16:30 - 18:00 Session 4
      • 16:30 Enabling Fast and Highly Effective FPGA Design Process Using the CAPI SNAP Framework 30'
        The CAPI SNAP (Storage, Network, and Analytics Program-ming) is an open source framework which enables C/C++ as well as FPGA programmers to quickly create FPGA-based accelerated computing that works on server host data, as well as data from storage, flash, Ethernet, or other connected resources. The SNAP framework is based on the IBM Co-herent Accelerator Processor Interface (CAPI). From POWER8 with CAPI1.0, to POWER9 with CAPI2.0 and OpenCAPI, programmers can have access to a very simple framework to develop accelerated applications using high speed and very low latency interfaces to access an external FPGA. With SNAP, no specific hardware skill is required to port or devel-op an application and then accelerate it. Even more, a cloud environment is being offered as a cost effective, ready-to-use environment for a first-time right experience as well as a deeper development so that it can be achieved with very little investment.
        Speakers: Mr. Castellane Alexandre (IBM France), Mr. Mesnet Bruno (IBm France)
      • 17:00 Exploring the Behavior of Coherent Accelerator Processor Interface (CAPI) on IBM Power8+ Architecture and FlashSystem 900 30'
        The Coherent Accelerator Processor Interface (CAPI) is a general term for the infrastructure that provides high throughput and low latency path to the flash storage connected to the IBM POWER 8+ System. CAPI accelerator card is attached coherently as a peer to the Power8+ processor. This removes the overhead and complexity of the IO subsystem and allows the accelerator to operate as part of an application. In this paper, we present the results of experiments on IBM FlashSystem900 (FS900) with CAPI accelerator card using the “CAPI- Flash - IBM Data Engine for NoSQL Software” Library. This library provides the application, a direct access to the underlying flash storage through user space APIs, to manage and access the data in flash. This offloads kernel IO driver functionality to dedicated CAPI FPGA accelerator hardware. We conducted experiments to analyze the performance of FS900 with CAPI accelerator card, using the Key Value Layer APIs, employing NASA’s MODIS Land Surface Reflectance dataset as a large dataset use case. We performed Read and Write operations on datasets of size ranging from 1MB to 3TB by varying the number of threads. We then compared this performance with other heterogeneous storage and memory devices such as NVM, SSD and RAM, without using the CAPI Accelerator in synchronous and asynchronous file IO modes of operations. The asynchronous mode had the best performance on all the memory devices that we used for this study. In particular, the results indicate that FS900 & CAPI , together with the metadata cache in RAM, delivers the highest IO/s and OP/s for read operations. This was higher than just using RAM, along with utilizing lesser CPU resources. Among FS900, SSD and NVM, FS900 had the highest write IO/s. Another important observation is that, when the size of the input dataset exceeds the capacity of RAM, and when the data access is non-uniform and sparse, FS900 with CAPI would be a cost-effective alternative.