Academic Discussion Group Workshop 2020chaired by Dirk Pleiter (Forschungszentrum Jülich & Universität Regensburg), Ganesan Narayanasamy (IBM), Sameer Shende, Fabrizio Magugliani
In the run-up to the SC20 conference, the 5th OpenPOWER Academia Discussion Group Workshop will take place as a virtual event on Friday, November 6, 2-8pm UTC.
Despite this being a virtual event, it is our goal to facilitate further interaction between members of the Academia Discussion Group (ADG) and other people interested in OpenPOWER. The workshop will focus on supporting developers using OpenPOWER technologies. We encourage developers from different application areas of scientific computing, data analytics and AI to join. The workshop will promote exchange of results among participants, enhance their technical knowledge as well as skills and facilitate networking.
A large fraction of the OpenPOWER members are academic institutions. The ADG serves as a platform within the OpenPOWER Foundation for academic members as well as a platform for other academics interested in OpenPOWER. Current ADG membership comprises a broad range of academic institutions with strong competence, e.g., in providing and operating high-performance computing facilities or in developing scientific applications, scalable numerical methods, new approaches for processing extreme scale data or new methods for data analytics.
14:00 - 15:40
Extreme-scale Scientific Software Stack (E4S)
The DOE Exascale Computing Project (ECP) Software Technology focus area is developing an HPC software ecosystem that will enable the efficient and performant execution of exascale applications. The Extreme-scale Scientific Software Stack (E4S) [https://e4s.io], is a coherent software stack that will enable application developers to write parallel applications that can target diverse exascale architectures. E4S provides both source builds through the Spack platform and a set of containers that feature a broad collection of HPC software packages. E4S exists to accelerate the development, deployment, and use of HPC software, lowering the barriers for HPC users. It provides container images, build manifests, and turn-key, from-source builds of popular HPC software packages developed as Software Development Kits (SDKs). It will introduce the E4S containers are being deployed at the HPC systems at DOE national laboratories using Singularity, Shifter, and Charliecloud container runtimes.
Speaker: Prof. Sameer Shende (University of Oregon) Material: Slides
Open OnDemand platform for POWER systems
IBM Power plus PowerAI systems are arguably the most advanced and highly architected systems for machine learning/deep learning on the market. In most installations, these systems are part of a high performance computing (HPC) cluster. As a new user, the intricacies of computing in an HPC cluster is a daunting task, yet one well worth tackling. Here we introduce Open OnDemand as a platform to enable new users on Power based HPC clusters. Open OnDemand is a web based platform for interacting with HPC (and cloud) clusters. We demonstrate how new users interact with the IBM Watson AI toolset. As one progresses from new to intermediate user, there remain many components of computing, such as optimizing, that remain to be learned. XDMoD is a job log analysis tool that exposes job performance metrics to users. We end by showing how Open OnDemand, in combination with XDMoD, can assist users in job optimising.
Speaker: Prof. Robert Settlage (Virginia Tech) Material: Slides
LBM performance in Exascale era
In the next two years, the first Exascale class supercomputers will be delivered. In this talk, starting from the results obtained using marconi100 (9th in the Top500 list) we try to extrapolate a reasonable performance scenario that a Lattice Boltzmann Method (LBM) based code can achieve using these high-end HPC machines class, with the idea to set a realistic baseline to help the LBM community to carefully plan larger and more complex simulations using Exascale supercomputer machines.
Speaker: Dr. Giorgio Amati (CINECA) Material: Slides
Parallel Comparison of Huge DNA Sequences in Multiple GPUs with Pruning
Sequence comparison is a task performed in several Bioinformatics applications daily all over the world. Algorithms that retrieve the optimal result have quadratic time complexity, requiring a huge amount of computing power when the sequences compared are long. In order to reduce the execution time, many parallel solutions have been proposed in the literature. Nevertheless, depending on the sizes of the sequences, even those parallel solutions take hours or days to complete. Pruning techniques can significantly improve the performance of the parallel solutions and a few approaches have been proposed to provide pruning capabilities for sequence comparison applications. This talk presents a variant of the block pruning approach that runs in multiple GPUs, in homogeneous or heterogeneous environments. Experimental results obtained with DNA sequences in two testbeds show that significant performance gains are obtained with pruning, compared to its non-pruning counterpart. We also tested the solution in the U.Oregon environment with up to 8 Volta GPUs, achieving the impressive performance of 2.35 TCUPS (Trillions of Cells Updated per Second).
Speakers: Prof. Alba Cristina Magalhaes Alves de Melo (University of Brasilia), Mr. Marco Figueiredo (University of Brasilia) Material: Slides
- 14:00 Extreme-scale Scientific Software Stack (E4S) 25'
- 15:40 - 16:10 Virtual break
16:10 - 17:50
Middleware for Message Passing Interface (MPI) and Deep Learning on OpenPOWER platforms
This talk will focus on high-performance and scalable middleware for Message Passing Interface (MPI) and Deep Learning on OpenPOWER platforms with NVIDIA GPGPUs and RDMA-enabled interconnects (InfiniBand and RoCE). We will first focus on the OSU MVAPICH2 MPI libraries and their capabilities for high-performance computing applications with both CPUs (OpenPOWER) and GPUs (NVIDIA). Next, we will focus on the challenges being faced by the AI community to achieve high-performance, scalable and distributed DNN training on Modern HPC systems with both scale-up and scale-out strategies. The talk will focus on a range of solutions being carried out in my group to address these challenges. The solutions will include: 1) MPI-driven Deep Learning, 2) Out-of-core DNN training, and 3) Hybrid (Data and Model) parallelism. Performance results from the ORNL SUMMIT system (#2nd) and Lassen (#14th) with thousands of GPUs and POWER9 CPUs will be presented.
Speaker: Prof. DK Panda (Ohio State University) Material: Slides
Parallelware Analyzer: Data race detection for GPUs using OpenMP
The development and maintenance of parallel software is far more complex than sequential software. Bugs related to parallelism are difficult to find and fix because a buggy parallel code might run correctly 99% of the time and fail just the remaining 1%. This is also true for Graphical Processing Units (GPUs). In order to take advantage of the performance promised by GPUs, developers must write bug-free parallel code using the C/C++ and Fortran programming languages. This talk presents a new innovative approach to parallel programming based on two pillars: first, an open catalog of rules and recommendations that leverage parallel programming best practices; and second, the automation of quality assurance for parallel programming through new static code analysis tools specializing in parallelism that integrate seamlessly into professional software development tools. We also present new Parallelware Analyzer capabilities for data race detection for GPUs using OpenMP.
Speaker: Dr. Manuel Arenaz (Appentra / University of A Coruña) Material: Slides
The Marconi100 high-frequency power monitoring system
The availability of high-resolution, real-time power measurements of high-performance computing systems (HPC) opens the door to new applications, especially in the fields of security, malware detection and anomaly detection. This presentation describes the initial work done on the Marconi100 system where, using the OpenBMC framework, it is possible to obtain high resolution power measurements of a server without the need for additional hardware. It will also describe the scalable storage and analysis platform where data is sent in real time and how users can implement applications based on this data.
Speaker: Mr. Francesco Beneventi (University of Bologna) Material: Slides
Exploring the Power of Containerization to Improve Data- and HPC-Education
Hands-on learning over big data sets is complicated by several factors including data movement, bandwidth consumption. To understand the analytics at scale, students need a uniform learning environment to fulfill their learning needs on top of multiple different infrastructures starting from their personal laptop to different types of high-performance computing cluster. In this presentation, I will demonstrate ‘Onstitute’ addresses this issue utilizing the power of containerization on top of cutting edge HPC-technologies such as, POWER-based hardware. We developed an end-to-end learning management framework that provides optimized containers, with different analytic software for multiple different platforms. Our platform separates the computation from data in order to save bandwidth while executes containers on demand on the big data that resides on the HPC. Our management platform and containers are designed to meet a broad range learning needs for different job-roles including data-curator, data-architect, data-scientist, data-engineer.
Speaker: Prof. Arghya Das (University of Wisconsin Platteville)
- 16:10 Middleware for Message Passing Interface (MPI) and Deep Learning on OpenPOWER platforms 25'
- 17:50 - 18:20 Virtual break
18:20 - 20:00
Memory is everywhere and....is often the bottleneck
Power9 processor brought 2 years ago the OpenCapi interface which provides a huge bandwidth and very low latency to your system but also keeps the data coherency. This standard is used by hardware accelerators to off-load applications bringing dedicated optimized processing to your server, but also to manage all new technology host memory types. During this presentation, we will present you real use cases to show its amazing capabilities through huge data acquisition chain used in synchrotron. We will also introduce memory inception concept that is changing the servers architecture, and Hybrid Memory Subsystems which allows to use any type of memory and not just DDR as usual Last but not least, designing with this OpenCapi technology is coming with an open-source framework which allows a software developer to easily program these FPGA based hardware accelerator.
Speakers: Mr. Alexandre Castellane (IBM France), Mr. Bruno Mesnet (IBM France) Material: Slides
FPGA acceleration of Spark SQL queries using Apache Arrow and OpenCAPI
Apache Spark is one of the most widely used big data analytics frameworks due to its user-friendly API. However, the high level of abstraction in Spark introduces large overheads to access modern high-performance heterogeneous hardware accelerators such as FPGAs. This talk discusses solutions to accelerate Spark SQL queries using FPGAs and to offload these computations transparently with little user configuration. In order to achieve this, we use the Apache Arrow in-memory format and the Fletcher hardware interface generator to exchange data efficiently with the accelerators. The performance of the proposed approach was benchmarked on a Power9 system with OpenCAPI, where our proof-of-concept accelerator was able to achieve 13x speedup for a filter-reduce query use case compared to a CPU-based Apache Spark implementation.
Speaker: Mr. Akos Hadnagy (Delft University of Technology) Material: Slides
Power 10 features
POWER10 is IBM's next generation POWER micro-processor that includes superior attributes for enterprise, cognitive and high-performance computing. This talk will describe many of the innovations and capabilities of POWER10 that provide a strong foundation for high-performance computing workloads. POWER10 features a new core microarchitecture focused on energy efficiency, thread strength, increased SIMD execution capabilities, and instruction set enhancements targeted toward AI optimization. The Open Memory Interface (OMI) provides superior memory bandwidth, low access latency, and a technology agnostic physical interface. A substantial data plane bandwidth increase supports the OMI memory attach as well as next generation I/O capabilities including robust PCIe and OpenCAPI accelerator attach. The PowerAXON high bandwidth, low latency, multi-protocol link architecture provides scale-out system architectures that provide disaggregated memory clustering capabilities and superior scaling capabilities.
Speakers: Mr. Brian Thompto (IBM), Mr. Bill Starke (IBM) Material: Slides
Disaggregated memory technologies and future opportunities in the HPC world
Traditional server trays encapsulate memory, computational units and accelerators creating physical constraints that limit clusters design flexibility. But, for what concerns memory, this is not the only option anymore. Memory disaggregation technologies allow us to go beyond the trays physical boundaries opening new opportunities. Thymesisflow, a hardware and software prototype for memory disaggregation on POWER9, and Memory Inception, a new POWER10 features enabling a system to map other system memory as its own, make disaggregated memory reality and open new horizons for the management and execution of HPC workloads. During this talk we are going to explore how disaggregated memory technologies bring future opportunities in HPC scenarios.
Speaker: Mr. Michele Gazzetti (IBM Research Europe in Dublin)
- 18:20 Memory is everywhere and....is often the bottleneck 25'
- 14:00 - 15:40 Session 1