Second International Workshop on Deepening Performance Models for Automatic Tuning (DPMAT)

August 28th (Mon), 2017, 11:00 -
August 31th (Thu), 2017 , -17:30

4F Training Room,
Information Technology Center, Nagoya University

Higashiyama Campus, Nagoya University

Main Sponsorship:
Japan Society of Promotion Science (JSPS) Bilateral Joint Research Projects (Open Partnership), “Deepening Performance Models for Automatic Tuning with International Collaboration”

Grant-in-Aid for challenging Exploratory Research, “Development of Technologies of High Performance Computing for Accuracy Assurance”

Grant-in-Aid for Scientific Research (B), “A Novel Development of Auto-tuning Technologies for Communication Avoiding and Reducing Algorithms”

Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures" and "High Performance Computing Infrastructure" in Japan (JHPCN), “High-performance Randomized Matrix Computations for Big Data Analytics and Applications”

Information Technology Center, Nagoya University

Joint Usage/Research Center for Interdisciplinary Large-scale Information Infrastructures

Aim of the workshop

Advanced computer architectures toward to exascale computing are getting complex, such as many-core processors, deep memory hierarchy, and heterogeneity of components.Development of high performance software is also being costly. One of candidates to solve this “complex” situation is a technology for automatic performance tuning (AT).

In this first workshop, we discuss emergent problems and state-of-the-art technologies for AT with researchers between Japan and Taiwan. The key topics of the workshop are summarized as follows:

Invited Speakers:
Dr. Osni Marques (Lawrence Barkley National Laboratory, USA)
Prof. Richard Vuduc (Georgia Institute of Technology, USA)
Prof. Weichung Wang (National Taiwan University, Taiwan)
Prof. Feng-Nan Hwang (National Central University, Taiwan)
Prof. Ray-Bing Chen (National Cheng Kung University, Taiwan)

Domestic Speakers:
Prof. Takahiro Katagiri (Nagoya University, Japan)
Prof. Satoshi Ohshima (Research Institute for Information Technology, Kyusyu University, Japan)
Prof. Reiji Suda  (The University of Tokyo, Japan)
Prof. Akihiro Fujii (Kogakuin University, Japan)
Prof. Teruo Tanaka (Kogakuin University, Japan)
Mr. Tomoya Ikeda (Graduate School of Information Science, Nagoya University, Japan)
Mr. Koki Masui (Graduate School of Information Science, Nagoya University, Japan)


August 28th (Mon), 2017

11:00-13:00 Lunch Meeting

    @ Seminar Room, 5th Floor, Information Technology Center, Nagoya University
Welcome Address by Prof. Takahiro Katagiri (Nagoya University)

13:00-17:00 Group Meeting
    @ Seminar Room, 5th Floor, Information Technology Center, Nagoya University

*17:30- Welcome Reception

August 29th (Tue), 2017

*11:00-12:45 Lunch Meeting
    @ Seminar room, 5F ITC, Nagoya University

The first day of DPMAT@ Training Room, 4F ITC, Nagoya University

* 13:00 - 13:10 Opening Talk
Takahiro Katagiri (Information Technology Center, Nagoya University, Japan)

* Invited Talk (1)  13:10 - 14:10
(Chair: Takahiro Katagiri, Nagoya University )


Unconstrained Functionals for Efficient Parallel Scaling of Conjugate Gradient Eigensolvers


Dr. Osni Marques (Lawrence Barkley National Laboratory, USA)

A large number of scientific applications require the solution of eigenvalue problems. For applications where some small percentage of the eigenpairs are required, rather than the full spectrum, iterative eigensolvers are typically used. Often, the scaling of these types of iterative eigensolvers on modern massively parallel multi-core computers is limited by the reorthogonalization step, which typically uses direct diagonalization of the subspace matrix, Cholesky or QR decomposition. Algorithms that reduce to as few as possible the number of explicit matrix diagonalizations are expected to display much more favorable scaling and therefore being the preferred choice for very large systems, compared to the actual state-of-the art parallel implementation. We focus on the unconstrained energy functional formalism for electronic structure calculations, which offers the possibility to fulfilling that requirement. Unconstrained functionals are more complex than the standard functionals but avoid the reorthogonalization step so that the trial vectors are only orthogonal at the minimum. In the presentation, we compare the performance of the unconstrained energy functional formalism with that of existing approaches on realistic problems, and discuss opportunities for performance improvement.

Joint work with Mauro Del Ben and Andrew Canning.

* 14:10 – 14:20 Break

(Chair: Akihiro Fujii, Kogakuin University)
* 14:20-14:50

Auto-tuning to Scientific Applications –Traditional Approach by Code Transformations and New Approach by AI –

Prof. Takahiro Katagiri (Information Technology Center, Nagoya University, Japan)

Currently, multi-core and many-core architectures are pervasive. In these architectures, it needs to take care of parallelism for kernel loop length to obtain high parallelism. Length of several hundred or more is needed to obtain sufficient parallelism. If application or target problem do not have enough loop length, traditional approach is still working, such as loop transformation, including loop collapse of the kernel. This is an old-fashioned technology, but it is still strong technology in current many-core CPUs, such as the Xeon Phi. On the other hand, AI technology is widely-adapted to several applications. Applying AI to technology of auto-tuning (AT) is natural way in view point of optimization of parameters.

In this presentation, we preset two kinds of examples for AT. The first is based on the traditional approach. This performs code transformations to codes of typical scientific simulations, such as seismic wave analysis, or plasma simulation. The second is based on a new approach. This performs parameter optimization by AI to a numerical library. Preliminary results are shown by using supercomputers in Japan for the two AT approaches.

* 14:50 – 15:20

Simultaneous estimation of multiple performance parameters using multi step single dimensional d-Spline estimation

Prof. Teruo Tanaka (Department of Compute Science, Kogakuin University, Japan)

In AT, to improve performance of target program, which is executed iteratively and occupies major portion of execution time, it is important to find the most suitable performance parameter point in parameters space efficiently. At this optimal point, execution time (measured value) of target program is minimum. To achieve accurate estimation of optimal performance parameter point within a short estimation time, we have proposed Incremental Performance Parameter Estimation (IPPE) method. This method is characterized by the successive addition of new sampling parameter points to a small set of sampling parameter points. IPPE is based on discretized spline function (d-Spline) as a fitting function. d-Spline is highly adaptable and requires little estimation time. d-Spline must be updated by a new sampling parameter point at each iteration. When the dimension of the performance parameters increases, the computational cost of multi-dimensional d-Spline increases exponentially. We proposed enhanced IPPE/d-Spline method to reduce the computational cost significantly by defining the relationship of multi-dimensional performance parameters and by using single dimensional d-Spline search repeatedly. Our method operates like a steepest descent method. We show more than three simultaneous estimation cases. We also discuss behavior of this method in terms of relations of the dependency between parameters.

* 15:20-15:50
Auto-tuning of directives: tuning directives of OpenMP and OpenACC

Prof. Satoshi Ohshima (Research Institute for Information Technology, Kyusyu University, Japan)

In the OpenMP and OpenACC programming, tuning directives is effective and important work. For example, number of threads and loop scheduling of OpenMP are optimized by directives and have large impact to performance on current multi-/many-core processors. Execution form of OpenACC (gang, worker, and vector) have large impact to GPU programs and optimized by directives, too. Because directives are processed before compile time, it is difficult to use common optimization technique. We are developing ppOpen-AT which is directive-based Auto-tuning language supports tuning the directives. Current ppOpen-AT provides several features and directives which modify the directives and new features are under discussion.

In this talk, current status of the “directives tuning directives” features and its future are shown.

* 15:50 - 16:00 Break (Taking a Group Photo)

(Chair: Takahiro Katagiri, Nagoya University )
* Invited Talk (2) 16:00 -17:00

Nonlinear Preconditioner for Full-space Lagrange-Newton-Krylov Algorithms with applications in Large-scale PDE-constrained Optimization Problems

Prof. Feng-Nan Hwang (Department of Mathematics, National Central University, Taiwan)

PDE-constrained optimization problems are a class of important and computationally challenging problems. The full-space Lagrange-Newton algorithms is one of the most popular numerical algorithms for solving the problems, since Newton-type method enjoys fast convergence when the nonlinearities in the system are well balanced. However, in many practical problems such as flow control, if some of the equations are much more nonlinear than the others in the system, the convergence of the method become slow or at worse case the convergence failure happens. The radius of convergence is often constrained by a small number of the variables of equations in the system with the strong nonlinearities. In the talk, we introduce and study a parallel nonlinear elimination preconditioned inexact Newton algorithm for the boundary control of thermally convective flows based on the field variable partition.

In this approach, in the standard manner, once the objective function and the PDE constrained problem discretized by some numerical schemes, we convert the constrained optimization problem into unconstrained optimization problem by introducing the augmented Lagrange function, then find the candidate optimal solution by solving the first order necessary condition using an inexact Newton method with backtracking techniques. The key point of new proposed algorithm is that before performing the global Newton update, we first identify the to-be-eliminated components that causes Newton method's into a slow convergence, and then remove the high nonlinearity by using a subspace correction, which can be interpreted the application of nonlinear elimination based preconditioner to the nonlinear system.

As a result, the new approach shows a significantly improved performance when compared to a standard Lagrange-Newton type method or its grid-sequencing version. Some numerical results are presented to demonstrate the robustness and efficiency of the proposed algorithm.

* 17:00- Closing of The First day

*17:30- Workshop Reception

August 30th (Wed), 2017

The second day of DPMAT@ Training Room, 4F ITC, Nagoya University

(Chair: Takahiro Katagiri, Nagoya University )
*11:00 - 12:00 Invited Talk (3)

A power-tunable single-source shortest path algorithm

Prof. Richard Vuduc (School of Computational Science and Engineering, Georgia Institute of Technology, USA)

I'll discuss a novel methodology to control power consumption by tuning the amount of parallelism that is available during the execution of an algorithm. The specific algorithm is a tunable variation of delta-stepping for computing a single-source shortest path (SSSP); its available parallelism is highly irregular and depends strongly on the input. Informed by an analysis of these runtime characteristics, we propose a software-based controller that uses online learning techniques to dynamically tune the available parallelism to meet a given target, thereby improving the average available parallelism while reducing its variability. We verify experimentally that this mechanism makes it possible for the algorithm to “self-tune” the tradeoff between performance and power. The prototype extends Gunrock's GPU SSSP implementation, and the experimental apparatus is an embedded CPU+GPU platform (NVIDIA TK1), which has separately tunable GPU core and memory frequency knobs, and is connected to an external power monitoring device (PowerMon 2).

This work is led by Sara Karamati, a Ph.D. student, and joint with Jeff Young, both at Georgia Tech.

* 12:00-13:30 A Lunch Meeting
     @ Seminar room, 5F ITC, Nagoya University

(Chair: Takahiro Katagiri, Nagoya University )
*13:30 - 14:30 Invited Talk (4)

Surrogate-Assisted Tuning for Computer Experiments with Qualitative and Quantitative Parameters

Prof. Ray-Bing Chen (Department of Statistics, National Cheng Kung University, Taiwan)

Performance tuning of computer codes is an essential issue in computer experiments. By suitable choosing the values of the tuning parameters, we can optimize the codes in terms of timing, accuracy, robustness, or other performance objectives. As computer software and hardware are becoming more and more complicated, such a tuning process is not an easy task, and there are strong needs for developing efficient and automatic tuning methods. In this article, we consider software auto-tuning problems that involve qualitative and quantitative tuning parameters by solving the resulting optimization problems. Because the performance objective functions in the target optimization problems are usually not explicitly defined, we build up surrogates from the response data and attempt to mimic the true, yet unknown, performance response surfaces. The proposed surrogate-assisted tuning process is an iterative procedure. At each iteration, surrogates are updated and new experimental points are chosen based on the prediction uncertainties provided by the surrogate models until a satisfactory solution is obtained. We propose two surrogate construction methods that adopt two infill criteria for the tuning problems containing qualitative and quantitative parameters. The four variants of the proposed algorithm are used to optimize computational fluid dynamic simulation codes and artificial problems to illustrate the usefulness and strengths of the proposed algorithms.

* 14:30-14:45 Break

(Chair: Satoshi Ohshima, Kyusyu University)
* 14:45-15:10

Study on the precision and the type of implementation of iterative methods for solving complex symmetric linear systems

Mr. Koki Masui (Graduate School of Information Science, Nagoya University, Japan), Prof. Masao Ogino (Information Technology Center, Nagoya University, Japan)

In solving complex symmetric linear systems arising from the edge finite element analysis of high-frequency electromagnetic fields, we suffer from slow rates of convergence of iterative methods. On the other hand, by using the high precision calculation such as quadruple precision, we expect to improve the rate of convergence. However, since high precision calculation requires much computation time than double precision, it is necessary to implement high precision complex number calculations efficiently. In this study, we focus on an implementation of a mixed-precision iterative method based on double-double (DD) precision complex number calculations. Especially, we consider to optimize DD/DD and DD/double arithmetic by using FMA instructions. Moreover, we carry out performance measurements of latest computers with developed codes.

* 15:10-15:40

Halo communication using RDMA interface on FX10 supercomputer system

Prof. Akihiro Fujii (Department of Computer Science, Kogakuin University, Japan)


Sparse matrix vector product (SpMV) is the key component of the HPC applications. This presentation talks about the communication time reduction of SpMV. Recently, HPC applications are often executed with thousands of processes. Especially when we consider strong scaling setup, communication time reduction is one of the most important aspect for developing parallel software. FX10 system has the RDMA (Remote Direct Memory Access) interface which allows users to use the direct RDMA communication for each application. This presentation evaluates the direct RDMA routines for SpMV halo communication in strong scaling setup.

* 15:40-15:50 Break

(Chair: Takahiro Katagiri, Nagoya University )

* 15:50-16:50 Invited Talk(5)

Parallel Singular Value Decomposition for Large Matrices by Multiple Random Sketches

Prof. Weichung Wang (Institute of Applied Mathematical Sciences, National Taiwan University, Taiwan)

The singular value decomposition (SVD) of large-scale matrices is a key tool in data analytics and scientific computing. The rapid growth in the size of matrices further increases the need for developing efficient SVD algorithms. Randomized SVD based on one-time sketching has been studied, and its potential has been demonstrated for computing a low-rank SVD. We present a Monte Carlo type integrated SVD algorithm based on multiple random sketches in this talk. The proposed integration algorithm takes multiple random sketches and then integrates the results obtained from the multiple sketched subspaces. So that the integrated SVD can achieve higher accuracy and lower stochastic variations. The main component of the integration is an optimization problem and an average scheme over a matrix Stiefel manifold. In addition to the theoretical and statistical analyses, we also consider practical algorithms that are suitable for parallel computers. The proposed algorithms can be implemented on the latest multi-core CPU, many-core GPU, and MPI-based cluster. Numerical results suggest that the proposed integrated SVD algorithms are promising.

This is a joint work with Ting-Li Chen and Su-Yun Huang at the Institute of Statistical Science, Academia Sinica, Da-Wei D. Chang, Hung Chen, Chen-Yao Lin, and Mu Yang at Institute of Applied Mathematical Sciences, National Taiwan University.

* 16:50-17:00 Break

(Chair: Akihiro Fujii, Kogakuin University)
* 17:00-17:30

Optimizing Forward and Backward computations in the adjoint method via Multi-level Blocking

Mr. Tomoya Ikeda (Graduate School of Information Science, Nagoya University, Japan), Dr. Shin-ichi Ito (Earthquake Research Institute, The University of Tokyo, Japan), Prof. Hiromichi Nagao (Earthquake Research Institute / Graduate School of Information Science and Technology, The University of Tokyo, Japan), Prof. Takahiro Katagiri, Prof. Toru Nagai, Prof. Masao Ogino (Information Technology Center, Nagoya University, Japan)

Data Assimilation (DA) is a computational technique to integrate large-scale numerical simulations and observed data, and the Adjoint Method (AM) is classified as a non-sequential DA technique.

The target simulation model in our research is the phase-field model, which is often used to simulate the temporal evolution of the internal structures of materials. Since the phase-field model computes a continuous field, a naïve implementation of the AM requires an enormous amount of computation time. One reason for this increase in computation time is that the size of data required for simulations is much larger than the cache capacity of the computers.

To reduce memory access and achieve better performance, it is necessary to use computational blocking, which involves reusing data within the cache as much as possible. In order to improve the performance of the AM, we propose Multi-Level Blocking (MLB) to optimize Forward and Backward computations, which are the bottleneck in AM.

We investigated the effectiveness of the proposed method on Fujitsu PRIMEHPC FX100 supercomputer and we attained 1.18-times speed up as the entire performance of the AM by applying the MLB.

* 17:30-18:00


Communication-Avoiding CG Method

Prof. Reiji Suda (Graduate School of Information Science and Technology, The University of Tokyo, Japan)

As the number of processors is still expected to increase, the cost of data communication among processors is one of major factors of performance bottleneck. In such situations, communication-avoiding algorithms, which requires less numbers or volumes of communication messages, possibly at a cost of some increase of computational costs, may improve the effective performance. We are working on communication-avoiding CG methods. It is easy to block the CG iteration to construct larger Krylov subspace with the same number of messages, but the straightforward way deteriorates numerical stability. We will report some techniques to improve numerical stability, and some results in seeking the best blocking sizes.

* 18:00– Closing Remarks
Prof. Takahiro Katagiri (Information Technology Center, Nagoya University, Japan)

* 18:30 - Reception (Speaker Only)


August 31th (Thu), 2017

*11:00-13:00 A Lunch Meeting
    @ Seminar room, 5F ITC, Nagoya University

*13:00-17:00 Group meeting
    @ Seminar Room, 5th Floor, Information Technology Center, Nagoya University

*17:30- Farewell Reception


* Reception:
We are planning a reception with speakers in Nagoya after finishing the first day of the workshop.

If you are interested in joining the reception, please contact to Prof. Katagiri. His e-mail address is “katagiri _AT_”, where “_AT_” needs to replace with “@”.

Reception fee is around 5000 Yen/person, and reception place, however that is still under planning, is inside or near Nagoya University.


The workshop was successfully ended.
Thank you very much for your participation!

The number of attenders: 21 (including 5 foreign nationalities) in August 29th (Tue), 2017.
The number of attenders: 20 (including 5 foreign nationalities) in August 30th (Wed), 2017.


[Latest updates] 12nd September, 2017