Len WisniewskiLen Wisniewski is Director of Engineering at IQSS.  Before IQSS, Len was Engineering Manager for the High Performance Computing group at Sun Microsystems, leading the teams developing the Sun HPC ClusterTools and Sun Studio Performance Analyzer products.   In that role, he actively supported the Open MPI open-source community.   Before entering a management role, Len held software engineering roles at various companies, including Sun/OracleAcopia Networks (acquired by F5 Networks), and Thinking Machines (acquired by Sun).  His primary technical interests have been in developing algorithms and file systems for parallel and distributed computers.

Len is a native Philadelphian with a B.A. in mathematics and computer science from the Honors Program at LaSalle University and a Ph.D. in computer science from Dartmouth College.  Len also has taught computer science and software security practices for the Master of Science in Cybersecurity program at Northeastern University.

Publications

Steven Abramson, William Horka, and Leonard Wisniewski. 6/30/2014. “A Hybrid Cloud Architecture for a Social Science Research Computing Data Center.” In The 4th IEEE International Workshop on Data Center Performance (DCPerf '14) in the Proceedings of the 2014 IEEE 34th International Conference on Distributed Computing Systems Workshops (ICDCSW '14) , Pp. 45-50. Madrid, Spain: IEEE Computer Society Washington, DC, USA ©6/30/2014. Abstract
Research computing in the social sciences requires access to statistical software and quantitative tools that perform embarrassingly parallel computation at moderate scale, large memory to fit entire data sets, and secure storage for potentially confidential data. The Research Computing Environment (RCE) was designed as a three-tier system to satisfy these requirements in a transparent manner. We extend the RCE to use cloud resources while maintaining the transparency of the resources from the user. This paper describes this use case and highlights the significant resource management and networking decisions made when designing and implementing a hybrid cloud architecture for a research computing environment to support the social sciences.
Terry Dontje, Don Kerr, Daniel Lacher, Pak Lui, Ethan Mallove, Karen Norteman, Rolf Vandevaart, and Leonard Wisniewski. 2008. “Sun HPC ClusterToolsTM 7+: A Binary Distribution of Open MPI.” In Tools for High Performance Computing, edited by M. Resch, R. Keller, V. Himmler, B. Krammer, and A. Schulz, Pp. 3-18. Berlin, Heidelberg: Springer.Abstract
The Sun HPC ClusterTools 7 release was Sun’s first binary distribution of the Open MPI software. This release marked a change in source-code base for Sun from a proprietary code base derived from the Thinking Machines Corporation Globalworks™ software to the open-source Open MPI software. Sun HPC ClusterTools includes packages of binaries built from the Open MPI source code by the Sun™ Studio compilers and install scripts for installing those packages across a cluster of nodes. The Sun HPC ClusterTools team contributed a Sun Grid Engine plug-in and developed the uDAPL Byte Transfer Layer module as its Infiniband solution on Solaris™ operating system. Additionally, Sun HPC ClusterTools includes examples of using DTrace to analyze performance and debug MPI applications. Other product-focused activity included significant contribution to the development of the MPI Test Tool (MTT) and development of a set of user documentation. This paper describes the new Sun HPC ClusterTools based on Open MPI, focusing on areas where Sun has contributed to Open MPI.
Geeta Chaudhry, Thomas H. Cormen, and Leonard F. Wisniewski. 7/3/2001. “Columnsort Lives! An Efficient Out-of-Core Sorting Program.” In Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '01 ), Pp. 169-178. Crete Island, Greece: ACM New York, NY, USA ©7/3/2001. Abstract
We present the design and implementation of a parallel out-of-core sorting algorithm, which is based on Leighton's columnsort algorithm. We show how to relax some of the steps of the original columnsort algorithm to permit a faster out-of-core implementation. Our algorithm requires only 4 passes over the data, and a 3-pass implementation is possible. Although there is a limit on the number of records that can be sorted—as a function of the memory used per processor—this upper limit need not be a severe restriction, and it increases superlinearly with the per-processor memory. To the best of our knowledge, our implementation is the first out-of-core multiprocessor sorting algorithm whose output is in the order assumed by the Parallel Disk Model. We define several measures of sorting efficiency and demonstrate that our implementation's sorting efficiency is competitive with that of NOW-Sort, a sorting algorithm developed to sort large amounts of data quickly on a cluster of workstations.
Thomas H. Cormen, Thomas Sundquist, and Leonard F. Wisniewski. 2/1999. “Asymptotically Tight Bounds for Performing BMMC Permutations on Parallel Disk Systems.” SIAM Journal on Computing, 28, 1, Pp. 105-136.Abstract
This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF(2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bit-reversal permutations, vector-reversal permutations, hypercube permutations, matrix reblocking, Gray-code permutations, and inverse Gray-code permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bit-permute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linear-algebra techniques to factor the characteristic matrix for the BMMC permutation into a product of factors, each of which characterizes a permutation that can be performed in one pass over the data.

The factoring uses new subclasses of BMMC permutations: memoryload-dispersal (MLD) permutations and their inverses. These subclasses extend the catalog of one-pass permutations.

Although many BMMC permutations of practical interest fall into subclasses that might be explicitly invoked within the source code, this paper shows how to quickly detect whether a given vector of target addresses specifies a BMMC permutation. Thus, one can determine efficiently at run time whether a permutation to be performed is BMMC and then avoid the general-permutation algorithm and save parallel I/Os by using the BMMC permutation algorithm herein.

This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF(2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bit-reversal permutations, vector-reversal permutations, hypercube permutations, matrix reblocking, Gray-code permutations, and inverse Gray-code permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bit-permute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linear-algebra techniques to factor the characteristic matrix for the BMMC permutation into a product of factors, each of which characterizes a permutation that can be performed in one pass over the data.

The factoring uses new subclasses of BMMC permutations: memoryload-dispersal (MLD) permutations and their inverses. These subclasses extend the catalog of one-pass permutations.

Although many BMMC permutations of practical interest fall into subclasses that might be explicitly invoked within the source code, this paper shows how to quickly detect whether a given vector of target addresses specifies a BMMC permutation. Thus, one can determine efficiently at run time whether a permutation to be performed is BMMC and then avoid the general-permutation algorithm and save parallel I/Os by using the BMMC permutation algorithm herein.




Read More: https://epubs.siam.org/doi/abs/10.1137/S0097539795283681

This paper presents asymptotically equal lower and upper bounds for the number of parallel I/O operations required to perform bit-matrix-multiply/complement (BMMC) permutations on the Parallel Disk Model proposed by Vitter and Shriver. A BMMC permutation maps a source index to a target index by an affine transformation over GF(2), where the source and target indices are treated as bit vectors. The class of BMMC permutations includes many common permutations, such as matrix transposition (when dimensions are powers of 2), bit-reversal permutations, vector-reversal permutations, hypercube permutations, matrix reblocking, Gray-code permutations, and inverse Gray-code permutations. The upper bound improves upon the asymptotic bound in the previous best known BMMC algorithm and upon the constant factor in the previous best known bit-permute/complement (BPC) permutation algorithm. The algorithm achieving the upper bound uses basic linear-algebra techniques to factor the characteristic matrix for the BMMC permutation into a product of factors, each of which characterizes a permutation that can be performed in one pass over the data.

The factoring uses new subclasses of BMMC permutations: memoryload-dispersal (MLD) permutations and their inverses. These subclasses extend the catalog of one-pass permutations.

Although many BMMC permutations of practical interest fall into subclasses that might be explicitly invoked within the source code, this paper shows how to quickly detect whether a given vector of target addresses specifies a BMMC permutation. Thus, one can determine efficiently at run time whether a permutation to be performed is BMMC and then avoid the general-permutation algorithm and save parallel I/Os by using the BMMC permutation algorithm herein.




Read More: https://epubs.siam.org/doi/abs/10.1137/S0097539795283681
Len Wisniewski, Brad Smisloff, and Nils Nieuwejaar. 11/1999. “Sun MPI I/O: Efficient I/O for Parallel Applications.” In SC '99: Proceedings of the 1999 ACM/IEEE Conference on Supercomputing. Portland, OR, USA, USA: IEEE.Abstract
Many parallel applications require high-performance I/O to avoid negating some or all of the benefit derived from parallelizing its computation. When these applications are run on a loosely-coupled cluster of SMPs, the limitations of existing hardware and software present even more hurdles to performing high-performance I/O. In this paper, we describe our full implementation of the I/O portion of the MPI-2 specification. In particular, we discuss the limitations inherent in performing high-performance I/O on a cluster of SMPs and demonstrate the benefits of using a cluster-based filesystem over a traditional node-based filesystem.
Leonard F. Wisniewski. 3/1996. “Efficient Design and Implementation of Permutation Algorithms on the Memory Hierarchy.” Dartmouth College Department of Computer Science, Hanover, NH, USA ©3/1996. Abstract
For many scientific applications, the data set cannot entirely fit in main memory. The data must reside out-of-core, i.e., on parallel disks. For many basic data-movement operations such as permuting, if the programmer does not design efficient algorithms for accessing data on the parallel disks, the penalty can be quite severe. As processor speeds increase more rapidly than disk speeds, additional disk accesses become even more costly in a relative sense. In this thesis, we focus on efficiently performing the data movement for permutations when the data reside out-of-core.

We present a unified approach for performing bit-matrix-multiply/complement (BMMC) permutations at different levels of the memory hierarchy. Our unified approach uses a linear-algebraic technique to decompose an arbitrary BMMC permutation into a number of permutations which we know how to perform efficiently. We show that this technique is flexible enough to apply at the following levels of memory abstraction: parallel disk access, interprocessor communication on distributed memory multiprocessors (with the processors connected by either a mesh or a multistage network), uniprocessor memory access, and the design of combinational circuits. The BMMC permutations include commonly used permutations such as matrix transposition, bit-reversal permutations (used in performing FFTs), vector-reversal permutations, hypercube permutations, matrix reblocking, and permutations used by fast cosine transforms (FCTs).

This thesis presents additional work at the parallel disk level of abstraction. We define in-place algorithms for permuting out-of-core data and show how to efficiently perform BMMC, mesh, and torus permutations in place. We also define an application user interface (API) for performing parallel data accesses in the manner suggested by Vitter and Shriver's Parallel Disk Model (PDM).

Matt Bishop, Mark Valence, and Leonard F. Wisniewski. 8/1995. Process Migration for Heterogeneous Distributed Systems. Hanover, NH: Dartmouth College.Abstract
The policies and mechanisms for migrating processes in a distributed system become more complicated in a heterogeneous environment, where the hosts may differ in their architecture and operating systems. These distributed systems include a large quantity and great diversity of resources which may not be fully utilized without the means to migrate processes to the idle resources. In this paper, we present a graph model for single process migration which can be used for load balancing as well as other non-traditional scenarios such as migration during the graceful degradation of a host. The graph model provides the basis for a layered approach to implementing the mechanisms for process migration in a Heterogeneous Migration Facility (HMF). HMF provides the user with a library to automatically migrate processes and checkpoint data.
Sean S.B. Moore and Leonard F. Wisniewski. 10/1995. Complexity Analysis of Two Permutations Used by Fast Cosine Transform Algorithms. Hanover, NH: Dartmouth College.Abstract
Recently developed fast cosine transform (FCT) algorithms require fewer operations than any other known general algorithm. Similar to related fast transform algorithms (e.g., the FFT), these algorithms permute the data before, during, or after the computation of the transform. The choice of this permutation may be an important consideration in reducing the complexity of the permutation algorithm. In this paper, we derive the complexity to generate the permutation mappings used in these FCT algorithms for power-of-2 data sets by representing them as linear index transformations and translating them into combinational circuits. Moreover, we show that one of these permutations not only allows efficient implementation, but is also self-invertible, i.e., we can use the same circuit to generate the permutation mapping for both the fast cosine transform and its inverse, like the bit-reversal permutation used by FFT algorithms. These results may be useful to designers of low-level algorithms for implementing fast cosine transforms.
Elizabeth A. M. Shriver and Leonard F. Wisniewski. 11/1995. An API for Choreographing Data Access. Hanover, NH: Dartmouth College.Abstract

Current APIs for multiprocessor multi-disk file systems are not easy to use in developing out-of-core algorithms that choreograph parallel data accesses. Consequently, the efficiency of these algorithms is hard to achieve in practice. We address this deficiency by specifying an API that includes data-access primitives for data choreography. With our API, the programmer can easily access specific blocks from each disk in a single operation, thereby fully utilizing the parallelism of the underlying storage system.

Our API supports the development of libraries of commonly-used higher-level routines such as matrix-matrix addition, matrix-matrix multiplication, and BMMC (bit-matrix-multiply/complement) permutations. We illustrate our API in implementations of these three high-level routines to demonstrate how easy it is to use.