Gonzalo P. Rodrigo Álvarez

PhD. Candidate gonzalo@cs.umu.se
Department of Computing Science.
Umeå University, SE-901 87
Umeå, Sweden

Supervisors

Prof. Erik Elmroth, Umeå University.
Dr. Lavanya Ramakrishnan, Lawrence Berkeley National Lab.
Dr. P-O Östberg, Umeå University.


Thesis: manuscript, publications, and tools


HPC Scheduling in a Brave New World: Thesis Full Text

Summary: This thesis focuses on understanding what new scheduling models are and will be required for future HPC systems. It starts presenting how workloads have evolved in the lifetime of recent and current systems (Paper 1). It identifies new specific workload challenges that affect the scheduling performance (Paper 1). It follows analyzing and proposing general scheduling models for HPC systems (Papers 2 and 3). Next, it presents the set of tools that we have developed to perform scheduling research (Paper 4). Finally, it ends presenting a new scheduling algorithm for one of the identified challenges: efficient scheduling of workflows (Paper 5).

In addition to the publications, the outcome of this thesis includes two open source projects:


Paper 1: Understanding the workload towards future systems.

Rodrigo Álvarez, G. P., Östberg, P. O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L. (2016, May). Towards Understanding HPC Users and Systems: A NERSC Case Study. Submitted to JPDC (Journal of Parallel and Distributed Computing). Full Text-Draft

The journal paper is the compilation of two previous short papers, including some extra original work. These are the papers:

Rodrigo Álvarez, G. P., Östberg, P. O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L. (2016, May). Towards Understanding Job Heterogeneity in HPC: A NERSC Case Study. 6th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2016). Full Text

Rodrigo Álvarez, G. P., Östberg, P. O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L. (2015, June). HPC System Lifetime Story: Workload Characterization and Evolutionary Analyses on NERSC Systems. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2015) (pp. 57-60). ACM. Full Text


Papers 2 and 3: Understanding scheduling mechanisms and their role in current future HPC systems.

Rodrigo Álvarez, G. P., Östberg, P-O. Elmroth, E. (2014). Priority Operators for Fairshare Scheduling. 18th Workshops on Job Scheduling Strategies for Parallel Processing (JSSPP 2014) co-located with the IPDPS 2014 conference. Full Text

Rodrigo Álvarez, G. P., Östberg, P. O., Elmroth, E., Ramakrishnan, L. (2015, June). A2L2: An Application Aware Flexible HPC Scheduling Model for Low-Latency Allocation. In Proceedings of the 8th International Workshop on Virtualization Technologies in Distributed Computing (VTDC 2015) (pp. 11-19). ACM. Co-located with the HPDC 2015 conference. Full Text


Paper 4: Creation of an scheduling research tool.

Rodrigo Álvarez, G.P, Elmroth, E., Östberg, P.O., Ramakrishnan, L. ScSF: A Scheduling Simulation Framework. 21th Workshops on Job Scheduling Strategies for Parallel Processing (JSSPP 2017) co-located with the IPDPS 2017 conference. Full Text


Paper 5: A new scheduling algorithm for workflows in HPC systems.

Rodrigo Álvarez, G.P, Elmroth, E., Östberg, P.O., Ramakrishnan, L. Enabling workflow aware scheduling on HPC systems. 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2017). Full Text



Technical reports produced during thesis work


Rodrigo Álvarez, G. P. Establishing the equivalence between operators: theorem to establish a sufficient condition for two operators to produce the same ordering in a Fairshare prioritization system. January 2014. Full Text

Rodrigo Álvarez, G. P. Proof of compliance for the relative operator on the proportional distribution of unused share in an ordering fairshare system. January 2014. Full Text



Scheduling for future HPC systems. Swedish e-Science Academy 2016 - eSSENCE, Lund, Sweden, October 12, 2016. Video presentation. Slides

Towards understanding today’s and tomorrow’s scheduling challenges in HPC systems. Nordu Grid 2016, Košice, Slovakia, 3 June, 2016. Slides

Towards understanding today’s and tomorrow’s scheduling challenges in HPC systems. Mid-Thesis seminar, February 2016. Slides

Analysis of job traces from Carver, Hopper, and Edison. Brown-bag seminar at NERSC, Oakland, California. May 2014 Slides


Open source projects part of thesis work


WoAS, Workflow Aware System (Slurm): Scheduling plug-in for Slurm to support workflow aware jobs, i.e. a new way run static workflows with fine grained resource allocation without long turnaround times. Fork of the SLURM project. Owner, in process of liberation.

ScSF, a Scheduling Simulation Framework: Tool set to perform scheduling research including workload modeling, generation, analysis, and HPC system simulation (based on Slurm). Includes an orchestration layer to deploy current simulations over distributed resources. Owner, in process of liberation.

QDO (kew-doo): a lightweight high-throughput queuing system for workflows that have many small tasks to perform. Contributor.

qdo werbserver: A rest API to execute QDO remotely on a yet more remote server through NEWT or SSH. Main contributor.

sremote: Simple remote is a python library to run python code remotely: simple, deploys the code itself, and the communication channel can be anything that allows copy/read files and command line execution. Includes a NEWT and SSH connectors. Owner.


Other relevant work



Work as reviewer for conferences and journals



Research visits and internships


2016, 6 months: Systems engineer at Lawrence Berkeley National Lab. Data Science and Technology department, CRD. Employed by LBNL. Supervised by L. Ramakrishnan

2015, 6 months: PhD student intern at Lawrence Berkeley National Lab. Data Science and Technology department, CRD. 95% Employed by LBNL, 5% employed by UmU. Supervised by L. Ramakrishnan

2014, 4 months: Software Engineering intern at Google Inc. Cluster management group in Mountain View CA. Work performed on data intensive workflows auto-scaling. Employed by Google Inc. Supervised by J. Wilkes

2014, 5 months: Visiting PhD student at Lawrence Berkeley National Lab. Data Science and Technology department, CRD. Funded by the Berkeley exchange scholarship for PhD studies of the Faculty of Sciences and Technology, Umeå University. Supervised by L. Ramakrishnan