Gonzalo P. Rodrigo Álvarez
PhD. Candidate
gonzalo@cs.umu.se
Department of Computing Science.
Umeå University, SE-901 87
Umeå, Sweden
Supervisors
Prof. Erik Elmroth, Umeå University.
Dr. Lavanya Ramakrishnan, Lawrence Berkeley National Lab.
Dr. P-O Östberg, Umeå University.
Thesis: manuscript, publications, and tools
HPC Scheduling in a Brave New World: Thesis Full Text
Summary: This thesis focuses on understanding what new scheduling models are and will be required for future HPC systems. It starts presenting how workloads have evolved in the lifetime of recent and current systems (Paper 1). It identifies new specific workload challenges that affect the scheduling performance (Paper 1). It follows analyzing and proposing general scheduling models for HPC systems (Papers 2 and 3). Next, it presents the set of tools that we have developed to perform scheduling research (Paper 4). Finally, it ends presenting a new scheduling algorithm for one of the identified challenges: efficient scheduling of workflows (Paper 5).
In addition to the publications, the outcome of this thesis includes two open source projects:
-
ScSF, an scheduling simulation framework, that will provide the community with tools to perform scheduling research: workload analysis, generation, simulation, and analysis.
-
WoAS, a workflow aware scheduling algorithm implementation integrated in Slurm to provide short workflow turnaround time while not over-allocation resources.
Paper 1: Understanding the workload towards future systems.
Rodrigo Álvarez, G. P., Östberg, P. O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L. (2016, May). Towards Understanding HPC Users and Systems: A NERSC Case Study. Submitted to JPDC (Journal of Parallel and Distributed Computing). Full Text-Draft
The journal paper is the compilation of two previous short papers, including some extra original work. These are the papers:
Rodrigo Álvarez, G. P., Östberg, P. O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L. (2016, May). Towards Understanding Job Heterogeneity in HPC: A NERSC Case Study. 6th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2016). Full Text
Rodrigo Álvarez, G. P., Östberg, P. O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L. (2015, June). HPC System Lifetime Story: Workload Characterization and Evolutionary Analyses on NERSC Systems. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2015) (pp. 57-60). ACM. Full Text
Papers 2 and 3: Understanding scheduling mechanisms and their role in current future HPC systems.
Rodrigo Álvarez, G. P., Östberg, P-O. Elmroth, E. (2014). Priority Operators for Fairshare Scheduling. 18th Workshops on Job Scheduling Strategies for Parallel Processing (JSSPP 2014) co-located with the IPDPS 2014 conference. Full Text
Rodrigo Álvarez, G. P., Östberg, P. O., Elmroth, E., Ramakrishnan, L. (2015, June). A2L2: An Application Aware Flexible HPC Scheduling Model for Low-Latency Allocation. In Proceedings of the 8th International Workshop on Virtualization Technologies in Distributed Computing (VTDC 2015) (pp. 11-19). ACM. Co-located with the HPDC 2015 conference. Full Text
Paper 4: Creation of an scheduling research tool.
Rodrigo Álvarez, G.P, Elmroth, E., Östberg, P.O., Ramakrishnan, L. ScSF: A Scheduling Simulation Framework. 21th Workshops on Job Scheduling Strategies for Parallel Processing (JSSPP 2017) co-located with the IPDPS 2017 conference. Full Text
Paper 5: A new scheduling algorithm for workflows in HPC systems.
Rodrigo Álvarez, G.P, Elmroth, E., Östberg, P.O., Ramakrishnan, L. Enabling workflow aware scheduling on HPC systems. 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2017). Full Text
Technical reports produced during thesis work
Rodrigo Álvarez, G. P. Establishing the equivalence between operators: theorem to establish a sufficient condition for two operators to produce the same ordering in a Fairshare prioritization system. January 2014. Full Text
Rodrigo Álvarez, G. P. Proof of compliance for the relative operator on the proportional distribution of unused share in an ordering fairshare system. January 2014. Full Text
Seminars and talks related to thesis work
Scheduling for future HPC systems. Swedish e-Science Academy 2016 - eSSENCE, Lund, Sweden, October 12, 2016. Video presentation. Slides
Towards understanding today’s and tomorrow’s scheduling challenges in HPC systems. Nordu Grid 2016, Košice, Slovakia, 3 June, 2016. Slides
Towards understanding today’s and tomorrow’s scheduling challenges in HPC systems. Mid-Thesis seminar, February 2016. Slides
Analysis of job traces from Carver, Hopper, and Edison. Brown-bag seminar at NERSC, Oakland, California. May 2014 Slides
Open source projects part of thesis work
WoAS, Workflow Aware System (Slurm): Scheduling plug-in for Slurm to support workflow aware jobs, i.e. a new way run static workflows with fine grained resource allocation without long turnaround times. Fork of the SLURM project. Owner, in process of liberation.
ScSF, a Scheduling Simulation Framework: Tool set to perform scheduling research including workload modeling, generation, analysis, and HPC system simulation (based on Slurm). Includes an orchestration layer to deploy current simulations over distributed resources. Owner, in process of liberation.
QDO (kew-doo): a lightweight high-throughput queuing system for workflows that have many small tasks to perform. Contributor.
qdo werbserver: A rest API to execute QDO remotely on a yet more remote server through NEWT or SSH. Main contributor.
sremote: Simple remote is a python library to run python code remotely: simple, deploys the code itself, and the communication channel can be anything that allows copy/read files and command line execution. Includes a NEWT and SSH connectors. Owner.
Other relevant work
-
Planning committee member at Super Computing 2017, Denver, CO, USA: SCinet, Architecture.
-
Program committee member at 10th Cloud Control Workshop, 2017, Umeå, Sweden.
-
Program committee member at CCGRID 2017, Madrid, Spain: Scheduling and Resource management track.
-
Planning committee member at Super Computing 2016, Salt Lake City, UT, USA: SCinet, WAN Transport.
-
Program committee member at 8th Cloud Control Workshop, 2016, Lövånger, Sweden.
-
Student Volunteer at Super Computing 2015, Austin, TX, USA: SCinet, WAN Transport.
Work as reviewer for conferences and journals
-
37th IEEE International Conference on Distributed Computing Systems (ICDCS 2017).
-
16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2016).
-
Super Computing (SC) 2015.
-
8th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2015).
-
35th IEEE International Conference on Distributed Computing Systems (ICDCS 2015).
-
IEEE’s Transactions on Cloud Computing (TCC), 2014.
Research visits and internships
2016, 6 months: Systems engineer at Lawrence Berkeley National Lab. Data Science and Technology department, CRD. Employed by LBNL. Supervised by L. Ramakrishnan
2015, 6 months: PhD student intern at Lawrence Berkeley National Lab. Data Science and Technology department, CRD. 95% Employed by LBNL, 5% employed by UmU. Supervised by L. Ramakrishnan
2014, 4 months: Software Engineering intern at Google Inc. Cluster management group in Mountain View CA. Work performed on data intensive workflows auto-scaling. Employed by Google Inc. Supervised by J. Wilkes
2014, 5 months: Visiting PhD student at Lawrence Berkeley National Lab. Data Science and Technology department, CRD. Funded by the Berkeley exchange scholarship for PhD studies of the Faculty of Sciences and Technology, Umeå University. Supervised by L. Ramakrishnan