Summary

(2017-2018) Postdoctoral researcher at the Data Science and Technology department at the Lawrence Berkeley National Lab.
(2012-2017) PhD in Computing Science: Distributed Systems group. Department of Computing Science Department, Umeå University, Umeå Sweden.
(2011-2012) Executive MBA: ESIC Zaragoza, Spain. (Focus on technological businesses).
(1998-2004) Master’s degree in Computer Engineering: Universidad De Zaragoza, Zaragoza, Spain.
(2003-2004) Master’s Thesis: Systemteknik/Datalogi, Luleå Universitet. Luleå,Sweden.

Science Search project

Institutions: Lawrence Berkeley National Lab, NERSC, and UC Berkeley.
Primary Investigators: Katie Antypas (NERSC), Lavanya Ramakrishnan (LBNL), and Joseph M. Hellerstein (UC Berkeley)

Scientific facilities produce large datasets and individual data elements are increasingly hard to find as the data ages. While explicit data tagging could solve the problem, consistency and quality are hard to accomplish across different scientific groups, even in the same field. Science Search proposes to create a scientific datasets search engine by applying machine learning to metatada surrounding individual data pieces to infer their content and semantics.

Postdoctoral experience in this project include: Web-Service development (Django), NLP analysis, use of deep learning (Tensorflow), semantic models design, and continus works with large and diverse scientific datasets.

Doctoral Thesis

Thesis title: Scheduling in a Brave new World (Thesis Full Text): This thesis focuses on understanding what new scheduling models are and will be required for future HPC systems. It starts presenting how workloads have evolved in the lifetime of recent and current systems. It identifies new specific workload challenges that affect the scheduling performance. It follows analyzing and proposing general scheduling models for HPC systems. Next, it presents the set of tools that we have developed to perform scheduling research. Finally, it ends presenting a new scheduling algorithm for one of the identified challenges: efficient scheduling of workflows.

In addition to seven peer-reviewed publications, the outcome of this thesis includes two open source projects:

WoAS, Workflow Aware System (Slurm): Scheduling plug-in for Slurm to support workflow aware jobs, i.e. a new way run static workflows with fine grained resource allocation without long turnaround times. Fork of the SLURM project. Project lead. Download

ScSF, a Scheduling Simulation Framework: Tool set to perform scheduling research including workload modeling, generation, analysis, and HPC system simulation (based on Slurm). Includes an orchestration layer to deploy current simulations over distributed resources. Project lead. Download

Conference, workshop, and journal publications.

Rodrigo Álvarez, G. P., Östberg, P. O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L. (2016, May). Towards Understanding HPC Users and Systems: A NERSC Case Study. Submitted to JPDC (Journal of Parallel and Distributed Computing). In Journal of Parallel and Distributed Computing, Volume 111, 2018, Pages 206-221, ISSN 0743-7315. Full Text

Rodrigo Álvarez, G.P, Elmroth, E., Östberg, P.O., Ramakrishnan, L. Enabling workflow aware scheduling on HPC systems. 26th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2017). Full Text

Rodrigo Álvarez, G.P, Elmroth, E., Östberg, P.O., Ramakrishnan, L. ScSF: A Scheduling Simulation Framework. 21th Workshops on Job Scheduling Strategies for Parallel Processing (JSSPP 2017) co-located with the IPDPS 2017 conference. Full Text

Rodrigo Álvarez, G. P., Östberg, P. O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L. (2016, May). Towards Understanding Job Heterogeneity in HPC: A NERSC Case Study. 6th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2016). Full Text

Rodrigo Álvarez, G. P., Östberg, P. O., Elmroth, E., Antypas, K., Gerber, R., Ramakrishnan, L. (2015, June). HPC System Lifetime Story: Workload Characterization and Evolutionary Analyses on NERSC Systems. In Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2015) (pp. 57-60). ACM. Full Text

Rodrigo Álvarez, G. P., Östberg, P. O., Elmroth, E., Ramakrishnan, L. (2015, June). A2L2: An Application Aware Flexible HPC Scheduling Model for Low-Latency Allocation. In Proceedings of the 8th International Workshop on Virtualization Technologies in Distributed Computing (VTDC 2015) (pp. 11-19). ACM. Co-located with the HPDC 2015 conference. Full Text

Rodrigo Álvarez, G. P., Östberg, P-O. Elmroth, E. (2014). Priority Operators for Fairshare Scheduling. 18th Workshops on Job Scheduling Strategies for Parallel Processing (JSSPP 2014) co-located with the IPDPS 2014 conference. Full Text

Technical reports

Rodrigo Álvarez, G. P. Establishing the equivalence between operators: theorem to establish a sufficient condition for two operators to produce the same ordering in a Fairshare prioritization system. January 2014. Full Text

Rodrigo Álvarez, G. P. Proof of compliance for the relative operator on the proportional distribution of unused share in an ordering fairshare system. January 2014. Full Text

Seminars and talks

Scheduling for future HPC systems. Swedish e-Science Academy 2016 - eSSENCE, Lund, Sweden, October 12, 2016. Video presentation. Slides

Towards understanding today’s and tomorrow’s scheduling challenges in HPC systems. Nordu Grid 2016, Košice, Slovakia, 3 June, 2016. Slides

Towards understanding today’s and tomorrow’s scheduling challenges in HPC systems. Mid-Thesis seminar, February 2016. Slides

Analysis of job traces from Carver, Hopper, and Edison. Brown-bag seminar at NERSC, Oakland, California. May 2014 Slides

Open source projects

WoAS, Workflow Aware System (Slurm): Scheduling plug-in for Slurm to support workflow aware jobs, i.e. a new way run static workflows with fine grained resource allocation without long turnaround times. Fork of the SLURM project. Project lead. Download

ScSF, a Scheduling Simulation Framework: Tool set to perform scheduling research including workload modeling, generation, analysis, and HPC system simulation (based on Slurm). Includes an orchestration layer to deploy current simulations over distributed resources. Project lead. Download

Docker containers for data transfer applications: Collection of docker “recipes” to ease deployment of scientific data transfer applications, developed during volunteering work in SCinet: Globus Connect Server, Grid-ftp+Jupyter, and a Jupyter notebook to monitor file transfers (github).

QDO (kew-doo): a lightweight high-throughput queuing system for workflows that have many small tasks to perform. Contributor.

qdo webserver: A rest API to execute QDO remotely on a yet more remote server through NEWT or SSH. Main contributor.

sremote: Simple remote is a python library to run python code remotely: simple, deploys the code itself, and the communication channel can be anything that allows copy/read files and command line execution. Includes a NEWT and SSH connectors. Owner.

Other relevant academic work

  • Program Committee 22nd Workshop on Job Scheduling Strategies for Parallel Processing (JSSPP 2018). May 2018

  • Planning committee member at Super Computing 2017, Denver, CO, USA: SCinet, Architecture, and DTN depl

  • Program committee member at 10th Cloud Control Workshop, 2017, Umeå, Sweden.

  • Program committee member at CCGRID 2017, Madrid, Spain: Scheduling and Resource management track.

  • Planning committee member at Super Computing 2016, Salt Lake City, UT, USA: SCinet, WAN Transport.

  • Program committee member at 8th Cloud Control Workshop, 2016, Lövånger, Sweden.

  • Student Volunteer at Super Computing 2015, Austin, TX, USA: SCinet, WAN Transport.

Work as reviewer for conferences and journals

  • 37th IEEE International Conference on Distributed Computing Systems (ICDCS 2017).

  • 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2016).

  • Super Computing (SC) 2015.

  • 8th IEEE/ACM International Conference on Utility and Cloud Computing (UCC 2015).

  • 35th IEEE International Conference on Distributed Computing Systems (ICDCS 2015).

  • IEEE’s Transactions on Cloud Computing (TCC), 2014.

Research visits and internships

2016, 6 months: Systems engineer at Lawrence Berkeley National Lab. Data Science and Technology department, CRD. Employed by LBNL. Supervised by L. Ramakrishnan

2015, 6 months: PhD student intern at Lawrence Berkeley National Lab. Data Science and Technology department, CRD. 95% Employed by LBNL, 5% employed by UmU. Supervised by L. Ramakrishnan

2014, 4 months: Software Engineering intern at Google Inc. Cluster management group in Mountain View CA. Work performed on data intensive workflows auto-scaling. Employed by Google Inc. Supervised by J. Wilkes

2014, 5 months: Visiting PhD student at Lawrence Berkeley National Lab. Data Science and Technology department, CRD. Funded by the Berkeley exchange scholarship for PhD studies of the Faculty of Sciences and Technology, Umeå University. Supervised by L. Ramakrishnan