Graduate Course in Cloud Computing Resource Management (6 Credits)

Johan Eker

Cloud computing is rapidly transforming the ICT industry and moving compute and storage to data centers. A main driving force is cost reduction, moving CAPEX to OPEX and providing computational power in a pay-as-you-go- fashion. A main reason for the decreased cost is the improved utilization of hardware. However, main cloud vendors still report average utilization level below 30%. To improve efficiency scheduling of hardware resources is an important topic and research area. This course gives an overview to cloud computing and popular computational models while deep diving into the state of the art systems for cloud resource management.

The course is setup as a reading group where all participants are expected to have read all papers and to be able to present any given part of the current paper to the group. In addition, a cloud resource management project corresponding to one week full time work is required. The project details are defined together with the course leader on a per project basis.

Meeting #1 - Sept 25

Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: a platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX conference on Networked systems design and implementation (NSDI'11). USENIX Association, Berkeley, CA, USA, 295-308.

Meeting #2 - Oct 2

Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys '13). ACM, New York, NY, USA, 351-364. DOI=10.1145/2465351.2465386

Meeting #3 - Oct 9
D. R. Engler, M. F. Kaashoek, and J. O'Toole, Jr.. 1995. Exokernel: an operating system architecture for application-level resource management. In Proceedings of the fifteenth ACM symposium on Operating systems principles (SOSP '95), Michael B. Jones (Ed.). ACM, New York, NY, USA, 251-266. DOI=10.1145/224056.224076

Meeting #4 - Oct 21
Christina Delimitrou and Christos Kozyrakis. 2014. Quasar: resource-efficient and QoS-aware cluster management. In Proceedings of the 19th international conference on Architectural support for programming languages and operating systems (ASPLOS '14). ACM, New York, NY, USA, 127-144. DOI=10.1145/2541940.2541941

Meeting #5 - Nov 6
Anshul Gandhi, Mor Harchol-Balter, Ram Raghunathan, and Michael A. Kozuch. 2012. AutoScale: Dynamic, Robust Capacity Management for Multi-Tier Data Centers. ACM Trans. Comput. Syst. 30, 4, Article 14 (November 2012), 26 pages. DOI=10.1145/2382553.2382556

Meeting #6 - Nov 13
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (SOCC '13). ACM, New York, NY, USA, , Article 5 , 16 pages. DOI=10.1145/2523616.2523633

Meeting #7 - Nov 20
Michael Isard, Vijayan Prabhakaran, Jon Currey, Udi Wieder, Kunal Talwar, and Andrew Goldberg. 2009. Quincy: fair scheduling for distributed computing clusters. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles (SOSP '09). ACM, New York, NY, USA, 261-276. DOI=10.1145/1629575.1629601

Meeting #8 - Nov 27
Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Sethuraman Subbiah, and John Wilkes. 2013. AGILE: Elastic distributed resource scaling for infrastructure-as-a-service. In Proceedings of the 10th International Conference on Autonomic Computing (ICAC'13). 69--82.

Meeting #9 - Dec 4
Rishi Kapoor, George Porter, Malveeka Tewari, Geoffrey M. Voelker, and Amin Vahdat. 2012. Chronos: predictable low latency for data center applications. In Proceedings of the Third ACM Symposium on Cloud Computing (SoCC '12). ACM, New York, NY, USA, , Article 9 , 14 pages. DOI=10.1145/2391229.2391238

Meeting #10 - Dec 11
Kay Ousterhout, Patrick Wendell, Matei Zaharia, and Ion Stoica. 2013. Sparrow: distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP '13). ACM, New York, NY, USA, 69-84. DOI=10.1145/2517349.2522716

Meeting #11 - Dec 18 Project Presentations

Project proposals: