Quick Links
Contact Us
Computer Science DepartmentSwarthmore College
500 College Avenue
Swarthmore, PA 19081
Phone: 610.328.8272
Fax: 610.328.8606
Email: info at cs.swarthmore.edu
Copyright 2009 Swarthmore College. All rights reserved.
Talk by Alison Norman, Department of Computer Science at the University of Texas at Austin
Towards Scalable Checkpointing in Supercomputing ApplicationsThursday, February 17, 2011
SCI 240, 4:00 pm (refreshments at 3:45)
Abstract
Long-running parallel applications must occasionally save their state in a "checkpoint"; this is necessary to enable recovery of the computation after any failure in software, hardware, or environment (e.g. power). But, current checkpointing methods are becoming untenable for large-scale parallel applications on supercomputers. Many applications checkpoint all the parallel processes simultaneously---a technique that is easy to implement but can saturate the network and file system, causing a significant increase in checkpoint overhead.
This talk introduces "compiler-assisted staggered checkpointing", where processes can checkpoint at different places in the application text, thereby reducing contention for the network and file system. Placing staggered checkpoints is algorithmically challenging since the number of possible solutions is enormous and the number of desirable solutions is small, but we have developed a compiler algorithm that both places staggered checkpoints in an application and ensures that the solution is desirable. This algorithm successfully places staggered checkpoints in parallel applications configured to use tens of thousands of processes. For our benchmarks, this algorithm successfully finds and places checkpoints that are significantly faster than the current state of the art.