Fault Tolerance in High Performance Computing: Fault Tolerance in High Performance Computing: MPI and Checkpoint/Restart

Modern HPC applications must be able to tolerate
inevitable faults if they are to harness current and
future HPC systems. Detecting and responding to
such failures in distributed systems poses complex
and intriguing research questions. Researchers
at Indiana University are leading the Open MPI
transparent checkpoint/restart fault tolerance
development effort, and with a novel architecture
are enabling applications to transparently take
advantage of fault tolerance services provided by
Open MPI, particularly by its support for a variety of
interconnects including Infiniband, Myrinet, shared
memory, and Ethernet.

Fault Tolerance in High Performance Computing

Friday, November 14, 2008

Fault Tolerance in High Performance Computing: MPI and Checkpoint/Restart

No comments:

Followers

Blog Archive

About Me