Modern HPC applications must be able to tolerate
inevitable faults if they are to harness current and
future HPC systems. Detecting and responding to
such failures in distributed systems poses complex
and intriguing research questions. Researchers
at Indiana University are leading the Open MPI
transparent checkpoint/restart fault tolerance
development effort, and with a novel architecture
are enabling applications to transparently take
advantage of fault tolerance services provided by
Open MPI, particularly by its support for a variety of
interconnects including Infiniband, Myrinet, shared
memory, and Ethernet.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment