Write For Us

Keras Tutorial: Checkpointing distributed models with Orbax

E-Commerce Solutions SEO Solutions Marketing Solutions
39 Views
Published
Don't let device failures or power outages ruin your training runs. In this tutorial, Yufeng Guo demonstrates how to use Keras with the Orbax checkpointing library. Learn how to implement a custom checkpoint manager and Keras callbacks to ensure your model state is always safely stored.

0:00 Introduction to Orbax & Keras Integration
0:39 Exploring Keras Checkpointing
1:11 Why Extend Keras for Multi-Host Environments?
1:48 What is Orbax?
2:29 Building Utility Classes: KerasOrbaxCheckpointManager & OrbaxCheckpointCallback
2:57 Deep Dive into KerasOrbaxCheckpointManager
3:45 Coding the Get, Save, and Restore State Functions
4:37 Implementing the OrbaxCheckpointCallback
5:12 Protecting Against Device Failures & Preemption
5:31 Implementation Details & Model.fit Integration
6:07 Checkpointing in Action: File Directory Walkthrough
6:56 Summary & Final Tips

Resources:
Orbax checkpointing in Keras - Developer guide → https://goo.gle/40T2LI8
ModelCheckpoint - Keras 3 API documentation → https://goo.gle/3PkAlEq


Subscribe to Google for Developers → https://goo.gle/developers

Speaker: Yufeng Guo
Products Mentioned: Google AI
Category
Project
Tags
Google, developers, pr_pr: AI DevRel (fka Core ML);
Sign in or sign up to post comments.
Be the first to comment