It is not an exaggeration to suggest that ideas similar to ChatGPT have had a revolutionary impact on the online world. In order to increase the accessibility of ChatGPT-style models, the AI open-source community is engaged in several initiatives (such as ChatLLaMa, Alpaca, etc.). These models are incredibly adaptable and are capable of doing tasks like coding, translation, and summarization at or even better than human levels of proficiency.
A strong ChatGPT-like model still cannot be trained via a publicly accessible end-to-end RLHF pipeline, despite these commendable efforts. Even with access to such computer power, training efficiency is usually less than 5% of these machines’ capabilities. Despite having access to multi-GPU clusters, current systems cannot support the quick, cheap, and simple training of cutting-edge ChatGPT models with billions of features.
These limitations result from the fact that current DL systems, which are designed for more traditional pre-training and fine-tuning pipelines, do not adequately support the advanced RLHF training pipeline employed by InstructGPT. The Microsoft team is releasing DeepSpeed-Chat, which offers an end-to-end RLHF pipeline to train ChatGPT-like models, to increase the accessibility of RLHF training and the availability of ChatGPT-like models. It has these characteristics:
DeepSpeed-RLHF System: A strong and sophisticated system that combines DeepSpeed’s training and inference capabilities is called the Hybrid Engine (DeepSpeed-HE) for RLHF. With the aid of DeepSpeed-Inference optimisations like tensor-parallelism and high-performance transformer kernels for generation, as well as RLHF’s numerous memory optimisation techniques like ZeRO and LoRA, the Hybrid Engine can quickly move between RLHF’s inference and training modes. DeepSpeed-HE is additionally aware of the entire RLHF pipeline, which helps to further optimise memory management and data transmission throughout the various phases of RLHF. With its scale-unprecedented efficiency, the DeepSpeed-RLHF system enables the AI community to easily, swiftly, and affordably access training on intricate RLHF models.
Efficiency and affordability: RLHF training may be done quickly and affordably because DeepSpeed-HE is nearly 15 times faster than traditional systems.
Strong Scalability: DeepSpeed-HE can handle models with hundreds of billions of parameters because to its excellent scalability on multi-node multi-GPU systems.
Increasing Access to RLHF Education: Using just a single GPU for training, DeepSpeed-HE enables data scientists without access to multi-GPU systems to develop enormous, powerful RLHF models that can be used in real-world contexts.
To make the training process as efficient as possible, the researchers have integrated a complete end-to-end training pipeline into DeepSpeed-Chat and modelled it after InstructGPT.
There are three steps to the production process:
- Using supervised fine-tuning (SFT), which involves carefully choosing human responses to various questions, the pretrained language models are adjusted.
- After that, the group engages in “reward model fine-tuning,” which entails training an alternative (typically more compact than the SFT) model (RW) using a dataset that contains human-provided rankings of a variety of responses to the same question.
- The Proximal Policy Optimisation (PPO) algorithm is utilised in RLHF training to further modify the SFT model based on reward feedback from the RW model.
DeepSpeed-Chat is now accessible to the AI community because it is open-sourced. Users are encouraged to report issues, send PRs, and engage in discussions on the DeepSpeed GitHub page.