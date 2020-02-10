Advertisement

Microsoft Research today announced DeepSpeed, a new deep learning optimization library that can train massive 100 billion parameter models. In AI, you need larger natural language models to achieve better accuracy. However, training larger natural language models is time consuming and the associated costs are very high. Microsoft claims that DeepSpeed’s new deep learning library improves speed, cost, scalability, and ease of use.

Microsoft also mentioned that DeepSpeed ​​enables language models with up to 100 billion parameters and includes ZeRO (Zero Redundancy Optimizer), a parallelized optimization program that reduces the resources required for model and data parallelism while increasing the number of parameters that are trained can be. With the help of DeepSpeed ​​and ZeRO, Microsoft researchers have developed the new Turing Natural Language Generation (Turing-NLG), the largest language model with 17 billion parameters.

DeepSpeed ​​highlights:

frame : Modern large models such as OpenAI GPT-2, NVIDIA Megatron-LM and Google T5 have parameter sizes of 1.5 billion, 8.3 billion and 11 billion, respectively. ZeRO Level 1 in DeepSpeed ​​offers system support for the execution of models with up to 100 billion parameters, ten times larger.

: Modern large models such as OpenAI GPT-2, NVIDIA Megatron-LM and Google T5 have parameter sizes of 1.5 billion, 8.3 billion and 11 billion, respectively. ZeRO Level 1 in DeepSpeed ​​offers system support for the execution of models with up to 100 billion parameters, ten times larger. speed : We observe throughput that is up to five times higher than that of the state of the art for various hardware components. With NVIDIA GPU clusters with low bandwidth connections (without NVIDIA NVLink or Infiniband), we achieve a 3.75-fold throughput improvement compared to Megatron-LM alone for a standard GPT-2 model with 1.5 billion parameters. NVIDIA DGX-2 clusters with high bandwidth interconnect are three to five times faster for models with 20 to 80 billion parameters.

: We observe throughput that is up to five times higher than that of the state of the art for various hardware components. With NVIDIA GPU clusters with low bandwidth connections (without NVIDIA NVLink or Infiniband), we achieve a 3.75-fold throughput improvement compared to Megatron-LM alone for a standard GPT-2 model with 1.5 billion parameters. NVIDIA DGX-2 clusters with high bandwidth interconnect are three to five times faster for models with 20 to 80 billion parameters. costs : Improved throughput can lead to significantly lower training costs. For example, to train a model with 20 billion parameters, DeepSpeed ​​requires three times less resources.

: Improved throughput can lead to significantly lower training costs. For example, to train a model with 20 billion parameters, DeepSpeed ​​requires three times less resources. user friendliness: Only a few code changes are required for a PyTorch model to use DeepSpeed ​​and ZeRO. Compared to current model parallelism libraries, DeepSpeed ​​does not require code redesign or model refactoring.

Microsoft is an open sourcing provider for DeepSpeed ​​and ZeRO. You can read it here on GitHub.

Source: Microsoft

