Gemini - A Family of Highly Capable Multimodal Models: Training Dataset

Written by escholar | Published 2023/12/24
Tech Story Tags: generative-ai | gemini-training-dataset | gemini-google | sentencepiece-tokenizer | gemini-model-training-dataset | gemini-family-of-models | multimodal-gemini-models | machine-learning

TLDRThis report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks — notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of Gemini models in cross-modal reasoning and language understanding will enable a wide variety of use cases and we discuss our approach toward deploying them responsibly to users. via the TL;DR App

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Gemini Team, Google.

Table of Links

Abstract and Introduction

Model Architecture

Training Infrastructure

Training Dataset

Evaluation

Responsible Deployment

Discussion and Conclusion, References

Contributions and Acknowledgments

Appendix

4. Training Dataset

Gemini models are trained on a dataset that is both multimodal and multilingual. Our pretraining dataset uses data from web documents, books, and code, and includes image, audio, and video data.

We use the SentencePiece tokenizer (Kudo and Richardson, 2018) and find that training the tokenizer on a large sample of the entire training corpus improves the inferred vocabulary and subsequently improves model performance. For example, we find Gemini models can efficiently tokenize non-Latin scripts which can, in turn, benefit model quality as well as training and inference speed.

The number of tokens used to train the largest models were determined following the approach in Hoffmann et al. (2022). The smaller models are trained for significantly more tokens to improve performance for a given inference budget, similar to the approach advocated in Touvron et al.(2023a).

We apply quality filters to all datasets, using both heuristic rules and model-based classifiers. We also perform safety filtering to remove harmful content. We filter our evaluation sets from our training corpus. The final data mixtures and weights were determined through ablations on smaller models. We stage training to alter the mixture composition during training – increasing the weight of domain-relevant data towards the end of training. We find that data quality is critical to a high performing model, and believe that many interesting questions remain around finding the optimal dataset distribution for pretraining.


Written by escholar | We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community
Published by HackerNoon on 2023/12/24