Baichuan Intelligent Technology Releases Its First Large-scale Pre-training Language Model baichuan-7B

Baichuan Intelligent Technology, a company established by former Sogou CEO Wang XiaoChuan, has officially launched its first large-scale Chinese and English pre-training model, baichuan-7B. The model, which consists of 7 billion parameters, has been released on several platforms including Hugging Face, Github, and Model Scope, and has achieved top results in multiple benchmark tests.

At Github, it states: baichuan-7B is an open-source, large-scale pre-training language model developed by Baichuan Intelligent Technology. baichuan-7B is based on Transformer architecture, which contains 7 billion parameter and trained on approximately 1.2 trillion tokens. It supports both Chinese and English languages with a context window length of 4096. It has achieved the best performance among models of the same size on standard Chinese and English authoritative benchmarks (C-EVAL, MMLU, etc).

The performance of baichuan-7B has been verified through comprehensive assessments using influential Chinese benchmarks such as C-Eval, AGIEval, and Gaokao. In these evaluations, the model has consistently achieved outstanding results, surpassing other pre-trained models of the same parameter scale and becoming the top-performing native pre-trained model in Chinese.In the AGIEval assessment, baichuan-7B scored 34.4 points, significantly outperforming other open-source models such as LLaMA-7B, Falcon-7B, Bloom-7B, and ChatGLM-6B.

In the C-EVAL test, baichuan-7B scored 42.8 points, exceeding ChatGLM-6B’s 38.9 points, and in the Gaokao evaluation, it scored 36.2 points, clearly leading other pre-trained models of the same parameter scale.

AGIEval is a benchmark launched by Microsoft Research Institute aimed at comprehensively evaluating the capabilities of basic models in human cognition and problem-solving tasks. C-Eval, co-created by Shanghai Jiao Tong University, Tsinghua University, and the University of Edinburgh, is a comprehensive exam evaluation set for Chinese language models, covering 52 subjects from different industry fields. The Gaokao benchmark, created by the research team of Fudan University, uses Chinese college entrance examination questions as a dataset to test large models’ capabilities in Chinese language understanding and logical reasoning.

Baichuan-7B not only excels in Chinese, but also performs brilliantly in English. In the MMLU assessment, baichuan-7B scored as high as 42.5 points, significantly leading the English open-source pre-trained model LLaMA-7B’s 34.2 points and the Chinese open-source model ChatGLM-6B’s 36.9 points.

SEE ALSO: iFlytek Claims Its Large Language Model Outperforms ChatGPT in Three Key Areas

Training corpus is crucial to the results of large model training. Baichuan Intelligent Technology built a high-quality pre-training corpus based on high-quality Chinese corpora, while also integrating high-quality English data. The original data includes a massive amount of Chinese and English internet data and some open-source Chinese and English data, as well as a large amount of high-quality knowledge data.

The model has implemented an efficient and stable training process. Compared to models of the same parameter scale, baichuan-7B shows superior performance in key performance indicators such as perplexity (PPL) and training loss.

Most of the existing open-source models have a window length of 2K or less. With an optimized tokenization algorithm, baichuan-7B has been able to expand to a super-long dynamic window capability of 4K, making it more versatile for a wide range of applications.