Abstract

In recent months, large language models (LLMs) have sparked a new revolution in AI, demonstrating impressive capabilities across various fields. Huawei is currently working on developing LLMs and related infrastructure. However, the training and inference of LLMs face numerous reliability challenges. For instance, distributed training clusters frequently encounter faults and soft failures, requiring extensive trial-and-error for detection and recovery, increasing training costs. Additionally, emergent capabilities make model performance difficult to predict. Sometimes, LLMs exhibit hallucination behaviors, generating seemingly correct but actually incorrect outputs, rendering the output untrustworthy. This talk aims to address these challenges by developing techniques to monitor LLM training and enable early detection and localization of faults.

Additionally, this talk will explore the mechanisms behind scaling laws to enable predicting emergent capabilities and potential risks through building scalable metrics. Furthermore, this talk will develop LLM calibration models to measure output reliability. Overall, this talk focuses on improving the reliability of LLM training and inference.

Speaker

Dr. Zheng Hu China

Director of Reliability Technology Lab

Huawei Technologies Co., Ltd.

Dr. Zheng Hu is the Director of Reliability Technology Lab of Huawei Technologies Co., Ltd.. Dr Hu is currently leading the trustworthy AI project and reliable AI cluster project. Meanwhile, his research also focuses on software reliability, ah-hoc networks, functional safety, databases, SRE, etc. Dr. Zheng Hu received his PhD degree in Computer Science from Lyon University in Lyon, France. Before joining Huawei, he was the senior researcher in Orange Labs (France Telecom), working on the self-configuration network of smart home/smart building.