In recent years, global media coverage of China’s AI development has been dominated by sweeping narratives of geopolitical tension and technological one-upmanship. In these narratives, the Communist Party is portrayed as mobilizing China's vast state apparatus and civil ingenuity to push its AI capabilities—hardware, software, deployment, and application—toward parity with the West. However, beneath these sweeping narratives lies a critical but widely overlooked issue: given that the successful development and application of AI models is fundamentally dependent on high-quality data, shortcomings in Chinese data quality and availability will be a key limiting factor in the effective development and deployment of China’s AI technologies. These shortcomings represent a technological ceiling imposed by authoritarian control over information, one that will have significant implications not only for China's AI landscape and its developmental trajectory, but also for China’s broader social, political, and economic stability and well-being in the years ahead.
Let us first briefly discuss why high-quality data is so important. At the heart of computing lies a fundamental principle: “Garbage in, garbage out.” Unlike human reasoning, which can draw on broad experience to make inferences and logical leaps, AI systems are entirely dependent on the quality of their training data. Poor-quality data can undermine even the most sophisticated of algorithms, compromising the accuracy, performance, and reliability of AI systems. A 2024 Appen report surveying 300 companies found data quality to be a top challenge in the construction of AI applications. A 2025 survey from Qlik noted that 81% of AI professionals acknowledged significant data quality challenges within their organizations, threatening returns on investment and business stability. As Drew Clarke, EVP and GM of Qlik's Data Business Unit explained, “Companies are essentially building AI skyscrapers on sand...The most advanced algorithms can't compensate for unreliable data inputs.” Data quality has become one of the main obstacles to AI project success.
In China, these issues are more pronounced, for a number of reasons. First, there is simply less Chinese-language information available. On Common Crawl, a massive, open-access web archive that contains petabytes of data collected from billions of webpages across the internet, Chinese comprises only 5.2% of content (compared to 43.2% for English). As of March 2025, English accounted for about 49% of all internet content, while Chinese made up only 1.1%. This trend shows no sign of improving; indeed, much of the Chinese-language web is being systematically erased. In 2023, China had only 3.9 million websites, down from 5.3 million in 2017. Although Chinese internet users make up nearly one-fifth of the global total, a mere 1.3% of global websites use Chinese, down from 4.3% in 2013—a 70% decline in a decade. To make matters worse, many of China’s public databases have been shut down or restricted in recent years. While the Supreme People’s Court used to publish an open database of court verdicts, with 23 million rulings posted online in 2020, only 3 million were released in 2023. Access to public health data is similarly restricted. And, due to concerns about IP and commercial competition, Chinese firms like Tencent and ByteDance, each of which controls huge swathes of the Chinese-language internet, are reluctant to share data with third parties for the purpose of training LLMs.
Join the conversation as a VIP Member