Inside AI “portraits of Vietnamese people” dataset in global top 15 trending list

A Vietnamese-language dataset developed by FPT and NVIDIA to support the development of sovereign artificial intelligence (AI) in Viet Nam has rapidly entered the top 15 trending datasets on Hugging Face just four days after its release.

Illustrative image
Illustrative image

Just four days after its launch, Nemotron-Personas-Viet Nam, a dataset jointly developed by FPT Corporation and NVIDIA, quickly entered the top 15 trending datasets on Hugging Face — the world’s leading open-source platform for sharing AI models and datasets.

A “Portraits of Vietnamese people” dataset for AI development

On Hugging Face, the trending rankings reflect the level of community interest in a resource, typically measured through downloads, likes, and user engagement.

Nemotron-Personas-Viet Nam’s inclusion in the top 15 trending datasets indicates that a dataset specifically designed for the Vietnamese language and the Vietnamese context is attracting attention from the international AI community. It also reflects the increasingly important role of localised data resources as many countries seek to advance sovereign AI development.

Nemotron-Personas-Viet Nam is not a large language model but rather a foundational dataset — a source dataset used by developers as a basis for AI development.

The dataset is built in the form of Vietnamese-language personas, or “character profiles”, designed to simulate the diversity of Vietnamese people in terms of daily life, education, work, and personal interests.

These personas are not based on real individuals. Instead, they are synthetic data generated by AI systems using statistical distributions and validation methods to more accurately reflect the realities of Vietnamese society.

data-structure-nemotron-personas.png
Nemotron-Personas-Viet Nam comprises 100,000 records, equivalent to 900,000 Vietnamese personas.

The publicly released version of Nemotron-Personas-Viet Nam contains 100,000 records, equivalent to 900,000 Vietnamese personas, with a total volume of 118 million tokens, including 52 million persona tokens. A token can be understood as the basic unit used by AI models to “read” and process language. The total of 118 million tokens indicates that the dataset contains a substantial volume of text that is sufficient to support developers in generating training data, fine-tuning models, or evaluating Vietnamese-language AI systems.

Each record in the dataset is described through multiple information fields, including occupation; skills; career goals; interests in sports, arts, travel, and cuisine; age; gender; educational attainment; marital status; and residential region and locality.

Describing personas across multiple dimensions enables developers to filter, categorise, and generate data scenarios tailored to specific user groups, professions, or application contexts.

The dataset covers six centrally governed cities and provinces, namely Ha Noi, Ho Chi Minh City, Hai Phong, Da Nang, Can Tho, and Dong Nai, based on Viet Nam’s new administrative boundaries following the 2025 restructuring.

Nemotron-Personas-Viet Nam is openly released on Hugging Face and can be used for both commercial and non-commercial purposes, provided appropriate attribution is given.

As a result, researchers, start-ups, businesses, and the AI development community in Viet Nam can access a foundational dataset for experimenting with, training, fine-tuning, and evaluating AI systems.

Advancing sovereign AI for Viet Nam

With Nemotron-Personas-Viet Nam, developers now have access to a data resource that more accurately reflects the characteristics of Vietnamese people, enabling the generation of additional synthetic data, reducing bias during model training, and improving diversity in the responses of Vietnamese-language AI models.

This is an important step towards ensuring that AI systems not only “speak Vietnamese” but also better understand Vietnamese people, Vietnamese society, and the specific challenges facing Viet Nam.

pgsts-ngo-xuan-bach.jpg
Assoc Prof, Dr Ngo Xuan Bach, Director of the AI Product Division at FPT Smart Cloud and Director of the Quantum AI & Cyber Security Institute at FPT Corporation.

Assoc Prof, Dr Ngo Xuan Bach, Director of the AI Product Division at FPT Smart Cloud and Director of the Quantum AI & Cyber Security Institute at FPT Corporation, said: “FPT believes that sovereign AI must be built from the ground up to reflect local language, culture, and economic realities.”

“The Nemotron-Personas-Viet Nam dataset demonstrates our commitment to helping local AI developers gain easier access to the resources needed to build AI solutions tailored to Vietnamese users and capable of scaling across the region,” Ngo Xuan Bach emphasised.

The collaboration between FPT and NVIDIA stems from a shared objective of providing efficient open models, datasets, and libraries for the AI development community. These resources help developers build AI systems that better reflect each country’s language, culture, regulations, data infrastructure, and economic priorities, rather than relying entirely on global general-purpose models.

Within this partnership, NVIDIA contributes its open model framework, the NVIDIA NeMo Data Designer synthetic data library, and the Nemotron-Personas methodology. This structured approach enables the creation of large-scale synthetic datasets capable of reflecting the demographic, geographical, and usage-context characteristics of individual countries.

FPT contributes local expertise, contextual understanding, data validation capabilities, data infrastructure, and AI research capacity through units including FPT Smart Cloud, the Quantum AI & Cyber Security Institute, and FPT DC5.

Globally, persona datasets are becoming an increasingly important approach in AI development, particularly for models that require diverse synthetic data, reduced bias, and a better representation of user contexts.

Within the Nemotron-Personas series, NVIDIA has developed persona datasets for a number of countries and regions, including the US, Japan, India, Singapore, Brazil, and France.

Most popular AI models today are trained primarily on English-language data and Western contexts. When applied in Viet Nam, these models may not fully understand the differences in language, culture, occupations, regional characteristics, communication styles, and the practical needs of Vietnamese users. This can result in responses that are less natural, less accurate, or insufficiently suited to local contexts.

The presence of Nemotron-Personas-Viet Nam among the trending datasets on Hugging Face demonstrates the growing importance of localised data in AI development. For Viet Nam, it represents a practical step towards expanding resources for the technology community, supporting businesses and researchers in developing AI systems that better understand Vietnamese people, serve Vietnamese users more effectively, and possess the potential to scale across the region.

Back to top