×

IIT Guwahati Innovates Method to Enhance Wikipedia's Accuracy

Researchers from IIT Guwahati have developed a novel method to identify and correct Surface Name Errors (SNEs) in Wikipedia, enhancing the reliability of information for both human users and AI systems. Their study revealed that a significant percentage of entity mentions in Wikipedia contain errors that can undermine credibility. The method, which employs mathematical frequency patterns, has been tested across multiple languages and has shown promising results. This innovation not only aids in improving Wikipedia's accuracy but also emphasizes the importance of quality data for AI applications. The research team found that a substantial portion of predicted errors had been corrected over four years, showcasing the method's effectiveness.
 

Advancements in Wikipedia Error Correction

GUWAHATI, March 6: Researchers at the Indian Institute of Technology (IIT) Guwahati have introduced a groundbreaking multilingual and scalable technique aimed at identifying and rectifying Surface Name Errors (SNEs) in Wikipedia. This innovation is set to bolster the reliability of information for both human readers and artificial intelligence (AI) systems.

A surface name is the term used in Wikipedia articles to reference or link to another entity, and a SNE arises when this term is incorrect.

The IIT Guwahati research team conducted a study revealing that approximately three to six percent of all entity mentions in Wikipedia contain SNEs. Although these errors may seem trivial, they can have profound consequences.

For users, an erroneous surface name can diminish the perceived trustworthiness and reliability of the information presented.

Moreover, numerous machine learning and deep learning models rely on Wikipedia as a primary dataset. Errors in surface names can adversely affect AI tasks and the performance of these models, according to the research team.

To tackle this issue, Prof Amit Awekar, an Associate Professor in the Department of Computer Science and Engineering at IIT Guwahati, along with MTech student Anuj Khare (2022 batch), developed a method utilizing mathematical frequency patterns, making it versatile across various languages. Their approach consists of three steps to classify SNEs.

The initial step involved scanning Wikipedia and transforming each link into a quadruplet that includes details about the page where the link is found, the page it directs to, the surface name used, and the surrounding textual context.

In the subsequent step, the method evaluated the surface name, deeming it correct only if it appeared at least ten times and constituted at least five percent of all links leading to a specific page.

Surface names failing to meet these criteria were flagged as potential errors.

The final step involved categorizing the identified errors into ‘typing mistakes’, such as ‘Gawahati’ instead of ‘Guwahati’, or ‘entity span errors’, where incorrect or additional words are mistakenly included in the link.

The researchers tested their method across eight languages, including English, Sanskrit, German, Italian, Urdu, Hindi, Marathi, and Gujarati, achieving accurate results.

Discussing the practical implications of their method, Prof Awekar stated, “This work emphasizes the importance of not blindly trusting data from the web, both for human users and for training AI models. Quality data is fundamental to any effective AI model and its subsequent applications.”

To validate their method, the research team compared snapshots of English Wikipedia from 2018 and 2022, discovering that around 30 percent of the errors predicted by their method had been rectified on Wikipedia over four years, confirming its effectiveness.

Wikipedia is curated by volunteers globally, and this method can assist editors in uncovering hidden typos and linking errors that might otherwise go unnoticed for extended periods, according to Prof Awekar. The Wikipedia community has accepted over 99 percent of the manual corrections proposed by the researchers.