User:AlekseyBot
This user account is a bot operated by AlekseyFy (talk). It is used to make repetitive automated or semi-automated edits that would be extremely tedious to do manually, in accordance with the bot policy. The bot is currently inactive but retains the approval of the community. Administrators: if this bot is malfunctioning or causing harm, please block it. |
This bot reads Wikipedia to build collaborative filter models to aid link disambiguation. Right now, it should not be making any edits.
Algorithm description
[edit]Collaborative filtering is a technique designed to predict unknown preferences of a user based on its previous preferences and the preferences of other users. Similar systems are used to predict whether a person will like a particular movie or product.
This concept can be applied to ambiguous links on Wikipedia. In this context, consider the links from a disambiguation page to be possible targets for an ambiguous link. To build a model, we look at all the articles that currently (and unambiguously) link to a target. We call these articles pages, which fill the same role that users would in the above examples. Next, we look at all the links present in each page. Each article linked to from a page we call an item, which fill the role of movies or products. A page linking to an item is considered to be a vote or preference for that item. We expect that pages that link to a specific target will have similar "preferences", meaning that they also link to a similar set of items. When presented with a new page that has an ambiguous link to one of our targets, we also expect that if the new page links to a substantially similar set items as other pages that link to a particular target, the new page would probably prefer that particular target as well.
Bot Description
[edit]This bot implements a system like the one described above. Right now, initial testing has been conducted on the Mandarin disambiguation page and has given the results summarized here. Official bot status would be useful to speed up the time needed to build a model (which is transfer intense) and allow some formal trials to see if in the future the system could disambiguate some links automatically with an acceptably small error rate.