Page MenuHomePhabricator

Automatically matching new Wikipedia articles with Wikidata items using Python
Closed, ResolvedPublic

Description

IMPORTANT: Make sure to read the Outreachy participant instructions and communication guidelines thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on Zulip first!

Approved license

I assert that this Outreachy internship project will released under either an OSI-approved open source license that is also identified by the FSF as a free software license, OR a Creative Commons license approved for free cultural works

  • Yes

No proprietary software:

I assert that this Outreachy internship project will forward the interests of free and open source software, not proprietary software.

  • Yes

How long has your team been accepting publicly submitted contributions?

  • 1 year

How many regular contributors does your team have?

  • 1-2 people

Brief summary

Wikidata is a structured data repository linked to Wikipedia and the other Wikimedia projects. It holds structured data about a huge number of concepts, including every topic covered by a Wikipedia article, and many scientific papers and other topics. It also includes the interlanguage links between Wikipedia articles in different languages, links from Wikipedia to Commons, and between other Wikimedia projects.

It was started by importing all Wikipedia interwiki links, and has been steadily expanding since. However, when a new Wikipedia article is started, it is not automatically matched to Wikidata items, nor is a new item created for it. For a limited number of wikis, an automated python script creates new items, but it can easily create duplicate items.

In this project you match new articles against existing Wikidata items using ancillary data (such as identifiers that are common in both the Wikipedia article and Wikidata entry). You will start with existing Python scripts, which use the 'pywikibot' package to edit Wikidata, and significantly expand them to handle more situations automatically. This code will then be used live to create new Wikidata items, replacing the existing scripts.

If there is time, you will also expand it to work with matching categories/articles from other Wikimedia projects, such as Wikimedia Commons or Wikisource, and/or look into creating a Wikidata Game that people can play to add sitelinks in cases where it's less clear for an automated tool.

This project is mentored by Mike Peel. Knowledge of Python is an advantage, although it can be learnt during the project. Knowledge of machine learning techniques might be useful (but this can also be achieved with non-ML approaches). Knowing multiple human languages is useful to work with multiple Wikipedia language communities, but is not required.

Minimum system requirements

You will need a computer with a working Python 3 installation; you can install pywikibot and other useful modules using standard package systems.

How can applicants make a contribution to your project?

You will start by understanding how Wikidata works, looking through Wikipedia articles and seeing how the information is stored on Wikidata. From there you will identify patterns can be used to match articles with the Wikidata item if that link did not already exist. You will then code up automated matching functions and test how well they will work with currently unmatched articles. Ultimately, these will be integrated into the live code to keep Wikidata and the different language Wikipedias in sync.

You will need to create an account on Wikipedia (if you don't already have one), and install the pywikibot package (https://www.mediawiki.org/wiki/Manual:Pywikibot). I can provide guidance for each specific starting task, and in general please feel free to ask questions through Outreachy, by email, or at https://www.wikidata.org/wiki/User_talk:Mike_Peel .

Repository

https://bitbucket.org/mikepeel/wikicode/

Issue tracker

N/A

Tasks

There are three 'starter' tasks that can be done as Outreachy contributions. These aim to guide you through how Wikipedia and Wikidata are structured, and how Pywikibot interacts with them. They get progressively harder, but you don't have to do them in order (except you must do task 1 first!), and you don't have to do all of them.

  1. T290719 Look through a class of articles (e.g., books, authors, games, etc.) and identify what information is in common between Wikipedia and Wikidata
  2. T290720 Set up pywikibot on your computer, and understand how it interacts with Wikidata
  3. T290721 Write a function that searches Wikidata for a match to a term

These tasks also form the start of the main project, which will integrate these into an automated script that adds sitelinks to new articles to existing items, or creates new items where there is no match.

Application and timeline

The Outreachy positions are assessed solely on the contributions and the application you submit for the project; the best things you can do are to do well with the contributions, and include all relevant information in your application. Contributions are evaluated based on their completeness, coding style, and any additional work beyond the core task. I generally look for applicants who have demonstrated that they understand the tasks and the Wikimedia community.

When filling in your application, you will be asked about a timeline for the work during the project. I encourage you to draft a rough timeline yourself, bearing in mind:

  • You should split the timeline into periods, e.g., weekly or two-weekly, and write a short summary of what you expect to be doing in that period.
  • The aim of the project is to match all new Wikipedia articles with Wikidata items, but this will be done in stages (e.g., different topic areas, like buildings vs. statues; different language wikis; drafting vs. testing on different pages vs. running code)
  • Large runs to add sitelinks will need bot approval (can be 2 weeks, can be longer if controversial), and you should include time for that (waiting for approval while working on other parts!)
  • Be realistic about what you think you will be able to achieve during the internship - you won't be able to do everything!
  • If you are accepted, we will work together to revise the timeline as the work progresses - it doesn't have to be perfect!

There are no community specific questions to answer in your application for this project. If you can demonstrate general knowledge of the community, or previous python coding activities in your application, that will will be really helpful.

Also, please bear in mind that I can only accept one intern for this project, so I would strongly recommend contributing to multiple Outreachy projects (particularly those with few applicants) to increase your chances of getting an internship.

Benefits

You will learn, or improve your knowledge of, Python coding. You will gain familiarity with how structured data is maintained on Wikidata, and how it relates to Wikipedia articles.

Community benefits

More interwiki links with Wikipedia articles. Fewer duplicate Wikidata items that need to be merged.

Questions?

Please feel free to ask questions in this phabricator task, or in the subtasks. You can also email me if you want (my address is available via Outreachy)

Event Timeline

@Mike_Peel Sept. 23 is the deadline for mentors to upload the projects on the Outreachy website. Maybe you can do so in the next few days and then I can approve? Step 3 under "Before the program"

@Mike_Peel Sept. 23 is the deadline for mentors to upload the projects on the Outreachy website. Maybe you can do so in the next few days and then I can approve? Step 3 under "Before the program"

@srishakatux Now done, hopefully correctly!

@Mike_Peel Approved! One thing I noticed is that you have shared your email address as the preferred method of communication. For example, I wonder if you would like to move the project discussions to someplace public on Zulip. Just a suggestion, but it is totally up to you what works best for you.

srishakatux changed the visibility from "Public (No Login Required)" to "acl*outreachy-mentors (Project)".Sep 20 2021, 9:14 PM

@Mike_Peel Approved! One thing I noticed is that you have shared your email address as the preferred method of communication. For example, I wonder if you would like to move the project discussions to someplace public on Zulip. Just a suggestion, but it is totally up to you what works best for you.

Thanks! I set up the preferred method of communication as here on Phabricator, then on-wiki - I've added Zulip as well now. I see my email address under 'Contact info' - but no idea how to change that!

srishakatux changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Oct 8 2021, 10:19 PM

Hi am Nancy, I have been approved for the contribution period for the December 2021 internship. Looking forward for a great time ahead.

@Mike_Peel Hello , I am Nafiya, I have been approved for the contribution period for the december 2021 internship,I am interested in contributing in the project "Automatically matching new Wikipedia articles with Wikidata items using Python", But I have forgotten a lot of things about python as I am out of touch for a very long time, can you please give me some guidelines...

Hi am Nancy, I have been approved for the contribution period for the December 2021 internship. Looking forward for a great time ahead.

Hi Nancy, welcome! Feel free to ask here if you have any questions about how to get started!

@Mike_Peel Hello , I am Nafiya, I have been approved for the contribution period for the december 2021 internship,I am interested in contributing in the project "Automatically matching new Wikipedia articles with Wikidata items using Python", But I have forgotten a lot of things about python as I am out of touch for a very long time, can you please give me some guidelines...

Hi Nafiya, welcome! It's OK, you can learn Python again as you go through the project, although it is useful to have beforehand. I recommend trying the first task to start with - T290719 - this doesn't require any knowledge of Python to complete. The second task - T290720 - guides you through setting up Python and Pywikibot, and making the first edits with it. If you encounter problems while doing that, then we can talk through them to sort them out. :-)

Hi all. Just to let you know, I'm marking this project as 'already has many strong applicants' (which may also appear on the Outreachy website as 'closed'). This is because I think everyone who's submitted a task already is doing well. If you've done task 1 already, please continue with the other tasks - this doesn't stop you from submitting contributions! If you haven't started yet, and we've been in contact (even if it's a simple 'hi' on Zulip, Phabricator, or by email), you're very much welcome to continue as well. If you're completely new, please say hi before starting the tasks!

Hi, a bit too late? :-)

Quite late, but you still have a bit more time before the deadline tomorrow!

Dear @Mike_Peel, I am really puzzled by the item "Outreachy internship project timeline" in the final application. How I could I plan timeline without knowing exactly the tasks for internship? From your explanation above, it turns out that we have to make a work plan on our own?

Dear @Mike_Peel, I am really puzzled by the item "Outreachy internship project timeline" in the final application. How I could I plan timeline without knowing exactly the tasks for internship? From your explanation above, it turns out that we have to make a work plan on our own?

It's an open-ended task, somewhat different from the contributor tasks. :-) But generally it's summarised as 'The aim of the project is to match all new Wikipedia articles with Wikidata items'. Think about how you would break that problem up into specific parts you can solve throughout the internship.

Thank you very much for your rapid answer, @Mike_Peel! Does the work include the elimination of existing inconsistencies and shortcomings in already existing items?

Thank you very much for your rapid answer, @Mike_Peel! Does the work include the elimination of existing inconsistencies and shortcomings in already existing items?

That's not part of the core work, which really is just focused on adding the links, but can be worked on if time permits.