Page MenuHomePhabricator

What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata
Closed, ResolvedPublic

Description

IMPORTANT: Make sure to read the Outreachy participant instructions and communication guidelines thoroughly before commenting on this task. This space is for project-specific questions, so avoid asking questions about getting started, setting up Gerrit, etc. When in doubt, ask your question on Zulip first!

Approved license

I assert that this Outreachy internship project will released under either an OSI-approved open source license that is also identified by the FSF as a free software license, OR a Creative Commons license approved for free cultural works

  • Yes

No proprietary software:

I assert that this Outreachy internship project will forward the interests of free and open source software, not proprietary software.

  • Yes

How long has your team been accepting publicly submitted contributions?

  • 1 year

How many regular contributors does your team have?

  • 1-2 people

Brief summary

Names are really complex. Which part is the first name? Which is the middle name? How do you define your surname? What happens if you have multiple family names? How do names work across multiple languages and cultures?

Accurately recording this information is important for scientific references that are used in Wikipedia articles and Wikidata items - but if it is wrong, then it's easy to miss-attribute publications, or miss connections between different works by the same author. It's also very difficult to get right, since this is very complex, particularly between different languages.

This project will focus on understanding what makes a name, and how it can be recorded in structured data, across many languages and conventions. The project focuses on Wikidata, which is the structured data repository linked to Wikipedia and the other Wikimedia projects. Wikidata holds records of millions of scientific publications as part of WikiCite. However, identifying individual author names and linking between their different publications is still in its early stages.

In this project, you will use currently available Bibtex author information to split author names into 'first' and 'last' names, and you will add this information to thousands of Wikidata items using Pywikibot. You will explore other approaches to identifying first and last names, potentially including machine learning, to see how reliably you can identify first/last names.

This project is mentored by Mike Peel and Andy Mabbett. Knowledge of scientific references and Python are useful, although they can be learnt during the project.

Minimum system requirements

You will need a computer with a working Python 3 installation; you can install pywikibot and other useful modules using standard package systems.

How can applicants make a contribution to your project?

You will start by learning how scientific references in Wikidata are structured, particularly with their author names. You will then investigate how author names are described, and how to identify first and last names of the authors. You will then write code that automatically identifies first and last names, and adds them to Wikidata.

You will need to create an account on Wikipedia (if you don't already have one), and install the pywikibot package (https://www.mediawiki.org/wiki/Manual:Pywikibot). I can provide guidance for each specific starting task, and in general please feel free to ask questions through Outreachy, by email, or at https://www.wikidata.org/wiki/User_talk:Mike_Peel .

Repository

https://github.com/mpeel/wikicode/

Issue tracker

N/A

Tasks

There are three 'starter' tasks that can be done as Outreachy contributions. These aim to guide you through how Wikipedia and Wikidata are structured, and how Pywikibot interacts with them. They get progressively harder, and you should do them in order. You don't have to do all of them, but it's recommended to try to do so. These tasks also form the start of the main project

  1. T301733 Look at existing Wikidata items for scientific articles. Document how author names have been recorded in them, and how they could be improved
  2. T301735 Set up pywikibot on your computer, and understand how it interacts with Wikidata
  3. T301737 Take a specific item (specified by Mike or Andy), and try to identify the first and last names of the authors (using bibtex/other means).

Application and timeline

The Outreachy positions are assessed solely on the contributions and the application you submit for the project; the best things you can do are to do well with the contributions, and include all relevant information in your application. Contributions are evaluated based on their completeness, coding style, understanding of the tasks, and any additional work beyond the core task. I generally look for applicants who have demonstrated that they understand the tasks and the Wikimedia community.

When filling in your application, you will be asked about a timeline for the work during the project. I encourage you to draft a rough timeline yourself, bearing in mind:

  • You should split the timeline into periods, e.g., weekly or two-weekly, and write a short summary of what you expect to be doing in that period.
  • The aim of the project is to match all new Wikipedia articles with Wikidata items, but this will be done in stages (e.g., different topic areas, like buildings vs. statues; different language wikis; drafting vs. testing on different pages vs. running code)
  • Large runs to add sitelinks will need bot approval (can be 2 weeks, can be longer if controversial), and you should include time for that (waiting for approval while working on other parts!)
  • Be realistic about what you think you will be able to achieve during the internship - you won't be able to do everything!
  • If you are accepted, we will work together to revise the timeline as the work progresses - it doesn't have to be perfect!

There are no community specific questions to answer in your application for this project. If you can demonstrate general knowledge of the Wikimedia community (e.g., past editing of Wikipedia/Wikidata), or previous python coding activities in your application, that will will be really helpful, but not essential.

Also, please bear in mind that we will only accept one intern for this project, so we strongly recommend contributing to multiple Outreachy projects (particularly those with few applicants) to increase your chances of getting an internship.

You are also encouraged to attend the Wikimedia Hackathon on 20-22 May:

Benefits

You will learn, or improve your knowledge of, Python coding. You will gain familiarity with how structured data is maintained on Wikidata and in other scientific databases.

Community benefits

Better metadata for wikicite items, being able to sort references by surname on Wikipedia

Questions?

Please feel free to ask questions in this phabricator task, or in the subtasks. You can also email me if you want (my address is available via Outreachy)

Related Objects

Event Timeline

@Mike_Peel Also, upload the project on the Outreachy site whenever you feel ready, and I will then approve. Thank you!

srishakatux changed the visibility from "Public (No Login Required)" to "acl*outreachy-mentors (Project)".Feb 22 2022, 8:13 PM

@Mike_Peel Also, upload the project on the Outreachy site whenever you feel ready, and I will then approve. Thank you!

Sure, could you remind me where to do that please?

@srishakatux Please could you include @Pigsonthewing as a mentor for the proposal on Outreachy (he's now registered on Outreachy under that username), and also give him view permissions for the Outreachy tickets?

@srishakatux Please could you include @Pigsonthewing as a mentor for the proposal on Outreachy (he's now registered on Outreachy under that username), and also give him view permissions for the Outreachy tickets?

Done done!

Mike_Peel updated the task description. (Show Details)
srishakatux changed the visibility from "acl*outreachy-mentors (Project)" to "Public (No Login Required)".Mar 25 2022, 5:33 PM

Hello Mentors.
I found this project very interesting and I have started doing the tasks provided by you.
I have 3 years of experience in python, I have developed lot of projects in python and also did two research internship from a reputed university which involved use of python, machine learning and deep learning and I have also done internship in Microsoft. I really enjoy doing the work for opensource .
I will be contributing for this community for the next one or two years as I really appreciate and love the work done by you all.
Looking forward to work with this community
Regards

I found this project very interesting and I have started doing the tasks provided by you.

Great - please have a go at the three starter tasks, and let me know when you're ready for me to look at your work!

Hello.
Iam Amitha a third year undergraduate at IIITBangalore. I found this project quite interesting. Looking forward to get started and make contributions. I have good experience with python and Machine learning.

Greetings everyone!
I am Sonali Rastogi, a prospective Outreachy intern from India.
I am a junior undergrad student pursuing my bachelors from NIT Agartala. I am an innovator and have been developing multiple tech projects ranging from core engineering to web platforms.

I am interested in "What's in a name? Automatically identifying first and last author names for Wikicite and Wikidata ", a very exciting project resonating with my interest in research and journal papers. I am looking forward to work with this spectacular community.

Hi @Amitha67 and @Sonali.Rastogi welcome! Please try the first microtask to get started, see T301733

I've marked this as 'Closed to new applicants' on Outreachy as there are now several people that have completed all three tasks. The project is still open for contributions, though, and will remain so until the contribution period ends. And if anyone new still wants to have a go at the tasks, that's also fine.

Hello Everyone, I'm Pinalee an Outreachy applicant from India. I want to contribute in this project. Looking forward to work with Wikimedia foundation. Thank you.

Hello Everyone, I'm Pinalee an Outreachy applicant from India. I want to contribute in this project. Looking forward to work with Wikimedia foundation. Thank you.

Hi Pinalee, please try task 1: T301733

Here is a blog post about the project: https://colonelsheep.github.io/outreachy24-blog/week-5 . It's a nice summary of what is spread out over this Phabricator ticket. While it does not have pointers to more detailed information, this wiki page does.

To mentors monitoring this task - could you ensure all relevant project updates get added to https://www.mediawiki.org/wiki/Outreachy/Past_projects? If there isn't anything remaining to be resolved, please close this Phabricator task. Move any pending items to a separate task.

Mike_Peel claimed this task.