Jump to content

User:GreenC/WaybackMedic 2.1

From Wikipedia, the free encyclopedia

Wayback Medic 2.1 is a bot that adds and maintains links from the list of known webarchive services in use on the English Wikipedia.

Edits made after 2017-01-07 are by version 2.1

The bot operator is User:GreenC. The bot account is User:GreenC bot. The bot (software) is "WaybackMedic".


WM fixes
WaybackMedic Fixes
Fix number Function name Example edit Description Notes Date added
1 fixthespuriousone Example Remove spurious |1= in cite templates. August 2016
2 fixmissingprotocol Example 1. Add https if protocol missing from the archive.org URL.
2. Convert existing protocol http to https.
3. Add second-level domain web if missing (archive.org/web/ → web.archive.org/web/)
4. Add /web/ path (web.archive.org/2016/ → web.archive.org/web/2016/). In some URLs adding /web/ breaks the link, test for those.
HTTPS per RFC August 2016
3 fixemptyarchive Example 1. If |archiveurl= is empty or missing but |archivedate= has content, attempt to find a working archive URL based on the archive date, otherwise add {{dead link}} if appropriate.
2. If |archivedate= is empty or missing but |archiveurl= has content, generate date value based on timestamp in the archive URL.
3. If |archiveurl= and |archivedate= are empty, remove both and leave a {{dead link}} if appropriate.
August 2016
4 fixbadstatus Example Check all Wayback Machine URLs for response code errors (anything but 200s). If an error code, try for a better URL via the Wayback API - first using accessdate, then using the earliest date available. If none there, check WebCite API. Try Memento API which checks a few dozen other archives. Other techniques undocumented. If still none found, remove |archiveurl= and |archivedate= and add {{dead link}}. August 2016
5 Retired
6 fixemptywayback Example The wayback template is mangled in a certain way. Action: re-assemble. It won't delete multiple instances if they exist in the same ref (as in the Example). August 2016
7 fixencodedurl Example The URL was incorrectly encoded. Fully decode URL and re-encode. August 2016
8 fixdatemismatch Example 1. Ensure |archivedate= matches the snapshot date in the URL
2. Ensure date format matches dmy or mdy if set (retain ymd if in use)
August 2016
9 fixwebcitlong Example
Example
Convert WebCite URL's from short-form to long-form
Convert Freezepage.com URL's from short-form to long-form
WebCite Usage January 2017
10 fixstraydt Example Remove stray {{dead link}} template when an archive exists for the link January 2017
11 fixwam Example Merge {{wayback}} and {{webcite}} --> {{webarchive}}
Merge completed February 5, 2017
Webarchive TfM January 2017
12 fixiats Example archive url -> |archive-url) January 2017
13 fixswitchurl Example Move an archive.org URL from |url= to |archiveurl= and add |archivedate= if missing. January 2017
14 Retired
15 fixembway Example
Example
1. A {{wayback}} is embedded in a CS template.
2. A {{dead link}} is embedded in a CS template.
January 2017
16 <various> Example Timestamp and/or |archivedate= is 19700101 and/or out-of-bounds. January 2017
17 fixdoubleurl Example archive.org URLs are doubled, tripled, etc.. January 2017
18 fixemptywebarchive Example {{webarchive}} |date= is missing or empty value. January 2017
19 fixdoublewebarchive Example Remove duplicate {{webarchive}} instances. January 2017
20 fixembwebarchive Example A {{cite web}} is embedded in a {{webarchive}} January 2017
21 fixarchiveis Example
Example
1. Convert Archive.is URL's from short-form to long-form
2. Fix URL encoding of broken links
Archive.is Usage January 2017
22 fixitems Example Change "/items/" URLs that are using machine IDs BRFA January 2017
23 encodemag Example Convert MediaWiki encoding to url encoding in URLs (ie. {{!}} and {{=}}) RFC3986 January 2017
24 decodespace Example Convert %20 to +, + to %20, etc.. in URLs that can be repaired this way See also June 2017
25 waytree_trailgarb Example
Example
Example
Remove typical garbage characters found at the end of URLs: .,;:-"l(%XX)('') February 2018
26 fixcommentarchive Example Open-up commented-out archives and add a |deadurl= "yes" or "no" February 2018
27 waytree_x2encoding Example Repair double URL-encoding eg. %3A -> %253A February 2018
28 fixencodebug Example Repair missed URL-encoding of square brackets T186417 February 2018
29 fixiats Example
Example
Restore truncated Wayback URL February 2018
30 fixiats Example Convert |title={title} -> |title=Archived copy T203865 September 2018
31 urlchanger Example Move broken URL to a new working URL and undo previous archives. BOTREQ November 2018


Technical details
  • Real-time operations, no link database.
  • Many APIs including Internet Archive, Memento, WebCite and "Timemap" APIs at individual service sites
  • Multiple HTTP header status code checks at the application (WaybackMedic) layer
  • Additional time-out & retries built-in to the web transfer libraries.
  • Additional operating-procedure level checks against network and other errors - semi-supervised.
  • Multiple redundant checks of the APIs using multiple dates to ensure a page really is unavailable
  • Accepts API results but then verifies by looking at page headers and/or contents
  • If IA returns a 404 Bummer. The machine that serves this file is down. -- treat it as a code 200 and leave the link alone.
  • If link is policy blocked by robots.txt log it but leave alone - in the future, robots may be lifted by the site owner or IA

Statistics

[edit]

The bot runs through a batch of articles about every 2-3 months taking a break in-between. Below are some stats from the first two runs.

Run 1: 2016-12-15 -> 2017-03-15

[edit]

From December 15, 2016 to March 15, 2017, WaybackMedic processed 336,271 articles. This set represents articles edited by InternetArchiveBot from July 2016 -> February 2017 plus articles requiring merge of {{wayback}} -> {{webarchive}}. WaybackMedic made 115,066 changes in 47,810 articles. All changes are logged and available on request eg. which articles had the Archive.is URL fix. Diffs of each article pre and post edit are also saved and searchable.

Bummer          : 507     (Wayback links that return "Bummer page not found")
Robots.txt      : 6477    (Wayback links blocked by robots.txt)
Bogusapi        : 13273   (Wayback API-returned links that don't match real status code)
API mismatch    : 17117   (Wayback API returned fewer records than sent.)
JSON mismatch   : 28972   (Wayback API returned different size JSON)
Discovered      : 47810   (Number of articles edited by WaybackMedic)
Log 404         : 9894    (Dead wayback links)
Log emptyarch   : 2001    (Empty archiveurl arguments)
Log emptyway    : 0       (Ref has an empty {{wayback}})
Log encode      : 0       (URL misencoded)
Log spurious 1  : 191     (Spurious "|1=" parameter)
Log trail       : 3       (URL has a trailing bad character)
Log dead URL    : 185     (|url= is dead even though dead-url=no, archiveurl is dead and no {{dead}})
Log skindeep    : 8466    (changes to URL are skindeep)
Log doubleurl   : 416     (Double archive.org URL error)
Log datemismatch: 27433   (Date in archive URL doesn't match archivedate argument in cite template)
Log wrong https : 895     (https and :80 conflict)
Log WAM         : 32198   (webarchive merge)
Log stray dead  : 2709    (stray {{dead link}} - straydt.awk)
Log WC|IS->IA   : 1022    (Convert WebCite|Archive.is to Wayback et al.)
Log short url   : 10552   (WebCite URL elongated - webcitlong.awk)
Log short url   : 522     (Archive.is URL elongated - archiveis.awk)
Log citeaddl    : 256     (webarchive merge - citeaddl.awk)
Log nowikiway   : 41      (Wayback mangled a certain way)
Log br bug      : 0       (br bug)
Log miss timest : 3043    (Timestamp missing from IA URL)
Log embeded way : 559     (embedded wayback template in cite template)
Log embeded wa  : 18      (embedded cite template in webarchive template)
Log switch URL  : 6051    (archive in url= field)
Log dead /items/: 281     (/items/ URL dead replacement)
Log x2 webarch  : 2311    (double webarchive template)
Log pct encode  : 15      (pct encode magic characters in URLs)
New alt archive : 1009    (Replaced with archive URL found at Mementoweb.org)
New IA link     : 509     (Added new IA link)
New IA date     : 1642    (Changed snapshot date)
Redirects       : 52      (Page was a redirect)
Zombie links    : 650     (Links needing removal by hand)
Wayback RM      : 2836    (Wayback link deleted)

; Links found

Wayback All     : 1099355 (Wayback links total found)
WebCite All     : 39120   (WebCite links total found)
Archive.is All  : 1288    (Archive.is links total found)
Loc.gov All     : 410     (Loc.gov links total found)
Portugal All    : 180     (Portugal links total found)
Stanford All    : 30      (Stanford links total found)
Archive-it All  : 76      (Archive-it.org links total found)
Bibalex All     : 17      (Bibalex.org links total found)
NatArchiveUK All: 4668    (National Archives (UK) links total found)
Europa Archives : 2       (Europa Archives (Ireland) links total found)
Perma.cc All    : 0       (Perma.CC links total found)
PRONI All       : 0       (PRONI links total found)
UK Parliament   : 1       (UK Parliament links total found)
UK Web Archive  : 125     (UK Web Archive (British Library) links total found)
Canada All      : 68      (Canada links total found)
Catalonian All  : 1       (Catalonian links total found)
Singapore Archiv: 10      (Singapore Archives links total found)
Slovenian Archiv: 1       (Slovenian Archives links total found)
Freezepage.com  : 1524    (Freezepage.com links total found)
Webharvest.gov  : 4       (US Nat. Archives links total found)
NLA AU ALL      : 2610    (AU Nat. Archives links total found)
archiveorg items: 419     (Archive.org /items/ total found)

Run 2: 2017-03-19 -> 2017-04-07

[edit]

From March 19, 2017 to April 7, 2017, WaybackMedic processed 149,195 articles. These were all articles on English Wikipedia containing a {{dead link}} template. WaybackMedic checked each tagged link and replaced with a working archive if available. It made other standard fixes. The number of links saved was 31,317

  • Archive.org: 12,804
  • Archive.is: 16,541
  • Webcite: 20
  • Library of Congress: 413
  • National Archives UK: 405
  • NLA Australia: 6
  • arquivo.pt (Portugal): 284
  • Stanford University: 17
  • Archive-It.org: 584
  • BibAlex: 86
  • National Archives Iceland: 61
  • Europa Archives Ireland: 29
  • Proni Web Archives: 6
  • Parliament UK: 34
  • UK Web Archive (British Library): 55
  • Canada: 8

The reason Archive.is is so high is because most of the articles had already been checked for archive.org saves on previous runs of IABot. Archive.is has many pages unavailable anywhere else and WaybackMedic is the only bot adding Archive.is links. Generally WaybackMedic uses Archive.is as last resort.

Notes

[edit]
[edit]