Jump to content

Article counts revisited

From Meta, a Wikimedia project coordination wiki

This page in a nutshell:

On 29 March 2015, the article counts of many Wikimedia wikis suddenly changed by significant amounts, causing a great deal of confusion among folks who pay attention to such matters. The quick explanation is that a maintenance script called updateArticleCount.php, which counts articles "from scratch", was run on almost all of Wikimedia's "content" wikis (specifically, all of the individual languages of Wikipedia, Wiktionary, Wikiquote, Wikisource, Wikinews, Wikiversity — including Wikiversity Beta — and Wikivoyage, but not Wikibooks nor any other multilingual project such as Wikimedia Commons or the Wikimedia Incubator). For several reasons (explained further below), the article counts reported by most of these wikis in the past have been Just Plain Wrong — many by negligible amounts, but some by ridiculously huge amounts — and now the counts are "correct". Unfortunately, not all of the causes of the unreliable counts have been fixed, and bugs affecting article counting continue to crop up, so the counts will still not be completely reliable going forward. To deal with this fact, the current plan is to recount the articles on these wikis on the 21st of each month. This should ensure that the article counts can't get too far off of their "correct" values for too long.

February 2018 update:

On 14–15 February 2018, and then again on 21 February 2018, the site statistics (including article counts) were recalculated for all Wikimedia wikis. While this changes some aspects of the situation described on this page, much of the information about how MediaWiki counts articles (and how it has been done in the past) is still relevant.

March 2018 update:

On 6 March 2018, the option to use the 'comma' method of article counting was removed from MediaWiki (affecting versions 1.31 onwards).

April 2018 update:

As of 15 April 2018, all Wikimedia wikis are now getting their on-wiki statistics (not just articles) recounted from scratch on the 1st and 15th day of each month, starting at 5:39 a.m. UTC.

August 2022 update:

As of 29 August 2022, all Wikimedia wikis are now getting their on-wiki statistics recounted from scratch every day at 9 p.m. UTC.

The article-count problem

[edit]

For years, people have been using the {{NUMBEROFARTICLES}} variable (which can be retrieved through the MediaWiki API using the meta=siteinfo&siprop=statistics query) to track the article counts of Wikimedia wikis on pages such as Wikipedia:Milestone statistics (at the English Wikipedia) and Wikimedia News, as well as in tables of collected statistics at List of Wikipedias/Table, Wiktionary/Table, and so forth. Unfortunately, for various reasons these article counts have never been very reliable; in fact, they have been much less reliable than most people probably think.

The many relevant issues boil down to the following:

  1. The intended definition of what constitutes an "article" (also known as a "content page") has varied in the past.
  2. The actual implementation of article counting in different parts of the MediaWiki software has been inconsistent in the past.
  3. The article counts have never been recomputed for all Wikimedia wikis to "fix" the problems caused by points 1 and 2.
  4. The article count of a high-traffic wiki is inherently difficult to pin down.

The following sections provide more in-depth discussions of these issues, followed by a detailed summary of the changes seen when the articles were recounted on most Wikimedia wikis on 29 March 2015.

(Note that this account is almost entirely concerned with the way the MediaWiki software itself counts articles, not the way Wikistats counts articles. If you are familiar with Wikistats, forget what you know about the way it counts articles until the section dealing explicitly with Wikistats near the bottom of this page.)

Shifting definitions of what constitutes an article

[edit]

The definition of what constitutes an article has changed over time. When MediaWiki first started counting articles (which, at the time, were Wikipedia articles, specifically), it simply checked whether a page contained a comma or not.

In particular, as of early 2003, an article was:

(1) a non-redirect in the main namespace, containing at least one comma

This worked fine for the English Wikipedia, but once other projects in other languages started up, people realized that this method would not work for all wikis. A very quick (one week!) discussion and vote was held here on the Wikimedia Meta-Wiki in March 2003, the details of which can be found at:

Based on the results of that vote, it was decided that a page would be counted as an article if it was:

(2) a non-redirect in the main namespace, containing at least one [[wikilink]]

Unfortunately, the implementation of this definition in the MediaWiki software left a lot to be desired (details in next section): the last criterion, having a wikilink, was checked merely by looking for the string "[[" anywhere in the wikicode "source" for the page. Eventually, this shortcoming led some editors to game the system by routinely placing the text "<!--[[-->" (using an HTML comment) on all their pages, just to get them counted as articles (something actually foreseen but dismissed in the announcement of the 2003 vote results)!

In any case, at this point the real definition of an article was:

(3) a non-redirect in the main namespace, containing the string "[["

In June 2006, the $wgContentNamespaces configuration variable was added to MediaWiki (specifically, in revision 14738) to enable namespaces other than the main one (ns0) to count as "content". (This change would eventually be taken advantage of mainly by Wikisources.)

Thus the real definition of an article became:

(4) a non-redirect in a content namespace, containing the string "[["

And this is how things stayed until May 2011, when Everything Changed. (More about this below.)

Problems with how article counting has been implemented

[edit]

Despite the best intentions of Wikimedians, the de jure definition of what constitutes an article (as in the previous section) has probably never matched the de facto one implemented in the MediaWiki software (except possibly in the very early days, when the code was much smaller). In fact, it makes little sense to talk about "one" definition used by the software, since there are several points at which pages are checked to see if they count as articles, and these have not always used compatible criteria.

To be specific, this is how articles get counted: The article count is set to an initial value by a script when a wiki is first created, and afterwards can be reset by using the updateArticleCount.php script, but otherwise all operations on the article count are relative changes in response to different actions on the wiki itself.

  • When a new page is created, the count may increase by 1 or not change.
  • When a page is edited, the count may increase by 1, decrease by 1, or not change.
  • When a page is moved, the count may increase by 1, decrease by 1 or not change.[1][2]
  • When a page is imported, the count may increase by 1 or not change.
  • When a page is deleted, the count may decrease by 1 or not change.
  • When a page is undeleted, the count may increase by 1 or not change.
  • When the edit histories of two pages are merged, the count may decrease by 1 or not change.

Even these descriptions don't entirely match the reality of how articles have been counted in the past (and possibly still today), since bugs have existed — for years, in some cases — that caused unexpected things to happen beyond the options listed above. (More about this later.)

As mentioned in the last section, when having a wikilink became the salient feature of an article, the software was changed to look for the string "[[" anywhere in the wikicode source. This failed to distinguish between many different types of legitimate wikilinks (1–5 below), as well as two types of "fake" wikilinks (6 and 7), and one type of non-wikilink (8):

  1. page links — examples: [[Babel]], [[Talk:Babel]], etc.
  2. category links — example: [[Category:Software]]
  3. image/file links — example: [[File:Yes.png]]
  4. interlanguage links — example: [[de:Wikipedia:Hauptseite]] or [[:de:Wikipedia:Hauptseite]]
  5. interwiki links — example: [[species:]]
  6. hidden links — example: <!-- [[don't look at me]] -->
  7. deactivated links — example: <nowiki>[[look at me]]</nowiki>
  8. any text containing the string "[[" — example: wikilinks start with "[["…

(Note that links like [[:Category:Software]] and [[:File:Yes.png]], which start with an initial colon, are regular page links of type 1.)

In November 2007, bug 11868 was submitted requesting that links provided by templates be counted, too. In the course of the ensuing discussion, it was pointed out that links other than page links (types 2, 3, etc.) were being counted, and that in fact three different counting methods (all of which correctly ruled out redirects and pages outside of content namespaces) were being employed at different places in the code:

  • Every time a page was saved, the "[["-string criterion was used to see whether the page would count as an article.
  • When the statistics-initializing script (now called initSiteStats.php) was run, it just checked to see whether the pages were non-empty.
  • When the updateArticleCount.php script was run, it checked whether a certain database table actually contained page links originating from each page in question (thus counting only type 1 links, but including links provided by templates).

In addition, when pages were imported into a wiki, the article count was not updated correctly (see, for example, bugs 2483, 5703, and 6600). All of these inconsistencies allowed the on-wiki article counts (the respective {{NUMBEROFARTICLES}}) to diverge from the "correct" counts over time.

The major (largely unnoticed) change of 14 May 2011

[edit]

In May 2011, almost five years after the "content namespace" change, a developer finally acted to "rationalize" the way articles were counted, and in revision 88113 introduced the $wgArticleCountMethod configuration variable to specify which type of non-empty, non-redirect, content-namespace pages would count as articles. Article.php and SiteStats.php (parts of the MediaWiki software concerned with tracking article counts) were modified to reflect this change.

So now there are actually three different possible definitions of an "article", based on whether $wgArticleCountMethod is set to "link", "comma", or "any" for a particular wiki:

(5a) a non-redirect in a content namespace, containing (after parsing) at least one true [[wikilink]] to another page on the same wiki ("link")
(5b) any non-empty non-redirect in any content namespace ("any")
(5c) a non-redirect in a content namespace, containing (after parsing) at least one comma ("comma")

Definition (5a) is the default and is used on the vast majority of Wikimedia wikis; notice how closely it adheres to the "original intent" of the "article reform" voters back in March 2003. (It just took 8 years to get there!) Definition (5b) is currently used by three wikis: the Czech Wikinews, Chinese Wikinews, and Gujarati Wikisource. Definition (5c) is used by two wikis: the English Wikibooks and Portuguese Wikibooks; it is similar to the original comma-based definition (1). (InitialiseSettings.php lists all configuration settings for Wikimedia wikis.) Unfortunately, neither the updateArticleCount.php script nor the initSiteStats.php script seem to correctly implement definition (5c), as they rely on SiteStats.php, which contains a "fake" comma-based criterion. It is unclear whether the same thing is true of any other parts of the MediaWiki code concerned with article counting.

Note that because of the "after parsing" part of definitions (5a) and (5c), for most wikis one can no longer tell whether a page will count as an article simply by examining its page source; if the page contains templates these must be fully parsed in order for any links provided by those templates to be accounted for. Fortunately, this is done when pages are saved.

Because article counts were not updated across all Wikimedia wikis to reflect this software change, very few people realized in May 2011 that anything significant had just happened. It would be a full year before the magnitude of the changes were seen "publicly".

Non-systematic fixing of article counts in the past

[edit]

Given the vast changes in the definition of an article and the actual methods used in the software to count them, it is perhaps surprising that articles have never been recounted for all Wikimedia wikis at once to correct the errors that have crept into the on-wiki counts. Instead, various bug reports have been submitted and acted upon individually. Presumably, some wikis have never had their articles recounted since they opened for editing. (This may be true of some of the languages of Wikibooks, and may or may not be true of the multilingual Wikispecies. The site statistics of Wikimedia Commons were completely recalculated on 26 August 2009; Wikidata seems to have had its "items" recounted on 9 October 2014, resulting in a 21% loss in article count: from 15,839,462 some unknown number of hours before the recount occurred, to 12,572,779 twenty-one hours later.)

There's "no telling" (unless a script is run, of course) how far off the article counts are on the Wikimedia Incubator, and all the "coordinating" wikis, such as the MediaWiki wiki — not to mention all the Wikimania and chapter wikis. Fortunately, people don't seem to care as much about these article counts, so it's probably not a big deal if they're wrong. (Some people would say it's not a big deal if any Wikimedia wikis' article counts are wrong. These people should count themselves lucky that they get to ignore reality in this way. ;-)

Changes to article counts on 10 May 2012

[edit]

On 10 May 2012, in response to a bug report, the updateArticleCount.php maintenance script was run on all of the language editions of Wiktionary and Wikisource (the articles of which are often called "entries" and "text units", respectively). This caused 8 Wiktionaries to rise to higher milestone levels tracked at Wikimedia News, and 24 to fall to lower levels; also, 15 Wikisources rose to higher levels and 13 fell to lower levels. (For example, the Greek Wiktionary rose from 191,251 entries to 290,691, passing the 200,000 milestone level.)

The most extreme changes in Wiktionary counts were seen in the Russian Wiktionary, which increased by 109,838 entries (a 34% increase), and the Chinese Wiktionary, which decreased by 376,510 entries (a 31% decrease). The largest relative increase was seen in the Western Panjabi Wiktionary, which rose by 148% (2,968 entries); this was the only Wiktionary to more than double its entry count. Two other Wiktionary languages increased by more than 50% (Interlingue, +54%, +99 entries; Greek, +52%, +99,440 entries). 14 Wiktionaries lost all of their entries (a 100% decrease), all of which had 5 or fewer entries before the change.

Of the 171 language editions of Wiktionary (at the time), 54 of them (32%) gained or lost fewer than 10 entries, 34 (20%) between 10 and 99 entries, 53 (31%) between 100 and 999 entries, and 30 (18%) at least 1000 entries (percents don't add to 100% due to rounding). The mean absolute change (i.e., ignoring positive and negative signs of the changes) was 5,452 entries; the median absolute change was 87 entries. The total entry count (still speaking of "articles", not "total pages") summed across all language editions of Wiktionary decreased by 220,590 entries (a 1.6% decrease).

As for Wikisource, the most extreme changes were seen in the French Wikisource, which increased by 819,297 text units (a 291% increase), and the Thai Wikisource, which decreased by 8,548 units (a 63% decrease). After the French Wikisource, the next two largest relative increases were seen in the Belarusian Wikisource, which rose by 215% (1,029 units), and the German Wikisource, which rose by 193% (166,943 units). No Wikisources lost all of their articles (so, no 100% decreases), but 3 Wikisources lost more than 80% (Slovak, −85%, −194 units; Azeri, −84%, −1,999 units; Faroese, −82%, −49 units), and 10 Wikisources more than doubled their text-unit count (greater than 100% increase).

Of the 64 language editions of Wikisource (at the time), only 6 of them (9%) gained or lost fewer than 10 text units, 9 (14%) between 10 and 99 units, 20 (31%) between 100 and 999 units, and 29 (45%) at least 1000 units (percents don't add to 100% due to rounding). The mean absolute change was 25,951 text units; the median absolute change was 819 text units. The total text-unit count (still "articles") summed across all language editions of Wikisource increased by 1,599,639 text units (a whopping 100% increase).

For complete details of the changes seen in Wiktionary and Wikisource article counts on 10 May 2012, see the tables at Article counts revisited/2012-05-10 recount changes.

Potential changes in other Wikimedia projects

[edit]

Since it was clear at the time that the updateArticleCount.php script would eventually have to be run on the wikis in Wikimedia's other content projects (Wikipedia, etc.), additional tables were created in May 2012 that showed the article-count changes that would occur if these wikis had their articles recounted. These "potential changes" were determined by processing database dumps (in much the same way that Wikistats does).

Not surprisingly, it was found that changes of a similar magnitude to those already observed would be seen in the rest of the projects. In particular, 3 Wikipedias, 2 Wikibooks, and 1 Wikiversity would have risen to new milestone levels, whereas 26 Wikipedias, 29 Wikibooks, 30 Wikiquotes, 11 Wikinews, and 2 Wikiversities would have fallen to lower levels. (Note that Wikivoyage was not a Wikimedia project at this time.) The most extreme potential increases and decreases in article counts in each project were: for Wikipedia, an increase of 41,026 articles (1%) in English and a decrease of 11,348 (11%) in Hindi; for Wikibooks, an increase of 2,399 (13%) in German and a decrease of 4,493 (88%) in Vietnamese; for Wikiquote, an increase of 553 (8%) in German and a decrease of 4,039 (22%) in Polish; for Wikinews, an increase of 487 (4%) in German and a decrease of 2,343 (18%) in Polish; for Wikiversity, an increase of 8,605 (219%) in German and a decrease of 2,846 (16%) in English. (German, Polish, and English are among the largest and most active editions of most Wikimedia content projects, which is probably why they recur so often in this list of extreme changes. Other "major" languages showed smaller potential changes, possibly because their article counts had already been fixed at least once through bug reports, or perhaps for other reasons related to differing patterns of editing and importing activities on those wikis.)

For a full accounting of the potential article-count changes for the remaining Wikimedia content projects, see the tables at Article counts revisited/2012-05-10 potential recount changes.

Changes to article counts on 29 March 2015

[edit]

On 29 March 2015, the updateArticleCount.php maintenance script was run on all Wikipedia, Wiktionary, Wikiquote, Wikisource, Wikinews, Wikiversity, and Wikivoyage wikis. The following table lists article counts and milestone levels for the 68 wikis that changed milestone levels tracked at Wikimedia News. This information was collected off-wiki by a Perl script written (and periodically run) by User:Dcljr.

Assuming the updateArticleCount.php script executed when it was supposed to (05:00 UTC on 29 March 2015), the data below was collected about 2 hours before that script was run, at 02:55 UTC on 29 March 2015, and 21 hours after it was started, at 01:50 UTC on 30 March 2015. (It is not clear how long it took the script to complete the recounting process.)

Note that the "%chg" column gives the percent by which the article count changed from before to after the running of the maintenance script; the "%off" column gives the percent by which the "before" article count different from the "after" (i.e., "true") article count. Thus, the former is a chronological "percent change" and the latter is a "percent error".

A second table lists information collected many hours before and after the maintenance script was run, but for all of the recounted wikis. It also shows total pages and edit counts for comparison. For this larger table, see Article counts revisited/2015-03-29 changes to all recounted wikis.

Among the 679 recounted wikis, comparing article counts from 1 or 2 days before the articles were recounted to those from almost 3 days after, the most extreme changes were seen in the English Wikisource, which decreased by 281,199 (a 27% decrease), and the English Wikipedia, which increased by 97,285 (a 2% increase). The largest relative increases were seen in the Norwegian (Nynorsk) Wikiquote, which rose by 40% (479 pages); the Bengali Wiktionary, 27% (241 pages); and the Mirandese Wikipedia, 23% (513 pages). No other wikis' article counts increased by more than 15%. One Wikipedia and 14 Wikiquotes lost all of their articles (a 100% decrease), all of which had 7 or fewer articles before the change. (Obviously, with such a large window of time, the smaller observed changes could well be due to bot edits or even regular user edits to the wikis. It is significant, however, that the overall largest change occurred on a wiki that had its articles recounted three years earlier.)

Of the 679 wikis, 113 of them (17%) did not gain or lose any articles, 117 (17%) gained or lost between 1 and 9 articles, 207 (30%) between 10 and 99 articles, 182 (27%) between 100 and 999 articles, and 60 (9%) at least 1000 articles. The mean absolute change (ignoring positive and negative signs of the changes) was 1,340 articles; the median absolute change was 39 articles.

Summary of article-count changes by project
Project Wikis Articles before Articles after Change Percent change Decreased wikis Unchanged wikis Increased wikis Total decrease Total increase
Wikipedia 288 34,795,319 34,836,734 +41,415 +0.1% 187 6 95 −124,868 +166,283
Wiktionary 171 22,263,625 22,244,087 −19,538 −0.1% 49 80 42 −30,148 +10,610
Wikiquote 89 180,392 158,845 −21,547 −11.9% 86 1 2 −22,582 +1,035
Wikisource 65 4,201,239 3,680,009 −521,230 −12.4% 30 16 19 −521,649 +419
Wikinews 33 217,288 202,885 −14,403 −6.6% 23 2 8 −15,167 +764
Wikiversity 16 86,031 73,137 −12,894 −15.0% 11 2 3 −13,049 +155
Wikivoyage 17 84,315 81,072 −3,243 −3.8% 7 6 4 −3,307 +64
All 679 61,828,209 61,276,769 −551,440 −0.9% 393 113 173 −730,770 +179,330
Note: The "before" counts were collected at 00:00 UTC on 28 March 2015 for Wikipedia and Wikivoyage, and 12:00 UTC on 27 March 2015 for the other projects (29 hours and 41 hours before the script was run, respectively) — except the Marathi and Oriya Wikisources, which had their counts collected at 08:33 UTC on 19 March 2015. The "after" counts were collected at 12:00 UTC on 31 March 2015 for all wikis (67 hours after the script was started) — except the two Wikisources, which were collected at 05:58 UTC on 4 April 2015.

Is there really a true article count?

[edit]

For most wikis, given a definition of what constitutes an article, there is a relatively stable "true" or "correct" article count that the maintenance script should be able to arrive at. However, for the highest-traffic wikis, like the English Wikipedia, it may be debatable (at times, anyway) whether there even is a true article count to be determined in the first place.

Over all of 2014, the English Wikipedia averaged about 3 million page edits per month. This works out to just over 1 edit per second. At peak periods of editing, the number of edits per second can be far higher. Any of these edits can potentially change the article count (although, of course, most do not). Even a single edit to a template can change the "article status" of several pages at the same time. Add to this the fact that several servers are simultaneously involved in handling incoming user requests, updating page content, updating site statistics, rendering pages (when necessary), and caching rendered pages, and the very concept of a "true" article count at a particular instant becomes a slippery one.

Even if one assumes there is always a true count, it may not be possible for the script to successfully determine it, since the counting process itself is not instantaneous.

Wikistats

[edit]

For many years now, Erik Zachte has been tracking monthly article counts (and many, many other statistics) for all of Wikimedia's content projects (and a few other related wikis), counted offline "from scratch" based on periodic database dumps, using a set of custom Perl scripts he wrote for that purpose. This "Wikistats" information is available at stats.wikimedia.org. It includes monthly article counts going back to the very beginning of each wiki (January 2001, in the case of the English Wikipedia).

Some people may view these numbers as representing the "true" article counts that the live, on-wiki counts are trying to estimate. There are several problems with this idea. First, consider that these database dumps are happening continually throughout each month (in fact, a handful are always in progress at any particular moment), and that completing the dumps required to count articles can take hours, or even days for the largest wikis. Thus, the Wikistats counts share the same problem as the counts given by the updateArticleCount.php script: they do not represent "snapshots" of the "true" counts at any specific instant in time.

More importantly, though, all of the counts for past months shown at Wikistats are recalculated every month based on the then-current edit histories of each wiki's content pages. This means that all articles that have been deleted (or been the target of any other action that decreases the article count) since the last full recount will "disappear" from all the previous monthly counts going back to when those articles were created. (To be more specific: in Feb 2015, the count Wikistats gives for a specific wiki for, say, the month of Feb 2014 ("count A"), will likely not match the Feb 2014 count Wikistats gave back in Feb 2014 ("count B"). This is because count A is based on the edit histories of pages on the wiki in Feb 2015, whereas count B was based on the edit histories of pages on the wiki in Feb 2014.) Clearly, this makes the Wikistats counts for past months completely unreliable as a measure of how many articles the wiki actually had at specific times in the past.

Erik Zachte puts it this way: "[F]or many years articles which were deleted were widely seen as just bad content, which shouldn't have been there in the first place. My own assessment of what Wikistats page counts meant evolved… into 'the total of number of articles which survived scrutiny/cleaning up', [then] into… 'the number that should have been presented earlier'… In short, Wikistats is not about total historical articles, but rather about total historical vetted articles. This wasn't by design, it just happened, but I came to see this (and still do) as one valid way of presenting article counts in a meaningful way."

Perhaps more disturbingly, Wikistats defines articles differently than the MediaWiki software does. Nominally, Wikistats defines an article as:

(6) a non-redirect in a content namespace, containing (after parsing) a [[wikilink]] to another page on the same wiki or a [[Category:]] link

(Hence pages with links of type 1 or 2 in the above list of link types are counted by this definition.)

Note that Wikistats does not distinguish between "link", "any", and "comma" methods of article counting: it treats every wiki as if it uses this "link-plus" (so to speak) counting method (but content-namespace settings are correctly respected).

The defintion just given is called the "official" one at Wikistats itself; there is also an "alternate" definition intended to rule out stubs:

(7) a non-redirect in a content namespace, containing (before parsing) at least 200 characters of readable text (at least 50, for some languages) and (after parsing) a [[wikilink]] to another page on the same wiki or a [[Category:]] link

Here the phrase "readable text" means the text left over when characters that act as wiki markup (including template calls), HTML markup, and section headings are removed.

But the situation is actually "worse" than that, because the above definition doesn't apply uniformly to every wiki. Most Wikipedias, in particular, have their article counts determined based on so-called "stub dumps", which contain "all revisions of every article, meta data, but no page content." (See, for example, the notice at the top of stats:EN/TablesWikipediaES.htm, containing historical counts for the Spanish Wikipedia.) This makes the article counts for these wikis "a few percent higher than when a full archive dump had been processed… because no page content was available to check for internal or category links". (So in these cases, definition (5b), the "all" counting method, seems to be the one being used.)

As of the February 2015 "Tables" pages, only 5 Wikipedias (Arabic, Indonesian, Javanese, Swahili, and Swedish) have a complete set of monthly article counts purportedly all based on "full dumps", which contain "all revisions of every article, meta data and raw content" and thus allow definition (6) to be used. The remaining 234 listed Wikipedias have historical counts based only on stub dumps. (Note that 45 Wikipedias contained too few articles or garnered too few edits in February 2015 to be fully analyzed at Wikistats, and 4 Wikipedias — Gagauz, Maithili, Palatinate German, and Mingrelian — were not included for other, unspecified reasons.)

As for the other content projects, all Wiktionary, Wikiquote, Wikisource, Wikibooks, Wikinews, Wikiversity, and Wikivoyage wikis analyzed at Wikistats (in February 2015, anyway) list historical article counts reportedly based on full dumps.

Finally, as a practical matter, all historical article counts greater than 999 are reported at Wikistats in a rounded "human readable" format such as "625 k" or "4.8 M", so the exact counts cannot be determined for past months, anyway (although "Summary" pages such as stats:EN/SummaryEN.htm give precise article counts for the most recent month for which data was available).

To be fair, article counting is not the primary focus of Wikistats (as the information it provides can be used for so much more than that), so the above critique should not be seen as a criticism of Wikistats itself but merely of the idea that the article counts it gives are directly comparable to the on-wiki article counts. (In addition to the documentation, tables, graphs, animations, etc., at Wikistats itself, see mw:Analytics/Wikistats for more information about what Wikistats is really intended to be used for.)

What is to be done?

[edit]

As things currently stand, the updateArticleCount.php script is set to run on the 21st of each month. This should keep the on-wiki article counts reasonably correct going forward. However, there are some remaining issues that probably require the attention of the wider Wikimedia community. In most cases, this will require working with the MediaWiki developers to determine what decisions may be realistic to implement.

  1. Do we need to reconsider what should count as an article?
    • Wikistats treats category links the same as page links when counting articles. Should MediaWiki's "link" criterion do this, too? Note that this would require checking two databases per page instead of one. Another argument against this idea is that categories are often used for page maintenance — for example, to collect pages that need deleting. But then, templates that perform this categorizing also tend to contain page links, and so pages in the main namespace marked for deletion typically already get counted as articles. (This complicates tracking article-count milestones and should probably be addressed somehow.)
    • Should page links only "count" when they are to pages in the main namespace? Is this a realistic goal?
    • Should interlanguage links be treated the same as regular page links? How would this work with Wikidata?
    • Should an intersection of two criteria be considered (e.g., a page must contain at least one link and be in at least one category)?
    • Should the "comma" article-counting criterion be abandoned? As previously mentioned, this criterion seems to not be implemented at all in the updateArticleCount.php and initSiteStats.php scripts. (Note that comma checking used to be in Article.php but is no longer there; now it's in WikitextContent.php, but this doesn't appear to be used by either maintenance script.)
    • Is the "any" criterion implemented consistently (in particular, checking that the relevant pages are non-empty) across all parts of the MediaWiki code?
  2. Can Wikibooks be included in the periodic recounts?
    • The English and Portuguese Wikibooks are the only Wikimedia wikis that use the "comma" criterion. Should those wiki communities consider switching to the "any" criterion instead?
    • Can the maintenance scripts be fixed to correctly implement the "comma" criterion? (And are there any other problems with comma-based article counting in the MediaWiki code?)
    • Can the rest of the Wikibooks wikis be recounted without those two?
    • Can those two wikis be recounted correctly (at least once) some other way?
  3. Should other content projects, such as Wikispecies or Commons, be recounted?
    • Should other Wikimedia wikis have their articles recounted at least once, if not periodically? Should every Wikimedia wiki get recounted at least once?
    • Has anyone analyzed the additional server load caused by the periodic recounting that has already happened?
  4. Can article counting in MediaWiki ever be "fixed" to the point where script-based recounting is unnecessary to maintain correct on-wiki counts?
    • Are the "live" article counts just inherently unreliable? Note that the article count of the English Wikisource, which had been recounted in May 2012, was already almost 40% off of the correct value 3 years later when it was recounted in March 2015, but this was likely due to known bugs that existed for much of the intervening time. In contrast, when the second monthly recount happened in April 2015, barely any evidence could be found — examining collected statistics from 50 and 10 minutes before, and 1 and 2 hours after the recount — that anything beyond normal editing had just occurred. In other words, none of the 679 recounted wikis experienced "surprising" changes in their article counts (5 wikis showed a slightly larger change in the number of articles than edits to the wiki, which is possible, although unlikely, in the course of normal editing). This may indicate that recounts can be done less frequently than once a month. (Or it might indicate that the script hadn't finished recounting all the wikis after 2 hours! How long does it take, anyway?)
  5. Might other site statistics, such as total pages or images, need recounting?
    • Images have been recounted on some wikis in the past, in response to at least one identified bug in image counting. Should this be done, at least once, for all Wikimedia wikis?
    • Is it possible that full site statistics might need to be recounted for all wikis? Might doing so possibly make some of the statistics less correct than they currently are? (In other words, are there any known bugs affecting the way initSiteStats.php counts the various site statistics other than articles?)

These issues can be discussed on the talk page.