Page MenuHomePhabricator

Instrumentation & data gathering to inform future performance & templating improvements
Open, HighPublic

Description

We will collect data for three modes of incremental parsing: article edits with template parse reuse, template edits with main article parse reuse, and section edits. For each we will quantify the number of edits that would qualify for optimized reuse, the number of such edits rejected due to cache misses or wikitext structure, and the time spent on the portion of the parse which could be optimized away in a future parser evolution. The results will allow us to quantify the performance improvement (CPU time savings) expected for various evolutions of our parsing pipeline, helping to determine which form(s) of incremental parsing to pursue (article edits, template edits, or section edits). Alternatively, we may determine that cache size, cache lifetime, or wikitext structure limitations (eg, unbalanced templates or malformed sections) restrict the current opportunities for performance improvements of this form, allowing us to pivot appropriately.

Event Timeline

Change #1059404 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Provide previous parse results to parser when rendering

https://gerrit.wikimedia.org/r/1059404

ssastry triaged this task as High priority.Aug 23 2024, 8:13 PM

In initial exploration, it was discovered to be quite difficult to plumb through precise reasons for each update (eg, "update to section X of revision Y of page Z" or "update because of an update template A included by template B included by page C").

In order to quickly collect statistics, it was decided to reverse-engineer the details of the edit from the before/after wikitext-and-HTML pairs. It was fairly easy to provide the wikitext and HTML for "some previous version" from the ParserCache. From that, we can (approximately) classify edits as follows:

  • If the before-and-after wikitext and DOM output are identical, this was a "useless" update (eg templates A and B were edited, but the reparse after template A already included the update to template B)
  • If the before-and-after wikitext and DOM are identical except for the region corresponding to a single section, this was a "section edit".
    • This may or may not have come from the actual section edit API in core and/or the Visual Editor "section edit" mode, so this is somewhat optimistic if selective update were to be tightly tied to those edit mechanisms
  • If the before-and-after wikitext is identical, and the DOM is identical except for the regions corresponding to certain templates, this is a "template update"
  • If the before-and-after wikitext are different, but the DOM for templated regions are identical, this is a "edit preserving templates" (bikeshed the name?)
    • False negatives here if templates are not idempotent but vary every time they are rendered

These should provide reasonable numbers for the feasibility (and time savings) of doing selective update in these cases. If it turns out that most of the possible performance improvement comes from a single one of these categories, we might redo the plumbing to focus on that particular update path. In the completely general case we'd have to do a database fetch for the "before" wikitext and a 'diff' operation on the before-and-after wikitext to determine which regions could be reused. But for example in some update paths the "before" wikitext might already be present in the client, so focusing on that update path may allow us to avoid doing additional fetches of the "before" wikitext from the DB. Alternatively, knowing that this was a "edit to section X" might allow us to avoid doing a diff operation to determine which section of the "after" wikitext was edited to differ from the "before" wikitext.

Change #1065296 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Add DataAccess::fetchTemplateTouched() for Parsoid dependency tracking

https://gerrit.wikimedia.org/r/1065296

Change #1065297 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] Record the last modification time (page_touched) for transclusions

https://gerrit.wikimedia.org/r/1065297

Hm. The page_touched property of pages doesn't quite do what we need it to. Consider a page [[Foo]] which contains:

{{1x|{{Bar}}}}

Now, Foo's backlinks contain Template:1x and Template:Bar, and if either of them is edited, [[Foo]]'s page_touched will be bumped as well.

But when looking at {{1x|{{Bar}}}} to determine whether we need to re-render it, we can't just look at the page_touched for {{Template:1x}}, since that won't be updated if {{Template:Bar}} is updated.

So we actually need to keep track of the ParserOutput::addTemplate() dependencies for the entire expansion of {{1x|{{Bar}}}}, which will include both {{Template:1x}} and {{Template:Bar}}, and re-render if *any* of them is edited.

This is a little bit unfortunate, since the whole point of watching page_touched was to allow HTMLCacheUpdateJob to handle the recursion here so that we /wouldn't/ have to do a recursive lookup in order to determine if we need to re-render a template.

Two possible ways out:

  • Explicitly pass around the 'template that has been updated' along with the entire $deps array for each top-level template. We can then tell that (eg) "we're doing an update of Template:Bar" and "our expansion of {{1x|{{Bar}}}} depended on Template:Bar" without doing any additional database queries to determine the latest values of page_touched.
  • Restructure the backlinks array to group by top-level transclusion. Then we can still do the "hard part" in HTMLCacheUpdateJob to update the page_touched for a specific transclusion. That is, instead of recording simply that [[Foo]] depends on Template:1x and Template:Bar etc we can record that "[[Foo]] transclusion #1" depends on Template:1x and Template:Bar. Then when Template:Bar is edited we directly invalidate [[Foo]] transclusion #1 (by updating its page_touched) instead of invalidating the entire page. We still do a single look up (of the page_touched for "[[Foo]] transclusion #1") to determine if a transclusion needs to be re-rendered. The actual size of the backlinks table may increase if a common template is used at multiple transclusion sites on the same page.

Change #1059404 merged by jenkins-bot:

[mediawiki/core@master] Provide previous parse results to parser when rendering

https://gerrit.wikimedia.org/r/1059404

Change #1072819 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Randomly sample statistics for Parsoid Selective Update

https://gerrit.wikimedia.org/r/1072819

Change #1072819 merged by jenkins-bot:

[mediawiki/core@master] Randomly sample statistics for Parsoid Selective Update

https://gerrit.wikimedia.org/r/1072819

Change #1073497 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/extensions/Linter@master] Collect selective update statistics from LintUpdate job

https://gerrit.wikimedia.org/r/1073497

Change #1073890 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] stats: collect timing information for parsercache_selective_* sample

https://gerrit.wikimedia.org/r/1073890

Change #1074484 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] WIP: classify parses to identify opportunities for selective update

https://gerrit.wikimedia.org/r/1074484

Change #1073890 merged by jenkins-bot:

[mediawiki/core@master] stats: collect timing information for parsercache_selective_* sample

https://gerrit.wikimedia.org/r/1073890

Change #1073497 merged by jenkins-bot:

[mediawiki/extensions/Linter@master] Collect selective update statistics from LintUpdate job

https://gerrit.wikimedia.org/r/1073497

Change #1077460 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[operations/mediawiki-config@master] Turn on Parsoid Selective Update metrics

https://gerrit.wikimedia.org/r/1077460

Change #1077460 merged by jenkins-bot:

[operations/mediawiki-config@master] Turn on Parsoid Selective Update metrics

https://gerrit.wikimedia.org/r/1077460

Mentioned in SAL (#wikimedia-operations) [2024-10-03T21:13:05Z] <brennen@deploy2002> Started scap sync-world: Backport for [[gerrit:1077460|Turn on Parsoid Selective Update metrics (T371713)]]

Mentioned in SAL (#wikimedia-operations) [2024-10-03T21:15:07Z] <brennen@deploy2002> cscott, brennen: Backport for [[gerrit:1077460|Turn on Parsoid Selective Update metrics (T371713)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-10-03T21:28:36Z] <brennen@deploy2002> Finished scap sync-world: Backport for [[gerrit:1077460|Turn on Parsoid Selective Update metrics (T371713)]] (duration: 15m 30s)

Change #1077811 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/core@master] Parsoid selective update metrics: add labels for wiki id and content model

https://gerrit.wikimedia.org/r/1077811

Change #1077811 merged by jenkins-bot:

[mediawiki/core@master] Parsoid selective update metrics: add labels for wiki id and content model

https://gerrit.wikimedia.org/r/1077811

Change #1079274 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[operations/mediawiki-config@master] Turn on Parsoid Selective Update metrics (take 2)

https://gerrit.wikimedia.org/r/1079274

Change #1079274 merged by jenkins-bot:

[operations/mediawiki-config@master] Turn on Parsoid Selective Update metrics (take 2)

https://gerrit.wikimedia.org/r/1079274

Mentioned in SAL (#wikimedia-operations) [2024-10-10T13:20:59Z] <lucaswerkmeister-wmde@deploy2002> Started scap sync-world: Backport for [[gerrit:1075635|Turn on mobile support for Parsoid Read Views (but not on talk pages) (T269499 T376048)]], [[gerrit:1079274|Turn on Parsoid Selective Update metrics (take 2) (T371713 T376433)]]

Mentioned in SAL (#wikimedia-operations) [2024-10-10T13:23:01Z] <lucaswerkmeister-wmde@deploy2002> lucaswerkmeister-wmde, cscott: Backport for [[gerrit:1075635|Turn on mobile support for Parsoid Read Views (but not on talk pages) (T269499 T376048)]], [[gerrit:1079274|Turn on Parsoid Selective Update metrics (take 2) (T371713 T376433)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)

Mentioned in SAL (#wikimedia-operations) [2024-10-10T13:37:09Z] <lucaswerkmeister-wmde@deploy2002> Finished scap sync-world: Backport for [[gerrit:1075635|Turn on mobile support for Parsoid Read Views (but not on talk pages) (T269499 T376048)]], [[gerrit:1079274|Turn on Parsoid Selective Update metrics (take 2) (T371713 T376433)]] (duration: 16m 09s)

Change #1074484 merged by jenkins-bot:

[mediawiki/services/parsoid@master] Classify parses to identify opportunities for selective update

https://gerrit.wikimedia.org/r/1074484

Change #1079939 had a related patch set uploaded (by Isabelle Hurbain-Palatin; author: Isabelle Hurbain-Palatin):

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.20.0-a25

https://gerrit.wikimedia.org/r/1079939

Change #1079939 merged by jenkins-bot:

[mediawiki/vendor@master] Bump wikimedia/parsoid to 0.20.0-a25

https://gerrit.wikimedia.org/r/1079939

Change #1076300 had a related patch set uploaded (by C. Scott Ananian; author: C. Scott Ananian):

[mediawiki/services/parsoid@master] WIP: Track expansion time consumed by modified/unmodified templates

https://gerrit.wikimedia.org/r/1076300