Module talk:Plain text

This module was considered for deletion on 2018 May 5. The result of the discussion was "no consensus".

strip_apostrophe_markup

@Galobtter: The function string.gsub() is quite forgiving, so you don't need to test for each case. Also ' doesn't need to be escaped when used as a search pattern. You can't sensibly export the strip_apostrophe_markup function, so it should be local, or could just go inline. You can simplify strip_apostrophe_markup to

local function strip_apostrophe_markup(txt)
	txt = txt:gsub("'''''", ""):gsub("''''", ""):gsub("'''", ""):gsub("''", "")
	return txt
end

In the main function, text should be a local variable:

local text = frame.args[1]

I don't like altering code while others are developing it, so I'll leave you to update it as you see fit. --RexxS (talk) 19:56, 14 April 2018 (UTC)[reply]

RexxS the second point - yeah I forgot to localize - regarding the strip_apostrophe_markup(txt), yeah I was also wondering why there were so many ifs etc, but I was too lazy to look over it (as you can see, I just copied it from Module:Citation/CS1/COinS). Wonder if the same change should be done on Module:Citation/CS1/COinS - ping Trappist the monk on that Galobtter (pingó mió) 20:05, 14 April 2018 (UTC)[reply]
- It's best to use the ustring library (as you have done), mainly because the module is likely to be reused in other languages, so your new code ends up not quite as simple, but is still fine. Nice work! --RexxS (talk) 20:36, 14 April 2018 (UTC)[reply]
  - Thanks! Yeah it is good to allow easy reuse., plus we ourselves use unicode characters occasionally for places I believe Galobtter (pingó mió) 20:43, 14 April 2018 (UTC)[reply]
    - We do sometimes use unicode characters for places, but interestingly string.gsub copes perfectly well with all of the Latin diacriticals and Greek or Cyrillic script that I've tried: Module talk:RexxS #Test stripApost. There will almost certainly be some characters that trip it up, but they won't be common. --RexxS (talk) 21:16, 14 April 2018 (UTC)[reply]
      - That is rather interesting, I think perhaps as long there is nothing being done to the unicode characters themselves it may be ok Galobtter (pingó mió) 06:06, 15 April 2018 (UTC)[reply]
        I suspect that it's a case of not using a function that makes use of absolute positioning within the string, because the byte count that the string library uses much of the time is obviously going to be incorrect with unicode characters. We probably just struck lucky with gsub . --RexxS (talk) 20:55, 15 April 2018 (UTC)[reply]

I replaced the mw.ustring.gsub with plain gsub because ustring is a lot slower than gsub and is not needed in this module. The optimization is not necessary but since people are looking at the code I thought it worth mentioning that wikitext will always use UTF-8 and that means Lua gsub with the patterns in this module will work well. Lua gsub works in any language with a pattern like '[12]' ('1' or '2') but mw.ustring.gsub would be needed for a pattern like ['১২'] (that might be used at the Bengali Wikipedia to search for their equivalent). In the first case (Lua gsub), the pattern finds the first location matching any of the bytes between [ and ]. In the Bengali case, each digit is three bytes in UTF-8, so there are six bytes between the square brackets. If Lua gsub were used, it would look for any of those bytes. Johnuniq (talk) 09:47, 18 April 2018 (UTC)[reply]

Could remove indentations

Can be comnbined with leading spaces: gsub("^[:;%s]+", "") — 𝐆𝐮𝐚𝐫𝐚𝐩𝐢𝐫𝐚𝐧𝐠𝐚 (talk) 20:31, 24 May 2021 (UTC)[reply]

Performance improvements (and other) in the sandbox

I made a few performance (and other) improvements to this module in the sandbox based on the work with Module:User scripts table (for which I started using Module:Plain text, and ended up forking and customising it for the needs there). The two performance improvements are:

Use greedy [^x]+x instead of ungreedy .-x whenever possible; and
Use a single gsub for all File:, Category:, Media:, etc, instead of a gsub for each.

— 𝐆𝐮𝐚𝐫𝐚𝐩𝐢𝐫𝐚𝐧𝐠𝐚 ☎ 13:48, 21 June 2021 (UTC)[reply]

nowiki text removed?

The documentation example has in its example: <nowiki>?</nowiki> (question mark in nowiki tag).

The module removes this wikitext altogether, including the question mark. Why is this "other stuff" to be removed? -DePiep (talk) 13:41, 2 September 2021 (UTC)[reply]

Tag stripping

Currently, this module strips both the tags and their contents for all HTML-style tags, except for , , , , and  (and the last three only because I just added cases for them). However, there are a variety of other tags which are valid in wikitext, and which contents arguably should be kept after discarding the tags themselves, e.g. <h2>, <dfn>, , . These could continue to be added here individually, but I think it's probably simpler to reverse the module's behavior, and only discard contents of tags for a curated list, and otherwise keep the contents.

The main issue I can see with that would be for  and , where just stripping them often results in confusing text, e.g. stripping "2³²" would produce "232", or "v_e" producing "ve"; in these cases it might be better to replace the tags with "^"/"_" (resulting, for the aforementioned examples, in output of "2^32" and "v_e") or other appropriate characters (though the suggested characters, I believe, are the ones most often used for indicating super/subscript when formatting options are limited). 「ディノ奴千？！」^{☎ Dinoguy1000} 04:13, 6 October 2021 (UTC)[reply]

Reasonable, especially when whitelist/blacklist are argued well & systematically -- and so more stable. Module history shows that it was never approached this systematically.

Now, the documentation has this peculiar sentence "other stuff that needs removing from short descriptions". Looks like it was purpose-build for WP:SHORTDESC, WP:SDFORMAT then. But it is actualy unused in {{short description}}; ask WP:WPSHORTDESC? And, what effect on does existing 1M+ usage (that's module; {{Plain text}} has 35k)? For this, the proposed extended removals be put in a separate function? -DePiep (talk) 06:06, 6 October 2021 (UTC)[reply]

I saw that the original intention was for SHORTDESCs before editing the module and starting this discussion, though I didn't actually check to see if the module is currently used for that; it's mildly amusing to me that it isn't.

To be honest, I have no idea how this change might affect current uses. If performance isn't too much of a concern, we could simply add tracking for cases where nonspecific tags are being stripped and give things a while to filter in before looking through them and seeing if anything interesting appears. That being said, I'd expect the vast majority of uses to be via other templates or modules; some quick searches show it's only being directly used in ~3 dozen templates and modules, which shouldn't be too hard to look through by hand (though TBH I don't know what I'd be looking for).

So what tags should be fully discarded? There's the obvious   (which is already stripped, though not very robustly), and the currently-not-stripped <hr />, , and most (almost all?) of the parser/extension tags (and maybe , though that doesn't get displayed anyways); <table> and <div> are potential candidates, though I could definitely see arguments for keeping their contents at least sometimes (so maybe make them optional somehow?).

Conversely, looking through WP:HTML reminds me of <abbr>, which would probably also need some specific consideration akin to /. The contents of <bdi>/<bdo> might be able to just be presented as they are in the original string, but I'm not an expert in this area, so at the least it would probably require a bit of discussion.

I'm getting away from this discussion at this point, but after thinking about it earlier I concluded that probably the best method for stripping templates while optionally keeping some of their contents would be for templates/modules to have some sort of "plain output" mode, that would be "safe" for applications like this or WP:NAVPOP, which currently just strips most templates entirely. Though obviously this would require quite a bit of work, and some planning/consideration on the implementation (which I don't have many thoughts on myself, but maybe one idea would be adding some feature to TemplateData to indicate "safe" parameters to output directly, assuming TemplateData doesn't already have such a feature). 「ディノ奴千？！」^{☎ Dinoguy1000} 08:39, 6 October 2021 (UTC)[reply]

I've started {{Navbox wikitext-handling templates}}, to see what is related. -DePiep (talk) 20:06, 6 October 2021 (UTC)[reply]

Link stripping

Since I've been thinking about this module today anyways, I realized that the link stripping here duplicates the stripping done by Module:Delink, albeit probably less robust and (I think?) not catching as many cases. Are there any major reasons (other than performance maybe) not to just use Module:Delink for that functionality in this module? 「ディノ奴千？！」^{☎ Dinoguy1000} 08:42, 6 October 2021 (UTC)[reply]

We need a more general approach to any "non-ascii"-stripping. Think HTML, html-tags, wm-extension tags, wikicode like [[{{!}}]], parser-strips, etc. -DePiep (talk) 20:27, 6 October 2021 (UTC)[reply]

Keeping contents of /

(starting a new topic because #Tag stripping is old and only touches on these tags as a side-comment)

At WP:VPT#Wikilinks from italic titles, I came across this template as a solution for one editor's issue (italics in ship-names), but the lack of preservation for superscript/subscript text breaks for another related use (chemical names). The general goal is to convert a properly formatted visual text into a bluelink. But as simple examples that currently don't work correctly for that use, we would want H2O (H₂O) to become "H2O" not "HO" and 3H (³H) to become 3H not H. DMacks (talk) 16:04, 26 June 2022 (UTC)[reply]

This happens because of line 21. Could be fixed by adding these lines ahead of line 21:

		:gsub('<sub.->(.-)</sub>', '%1') --remove subscript markup; retain contents
		:gsub('<sup.->(.-)</sup>', '%1') --remove superscript markup; retain contents

caveat lector: not tested

—Trappist the monk (talk) 16:25, 26 June 2022 (UTC)[reply]

I came here to ask for this feature, once again. Given existing behaviour & usage, I suggest this should be a opt-in by parameter (|"keep-tag-content"=T [F-by-default]. (alternative: by fork)

Other tag content to be kept? , , <hn>, <dfn>, ? <nowiki> content can not be kept because of unknown code injection.

From #Tag stripping: replace , .. with ⟨^⟩ or generic whitespace, etcetera? As option? -DePiep (talk) 07:53, 21 February 2023 (UTC)[reply]

I have adjusted the sandbox and added underline, subscript, and superscript test cases to the testcases page. Is there further testing that needs to be done, or should I roll this out to a million pages? – Jonesey95 (talk) 15:41, 23 July 2023 (UTC)[reply]

Looks good! Johnuniq (talk) 23:27, 23 July 2023 (UTC)[reply]