Tutorial :Python regex for finding contents of MediaWiki markup links



Question:

If I have some xml containing things like the following mediawiki markup:

" ...collected in the 12th century, of which [[Alexander the Great]] was the hero, and in which he was represented, somewhat like the British [[King Arthur|Arthur]]"

what would be the appropriate arguments to something like:

re.findall([[__?__]], article_entry)

I am stumbling a bit on escaping the double square brackets, and getting the proper link for text like: [[Alexander of Paris|poet named Alexander]]


Solution:1

Here is an example

import re    pattern = re.compile(r"\[\[([\w \|]+)\]\]")  text = "blah blah [[Alexander of Paris|poet named Alexander]] bldfkas"  results = pattern.findall(text)    output = []  for link in results:      output.append(link.split("|")[0])    # outputs ['Alexander of Paris']  

Version 2, puts more into the regex, but as a result, changes the output:

import re    pattern = re.compile(r"\[\[([\w ]+)(\|[\w ]+)?\]\]")  text = "[[a|b]] fdkjf [[c|d]] fjdsj [[efg]]"  results = pattern.findall(text)    # outputs [('a', '|b'), ('c', '|d'), ('efg', '')]    print [link[0] for link in results]    # outputs ['a', 'c', 'efg']  

Version 3, if you only want the link without the title.

pattern = re.compile(r"\[\[([\w ]+)(?:\|[\w ]+)?\]\]")  text = "[[a|b]] fdkjf [[c|d]] fjdsj [[efg]]"  results = pattern.findall(text)    # outputs ['a', 'c', 'efg']  


Solution:2

RegExp: \w+( \w+)+(?=]])

input

[[Alexander of Paris|poet named Alexander]]

output

poet named Alexander

input

[[Alexander of Paris]]

output

Alexander of Paris


Solution:3

import re  pattern = re.compile(r"\[\[([\w ]+)(?:\||\]\])")  text = "of which [[Alexander the Great]] was somewhat like [[King Arthur|Arthur]]"  results = pattern.findall(text)  print results  

Would give the output

["Alexander the Great", "King Arthur"]  


Solution:4

If you are trying to get all the links from a page, of course it is much easier to use the MediaWiki API if at all possible, e.g. http://en.wikipedia.org/w/api.php?action=query&prop=links&titles=Stack_Overflow_(website).

Note that both these methods miss links embedded in templates.


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »