Tutorial :Regular expression from font to span (size and colour) and back (VB.NET)



Question:

I am looking for a regular expression that can convert my font tags (only with size and colour attributes) into span tags with the relevant inline css. This will be done in VB.NET if that helps at all.

I also need a regular expression to go the other way as well.

To elaborate below is an example of the conversion I am looking for:

<font size="10">some text</font>  

To then become:

<span style="font-size:10px;">some text</span>  

So converting the tag and putting a "px" at the end of whatever the font size is (I don't need to change/convert the font size, just stick px at the end).

The regular expression needs to cope with a font tag that only has a size attribute, only a color attribute, or both:

<font size="10">some text</font>    <font color="#000000">some text</font>    <font size="10" color="#000000">some text</font>    <font color="#000000" size="10">some text</font>  

I also need another regular expression to do the opposite conversion. So for example:

<span style="font-size:10px;">some text</span>  

Will become:

<font size="10">some text</font>  

As before converting the tag but this time removing the "px", I don't need to worry about changing the font size.

Again this will also need to cope with the size styling, font styling, and a combination of both:

<span style="font-size:10px;">some text</span>    <span style="color:#000000;">some text</span>    <span style="font-size:10px; color:#000000;">some text</span>    <span style="color:#000000; font-size:10px;">some text</span>  

I am extracting basic HTML & text from CDATA tags in an XML file and then displaying them on a web-page. The text also appears in a rich-text editor so it can be edited/translated, and then saved back into a new XML file. The XML is then going to be read by a flash file, hence the need to use old-fashioned HTML.

The reason I want to convert this code is mainly for display purposes. In order to show the text sizes correctly and for it to work with my rich text editor they need to be converted to XHTML/inline CSS. The rich text editor will also generate XHTML/inline CSS that I need to convert 'back' to standard HTML before it is saved in the XML file.

I don't know a lot about XSLT transformation but I'm not sure that is what I need for this, or it might be more than I need right now, but please correct me if I'm wrong (and point me in the direction of any helpful links you may have on it).

I know the temptation will be to tell me a number of different ways to set up my code to do what I want but there are so many other permutations I haven't even mentioned which have forced me down this route, so literally all I want to do is convert a string containing standard HTML to XHTML/inline CSS, and then the same but the other way round.


Solution:1

Since some people have already given you warnings I'll skip ahead to the regex solution.

First off, I'll lay out a couple of assumptions that aren't set in stone but allow the problem to be approached as you presented it without me doing extra work:

  1. You can use LINQ (otherwise this will need to be updated)
  2. Font/Span tags will be in lowercase (font and span not FONT or SpAn)
  3. Each style attribute value will be properly formatted, ending with a semi-colon ; similar to your samples

Case-sensitivity can be worked in rather simply via the RegexOptions.IgnoreCase although, in turn, the dictionary values will need to be stored as ToLower to keep everything constant when the values are later accessed. The 3rd point ensures splitting text doesn't go haywire.

Below is a sample program that demonstrates the replacements.

Sub Main      Dim inputs As String() = { _          "<font size=""10"">some text</font>", _          "<font color=""#000000"">some text</font>", _          "<font size=""10"" color=""#000000"">some text</font>", _          "<font color=""#000000"" size=""10"">some text</font>", _          "<font size=""10"">some text</font> other text <font color=""#000000"">some text</font>", _          "<span style=""font-size:10px;"">some text</span>", _          "<span style=""color:#000000;"">some text</span>", _          "<span style=""font-size:10px; color:#000000;"">some text</span>", _          "<span style=""color:#000000; font-size:10px;"">some text</span>", _          "<span style=""color:#000000; font-size:10px;"">some text</span> other <font color=""#000000"" size=""10"">some text</font>" _      }        Dim pattern As String = "<(?<Tag>font|span)\b(?<Attributes>[^>]+)>(?<Content>.+?)</\k<Tag>>"      Dim rx As New Regex(pattern)        For Each input As String In inputs          Dim result As String = rx.Replace(input, AddressOf TransformTags)          Console.WriteLine("Before: " & input)          Console.WriteLine("After: " & result)          Console.WriteLine()      Next  End Sub    Public Function TransformTags(ByVal m As Match) As String      Dim rx As New Regex("(?<Key>\b[a-zA-Z]+)=""(?<Value>.+?)""")      Dim attributes = rx.Matches(m.Groups("Attributes").Value).Cast(Of Match)() _                         .ToDictionary(Function(attribute) attribute.Groups("Key").Value, _                                       Function(attribute) attribute.Groups("Value").Value)        If m.Groups("Tag").Value = "font" Then          Dim newAttributes = String.Join("; ", attributes.Select(Function(item) _                                                  If(item.Key = "size", "font-size", item.Key) _                                                  & ":" _                                                  & If(item.Key = "size", item.Value & "px", item.Value)) _                                              .ToArray()) _                                              & ";"          Return "<span style=""" & newAttributes & """>" & m.Groups("Content").Value & "</span>"      Else          Dim newAttributes = String.Join(" ", attributes("style") _                                               .Split(New Char() {";"c}, StringSplitOptions.RemoveEmptyEntries) _                                               .Select(Function(s) _                                                  s.Trim().Replace("px", "").Replace("font-", "").Replace(":", "=""") _                                                  & """") _                                          .ToArray())          Return "<font " & newAttributes & ">"  & m.Groups("Content").Value & "</font>"      End If  End Function  

If you have any questions let me know. Some enhancements can be made if a large amount of text is expected to be processed. For example, the regex object in the TransformTags method can be moved to the class level so it isn't recreated on every transformation.

EDIT: Here's the explanation of the first pattern: <(?<Tag>font|span)\b(?<Attributes>[^>]+)>(?<Content>.+?)</\k<Tag>>

  • <(?<Tag>font|span)\b - opening < and matches the font or span tag and uses a named group of Tag. The \b matches a word boundary to ensure nothing beyond the tag names specified are matched.
  • (?<Attributes>[^>]+)> - named group, Attributes, matches everything else in the tag as long as it is not a > symbol, then it matches the closing >
  • (?<Content>.+?) - named group, Content, matches anything between the tag
  • </\k<Tag>> - matches the closing tag by back-referencing the Tag group

The second pattern is used to match key-value pairs for the attributes: (?<Key>\b[a-zA-Z]+)=""(?<Value>.+?)""

  • (?<Key>\b[a-zA-Z]+) - named group, Key, matches any word (alphabets) starting at a word boundary
  • ="" - matches the equal symbol and opening quotation
  • (?<Value>.+?) - named group, Value, matches anything up till the closing quotation mark. It is non-greedy by specifying the ? symbol after the + symbol. It could've been [^""]+ similar to how the Attributes group was handled in the first pattern.
  • "" - matches the closing quotation


Solution:2

I don't think regular expressions are the way to go for this problem.

Stick to XML based technologies, such as XSLT to do the transformation.


Solution:3

You shouldn't try to parse HTML with regex. Use XML parsing instead.


Solution:4

I have found a solution to this issue. However it is not one that involves using a regular expression. Though I am very interested in the idea of creating a custom program in and GUI creation tool to accomplish this. The link below will provide the easiest solution to convert any deprecated font tags to inline span tags. This is a crucial and awesome tool.

http://tinymce.moxiecode.com/tryit/full.php

Clicking on html will show the html code for the message. Then you can replace that with the html that has the deprecated <font> tags and they will be converted to inline <span> tags.


Solution:5

It might a good idea to explain why you need to do this, as unless there's a particular goal, this seems to turn one kind of non-semantic code into another kind of non-semantic code.

Might the time be better spent converting to separate HTML and CSS code, based on class and id attributes?


Solution:6

I agree with both comments saying xslt should be used for xml transformation and that style shouldn't be mixed in html... but here is a starting point for your regex (perl, I don't know any VB but it shouldn't be too far) if you're in a hurry :

's/<font(.*)size="([^ ]*)"(.*)color="([^ ]*)"(.*)<\/font>/<span$1style="font-size:$2px;color:$4"$3$5<\/span>/g'  

I don't think you can do this in one regex, this one handles the case where size comes before color, you can derive the 3 missing regex from here...


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »