Tutorial :Tricky pattern match



Question:

This could be tricky, easy or impossible... I'm not sure

I have a list of domains and I'm trying to match them as closely as possible to the website name in the "title" tag.

For example...

  Domain: www.yahoo.com   Title: Yahoo!  Result: Yahoo!    Domain: www.thegreenpages.com   Title: Welcome to The Green Pages.  Result: The Green Pages    Domain: www.experts-exchange.com:  Title: Experts Exchange - The #1 resource on the web for solving technology problems.  Result: Experts Exchange  

So you can see the problem here. I need to consider case, spaces and any domain special characters. I also need to capture any special characters like the ! in Yahoo! but not something like a period which would just be the end of a sentence and whatever else you can think of.

Make sense?

In PHP.

I truly, truly suck at these types of pattern matching problems :)


Solution:1

I'm not sure you'll ever come up with a pattern that will solve all the eventualities you can run into with a problem like this. A title tag could be totally random text that wouldn't match at all.

For instance, here's a random site I picked off a random google search. The site domain is "plus2net.com", and the title is (obviously geared for SEO) "PHP HTML MySQL articles tutorials, free scripts and programming forum". How would you ever correlate those two things? Theoretically you could use something like the levenshtein() function to give you a sort of statistical analysis, but I think coming up with a regexp to solve this problem is the wrong approach.

I'd re-think the problem. What are you trying to accomplish? If you're just trying to correlate a list of domain names and title tags, couldn't you write a quick script to scrape the title tags from the list of domains you have and get the exact data?


Solution:2

Try this code:

$sites = array(      array('domain' => 'www.yahoo.com', 'title' => 'Yahoo!'),      array('domain' => 'www.thegreenpages.com', 'title' => 'Welcome to The Green Pages.'),      array('domain' => 'www.experts-exchange.com', 'title' => 'Experts Exchange - The #1 resource on the web for solving technology problems.'),  );    foreach ($sites as $idx => $site) {      $domain = preg_replace('/^www\./i', '', $site['domain']);      $domain = preg_replace('/\.(com|net|org|info|us)$/i', '', $domain);        $expression = '/';      for ($i = 0; $i < strlen($domain); $i++) {          $char = $domain[$i];          $expression .= $char . (ctype_alpha($char) ? '' : '?');          $expression .= '\s*';      }      $expression .= '/i';        preg_match($expression, $site['title'], $matches);      $sites[$idx]['name'] = $matches[0];  }  

If you print_r($sites) you'll get:

Array  (      [0] => Array          (              [domain] => www.yahoo.com              [title] => Yahoo!              [name] => Yahoo          )        [1] => Array          (              [domain] => www.thegreenpages.com              [title] => Welcome to The Green Pages.              [name] => The Green Pages          )        [2] => Array          (              [domain] => www.experts-exchange.com              [title] => Experts Exchange - The #1 resource on the web for solving technology problems.              [name] => Experts Exchange           )  )   

No matter what you'll have to tweak your script until you get it right, but this is a place to start.


Solution:3

You could build a regular expression based on the domain name such as:

t\s*h\s*e\s*g\s*r\s*e\s*e\s*n\s*p\s*a\s*g\s*e\s*s  

This would match The Green Pages in the case-insensitive mode.


Edit   Here’s an example of how you could build such a regular expression:

$data = array(      array('yahoo', 'Yahoo!'),      array('thegreenpages', 'Welcome to The Green Pages.'),      array('experts-exchange', 'Experts Exchange - The #1 resource on the web for solving technology problems.')  );    foreach ($data as $item) {      $domain = preg_split('/(.)/', $item[0], -1, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);      foreach ($domain as $key => $chr) {          if ($chr == '-') {              unset($domain[$key]);          }      }      $pattern = '/'.implode('[\s-]*', $domain).'!?/i';      preg_match($pattern, $item[1], $match);      var_dump($match[0]);  }  


Solution:4

I see this as at least a three step process.

  • Remove punctuation from both the title, and the url.
  • Split Url, if necessary.
  • Use the url to find the correct case, by comparing to the title.
'www.thegreenpages.com'    'Welcome to The Green Pages.'  'The Green Pages'      'thegreenpages'                                       # remove punctuation     'the green pages'    <= 'Welcome to The Green Pages'   # split url (if necessary)                          =>            'The Green Pages'   # result of search    'www.experts-exchange.com'    'Experts Exchange - The #1 res ...'  'Experts Exchange'      'experts exchange'        'Experts Exchange   The  1 res    '  # remove punctuation  #   'experts exchange'     <= 'Experts Exchange   The  1 res    '  # split url                             => 'Experts Exchange'                   # result of search    'www.yahoo.com'    'Yahoo!'  'Yahoo!'      'yahoo'        'Yahoo'   # remove punctuation  #   'yahoo'     <= 'Yahoo'   # split url (if necessary)                  => 'Yahoo'   # result of search  # whoops left off the exclamation point


Solution:5

Unless you seriously confine the problem domain, I would say that this is impossible.

The title attribute can contain any arbitrary string in any human language (symbols, foreign characters, "smart" stuff, you name it). How would a regex be smart enough to catch the relevant part? Can you even formally define the relevant part in your own words?

Regexes suck when applied to languages, and even much more complex systems tend to suck when applied to human languages.


Solution:6

Is your list of domains fixed? If so could you build regex for each domain?

Obviously, you can strip out the domain fairly simply, but as Tomalak says, unless the problem domain is very much more restricted is actually quite a complex computational problem!

From a domain, you need to strip out the words, for which you would need a reference dictionary (or one for each language), along with some kind of word matching, perhaps some kind of voting for potential matches. Although, really without a more specific problem domain this isn't likely to be accurate.

It might be good to know more about what you are trying to achieve?


Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Previous
Next Post »