Prerequisites: PHP, Theming, Drupal 6, Regular Expressions
Adventures With Nested Regular Expression Functions
Sometimes a situation calls for skills you know you should have but just don't use often enough to claim as a weapon in your arsenal of tools. For me, that skill is competence with regular expressions. It seems like every six months or so I'm back to reading Mastering Regular Expressions and scouring the interwebs for help deciphering or creating one of those cryptic phrases.
Many thanks to regular-expressions.info, a great resource for learning about regular expressions.
Here's the scenario: a client's site has nodes that are tagged with multiple taxonomy terms, sometimes enough to make the list of terms span multiple lines. Sometimes the terms are phrases, so they contain multiple words and as a result are sometimes split across lines. We want to ensure that taxonomy terms are always kept together.
The simplest answer is probably to use display: inline-block; in the appropriate CSS declaration, but even though that's what we ended up doing, I thought this alternate solution was an interesting exercise in the use of regular expressions.
The Code
function exampletheme_preprocess_node (&$variables) {
// Replace all spaces within <a> tags with non-breaking spaces so they don't span lines
$pattern = '/(<a\b[^>]*>)(.*?)(<\/a>)/';
$variables['terms'] = preg_replace_callback($pattern, create_function('$matches', 'return $matches[1] . preg_replace("/ /", " ", $matches[2]) . $matches[3];'), $variables['terms']);
}
The code will go in your theme's template.php file, in the mytheme_preprocess_node function.
Now let's have a look at what's going on. The parameter &$variables is a keyed array, in which each of the keys eventually becomes a variable that is made available to the node template (e.g., node.tpl.php). The variable we're interested in is 'terms', which is the fully rendered list of taxonomy terms for the node. For example,
<ul class="links inline">
<li class="taxonomy_term_79 first"><a href="/category/business-categories/accessories-jewelry" rel="tag" title="">Accessories / Jewelry</a></li>
<li class="taxonomy_term_92"><a href="//category/business-categories/energy-savings" rel="tag" title="">Energy Savings</a></li>
<li class="taxonomy_term_100"><a href="/category/business-categories/home-furnishing" rel="tag" title="">Home Furnishing</a></li>
<li class="taxonomy_term_101"><a href="/category/business-categories/household-cleaning-products" rel="tag" title="">Household Cleaning Products</a></li>
<li class="taxonomy_term_106"><a href="/category/business-categories/paper-products" rel="tag" title="">Paper Products</a></li>
<li class="taxonomy_term_110 last"><a href="/category/business-categories/retail-stores" rel="tag" title="">Retail Stores</a></li>
</ul>
Our goal is to take spaces that occur within <a> tag text and replace them with non-breaking spaces ( ). For example, change Paper Products into Paper Products. To do that, we use a nested call to the PHP function preg_replace.
Let's look at the innermost preg_replace:
preg_replace("/ /", " ", $matches[2])
This simply says, "find each space in $matches[2] and replace it with ." The first parameter is a regular expression consisting of the start delimiter (a slash), a space, and an end delimiter (another slash). We'll get to $matches[2] in a sec.
The parent function is actually a slightly different form of preg_replace that takes a function as its second parameter. preg_replace_callback calls that function once for each match it finds. In our case, we define the function inline using create_function, but it can just as easily be done with a normal function declaration.
preg_replace_callback passes an array in the form of:
- $matches[0] = the full matching text
- $matches[1] = backreference 1
- $matches[2] = backreference 2
- ...
- $matches[n] = backreference n
so, going back to the inner function, we're operating on the string passed as backreference 2. So, what's backreference 2? For that, we need to look at the regular expression:
/(<a\b[^>]*>)(.*?)(<\/a>)/
We already know that the slashes on the ends are just delimiters, so let's look at the three parenthetical clauses. Using parentheses around these clauses causes the regex engine to create backreferences which can then be referred to by our callback function.
First we have
(<a\b[^>]*>)
which matches
- The literal "<a" as long as it is on a word boundary (\b). That's to distinguish an <a> tag from say, an <acronym> tag.
- Followed by any number of characters as long as they are not a closed angle bracket ("[^>]*")
- Followed by the literal ">"
So, that's the <a> tag and all of it's attributes.
Let's skip the second one for now and look at number three:
(<\/a>)
- matching a literal "<"
- followed by a "/", which needs to be escaped with a "\" first
- followed by a literal "a>"
That's our closing tag.
Going back to number two:
(.*?)
which matches any character (".") any number of times ("*"). The "?" makes the star "lazy" so it will stop before the first closing tag rather than the last one. This represents the stuff in between the tags.
Lastly, we don't just need the stuff in the middle, we also need the tags, so we prepend $matches[1] and append $matches[3] to the string before we return the final value from the callback.
So there it is. A rather long explanation for a rather short amount of code, but a good opportunity to brush up on some regular expression basics.