9.3. Regular Expressions
Although regular expressions are very powerful, they are difficult to use, especially if you're new to them. So, instead of jumping on the functions that PHP supports for dealing with the regular expressions, we cover the pattern matching syntax first. If PCRE is enabled, the following should show up in phpinfo() output, as shown in Figure 9.3.
9.3.1. Syntax
PCRE functions check whether a text string matches a pattern. The syntax of a pattern always has the following format:
<delimiter> <pattern> <delimiter> [<modifiers>]
The modifiers are optional. The delimiter separates the pattern from the modifiers. PCRE uses the first character of the expression as the delimiter. You should use a character that does not exist in the pattern itself. Or, you can use a character that exists in your expression, but then you must escape it with the \. traditionally, the / is used as the delimiter, but other common delimiters are | or @. It's your choice. Personally, in most cases, we would pick the @, unless we need to do matching on an email or similar pattern that contains the @, in which case we would use the /.
The PHP function preg_match() is used to match regular expressions. The first parameter passed to the function is the pattern. The second parameter is the string to be matched to the pattern and is also called the subject. The function returns TRUE (the pattern matches) or FALSE (the pattern does not match). You can also pass a third parametera variable name. The text that matches is stored by reference in the array with this name. If you don't need to use the matching text but just want to know if there is a match or not, you can leave out the third parameter. In short, the format is as follows, with $matches being optional:
$result = preg_match($pattern, $subject, $matches);
Note
The examples in this section will not use the <?php and ?> tags, but of course, they are required.
9.3.1.1 Pattern Syntax
PCRE's matching syntax is very complex. A full discussion of all its details would exceed the scope of this book. We cover just the basics here, which is enough to be very useful. On most UNIX systems with the PCRE library installed, you can use man pcrepattern to read about the whole pattern matching language, or have a look at the (somewhat outdated) PHP Manual page at http://www.php.net/manual/en/pcre.pattern.syntax.php. But here we start with the simple things:
9.3.1.2 Metacharacters
The characters from the Table 9.1 are special characters in the way that they can be used to construct patterns.
Table 9.1. MetacharactersCharacter | Description |
---|
\ | The general escape character. You need this in case you want to use any of the metacharacters in your pattern, or the delimiter. The backslash also can be used to specify other special characters, which you can find in the next table. | . | Matches exactly one character, except a newline character.
preg_match('/./', 'PHP 5', $matches);
$matches now contains
Array
(
[0] => P
)
| ? | Marks the preceding character or sub-pattern (optional).
preg_match('/PHP.?5/', 'PHP 5', $matches);
This matches both PHP5 and PHP 5. | + | Matches the preceding character or sub-pattern one or more times.
'/a+b/' matches both 'ab', 'aab', 'aaaaaaaab', but not 'b'. preg_match also returns trUE in the example, but $matches does not contain the excessive characters.
preg_match('/a+b/', 'caaabc', $matches);
$matches now contains
Array
(
[0] => aaab
)
| * | Matches the preceding character zero or more times.
'/de*f/' matches both 'df', 'def' and 'deeeef'. Again, excessive characters are not part of the matched substring, but do not cause the match to fail. | {m} {m.n} | Matches the preceding character or sub-pattern 'm' times in case the {m} variant is used, or 'm' to 'n' times if the {m,n} variant is used.
'/TRe{1,2}f/' matches 'tref' and 'treef', but not 'TReeef'. It is possible to leave out the 'm' part of the equation or the 'n' part. In case there is no number in front of the comma, it means that the lower boundary for the number of matches is 0 and the upper boundary is determined by the number after the comma; in case the number after the comma is missing, then the upper boundary is undetermined.
'/fo{2,}ba{,2}r/' matches 'foobar', 'fooooooobar', and 'fooobaar', but not 'foobaaar'. | ^ | Marks the beginning of the subject.
' /^ghi/' matches 'ghik' and 'ghi', but not 'fghi'. | $ | Marks the end of the subject, unless the last character is a newline (\n) character. In that case, it will match just before that newline character. '/Derick$/' matches "Rethans, Derick" and "Rethans, Derick\n" but not "Derick Rethans". | [ ... ] | Makes a character class out of the characters between the opening and closing bracket. You can use this to create a group of characters to match. Using an hypen inside the character class creates a range of characters. In case you want to use the hypen as a character being part of the class, put it as last character in the class. The caret (^) has a special meaning if it is used as the first character in the class. In this case, it negates the character class, which means that it does not match with the characters listed.
Example 1:
preg_match('/[0-9]+/', 'PHP is released in 2005.',
$matches);
$matches now contains
Array
(
[0] => 2005
)
Example 2:
preg_match('/[^0-9]+/', 'PHP is released in 2005
.', $matches);
$matches now contains
Array
(
[0] => PHP is released in
)
Note that the $matches does not include the dot from the subject because a pattern always matches a consecutive string of characters.
Inside the character class, you cannot use any of the mentioned metacharacters from this table, except for ^ (to negate the character class), - (to create a range), ] (to end the character class) and, the \ (to escape special characters). | ( ... ) | Creates a sub-pattern, which can be used to group certain elements in a pattern. For example, if we had the string 'PHP in 2005.' and we wanted to extract both the century and the year as two separate entries, in the $matches array we would use the following: regexp: '/([12][0-9])([0-9]{2})/'
This creates two sub-patterns:
([12][0-9]) to match all centuries from 10 to 29.
([0-9]{2}) to match the year in the century.
preg_match(
'/([12][0-9])([0-9]{2})/',
'PHP in 2005.',
$matches
);
$matches now contains
Array
(
[0] => 2005
[1] => 20
[2] => 05
)
The element with index 0 is always the fully matched string, and all sub-patterns are assigned a number in the order in which they occur in the pattern. | (?: ...) | Creates a sub-pattern that is not captured in the output. You can use this to assert that the pattern is followed by something.
preg_match('@([A-Za-z ]+)(?:hans)@', 'Derick
Rethans', $matches);
$matches now contains
Array
(
[0] => Derick Rethans
[1] => Derick Ret
)
As you can see, the full match string still includes the fully matched part of the subject, but there is only one element extra for the sub-pattern matches. Without the ?: in the second sub-pattern, there would also have been an element containing hans. | (?P<name>...) | Creates a named sub-pattern. It is the same as a normal sub-pattern, but it generates additional elements in the $matches array.
preg_match(
'/(?P<century>[12][0-9])(?P<year>[0-9]{2})/',
'PHP in 2005.',
$matches
);
$matches now contains:
Array
(
[0] => 2005
[century] => 20
[1] => 20
[year] => 05
[2] => 05
)
This is useful in case you have a complex pattern and don't want to bother finding out the correct index number in the $matches array. |
9.3.1.3 Example 1
Let's dissect some useful complex regular expressions that we can create with the metacharacters from Table 9.1:
$pattern = "/^([0-9a-f][0-9a-f]:){5}[0-9a-f][0-9a-f]$/";
This pattern matches a MAC addressa unique number bound to a network cardwith the format 00:04:23:7c5d:01.
The pattern is bound to the start and end of our subject string with ^ and $, and it contains two parts:
This regexp could also have been written as /^([0-9a-f]{2}:){5}[0-9a-f]{2}$/, which would have been a bit shorter. To test the text against the pattern, use the following code:
preg_match($pattern, '00:04:23:7c:5d:01', $matches);
print_r($matches);
With either pattern, the output would be the same, as follows:
Array
(
[0] => 00:04:23:7c:5d:01
[1] => 5d:
)
9.3.1.4 Example 2
"/([^<]+)<([a-zA-Z0-9_-]+@([a-zA-Z0-9_-]+\\.)+[a-zA-Z0-9_-]+)>/"
This pattern is used to match email addresses in the following format:
'Derick Rethans <derick@php.net>'
This pattern is not good enough to match all email addresses, and validates some addresses that should not be matched. It only serves as a simple example.
The first part is ([^<]+)<, as follows:
/.
Delimiter used in this pattern.
( [^<]+).
Subpattern that matches all characters unless it is the '<' character.
<.
The < character which is not part of any sub-pattern.
The second part is ([a-zA-Z0-9_-]+@([a-zA-Z0-9_-]+\\.)+[a-zA-Z0-9_-]+), which used to match the email address itself:
[a-zA-Z0-9_-]+.
This matches everything until the @ and consists of one or more characters from the specified character class.
@.
The @ sign.
([a-zA-Z0-9_-]+\\.)+.
A subpattern that matches one or more levels of subdomains. Notice that the . in the pattern is escaped with the \, but also note that this \ is escaped with another \. This is needed because the pattern is enclosed in double quotes ("). You need to be careful with this. It would usually be better to use single quotes for the pattern.
[a-zA-Z0-9_-]+.
The top-level domain name (as in .com). As you can see, the regexp is not correct here; the last part should have been simply [a-z]{2,4}.
Then there is the trailing > and delimiter.
The following example shows the contents of the $matches array after running the preg_match() function:
<?php
$string = 'Derick Rethans <derick@php.net>';
preg_match(
"/([^<]+)<([a-zA-Z0-9_-]+@([a-zA-Z0-9_-]+\\.)+[a-zA-Z09_]+)>/",
$string,
$matches
);
print_r($matches);
?>
The output is
Array
(
[0] => Derick Rethans <derick@php.net>
[1] => Derick Rethans
[2] => derick@php.net
[3] => php.
)
The fourth element cannot really be avoided because a subpattern was used for the (sub)domain part of the pattern, but of course, it doesn't hurt to have it.
9.3.1.5 Escape Sequences
As shown in the previous table, the \ character is the general escape character. In combination with the character that follows it, the \ stands for a special group of characters. Table 9.2 shows the different cases.
Table 9.2. Escape SequencesCase | Description |
---|
\? \+ \* \[ \] \{ \} | The first use of the escape character is to take away the special meaning of the other metacharacters. For example, if you need to match 4** in your pattern, you can use
'/^4\*\*$/'
Be careful with using double quotes around your patterns, because PHP gives a special meaning to the \ in there too. The following pattern is therefore equal to the one above.
"/^4\\*\\*$/"
(Note: In this case, "/^4\*\*$" would also have worked because \* is not recognized by PHP as a valid escape sequence, but what is shown here is not correct way to do it.) | \\ | Escapes the \ so that it can be used in patterns.
<?php
$subject = 'PHP\5';
$pattern1 = '/^PHP\\\5$/';
$pattern2 = "/^PHP\\\\5$/";
$ret1 = preg_match($pattern1, $subject,
$matches1);
$ret2 = preg_match($pattern2, $subject,
$matches2);
var_dump($matches1, $matches2);
?>
Now you are probably wondering why we used three slashes in $pattern1; this is because PHP recognizes the \ as a special character inside single quotes when it parses the script. This is because you need to use the \ to escape a single quote in such a string ($str = 'derick\'s';). So, the first \ escapes the second \ for the PHP parser, and that combined character escapes the third slash for PCRE.
The second pattern inside double quotes even has four slashes. This is because inside double quotes \5 has a special meaning to PHP. It means "the octal character 5," which is, of course, not really useful at all, but it does give a problem for our pattern so we have to escape this slash with another slash, too. | \a | The BEL character (ASCII 7). | \e | The Escape character (ASCII 27). | \f | The Formfeed character (ASCII 12). | \n | The Newline character (ASCII 10). | \r | The Carriage Return character (ASCII 13). | \t | The Tab character (ASCII 9). | \xhh | Any character represented by its hexadecimal code (hh). Use \xdf for the ß (iso-8859-15), for example. | \ddd | Any character represented by its octal code (ddd). | \d | Any decimal digit, which is the same as specifying the character class [0-9] in a pattern. | \D | Any character that is not a decimal digit (is the same as [^0-9]). | \s | Any whitespace character. (It the same as [\t\f\r\n ], or in words: tab, formfeed, carriage return, newline, and space.) | \S | Any character that is not a whitespace character. | \w | Any character that is part of a words, meaning any letter or digit, or the underscore character. Letters are letters used in the current locale (language-specific):
<?php
$subject = "Montréal";
/* The 'default' locale */
setlocale(LC_ALL, 'C');
preg_match('/^\w+/', $subject, $matches);
print_r($matches);
/* Set the locale to Dutch, which has the é
in it's
alphabet */
setlocale(LC_ALL, 'nl_NL');
preg_match('/^\w+/', $subject, $matches);
print_r($matches);
?>
outputs
Array
(
[0] => Montr
)
Array
(
[0] => Montréal
)
Tip
For this example to work, you will need to have the locale nl_NL installed. Names of locales are system-dependent, toofor example, on Windows, the name of the locale is called nld_nld. See http://www.macmax.org/locales/index_en.html for locale names for MacOS X and http://msdn.microsoft.com/library/default.asp?url=/library/en-us/vclib/html/_crt_language_strings.asp for Windows.
| s | Any character that does not belong to the \w set. | \b | An anchor point for a word boundary. In simple words, this means a point in a string between a word character (\w) and a non-word character (\W). The following example matches only the letters in the subject:
<?php
$string = "##Testing123##";
preg_match('@\b.+\b@', $string, $matches);
print_r($matches);
?>
outputs
Array
(
[0] => Testing123
)
| \B | The opposite of the \b, it acts as an anchor between either two word characters in the \w set, or between two non-word characters from the \W set. Because of the first point that matches this restriction, the following example only prints estin:
<?php
$string = "Testing";
preg_match('@\B.+\B@', $string, $matches);
echo $matches[0]. "\n";
?>
| \Q ... \E | Can be used inside patterns to turn off the special meaning of metacharacters. The pattern '@\Q.+*?\E@' will therefore match the string '.+*?'. |
9.3.1.6 Examples '/\w+\s+\w+/'
Matches two words separated by whitespace.
'/(\d{1,3}\.){3}\d{1,3}/'
Matches (but not validates) an IP address. The IP address may appear anywhere in the string.
<?php
$str = "My IP address is 212.187.38.47.";
preg_match('/(\d{1,3}\.){3}\d{1,3}/', $str, $matches);
print_r($matches);
?>
outputs
Array
(
[0] => 212.187.38.47
[1] => 38.
)
It is interesting to notice that the second element only contains the last one of the three matched subpatterns.
9.3.1.7 Lazy Matching
Suppose you have the following string and you want to match the string inside the first <a /> tag:
<a href="http://php.net/">PHP</a> has an <a href="http://php.net/manual">excellent</a> manual.
The following pattern looks like it will work:
'@<a.*>(.*)</a>@'
However, when you run the following example, you see that it outputs the wrong result:
<?php
$str = '<a href="http://php.net/">PHP</a> has an '.
'<a href="http://php.net/manual">excellent</a> manual.';
$pattern = '@<a.*>(.*)</a>@';
preg_match($pattern, $str, $matches);
print_r($matches);
?>
outputs
Array
(
[0] => <a href="http://php.net/">PHP</a>
[1] => PHP
)
The example fails because the * and the + are greedy operators. They try to match as many characters as possible. In this case, <a.*> will match everything to manual">. You can tell the PCRE engine not to do this by appending the ? to the quantifier. If the ? is added, the PCRE engine tries to match as little characters/sub-patterns as possible, which is what we want here.
When the pattern @<a.*?>(.*?)</a>@ is used, the output is correct:
Array
(
[0] => <a href="http://php.net">PHP</a>
[1] => PHP
)
However, this is not the most efficient way. It's usually better to use the pattern @<a[^>]+>([^<]+)</a>@, which requires less processing by the PCRE engine.
9.3.1.8 Modifiers
The modifiers "modify" the behavior of the pattern matching engine. Table 9.3 lists them all with descriptions and examples.
Table 9.3. ModifiersModifier | Description |
---|
i | Makes the PCRE engine match in a case-insensitive way.
/[a-z]/ matches a letter in the range a..z./[a-z]/i matches a letter in the ranges A..Z and a..z. | m | Changes the behavior of the ^ and $ in such a way that ^ also matches just after a newline character, and $ also matches just before a newline character.
<?php
$str = "ABC\nDEF\nGHI";
preg_match('@^DEF@', $str, $matches1);
preg_match('@^DEF@m', $str, $matches2);
print_r($matches1);
print_r($matches2);
?>
outputs
Array
(
)
Array
(
[0] => DEF
)
| s | With this modifier set, the . (dot) also matches the newline character; without this modifier set (the default), it does not match the newline character.
<?php
$str = "ABC\nDEF\nGHI";
preg_match('@BC.DE@', $str, $matches1);
preg_match('@BC.DE@s', $str, $matches2);
print_r($matches1);
print_r($matches2);
?>
outputs
Array
(
)
Array
(
[0] => BC
DE
)
| x | If this modifier is set, you can put arbitrary whitespace inside your pattern, except of course in character classes.
<?php
$str = "ABC\nDEF\nGHI";
preg_match('@A B C@', $str, $matches1);
preg_match('@A B C@x', $str, $matches2);
print_r($matches1);
print_r($matches2);
?>
outputs
Array
(
)
Array
(
[0] => ABC
)
| e | Only has an effect on the preg_replace() function. When it is set, it performs the normal replacement of back references and then evaluates the replacement string as PHP code. For an example, see the section "Replacement Functions." | a | Setting this modifier has the same effect as using ^ as the first character in your pattern unless the m modifier is set.
<?php
$str = "ABC";
preg_match('@BC@', $str, $matches1);
preg_match('@BC@A', $str, $matches2);
print_r($matches1);
print_r($matches2);
?>
outputs
Array
(
[0] => BC
)
Array
(
)
| D | Makes the $ only match at the very end of the subject string, and not one character before the end in case that is a newline character.
<?php
$str = "ABC\n";
preg_match('@BC$@', $str, $matches1);
preg_match('@BC$@D', $str, $matches2);
print_r($matches1);
print_r($matches2);
?>
outputs
Array
(
[0] => BC
)
Array
(
)
| U | Swaps the "greediness" of the PCRE engine. Quantifiers become ungreedy by default, and the ? character turns on greediness. This makes the pattern we saw in an earlier example ('@<a.*?>(.*?)</a>@') an equivalent of '@<a.*>.*</a>@U'.
<?php
$str = '<a href="http://php.net/">PHP</a>
has an '.
'<a href="http://php.net/manual">'.
'excellent</a> manual.';
$pattern = '@<a.*>(.*)</a>@U';
preg_match($pattern, $str, $matches);
print_r($matches);
?>
outputs
Array
(
[0] => <a href="http://php.net/">PHP</a> has an
<a href="http://php.net
/manual">excellent</a>
[1] => excellent
)
| X | Turns on extra features in the PCRE engine. At the moment, the only feature it turns on is that the engine will throw an error in case an unknown escape sequence was detected. Normally, this would just have been treated as a literal. (Notice that we still have to escape the one \ for PHP itself.)
<?php
$str = '\\h';
preg_match('@\\h@', $str, $matches1);
preg_match('@\\h@X', $str, $matches2);
?>
output:
Warning: preg_match(): Compilation failed:
unrecognized
character follows \ at offset 1 in /dat/docs/book/
prenticehall/php5powerprogramming/chapters/draft/10-
mainstream-extensions/pcre/mod-X.php on line 4
| u | Turns on UTF-8 mode. In UTF-8 mode the PCRE engine treats the pattern as UTF-8 encoded. This means that the . (dot) matches a multi-byte character for example. (The next example expects you to view this book in the iso-8859-1 character set; if you view it in UTF-8, you'll see Dérick instead.)
<?php
$str = 'Dérick';
preg_match('@D.rick@', $str, $matches1);
preg_match('@D.rick@u', $str, $matches2);
print_r($matches1);
print_r($matches2);
?>
outputs
Array
(
)
Array
(
[0] => Dérick
)
|
9.3.2. Functions
Three groups of PCRE-related functions are available: matching functions, replacement functions, and splitting functions. preg_match(), discussed previously, belongs to the first group. The second group contains functions that replace substrings, which match a specific pattern. The last group of functions split strings based on regular expression matches.
9.3.2.1 Matching Functions
preg_match() is the function that matches one pattern with the subject string and returns either true or false depending whether the subject matched the pattern. It also can return an array containing the contents of the different sub-pattern matches.
The function preg_match_all() is similar, except that it matches the pattern with the subject repeatedly. Finding all the matches is useful when extracting information from documents. Take, for example, the situation in which you want to extract email addresses from a web site:
<?php
$raw_document = file_get_contents('http://www.w3.org/TR/CSS21');
$doc = html_entity_decode($raw_document);
$count = preg_match_all(
'/<(?P<email>([a-z.]+).?@[a-z0-9]+\.[a-z]{1,6})>/Ui',
$doc,
$matches
);
var_dump($matches);
?>
outputs
Array
(
[0] => Array
(
[0] => <bert @w3.org>
[1] => <tantekc @microsoft.com>
[2] => <ian @hixie.ch>
[3] => <howcome @opera.com>
)
[email] => Array
(
[0] => bert @w3.org
[1] => tantekc @microsoft.com
[2] => ian @hixie.ch
[3] => howcome @opera.com
)
[1] => Array
(
[0] => bert @w3.org
[1] => tantekc @microsoft.com
[2] => ian @hixie.ch
[3] => howcome @opera.com
)
[2] => Array
(
[0] => bert
[1] => tantekc
[2] => ian
[3] => howcome
)
)
This example reads the contents of the CSS 2.1 specification into a string and decodes the HTML entities in it. The script then uses a preg_match_all() on the document, using a pattern that matches < + an email address + >, and stores the email addresses in the $matches array. The output shows that preg_match_all() doesn't store all sub-pattern belonging to one match in one element of the $matches array. Instead, it stores all the sub-pattern matches belonging to the different matches into one element of $matches.
preg_grep() performs similarly to the UNIX egrep command. It compares a pattern against elements of an array containing the subjects. It returns an array containing the elements that were successfully matched against the pattern. See the next example, which returns all valid IP addresses from the array $addresses:
<?php
$addresses =
array('212.187.38.47', '188.141.21.91', '2.9.256.7', '<<empty>>');
$pattern =
'@^((\d?\d|1\d\d|2[0-4]\d|25[0-5])\.){3}'.
'(\d?\d|1\d\d|2[0-4]\d|25[0-5])@';
$addresses = preg_grep($pattern, $addresses);
print_r($addresses);
?>
9.3.2.2 Replacement Functions
In addition to the matching described in the previous section, PHP's regular expression functions can also replace text based on pattern matching. The replacement functions can replace a substring that matches a subpattern with different text. In the replacement, you can refer to the pattern matches using back references. Here is an example that explains the replacement functions. In this example, we use preg_replace() to replace a pseudo-link, such as [link url="www.php.net"]PHP[/link], with a real HTML link:
<?php
$str = '[link url="http://php.net"]PHP[/link] is cool.';
$pattern = '@\[link\ url="([^"]+)"\](.*?)\[/link\]@';
$replacement = '<a href="\\1">\\2</a>';
$str = preg_replace($pattern, $replacement, $str);
echo $str;
?>
The script outputs
<a href="http://php.net">PHP</a> is cool.
The pattern consists of two sub-patterns, ([^"]+) for the URL and (.*?). Instead of returning the substring of the subject that matches the two sub-patterns, the PCRE engine assigns the substring to back references, which you can access by using \\1 and \\2 in the replacement string. If you don't want to use \\1, you may use $1. Be careful when putting the replacement string into double quotes, because you will have to escape either the slashes (so that a back reference looks like \\\\1) or the dollar sign (so that a back reference looks like \$1). You should always put the replacement string in single quotes.
The full pattern match is assigned to back reference 0, just like the element with key 0 in the matches array of the preg_match() function.
Tip
If the replacement string needs to be back reference + number, you can also use ${1}1 for the first back reference, followed by the number 1.
preg_replace() can replace more than one subject at the same time by using an array of subjects. For instance, the following example script changes the format of the names in the array $names:
<?php
$names = array(
'rethans, derick',
'sæther bakken, stig',
'gutmans, andi'
);
$names = preg_replace('@([^,]+).\ (.*)@', '\\2 \\1', $names);
?>
The names array is changed to
array('derick rethans', 'stig sœther bakken', 'andi gutmans');
However, names usually start with an uppercase letter. You can uppercase the first letter by using either the /e modifier or preg_replace_callback(). The /e modifier uses the replacement string to be evaluated as PHP code. Its return value is the replacement string:
<?php
$names = array(
'rethans, derick',
'sæther bakken, stig',
'gutmans, andi'
);
$names = preg_replace('@([^,]+).\ (.*)@e', 'ucwords("\\2 \\1")', $names);
?>
If you need to do more complex manipulation with the matched patterns, evaluating replacement strings becomes complicated. You can use the preg_replace_callback() function instead:
<?php
function format_string($matches)
{
return ucwords("{$matches[2]} {$matches[1]}");
}
$names = array(
'rethans, derick',
'sæther bakken, stig',
'gutmans, andi'
);
$names = preg_replace_callback(
'@([^,]+).\ (.*)@', // pattern
'format_string', // callback function
$names // array with 'subjects'
);
print_r($names);
?>
Here's one more useful example:
<?php
$show_with_vat = true;
$format = '€ %.2f';
$exchange_rate = 1.2444;
function currency_output_vat ($data)
{
$price = $data[1];
$vat_percent = $data[2];
$show_vat = isset ($_GLOBALS['show_with_vat']) &&
$_GLOBALS['show_with_vat'];
$amount = ($show_vat)
? $price * (1 + $vat_percent / 100)
: $price;
return sprintf(
$GLOBALS['format'],
$amount / $GLOBALS['exchange_rate']
);
}
$data = "This item costs {amount: 27.95 %19%} ".
"and the other one costs {amount: 29.95 %0%}.\n";
echo preg_replace_callback (
'/\{amount\:\ ([0-9.]+)\ \%([0-9.]+)\%\}/',
'currency_output_vat',
$data
);
?>
This example originates from a webshop where the format and exchange rate are decoupled from the text, which is stored in a cache file. With this solution, it is possible to use caching techniques and still have a dynamic exchange rate.
preg_replace() and preg_replace_callback() allow the pattern to be an array of patterns. When an array is passed as the first parameter, every pattern is matched against the subject. preg_replace() also enables you to pass an array for the replacement string when the first parameter is an array with patterns:
<?php
$text = "This is a nice text; with punctuation AND capitals";
$patterns = array('@[A-Z]@e', '@[\W]@', '@_+@');
$replacements = array('strtolower(\\0)', '_', '_');
$text = preg_replace($patterns, $replacements, $text);
echo $text."\n";
?>
The first pattern @[A-Z]@e matches any uppercase character and, because the e modifier is used, the accompanying replacement string strtolower(\\0) is evaluated as PHP code. The second pattern [\W\] matches all non-word characters and, because the second replacement string is simply _, all non-word characters are replaced by the underscore (_). Because the replacements are done in order, the third pattern matches the already modified subject, replacing all multiple occurrences of _ with one. The subject string contains the following after each pattern/replacement match, as shown in Table 9.4.
Table 9.4. Replacement StepsStep | Result |
---|
Before: | This is a nice text; with punctuation AND capitals | Step 1: | this is a nice text; with punctuation and capitals | Step 2: | this_is_a_nice_text__with_punctuation_and_capitals | Step 3: | this_is_a_nice_text_with_punctuation_and_capitals |
9.3.2.3 Splitting Strings
The last group of functions includes only preg_split(), which can be used to split a string into substrings by using a regular expression match for the delimiters. PHP provides an explode() function that also splits strings, but explode() can only use a simple string as the delimiter. explode() is much faster than using a regular expression, so you might be better off using explode() when possible. A simple example of preg_splits()'s usage might be to split a string into the words it contains. See the following example:
<?php
$str = 'This is an example for preg_split().';
$words = preg_split('@[\W]+@', $str);
print_r($words);
?>
The script outputs
Array
(
[0] => This
[1] => is
[2] => an
[3] => example
[4] => for
[5] => preg_split
[6] =>
)
As you can see, the last element is empty. By default, the function returns empty elements, too. The character(s) before the end of the string are non-word characters so they act as a delimiter, resulting in an empty element. You can pass two more parameters to the preg_split() function: a limit and a flag. The "limit" parameter controls how many elements are returned before the splitting stops. In the preg_split() example, two elements are returned:
<?php
$str = 'This is an example for preg_split().';
$words = preg_split('@[\W]+@', $str, 2);
print_r($words);
?>
The output is
Array
(
[0] => This
[1] => is an example for preg_split().
)
In the next example, we use -1 as the limit. -1 means that there is no limit at all, and allows us to pass flags without shortening our output array. Three flags specify what is returned:
PREG_SPLIT_NO_EMPTY.
Prevents empty elements from ending up in the returned array:
<?php
$str = 'This is an example.';
$words = preg_split('@[\W]+@', $str, -1, PREG_SPLIT_NO_EMPTY);
print_r($words);
?>
The script outputs
Array
(
[0] => This
[1] => is
[2] => an
[3] => example
)
PREG_SPLIT_DELIM_CAPTURE.
Returns the delimiters itself, but only if the delimiters are surrounded by parentheses. We combine the flag with PREG_SPLIT_NO_EMPTY:
<?php
$str = 'This is an example.';
$words = preg_split(
'@([\W]+)@', $str, -1,
PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY
);
print_r($words);
?>
The script outputs
Array
(
[0] => This
[1] =>
[2] => is
[3] =>
[4] => an
[5] =>
[6] => example
[7] => .
)
PREG_SPLIT_OFFSET_CAPTURE.
Specifies that the function return a two-dimensional array containing both the text and the offset in the string where the element started. In this example, we combine all three flags:
<?php
$str = 'This is an example.';
$words = preg_split(
'@([\W]+)@', $str, -1,
PREG_SPLIT_OFFSET_CAPTURE |
PREG_SPLIT_DELIM_CAPTURE |
PREG_SPLIT_NO_EMPTY
);
var_export($words);
?>
The script outputs (reformatted):
array (
0 => array ( 0 => 'This', 1 => 0 ),
1 => array ( 0 => ' ', 1 => 4 ),
2 => array ( 0 => 'is', 1 => 5 ),
3 => array ( 0 => ' ', 1 => 7 ),
4 => array ( 0 => 'an', 1 => 8 ),
5 => array ( 0 => ' ', 1 => 10 ),
6 => array ( 0 => 'example', 1 => 11 ),
7 => array ( 0 => '.', 1 => 18 ),
)
|