您的位置:寻梦网首页编程乐园CGI编程Zhanshou's CGI Tutorial

Regular Expression


A regular expression is a way of describing a pattern of characters to a program so that it can either display or modify the occurrences of that pattern. The UNIX utility grep (which is an acronym for Global Regular Expression Print) is the most basic UNIX tool which supports regular expressions; given a regular expression and one or more file names as arguments (in that order), grep will display to standard output each line of the file which contains the regular expression.

Regular expressions are a very powerful tool for searching and processing text, and even in their simplest form can make many seemingly complex tasks very easy.

Fortunately, Perl successfully integrates the power of regular expressions. Its flexible pattern matching ability gives it a great advantage over other languages (such as C) in the CGI world. For example, one statement is normally enough for Perl to process one string, while C would probably require many statements to implement the same task.

An example is the conversion of all the + (plus) signs in a string to the "space" sign. (Recall that input data are encoded before they are sent to the CGI program, where a space sign converts into a plus sign. After we receive the input data, we need to convert the plus sign back to its original space sign. This is usually the first step after receiving the input data from the clients). In Perl only one statement:

  $str=~tr/+//;
can do it quite well while in C we need a subroutine:
 void ADDToSpace(char *str)
  {
   register int i;
   for(i=0;str[i];i++)
     if(str[i]=='+'
        str[i]=' ';
 }

Syntax for regular expression

Regular expression Format:
    /pattern/
The following is the common used pattern:

/pattern/ Description
x? match zero or one character 'x'
x* match zero or more characters 'x'
.* match zero or more any character
x+ match one or more character 'x'
.+ match one or more any character
{m} match m characters
[] match characters included in []
[^] match characters not in []
[0-9] match any digit from '0' to '9'
[a-z] match any character from 'a' to 'z'
[^0-9] match any character not between '0' to '9'
[^a-z] match any character not between 'a' to 'z'
^ match first character in string
$ match last character in string
\d same as [0-9]
\d+ match more than one digit, same as [0-9]+
\D same as [^0-9]
\D+ same as [^0-9]+
\w match one alphanumeric (character or digit) , same as [a-zA-Z0-9]
\w+ same as [a-zA-Z0-9]+
\W match a non-alphanumeric character, same as [^a-zA-Z0-9]
\W match more than one non-alphanumeric character, same as [^a-zA-Z0-9]+
\s match one space character , same as [\n\t\r\f]
\s+ match more than one space character, same as [\n\t\r\f]+
\S match one non-space character , same as [^\n\t\r\f]
\S+ match more than one non-space character , same as [^\n\t\r\f]+
a|b|c match 'a' or 'b' or 'c'
abc match substring "abc"
(pattern) () is a very useful operator which will remember the string we found. The string found in the first () will be assigned to $1; the second, to $2 and so on. I will give an example later on.
/patter/i match string or character ignore the uppercase or lowercase

Regular Expression Example

If this is all new to you, the above pattern table was probably quite confusing. Here are some examples:

Example Description
/perl/ search string have substring "perl"
/^perl/ match string start with "perl"
perl$ match string end with "perl"
/c|g|i/ match string have 'c' or 'g' or 'i'
/cg{2,4}i/ match string with the character 'c' followed by 2 to 4 character 'g' followed by the character 'i'
/cg*i/ match string with the character 'c' followed by zero or more characters 'g' then followed by charater 'i'
/c..i match string with the character 'c' followed by any two characters then followed by the character 'i'
/[cgi] match string which includes 'c' or 'g' or 'i'
/\d/ match one digit
/\W/ match string with no alphanumeric characters

Operators and Functions

  • =~ (Pattern matching operator)

    Pattern matching operator( =~ )allows us to exampine scalar variables and test for the existence of a particular pattern in a string. Example:

      print "Please input a string: \n";
      $string=<STDIN>                   #accept an input string from standard input
      chop($string);                     #build-in function to chop the last newline character
      if($string=~/cgi/){
       print "The input string include substring cgi! \n";
      }else{
       print " The input string does not include substring cgi! \n";
      }
    
  • !~ (Pattern not matching operator)

    Pattern not matching operator( !~ ) is the negative of Pattern matching operator(=~).

  • /pattern/

    Let's take a look at serveral examples to see how it works:
    • Example 1
         $string="chmod 711 cgi";
         $string=~/(\w+)\s+(\d+)/;
       
      The (\w+) matches any number of characters. The matching substring will be assigned to variable $1. \s matches any number of spaces. The (\d+) matches any number of digits. $2 will get the matching result. So now $1="chmod"; $2="711". Note that the () is an important operator listed in the above table.

    • Example 2
        $_="chmod 711 cgi";
        /(\w+)\s+(\d+)/;  
       
      We will get the same result as Example 1. Note that if do not specify a operation string the default variable $_ will be used.

    • Example 3
       
          $string="chmod 711 cgi";
          @list=split(/\s+/,$string); #split string using space 
        
      Now we get:
       @list=("chmod","711","cgi");
        
  • tr (Translation function)

    Syntax:
    tr/SEARCHLIST/REPLACELIST/
    Which translates SEARCHLIST to REPLACELIST. Here are two examples:
    1.         $string="testing";
              $string=~tr/et/ET;   #now $string="TEsTing"
              $string=~tr/a-z/A-Z/ #Here $string="TESTING"
       
    2.         $string="CGI+Perl";
              $string~tr/+//;   #Here $string=" CGI Perl"
       
  • s (Substition function)

    Syntax:
    s/PATTERN/REPLACE/eg
    Which substitutes the PATTERN with the REPLACE pattern, where 'e' ang 'g' are the \ parameters:
    • 'g' means substitutes all the patterns matched by PATTERN in a string while omit it only substitutes the first occurence.
    • 'e' takes the REPLACE part as an equation instead of the common string.
    Here are some examples:
    • Example 1:
      $string="i:love:perl";
      $string=~s/:/*/;       # now $string="i*love:perl"
      $string=~s/:/*/;       # now $string="i*love*perl"
      $string=~s/*/+/;       # now $string="i+love+perl"
      $string=~s/+//g;       # now $string="i love perl"
      $string=~s/perl/cgi;   # now $string="i love cgi"   
       
    • Example 1:
      $string="i love perl";
      $string=~s/(love)/<$1>/; # now $string="i<love>perl" 
       # Here the first match "love" is assigned to $1 
      
    • Example 3:
      $string="www22cgi44";
      $string=~s/(\d+)/$1*2/e; #now $string="www44cgi44";
      #the paramater 'e' shows that the  $1*2 is an equation instead of a common string
       

    Previous Page Table of Content Next Page