PHP regular expressions examples

The regular expression, as a pattern, can match all kinds of text strings helping your application validate, compare, compute, decide etc. It can do simple or very complex string manipulations. The list of possibilities is enormous when it comes to what you can achieve using regular expressions. You can take any phrase that starts with an “A” or any character and do various things with it. You can match phone numbers, email addresses, url’s, credit card numbers, social security numbers, zip codes, states, cities…..(the list never ends). A huge script that is supposed to validate a user input and prepare it for sql can be reduced to only one line with the help of preg_replace.

Mastering Regular Expressions quickly covers the basics of regular-expression syntax, then delves into the mechanics of expression-processing, common pitfalls, performance issues, and implementation-specific differences. Written in an engaging style and sprinkled with solutions to complex real-world problems, MRE offers a wealth information that you use.

I will start with some simple usage examples of the regular expressions and continue with a huge list of cases for various situations where we would normally need a regex to operate. We will use simple functions which return TRUE or FALSE. $regex will serve as our regular expression to match against and $text will be our text (pretty obvious):

function do_reg($text, $regex)
{
	if (preg_match($regex, $text)) {
		return TRUE;
	} 
	else {
		return FALSE;
	}
}

The next function will get the part of a given string ($text) matched by the regex ($regex) using a group srorage ($regs). By changing the $regs[0] to $regs[1] we can use a capturing group (in this case griup 1) to match against. The capturing group can also have a name ($regs[‘groupname’]):

function do_reg($text, $regex, $regs)
{
	if (preg_match($regex, $text, $regs)) {
		$result = $regs[0];
	} 
	else {
		$result = "";
	}
	return $result;
}

The following function will return an array of all regex matches in a given string ($text):

function do_reg($text, $regex)
{
	preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
	return $result = $result[0];
}

Next we can iterate (loop) over all matches in a string ($text) and output the results:

function do_reg($text, $regex)
{
	preg_match_all($regex, $text, $result, PREG_PATTERN_ORDER);
	for ($i = 0; $i < count($result[0]); $i++) {
	$result[0][$i];
}
}

Extending the above one we can iterate over all matches ($text) and capture groups in a string ($text):

function do_reg($text, $regex)
{
	preg_match_all($regex, $text, $result, PREG_SET_ORDER);
	for ($matchi = 0; $matchi < count($result); $matchi++) {
		for ($backrefi = 0; $backrefi < count($result[$matchi]); $backrefi++) {
			$result[$matchi][$backrefi];
		} 
	}
}
}

REGULAR EXPRESSION EXAMPLES BY SITUATIONS AND NEEDS:
Addresses

//Address: State code (US)
'/\\b(?:A[KLRZ]|C[AOT]|D[CE]|FL|GA|HI|I[ADLN]|K[SY]|LA|M[ADEINOST]|N[CDEHJMVY]|O[HKR]|PA|RI|S[CD]|T[NX]|UT|V[AT]|W[AIVY])\\b/'

//Address: ZIP code (US)
'\b[0-9]{5}(?:-[0-9]{4})?\b'

Columns

//Columns: Match a regex starting at a specific column on a line.
'^.{%SKIPAMOUNT%}(%REGEX%)'

//Columns: Range of characters on a line, captured into backreference 1
//Iterate over all matches to extract a column of text from a file
//E.g. to grab the characters in colums 8..10, set SKIPAMOUNT to 7, and CAPTUREAMOUNT to 3
'^.{%SKIPAMOUNT%}(.{%CAPTUREAMOUNT%})'

Credit cards

//Credit card: All major cards
'^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|6011[0-9]{12}|3(?:0[0-5]|[68][0-9])[0-9]{11}|3[47][0-9]{13})$'

//Credit card: American Express
'^3[47][0-9]{13}$'

//Credit card: Diners Club
'^3(?:0[0-5]|[68][0-9])[0-9]{11}$'

//Credit card: Discover
'^6011[0-9]{12}$'

//Credit card: MasterCard
'^5[1-5][0-9]{14}$'

//Credit card: Visa
'^4[0-9]{12}(?:[0-9]{3})?$'

//Credit card: remove non-digits
'/[^0-9]+/'

CSV

//CSV: Change delimiter
//Changes the delimiter from a comma into a tab.
//The capturing group makes sure delimiters inside double-quoted entries are ignored.
'("[^"\r\n]*")?,(?![^",\r\n]*"$)'

//CSV: Complete row, all fields.
//Match complete rows in a comma-delimited file that has 3 fields per row, 
//capturing each field into a backreference.  
//To match CSV rows with more or fewer fields, simply duplicate or delete the capturing groups.
'^("[^"\r\n]*"|[^,\r\n]*),("[^"\r\n]*"|[^,\r\n]*),("[^"\r\n]*"|[^,\r\n]*)$'

//CSV: Complete row, certain fields.
//Set %SKIPLEAD% to the number of fields you want to skip at the start, and %SKIPTRAIL% to 
//the number of fields you want to ignore at the end of each row.  
//This regex captures 3 fields into backreferences.  To capture more or fewer fields, 
//simply duplicate or delete the capturing groups.
'^(?:(?:"[^"\r\n]*"|[^,\r\n]*),){%SKIPLEAD%}("[^"\r\n]*"|[^,\r\n]*),("[^"\r\n]*"|[^,\r\n]*),("[^"\r\n]*"|[^,\r\n]*)(?:(?:"[^"\r\n]*"|[^,\r\n]*),){%SKIPTRAIL%}$'

//CSV: Partial row, certain fields
//Match the first SKIPLEAD+3 fields of each rows in a comma-delimited file that has SKIPLEAD+3 
//or more fields per row.  The 3 fields after SKIPLEAD are each captured into a backreference.  
//All other fields are ignored.  Rows that have less than SKIPLEAD+3 fields are skipped.  
//To capture more or fewer fields, simply duplicate or delete the capturing groups.
'^(?:(?:"[^"\r\n]*"|[^,\r\n]*),){%SKIPLEAD%}("[^"\r\n]*"|[^,\r\n]*),("[^"\r\n]*"|[^,\r\n]*),("[^"\r\n]*"|[^,\r\n]*)'

//CSV: Partial row, leading fields
//Match the first 3 fields of each rows in a comma-delimited file that has 3 or more fields per row.  
//The first 3 fields are each captured into a backreference.  All other fields are ignored.  
//Rows that have less than 3 fields are skipped.  To capture more or fewer fields, 
//simply duplicate or delete the capturing groups.
'^("[^"\r\n]*"|[^,\r\n]*),("[^"\r\n]*"|[^,\r\n]*),("[^"\r\n]*"|[^,\r\n]*)'

//CSV: Partial row, variable leading fields
//Match the first 3 fields of each rows in a comma-delimited file.  
//The first 3 fields are each captured into a backreference.
//All other fields are ignored.  If a row has fewer than 3 field, some of the backreferences 
//will remain empty.  To capture more or fewer fields, simply duplicate or delete the capturing groups.  
//The question mark after each group makes that group optional.
'^("[^"\r\n]*"|[^,\r\n]*),("[^"\r\n]*"|[^,\r\n]*)?,("[^"\r\n]*"|[^,\r\n]*)?'

Dates

//Date d/m/yy and dd/mm/yyyy
//1/1/00 through 31/12/99 and 01/01/1900 through 31/12/2099
//Matches invalid dates such as February 31st
'\b(0?[1-9]|[12][0-9]|3[01])[- /.](0?[1-9]|1[012])[- /.](19|20)?[0-9]{2}\b'

//Date dd/mm/yyyy
//01/01/1900 through 31/12/2099
//Matches invalid dates such as February 31st
'(0[1-9]|[12][0-9]|3[01])[- /.](0[1-9]|1[012])[- /.](19|20)[0-9]{2}'

//Date m/d/y and mm/dd/yyyy
//1/1/99 through 12/31/99 and 01/01/1900 through 12/31/2099
//Matches invalid dates such as February 31st
//Accepts dashes, spaces, forward slashes and dots as date separators
'\b(0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])[- /.](19|20)?[0-9]{2}\b'

//Date mm/dd/yyyy
//01/01/1900 through 12/31/2099
//Matches invalid dates such as February 31st
'(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)[0-9]{2}'

//Date yy-m-d or yyyy-mm-dd
//00-1-1 through 99-12-31 and 1900-01-01 through 2099-12-31
//Matches invalid dates such as February 31st
'\b(19|20)?[0-9]{2}[- /.](0?[1-9]|1[012])[- /.](0?[1-9]|[12][0-9]|3[01])\b'

//Date yyyy-mm-dd
//1900-01-01 through 2099-12-31
//Matches invalid dates such as February 31st
'(19|20)[0-9]{2}[- /.](0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])'

Delimiters

//Delimiters: Replace commas with tabs
//Replaces commas with tabs, except for commas inside double-quoted strings
'((?:"[^",]*+")|[^,]++)*+,'

Email addresses

//Email address
//Use this version to seek out email addresses in random documents and texts.
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.  
//Including these increases the risk of false positives when applying the regex to random documents.
'\b[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}\b'

//Email address (anchored)
//Use this anchored version to check if a valid email address was entered.
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.
//Requires the "case insensitive" option to be ON.
'^[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$'

//Email address (anchored; no consecutive dots)
//Use this anchored version to check if a valid email address was entered.
//Improves on the original email address regex by excluding addresses with consecutive dots such as [email protected]
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.  
//Including these increases the risk of false positives when applying the regex to random documents.
'^[A-Z0-9._%-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}$'

//Email address (no consecutive dots)
//Use this version to seek out email addresses in random documents and texts.
//Improves on the original email address regex by excluding addresses with consecutive dots such as [email protected]
//Does not match email addresses using an IP address instead of a domain name.
//Does not match email addresses on new-fangled top-level domains with more than 4 letters such as .museum.  
//Including these increases the risk of false positives when applying the regex to random documents.
'\b[A-Z0-9._%-]+@(?:[A-Z0-9-]+\.)+[A-Z]{2,4}\b'

//Email address (specific TLDs)
//Does not match email addresses using an IP address instead of a domain name.
//Matches all country code top level domains, and specific common top level domains.
'^[A-Z0-9._%-]+@[A-Z0-9.-]+\.(?:[A-Z]{2}|com|org|net|biz|info|name|aero|biz|info|jobs|museum|name)$'

//Email address: Replace with HTML link
'\b(?:mailto:)?([A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4})\b'

HTML

//HTML comment
'<!--.*?-->'

//HTML file
//Matches a complete HTML file.  Place round brackets around the .*? parts you want to extract from the file.
//Performance will be terrible on HTML files that miss some of the tags 
//(and thus won't be matched by this regular expression).  Use the atomic version instead when your search 
//includes such files (the atomic version will also fail invalid files, but much faster).
'<html>.*?<head>.*?<title>.*?</title>.*?</head>.*?<body[^>]*>.*?</body>.*?</html>'

//HTML file (atomic)
//Matches a complete HTML file.  Place round brackets around the .*? parts you want to extract from the file.
//Atomic grouping maintains the regular expression's performance on invalid HTML files.
'<html>(?>.*?<head>)(?>.*?<title>)(?>.*?</title>)(?>.*?</head>)(?>.*?<body[^>]*>)(?>.*?</body>).*?</html>'

//HTML tag
//Matches the opening and closing pair of whichever HTML tag comes next.
//The name of the tag is stored into the first capturing group.
//The text between the tags is stored into the second capturing group.
'<([A-Z][A-Z0-9]*)[^>]*>(.*?)</\1>'

//HTML tag
//Matches the opening and closing pair of a specific HTML tag.
//Anything between the tags is stored into the first capturing group.
//Does NOT properly match tags nested inside themselves.
'<%TAG%[^>]*>(.*?)</%TAG%>'

//HTML tag
//Matches any opening or closing HTML tag, without its contents.
'</?[a-z][a-z0-9]*[^<>]*>'

 

IP addresses

//IP address
//Matches 0.0.0.0 through 999.999.999.999
//Use this fast and simple regex if you know the data does not contain invalid IP addresses.
'\b([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\b'

//IP address
//Matches 0.0.0.0 through 999.999.999.999
//Use this fast and simple regex if you know the data does not contain invalid IP addresses, 
//and you don't need access to the individual IP numbers.
'\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b'

//IP address
//Matches 0.0.0.0 through 255.255.255.255
//Use this regex to match IP numbers with accurracy, without access to the individual IP numbers.
'\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b'

//IP address
//Matches 0.0.0.0 through 255.255.255.255
//Use this regex to match IP numbers with accurracy.
//Each of the 4 numbers is stored into a capturing group, so you can access them for further processing.
'\b(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b'

Lines

//Lines: Absolutely blank (no whitespace)
//Regex match does not include line break after the line.
'^$'

//Lines: Blank (may contain whitespace)
//Regex match does not include line break after the line.
'^[ \t]*$'

//Lines: Delete absolutely blank lines
//Regex match includes line break after the line.
'^\r?\n'

//Lines: Delete blank lines
//Regex match includes line break after the line.
'^[ \t]*$\r?\n'

//Lines: Delete duplicate lines
//This regex matches two or more lines, each identical to the first line.  
//It deletes all of them, except the first.
'^(.*)(\r?\n\1)+$'

//Lines: Truncate a line after a regex match.
//The regex you specify is guaranteed to match only once on each line.  
//If the original regex you specified should match more than once, 
//the line will be truncated after the last match.
preg_replace('^.*(%REGEX%)(.*)$', '$1$2', $text);

//Lines: Truncate a line before a regex match.
//If the regex matches more than once on the same line, everything before the last match is deleted.
preg_replace('^.*(%REGEX%)', '$1', $text);

//Lines: Truncate a line before and after a regex match.
//This will delete everything from the line not matched by the regular expression.
preg_replace('^.*(%REGEX%).*$', '$1', $text);

Logs

//Logs: Apache web server
//Successful hits to HTML files only.  Useful for counting the number of page views.
'^((?#client IP or domain name)\S+)\s+((?#basic authentication)\S+\s+\S+)\s+\[((?#date and time)[^]]+)\]\s+"(?:GET|POST|HEAD) ((?#file)/[^ ?"]+?\.html?)\??((?#parameters)[^ ?"]+)? HTTP/[0-9.]+"\s+(?#status code)200\s+((?#bytes transferred)[-0-9]+)\s+"((?#referrer)[^"]*)"\s+"((?#user agent)[^"]*)"$'

//Logs: Apache web server
//404 errors only
'^((?#client IP or domain name)\S+)\s+((?#basic authentication)\S+\s+\S+)\s+\[((?#date and time)[^]]+)\]\s+"(?:GET|POST|HEAD) ((?#file)[^ ?"]+)\??((?#parameters)[^ ?"]+)? HTTP/[0-9.]+"\s+(?#status code)404\s+((?#bytes transferred)[-0-9]+)\s+"((?#referrer)[^"]*)"\s+"((?#user agent)[^"]*)"$'

Numbers

//Number: Currency amount
//Optional thousands separators; optional two-digit fraction
'\b[0-9]{1,3}(?:,?[0-9]{3})*(?:\.[0-9]{2})?\b'

//Number: Currency amount
//Optional thousands separators; mandatory two-digit fraction
'\b[0-9]{1,3}(?:,?[0-9]{3})*\.[0-9]{2}\b'

//Number: floating point
//Matches an integer or a floating point number with mandatory integer part.  The sign is optional.
'[-+]?\b[0-9]+(\.[0-9]+)?\b'

//Number: floating point
//Matches an integer or a floating point number with optional integer part.  The sign is optional.
'[-+]?\b[0-9]*\.?[0-9]+\b'

//Number: hexadecimal (C-style)
'\b0[xX][0-9a-fA-F]+\b'

//Number: Insert thousands separators
//Replaces 123456789.00 with 123,456,789.00
'(?<=[0-9])(?=(?:[0-9]{3})+(?![0-9]))'

//Number: integer
//Will match 123 and 456 as separate integer numbers in 123.456
'\b\d+\b'

//Number: integer
//Does not match numbers like 123.456
'(?<!\S)\d++(?!\S)'

//Number: integer with optional sign
'[-+]?\b\d+\b'

//Number: scientific floating point
//Matches an integer or a floating point number.
//Integer and fractional parts are both optional.
'[-+]?(?:\b[0-9]+(?:\.[0-9]*)?|\.[0-9]+\b)(?:[eE][-+]?[0-9]+\b)?'

//Number: scientific floating point
//Matches an integer or a floating point number with optional integer part.
//Both the sign and exponent are optional.
'[-+]?\b[0-9]*\.?[0-9]+(?:[eE][-+]?[0-9]+)?\b'

Passwords

//Password complexity
//Tests if the input consists of 6 or more letters, digits, underscores and hyphens.
//The input must contain at least one upper case letter, one lower case letter and one digit.
'\A(?=[-_a-zA-Z0-9]*?[A-Z])(?=[-_a-zA-Z0-9]*?[a-z])(?=[-_a-zA-Z0-9]*?[0-9])[-_a-zA-Z0-9]{6,}\z'

//Password complexity
//Tests if the input consists of 6 or more characters.
//The input must contain at least one upper case letter, one lower case letter and one digit.
'\A(?=[-_a-zA-Z0-9]*?[A-Z])(?=[-_a-zA-Z0-9]*?[a-z])(?=[-_a-zA-Z0-9]*?[0-9])\S{6,}\z'

File paths

//Path: Windows
'\b[a-z]:\\[^/:*?"<>|\r\n]*'

//Path: Windows
//Different elements of the path are captured into backreferences.
'\b((?#drive)[a-z]):\\((?#folder)[^/:*?"<>|\r\n]*\\)?((?#file)[^\\/:*?"<>|\r\n]*)'

//Path: Windows or UNC
'(?:(?#drive)\b[a-z]:|\\\\[a-z0-9]+)\\[^/:*?"<>|\r\n]*'

//Path: Windows or UNC
//Different elements of the path are captured into backreferences.
'((?#drive)\b[a-z]:|\\\\[a-z0-9]+)\\((?#folder)[^/:*?"<>|\r\n]*\\)?((?#file)[^\\/:*?"<>|\r\n]*)'

Phone numbers

//Phone Number (North America)
//Matches 3334445555, 333.444.5555, 333-444-5555, 333 444 5555, (333) 444 5555 and all combinations thereof.
//Replaces all those with (333) 444-5555
preg_replace('\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})', '(\1) \2-\3', $text);

//Phone Number (North America)
//Matches 3334445555, 333.444.5555, 333-444-5555, 333 444 5555, (333) 444 5555 and all combinations thereof.
'\(?[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}'

Postal codes

//Postal code (Canada)
'\b[ABCEGHJKLMNPRSTVXY][0-9][A-Z] [0-9][A-Z][0-9]\b'

//Postal code (UK)
'\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b'

Programming

//Programming: # comment
//Single-line comment started by # anywhere on the line
'#.*$'

//Programming: # preprocessor statement
//Started by # at the start of the line, possibly preceded by some whitespace.
'^\s*#.*$'

//Programming: /* comment */
//Does not match nested comments.  Most languages, including C, Java, C#, etc. 
//do not allow comments to be nested.  I.e. the first */ closes the comment.
'/\*.*?\*/'

//Programming: // comment
//Single-line comment started by // anywhere on the line
'//.*$'

//Programming: GUID
//Microsoft-style GUID, numbers only.
'[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}'

//Programming: GUID
//Microsoft-style GUID, with optional parentheses or braces.
//(Long version, if your regex flavor doesn't support conditionals.)
'[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}|\([A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}\)|\{[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}\}'

//Programming: GUID
//Microsoft-style GUID, with optional parentheses or braces.
//Short version, illustrating the use of regex conditionals.  Not all regex flavors support conditionals.  
//Also, when applied to large chunks of data, the regex using conditionals will likely be slower 
//than the long version.  Straight alternation is much easier to optimize for a regex engine.
'(?:(\()|(\{))?[A-Z0-9]{8}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{4}-[A-Z0-9]{12}(?(1)\))(?(2)\})'

//Programming: Remove escapes
//Remove backslashes used to escape other characters
preg_replace('\\(.)', '\1', $text);

//Programming: String
//Quotes may appear in the string when escaped with a backslash.
//The string may span multiple lines.
'"[^"\\]*(?:\\.[^"\\]*)*"'

//Programming: String
//Quotes may appear in the string when escaped with a backslash.
//The string cannot span multiple lines.
'"[^"\\\r\n]*(?:\\.[^"\\\r\n]*)*"'

//Programming: String
//Quotes may not appear in the string.  The string cannot span multiple lines.
'"[^"\r\n]*"'

Quotes

//Quotes: Replace smart double quotes with straight double quotes.
//ANSI version for use with 8-bit regex engines and the Windows code page 1252.
preg_replace('[\x84\x93\x94]', '"', $text);

//Quotes: Replace smart double quotes with straight double quotes.
//Unicode version for use with Unicode regex engines.
preg_replace('[\u201C\u201D\u201E\u201F\u2033\u2036]', '"', $text);

//Quotes: Replace smart single quotes and apostrophes with straight single quotes.
//Unicode version for use with Unicode regex engines.
preg_replace("[\u2018\u2019\u201A\u201B\u2032\u2035]", "'", $text);

//Quotes: Replace smart single quotes and apostrophes with straight single quotes.
//ANSI version for use with 8-bit regex engines and the Windows code page 1252.
preg_replace("[\x82\x91\x92]", "'", $text);

//Quotes: Replace straight apostrophes with smart apostrophes
preg_replace("\b'\b", "?", $text);

//Quotes: Replace straight double quotes with smart double quotes.
//ANSI version for use with 8-bit regex engines and the Windows code page 1252.
preg_replace('\B"\b([^"\x84\x93\x94\r\n]+)\b"\B', '?\1?', $text);

//Quotes: Replace straight double quotes with smart double quotes.
//Unicode version for use with Unicode regex engines.
preg_replace('\B"\b([^"\u201C\u201D\u201E\u201F\u2033\u2036\r\n]+)\b"\B', '?\1?', $text);

//Quotes: Replace straight single quotes with smart single quotes.
//Unicode version for use with Unicode regex engines.
preg_replace("\B'\b([^'\u2018\u2019\u201A\u201B\u2032\u2035\r\n]+)\b'\B", "?\1?", $text);

//Quotes: Replace straight single quotes with smart single quotes.
//ANSI version for use with 8-bit regex engines and the Windows code page 1252.
preg_replace("\B'\b([^'\x82\x91\x92\r\n]+)\b'\B", "?\1?", $text);

Escape

//Regex: Escape metacharacters
//Place a backslash in front of the regular expression metacharacters
preg_replace("[][{}()*+?.\\^$|]", "\\$0", $text);

Security

//Security: ASCII code characters excl. tab and CRLF
//Matches any single non-printable code character that may cause trouble in certain situations.
//Excludes tabs and line breaks.
'[\x00\x08\x0B\x0C\x0E-\x1F]'

//Security: ASCII code characters incl. tab and CRLF
//Matches any single non-printable code character that may cause trouble in certain situations.
//Includes tabs and line breaks.
'[\x00-\x1F]'

//Security: Escape quotes and backslashes
//E.g. escape user input before inserting it into a SQL statement
preg_replace("\\$0", "\\$0", $text);

//Security: Unicode code and unassigned characters excl. tab and CRLF
//Matches any single non-printable code character that may cause trouble in certain situations.
//Also matches any Unicode code point that is unused in the current Unicode standard, 
//and thus should not occur in text as it cannot be displayed.
//Excludes tabs and line breaks.
'[^\P{C}\t\r\n]'

//Security: Unicode code and unassigned characters incl. tab and CRLF
//Matches any single non-printable code character that may cause trouble in certain situations.
//Also matches any Unicode code point that is unused in the current Unicode standard, 
//and thus should not occur in text as it cannot be displayed.
//Includes tabs and line breaks.
'\p{C}'

//Security: Unicode code characters excl. tab and CRLF
//Matches any single non-printable code character that may cause trouble in certain situations.
//Excludes tabs and line breaks.
'[^\P{Cc}\t\r\n]'

//Security: Unicode code characters incl. tab and CRLF
//Matches any single non-printable code character that may cause trouble in certain situations.
//Includes tabs and line breaks.
'\p{Cc}'

SSN (Social security numbers)

//Social security number (US)
'\b[0-9]{3}-[0-9]{2}-[0-9]{4}\b'

Trim

//Trim whitespace (including line breaks) at the end of the string
preg_replace("\s+\z", "", $text);

//Trim whitespace (including line breaks) at the start and the end of the string
preg_replace("\A\s+|\s+\z", "", $text);

//Trim whitespace (including line breaks) at the start of the string
preg_replace("\A\s+", "", $text);

//Trim whitespace at the end of each line
preg_replace("[ \t]+$", "", $text);

//Trim whitespace at the start and the end of each line
preg_replace("^[ \t]+|[ \t]+$", "", $text);

//Trim whitespace at the start of each line
preg_replace("^[ \t]+", "", $text);

URL’s

//URL: Different URL parts
//Protocol, domain name, page and CGI parameters are captured into backreferenes 1 through 4
'\b((?#protocol)https?|ftp)://((?#domain)[-A-Z0-9.]+)((?#file)/[-A-Z0-9+&@#/%=~_|!:,.;]*)?((?#parameters)\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?'

//URL: Different URL parts
//Protocol, domain name, page and CGI parameters are captured into named capturing groups.
//Works as it is with .NET, and after conversion by RegexBuddy on the Use page with Python, PHP/preg and PCRE.
'\b(?<protocol>https?|ftp)://(?<domain>[-A-Z0-9.]+)(?<file>/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(?<parameters>\?[-A-Z0-9+&@#/%=~_|!:,.;]*)?'

//URL: Find in full text
//The final character class makes sure that if an URL is part of some text, punctuation such as a 
//comma or full stop after the URL is not interpreted as part of the URL.
'\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]'

//URL: Replace URLs with HTML links
preg_replace('\b(https?|ftp|file)://[-A-Z0-9+&@#/%?=~_|!:,.;]*[-A-Z0-9+&@#/%=~_|]', '<a href="\0">\0</a>', $text);

Words

//Words: Any word NOT matching a particular regex
//This regex will match all words that cannot be matched by %REGEX%.
//Explanation: Observe that the negative lookahead and the \w+ are repeated together.  
//This makes sure we test that %REGEX% fails at EVERY position in the word, and not just at any particular position.
'\b(?:(?!%REGEX%)\w)+\b'

//Words: Delete repeated words
//Find any word that occurs twice or more in a row.
//Delete all occurrences except the first.
preg_replace('\b(\w+)(?:\s+\1\b)+', '\1', $text);

//Words: Near, any order
//Matches word1 and word2, or vice versa, separated by at least 1 and at most 3 words
'\b(?:word1(?:\W+\w+){1,3}\W+word2|word2(?:\W+\w+){1,3}\W+word1)\b'

//Words: Near, list
//Matches any pair of words out of the list word1, word2, word3, separated by at least 1 and at most 6 words
'\b(word1|word2|word3)(?:\W+\w+){1,6}\W+(word1|word2|word3)\b'

//Words: Near, ordered
//Matches word1 and word2, in that order, separated by at least 1 and at most 3 words
'\bword1(?:\W+\w+){1,3}\W+word2\b'

//Words: Repeated words
//Find any word that occurs twice or more in a row.
'\b(\w+)\s+\1\b'

//Words: Whole word
'\b%WORD%\b'

//Words: Whole word
//Match one of the words from the list
'\b(?:word1|word2|word3)\b'

//Words: Whole word at the end of a line
//Whitespace permitted after the word
'\b%WORD%\s*$'

//Words: Whole word at the end of a line
'\b%WORD%$'

//Words: Whole word at the start of a line
'^%WORD%\b'

//Words: Whole word at the start of a line
//Whitespace permitted before the word
'^\s*%WORD%\b'

Leave a Comment!