A good model and a proper database design form the foundation of an information system. Building the data layer is often the first critical step towards implementing a new system, and getting it right requires attention to detail and a whole lot of careful planning. A database, like any computer system, is a model of a small piece of the real world. And, like any model, it’s a narrow representation that disregards much of the complexity of the real thing. Read the rest of this entry »
Database Design: Choosing a Primary Key
Posted Aug 14 in Database, Tutorials, Web Development 18 Comments »Full text search with Apache Lucene
Posted Jul 03 in Software, Tutorials, Web Development 11 Comments »It’s rather ironic that, while search is nearly ubiquitous on the web, there is no perfect solution for adding search functionality to a web application. Many developers simply use the basic search functionality built into whatever database server they’re using. Until recently, systems that required a more feature-rich, efficient, or flexible search solution had to turn to proprietary commercial software. But this is no longer the case. Apache’s Lucene project has brought the open source community a sophisticated and flexible search solution that rivals most commercial packages. Read the rest of this entry »
Extreme regex foo: what you need to know to become a regular expression pro
Posted Jun 21 in Programming, Tutorials 23 Comments »This tutorial is intended for advanced audiences. If you’re new to regular expressions, or if you could use a quick refresher, go read my intro to regular expressions, and work through a few examples. Trust me, it’ll be one of the most rewarding twenty minutes you’ve ever spent. If you’re familiar with the basic regex concepts then read on and learn all you need to know to be a regex pro. Read the rest of this entry »
Getting Started with Google Gears
Posted May 31 in Google, Tutorials, Web Development 29 Comments »Yesterday, Google announced Google Gears, an open source browser extension that lets web applications work offline (though I think there are more interesting ways to use the platform). Lots of blogs have picked up the story, including TechCrunch, Mashable, and lifehacker. But most simply regurgitate information from Google’s official press release. They don’t explain how the platform works at all.
I’ve spent the morning reading over the Gears documentation, and working through some sample applications. Here are my notes. I’m hoping they’ll provide enough of an overview to get started with Gears without wading through dozens of pages of documentation yourself. Read the rest of this entry »
How to write a simple PHP template engine
Posted May 13 in Programming, Tutorials 8 Comments »Templates are a great way to separate logic from presentation in an application. There’s no shortage of template engines available for PHP, so why would you want to write another one? Well, because sometimes you don’t need a full fledged template language like smarty, and writing your own simple engine is as easy as it is useful. Read the rest of this entry »
How to measure your web site’s performance
Posted May 06 in Tutorials, Web Development 2 Comments »This is the first in a two part series where I’ll be describing how to load test a web server. In this post I’ll go over some basic load testing concepts and get everything up and running. In the next post I’ll compare the performance of two popular web servers (Apache and lighttpd) running a popular content management system (Wordpress) and explore how performance can be improved. Read the rest of this entry »
Fork PHP! (and speed up your scripts)
Posted Apr 08 in Linux, Programming, Tutorials 10 Comments »I just came across a forum post discussing what the author calls “multithreaded PHP,” and thought I’d clear a few things up about concurrency in PHP. First of all, this is not multhreading. It’s not even “sort of multithreading,” as the author implies (no offense, it’s just not). What the author is actually doing is creating multiple processes. A process is not the same thing as a thread. In fact, PHP does not support multiple threads at all, and doesn’t plan to do so anytime soon.
Threads and processes both increase the concurrency of an application, meaning you can do more things at the same time. A good example of a program that would benefit from multithreading/multiple processes is a web server. A single process or thread can only handle one I/O operation at a time (that’s not completely accurate, but it simplifies things so whatever… you can read more here if you want to know the truth). I/O operations are s - l - o - w (in computer-terms, anyways). If we fork a second process, or create a new thread, we can handle two I/O operations simultaneously! This simple model is the way 90% of network servers work:
While( 1 ) Process 1: Listen for connection Process 1: Connection made, fork Process 2 End Process 2: Handle request, terminate.
That way nobody is turned away by our server because the process that is supposed to be listening for incoming connections is handling a request.
Concurrency can come in handy for client-side programs too from time to time. Take the PHP script I wrote to download YouTube videos, for example. The script connects to YouTube, finds the URLs of the relevant FLV files, and pushes each URL onto a stack. Then it pops each URL off the stack one by one and downloads the FLV file. But YouTube throttles the downloads to about 60K/s. That’s fine when you’re streaming the video in your browser since the bitrate is much lower than that. But it’s not cool when you’re on a box with a 100mbit connection! So what can we do about it?
Well, we can’t change the fact that downloads from YouTube are capped at 60K/s, but we can download more than one movie at once by forking additional processes. And it turns out there is an easy way to do it built right in to PHP. The solution is the pcntl_fork() function, which provides an interface to the underlying fork syscall (only available on *nix platforms, sorry Windows users).
The trick to the fork syscall is that it returns twice. Once in the parent, and once in the child. It’s confusing at first, but if you think about it, it makes sense. In the parent the return value is the new child’s process ID. In the child it’s zero. Thus, we can use a simple if statement to determine whether we’re in the parent or the child process, which will usually determine the execution path we follow.
The changes that I had to make to the youtube script were almost trivial. The general pattern looks like this:
$pid = pcntl_fork();
if($pid == -1) {
die('could not fork')
}
else if ($pid) {
// positive value means we're in the parent.
// do whatever parents do
....
// wait for children to complete by calling
// pcntl_wait() or a variant
} else {
// zero value means we're in the child.
// do whatever children do
// (e.g. download the files, then exit)
}
Check out the source code here (plaintext version), and compare it to the previous version (plaintext version). The differences are minimal, but the forked version can substantially increase the execution speed if you’re downloading several files on a fast connection.
5 Regular Expressions Every Web Programmer Should Know
Posted Apr 06 in Lists, Programming, Tutorials 55 Comments »I’m going to assume you have a basic understanding of regular expressions at this point. If you’re a regex n00b (or /n0{2}b/, as I like to call them), or if you need a quick refresher, check out my previous post on the absolute bare minimum that every programmer should know about regular expressions. You won’t be disappointed.
So, without further adu, here are the five regular expressions that I have found the most useful for day-to-day web programming tasks.
Matching a username
This one’s quite easy, but it’s really invaluable if you’re trying to build a user registration system for a website. We typically want to limit usernames to a restricted set of characters in order to make development easier, and to keep malicious users from spoofing someone else’s name (e.g. replacing a space with multiple spaces or a newline character, which are all displayed the same by a web browser).
Without regular expressions, this would be a tedious exercise that would involve splitting the string into it’s component characters and examining each one individually. With regular expressions, it’s a breeze. First, let’s define what we want to accept, we’ll keep it simple and limit the example to the following characters:
We’ll also want to enforce a 3 character minimum and a 16 character maximum length. Here’s the regular expression that matches this fairly standard set of criteria:
/[a-zA-Z0-9_]{3,16}/
If you’re familiar with regular expressions you may have notice something missing at this point - don’t worry, I’ll get to it.
If you’ve read my introductory to regular expressions you should already know how this regex works. First we’re defining a character class that will match any letters (a through z, and A through Z) and any numbers (0 through 9), as well as the _ (underscore) character. Next comes an interval quantifier that tells the regex engine we’ll only match sequences of between 3 and 16 characters. Because the quantifier follows a character class rather than a single character it attaches itself to the entire class, and will match every sequence between 3 and 16 characters so long as each character falls within our restricted character set.
So what’s missing? As it stands our regex will match anywhere within a string. It won’t just match ‘mike_84′, it will also match any ‘%! mike_84&’, which contains several characters we don’t want. What we need are anchors, the ^ (caret) and $ (dollar) characters will anchor our regex to the beginning and end of the string, ensuring that the whole username meets our requirements and not just a portion of it.
So our revised regex will look like this:
/^[a-zA-Z0-9_]{3,16}$/
Here’s a quick PHP code snippet that shows how we can use this regex in production (we could just as easily use perl, java, ruby, or even javascript to do this validation).
function validate_username( $username ) {
if(preg_match('/^[a-zA-Z0-9_]{3,16}$/', $_GET['username'])) {
return true;
}
return false;
}
Matching an XHTML/XML tag
Matching an XML or XHTML tag can be extremely useful if you’re scraping a website for data, or trying to quickly extract information from an XML document. A simple regex to accomplish this sort of extraction follows this form (the word ‘tag’ should be replaced with whatever tag you are looking for):
{<tag[^>]*>(.*?)</tag>}
The question mark following the star turns the start into a lazy quantifier. By default, quantifiers are greedy, meaning they’ll consume as much of the input text as they can. Lazy quantifiers, by contrast, will match as little of the input text as they can. If we used a greedy quantifier in this case, our regex would not work as advertised on an input document like
<tag>item 1</tag><tag>item 2</tag>
Instead of matching a single tag, a greedy quantifier would match up to the final closing tag in the input text.
Here’s a simple PHP function to extract the contents of each matching XML or XHTML tag as an array:
function get_tag( $tag, $xml ) {
$tag = preg_quote($tag);
preg_match_all('{<'.$tag.'[^>]*>(.*?)</'.$tag.'>.'}',
$xml,
$matches,
PREG_PATTERN_ORDER);
return $matches[1];
}
Matching an XHTML/XML tag with a certain attribute value (e.g. class or tag)
This regex is very similar to the last example, except we only want tags with a certain attribute value. This comes in handy when you want to extract a tag with a particular class or ID value, for example. The regex is just slightly more complicated than our previous example (again, replace tag, attribute, and value with whatever you’re looking for):
{<tag[^>]*attribute\\s*=\\s*(["'])value\\\\1[^>]*>(.*?)</tag>}
We use a character class to allow either single or double quotes around our value. The portion of the regex following the value is called a backreference. It will be replaced with whatever is captured by the first set of parenthesis in the expression (either a single quote or double quote). That way we can be sure that the opening and closing quotes match.
Here’s a PHP function that shows how you can extract information form an XHTML document with this regex. The function tags an attribute, value, input text, and an optional tag name as arguments. If no tag name is specified it will match any tag with the specified attribute and attribute value.
function get_tag( $attr, $value, $xml, $tag=null ) {
if( is_null($tag) )
$tag = '\\w+';
else
$tag = preg_quote($tag);
$attr = preg_quote($attr);
$value = preg_quote($value);
$tag_regex = "/<(".$tag.")[^>]*$attr\\s*=\\s*".
"(['\\"])$value\\\\2[^>]*>(.*?)<\\/\\\\1>/"
preg_match_all($tag_regex,
$xml,
$matches,
PREG_PATTERN_ORDER);
return $matches[3];
}
Matching and parsing an email address
This one comes courtesy of Cal Henderson, the programmer behind Flickr and author of Building Scalable Web Sites (a great read). For more information check out Cal’s article on parsing email addresses.
This one’s such a behemoth that it’s easier to digest when broken into it’s component parts. Constructing a regex like this is a bit like describing a grammer in Backus-Naur form (BNF), which is convenient because many of the things we’re trying to match are already described using BNF in their specifications. This is the case for email addresses, which are described in RFC 822. Anyways, here’s a PHP function that will check the validity of an e-mail address:
function is_valid_email_address($email){
$qtext = '[^\\x0d\\x22\\x5c\\x80-\\xff]';
$dtext = '[^\\x0d\\x5b-\\x5d\\x80-\\xff]';
$atom = '[^\\x00-\\x20\\x22\\x28\\x29\\x2c\\x2e\\x3a-\\x3c'.
'\\x3e\\x40\\x5b-\\x5d\\x7f-\\xff]+';
$quoted_pair = '\\x5c[\\x00-\\x7f]';
$domain_literal = "\\x5b($dtext|$quoted_pair)*\\x5d";
$quoted_string = "\\x22($qtext|$quoted_pair)*\\x22";
$domain_ref = $atom;
$sub_domain = "($domain_ref|$domain_literal)";
$word = "($atom|$quoted_string)";
$domain = "$sub_domain(\\x2e$sub_domain)*";
$local_part = "$word(\\x2e$word)*";
$addr_spec = "$local_part\\x40$domain";
return preg_match("!^$addr_spec$!", $email) ? 1 : 0;
}
The ‘\x##’ sequences are hexadecimal character references. It’s just a fancy way of specifying a character using it’s underlying code point (the numerical representation of a particular symbol). Otherwise this is a fairly straightforward, albeit incredibly complex regular expression. I’ll refrain from any further analysis since it’s been done elsewhere.
Tim Fletcher has ported Cal’s original PHP function into Ruby and Perl, if that’s what you’re into..
Matching a URL
Matching a URL is a lot like matching an e-mail address, except that you tend to do it in more controlled, and less critical situations where you can tolerate a few false positives. I use this regex frequently in projects when I need to automatically generate links when a URL is typed in a comment field, for example. Like the email regex, this one’s a doozy, but it’s pretty easy to understand.
I’ve taken advantage of the ‘x’ and ‘i’ pattern modifiers for this regex. Pattern modifiers are tacked onto the end of a regex and change the way the regex engine interprets the expression. The ‘x’ modifier tells the engine to ignore whitespace, except when escaped or used inside of a character class. It also tells the engine to interpret any text following a ‘#’ character outside of a character class as a comment (i.e. ignore it). The ‘i’ modifier makes the regex case insensitive. This can drastically simplify a complicated regex like this one when case doesn’t matter. This regex is derived from one developed by Jeffrey Friedl in his book Mastering Regular Expressions.
{
\\b
# Match the leading part (proto://hostname, or just hostname)
(
# http://, or https:// leading part
(https?)://[-\\w]+(\\.\\w[-\\w]*)+
|
# or, try to find a hostname with more specific sub-expression
(?i: [a-z0-9] (?:[-a-z0-9]*[a-z0-9])? \\. )+ # sub domains
# Now ending .com, etc. For these, require lowercase
(?-i: com\\b
| edu\\b
| biz\\b
| gov\\b
| in(?:t|fo)\\b # .int or .info
| mil\\b
| net\\b
| org\\b
| [a-z][a-z]\\.[a-z][a-z]\\b # two-letter country code
)
)
# Allow an optional port number
( : \\d+ )?
# The rest of the URL is optional, and begins with /
(
/
# The rest are heuristics for what seems to work well
[^.!,?;"\\'<>()\[\]\{\}\s\x7F-\\xFF]*
(
[.!,?]+ [^.!,?;”\\’<>()\\[\\]\{\\}\s\\x7F-\\xFF]+
)*
)?
}ix
The comments in this expression are fairly self explanatory, so I don’t think it needs a whole lot of explanation. There are a few things to watch out for though. First, this regex will match some things that are not valid URLs. The regex assumes that any two-letter combination is a valid top-level domain (TLD), which is not the case. It also misses TLDs that were recently added to the IANA list like .travel, .name, and .museum. You can fix this by downloading the latest IANA TLD list and adding any missing TLDs in the list of alternatives mid-way through the expression.
That being said, this regex works great 99.9% of the time. Here’s a quick PHP function that will parse a section of text, replacing any URLs it finds with links. I’m going to assume you’ve set the variable $url_regex to the the above pattern so I don’t have to repeat it here.
function auto_link( $text ) {
$url_regex = ...
return preg_replace( $url_regex,
'<a href="$0"^gt;$0=</a>',
$text );
}
So that’s it. If you think I left something off the list that deserves mention, or if you have any suggestions for improvements, post a comment and let me know.
The absolute bare minimum every programmer should know about regular expressions
Posted Apr 06 in Programming, Tutorials 44 Comments »Wtf is a regular expression?
Regular expressions are strings formatted using a special pattern notation that allow you to describe and parse text. Many programmers (even some good ones) disregard regular expressions as line noise, and it’s a damned shame because they come in handy so often. Once you’ve gotten the hang of them, regular expressions can be used to solve countless real world problems.
Regular expressions work a lot like the filename “globs” in Windows or *nix that allow you to specify a number of files by using the special * or ? characters (oops, did I just use a glob while defining a glob?). But with regular expressions the special characters, or metacharacters, are far more expressive.
Like globs, regular expressions treat most characters as literal text. The regular expression mike, for example, will only match the letters m - i - k - e, in that order. But regular expressions sport an extensive set of metacharacters that make the simple glob look downright primitive.
Meet the metacharacters: ^[](){}.*?\|+$ and sometimes -
I know, they look intimidating, but they’re really nice characters once you get to know them.
The Line Anchors: ‘^’ and ‘$’
The ‘^’ (caret) and ‘$’ (dollar) metacharacters represent the start and end of a line of text, respectively. As I mentioned earlier, the regular expression mike will match the letters m - i - k - e, but it will match anywhere in a line (e.g. it would match “I’m mike”, or even “carmike”). The ‘^’ character is used to anchor the match to the start of the line, so ^mike would only find lines that start with mike. Similarly, the expression mike$ would only find m - i - k - e at the end of a line (but would still match ‘carmike’).
If we combine the two line anchor characters we can search for lines of text that contain a specific sequence of characters. The expression ^mike$ will only match the word mike on a line by itself - nothing more, nothing less. Similarly the expression ^$ is useful for finding empty lines, where the beginning of the line is promptly followed by the end.
The Character Class: ‘[]’
Square brackets, called a character class, let you match any one of several characters. Suppose you want to match the word ‘gray’, but also want to find it if it was spelled ‘grey’. A character class will allow you to match either. The regular expression gr[ea]y is interpreted as “g, followed by r, followed by either an e or an a, followed by y.”
If you use [^ ... ] instead of [ ... ], the class matches any character that isn’t listed. The leading ^ “negates” the list. Instead of listing all of the characters you want to included in the class, you list the characters you don’t want included. Note that the ^ (caret) character used here has a different meaning when it’s used outside of a character class, where it is used to match the beginning of a line.
The Character Class Metacharacter: ‘-’
Within a character-class, the character-class metacharacter ‘-’ (dash) indicates a range of characters. Instead of [01234567890abcdefABCDEF] we can write [0-9a-fA-F]. How convenient. The dash is a metacharacter only within a character class, elsewhere it simply matches the normal dash character.
But wait, there’s a catch. If a dash is the first character in a character class it is not considered a metacharacter (it can’t possibly represent a range, since a range requires a beginning and an end), and will match a normal dash character. Similarly, the question mark and period are usually regex metacharacters, but not when they’re inside a class (in the class [-0-9.?] the only special character is the dash between the 0 and 9).
Matching Any Character With a Dot: ‘.’
The ‘.’ metacharacter (called a dot or point) is shorthand for a character class that matches any character. It’s very convenient when you want to match any character at a particular position in a string. Once again, the dot metacharacter is not a metacharacter when it’s inside of a character class. Are you beginning to see a pattern? The list of metacharacters is different inside and outside of a character class.
The Alternation Metacharacter: ‘|’
The ‘|’ metacharacter, (pipe) means “or.” It allows you to combine multiple expressions into a single expression that matches any of the individual ones. The subexpressions are called alternatives.
For example, Mike and Michael are separate expressions, but Mike|Michael is one expression that matches either.
Parenthesis can be used to limit the scope of the alternatives. I could shorten our previous expression that matched Mike or Michael with creative use of parenthesis. The expression Mi(ke|chael) matches the same thing. I probably would use the first expression in practice, however, as it is more legible and therefore more maintainable.
Matching Optional Items: ‘?’
The ‘?’ metacharacter (question mark) means optional. It is placed after a character that is allowed, but not required, at a certain point in an expression. The question mark attaches only to the immediately preceding character.
If I wanted to match the english or american versions of the word ‘flavor’ I could use the regex flavou?r, which is interpreted as “f, followed by l, followed by a, followed by v, followed by o, followed by an optional u, followed by r.”
The Other Quantifiers: ‘+’ and ‘*’
Like the question mark, the ‘+’ (plus) and ‘*’ (star) metacharacters affect the number of times the preceding character can appear in the expression (with ‘?’ the preceding character could appear 0 or 1 times). The metacharacter ‘+’ matches one or more of the immediately preceding item, while ‘*’ matches any number of the preceding item, including 0.
If I was trying to determine the score of a soccer match by counting the number of times the announcer said the word ‘goal’ in a transcript, I might use the regular expression go+al, which would match ‘goal’, as well as ‘gooooooooooooooooal’ (but not ‘gal’).
The three metacharacters, question mark, plus, and star are called quantifiers because they influence the quantity of the item they’re attached to.
The Interval Quantifier: ‘{}’
The ‘{min, max}’ metasequence allows you to specify the number of times a particular item can match by providing your own minimum and maximum. The regex go{1,5}al would limit our previous example to matching between one and five o’s. The sequence {0,1} is identical to a question mark.
The Escape Character: ‘\’
The ‘\’ metacharacter (backslash) is used to escape metacharacters that have special meaning so you can match them in patterns. For example, if you would like to match the ‘?’ or ‘\’ characters, you can precede them with a backslash, which removes their meaning: ‘\?’ or ‘\\’.
When used before a non-metacharacter a backslash can have a different meaning depending on the flavor of regular expression you’re using. For perl compatible regular expressions (PCREs) you can check out the perldoc page for perl regular expressions. PCREs are extremely common, this flavor of regexes can be used in PHP, Ruby, and ECMAScript/Javascript, and many other languages.
Using Parenthesis for Matching: ‘()’
Most regular expression tools will allow you to capture a particular subset of an expression with parenthesis. I could match the domain portion of a URL by using an expression like http://([^/]+). Let’s break this expression down into it’s components to see how it works.
The beginning of the expression is fairly straightforward: it matches the sequence “h - t - t - p - : - / - /”. This initial sequence is followed by parenthesis, which are used to capture the characters that match the subexpression they surround. In this case the subexpression is ‘[^/]+’, which matches any character except for ‘/’ one or more times. For a URL like http://immike.net/blog/Some-blog-post, ‘immike.net’ will be captured by the parenthesis.
Want to know more?
I’ve only touched the surface on what can be done with regular expressions. If want to learn more, check out Jeffrey Friedl’s book Mastering Regular Expressions. Friedl did an outstanding job with this book, it’s accessible and a surprisingly fun and interesting read, considering the dry subject matter.
Check out my follow up to this article where I take a look at some of the most useful regular expressions for common programming tasks. And once you understand the basics read on to learn all you need to know to become a regex pro.

Recent Comments