TheJach.com

Jach's personal blog

(Largely containing a mind-dump to myselves: past, present, and future)
Current favorite quote: "Supposedly smart people are weirdly ignorant of Bayes' Rule." William B Vogt, 2010

I love regular expressions

regex

Regular expressions are seriously awesome. Perl (< 6) compatible ones, too. While I think what Perl 6 has done with changing them is for the better (|| becoming or, instead of |), I'm sticking with what I know for now.

Nevertheless, regexes can be annoying to work with. Especially when you start trying to replace patterns with your own text. And yes, I do have an example!


$post = htmlentities($post, ENT_NOQUOTES);
if (!$restricted) {
$post = preg_replace('/&lt;(.+?)&gt;/', '<$1>', $post);
}


What's this for? Well, on this blog, I want to paste code snippets like the above without manually typing in the escaped characters for non-literal less-thans, but also do literal HTML. I want to theoretically do "[b]x[/b] [ 3" (only with angle brackets) and expect x < 3. (AHA! I confused it. Which means I need to rethink it.)

To be honest, I was kind of confused how this is working in this post... Because it still lets me do actual HTML. Oh yeah, a very good regex helper is this... It doesn't get confused.

And I was confused why it was not confused, and didn't actually look at the source at first.

I thought: if it sees < here and > there, shouldn't it be thinking "< here and >" is a tag and not convert it? (What I really mean is it won't convert the &lt; to <, but bear with me.)

And then I looked at the source. So now, I'm worried if other browsers will handle this oddly, or if I can expect it to remain the same, and what strange edge-cases will pop up. It will not convert the above-mentioned symbols to &lt; and &gt;, but Firefox will display them properly anyway! Because < blah blah blah > is a senseless tag, plus there's no closing tag. Go Firefox.

I think the final solution might just be to escape everything, and then replace a white-list of HTML with the real deals. But that seems too restrictive, and this essentially works, so it stays. Regexes are great.

Now, here's a regex I completely understand, and I think is actually pretty awesome. It validates emails, but also covers edge-cases like Gmail's blah+whateveryoulike@gmail.com which is very useful. This is meant to be done case-insensitively, though, so watch your data.


^[a-z0-9][a-z0-9\.\-_]*\+?[a-z0-9\.]*@[a-z0-9\.\-_]+\.[a-z]{2,4}$


Lesson to be learned: the complexity of a regular expression is not in its size or how bad it looks on first glance, but in the pattern you're actually trying to match. Email addresses are fairly simple and have few edge-cases, and I believe this one matches most (if not all) cases.

Edit: I've decided this method is borderline madness. While regular expressions are awesome, parsing HTML with them is awful. Any passers-by with PHP library suggestions will be happily welcomed to comment, since while this has been queued on the todo list it likely won't be done any time soon. (Especially if I have to go looking for libraries myself.)


Posted on 2009-11-11 by Jach

Tags: programming

Permalink: https://www.thejach.com/view/id/45

Trackback URL: https://www.thejach.com/view/2009/11/i_love_regular_expressions

Back to the top

Back to the first comment

Comment using the form below

(Only if you want to be notified of further responses, never displayed.)

Your Comment:

LaTeX allowed in comments, use $$\$\$...\$\$$$ to wrap inline and $$[math]...[/math]$$ to wrap blocks.