I love regular expressions

Regular expressions are seriously awesome. Perl (< 6) compatible ones, too. While I think what Perl 6 has done with changing them is for the better (|| becoming or, instead of |), I'm sticking with what I know for now.

Nevertheless, regexes can be annoying to work with. Especially when you start trying to replace patterns with your own text. And yes, I do have an example!


$post = htmlentities($post, ENT_NOQUOTES);

if (!$restricted) {

  $post = preg_replace('/&lt;(.+?)&gt;/', '<$1>', $post);

}

What's this for? Well, on this blog, I want to paste code snippets like the above without manually typing in the escaped characters for non-literal less-thans, but also do literal HTML. I want to theoretically do "[b]x[/b] [ 3" (only with angle brackets) and expect x < 3. (AHA! I confused it. Which means I need to rethink it.)

To be honest, I was kind of confused how this is working in this post... Because it still lets me do actual HTML. Oh yeah, a very good regex helper is this... It doesn't get confused.

And I was confused why it was not confused, and didn't actually look at the source at first.

I thought: if it sees < here and > there, shouldn't it be thinking "< here and >" is a tag and not convert it? (What I really mean is it won't convert the < to <, but bear with me.)

And then I looked at the source. So now, I'm worried if other browsers will handle this oddly, or if I can expect it to remain the same, and what strange edge-cases will pop up. It will not convert the above-mentioned symbols to < and >, but Firefox will display them properly anyway! Because < blah blah blah > is a senseless tag, plus there's no closing tag. Go Firefox.

I think the final solution might just be to escape everything, and then replace a white-list of HTML with the real deals. But that seems too restrictive, and this essentially works, so it stays. Regexes are great.

Now, here's a regex I completely understand, and I think is actually pretty awesome. It validates emails, but also covers edge-cases like Gmail's blah+whateveryoulike@gmail.com which is very useful. This is meant to be done case-insensitively, though, so watch your data.


^[a-z0-9][a-z0-9\.\-_]*\+?[a-z0-9\.]*@[a-z0-9\.\-_]+\.[a-z]{2,4}$

Lesson to be learned: the complexity of a regular expression is not in its size or how bad it looks on first glance, but in the pattern you're actually trying to match. Email addresses are fairly simple and have few edge-cases, and I believe this one matches most (if not all) cases.

Edit: I've decided this method is borderline madness. While regular expressions are awesome, parsing HTML with them is awful. Any passers-by with PHP library suggestions will be happily welcomed to comment, since while this has been queued on the todo list it likely won't be done any time soon. (Especially if I have to go looking for libraries myself.)

Posted on 2009-11-11 by Jach

Tags: programming

Permalink: https://www.thejach.com/view/id/45

Trackback URL: https://www.thejach.com/view/2009/11/i_love_regular_expressions

Back to the top

Back to the first comment

Comment using the form below

TheJach.com

Jach's personal blog

(Largely containing a mind-dump to myselves: past, present, and future)

Current favorite quote: "Supposedly smart people are weirdly ignorant of Bayes' Rule." William B Vogt, 2010

I love regular expressions

Posted on 2009-11-11 by Jach

Archives

Selected Posts

Recent Posts

Recent Comments

Better Websites

Email:
Password:
Remember Me

TheJach.com

Jach's personal blog

(Largely containing a mind-dump to myselves: past, present, and future)

Current favorite quote: "Supposedly smart people are weirdly ignorant of Bayes' Rule." William B Vogt, 2010

I love regular expressions

Posted on 2009-11-11 by Jach

Archives

Selected Posts

Recent Posts

Recent Comments

Tag Cloud

aliens

altruism

Anarchy

anime

anti-anarchy

artificial intelligence

assembly

atheism

awesome

bash

basics

bayes

books

business

c

c++

cars

circuit analysis

clojure

cloning

cognition

college

comics

computer engineering

couchdb

cryonics

curl

daily life

databases

debate

demo

design

disaster

economics

evolution

existentialism

FFP Machine

fiction

flex

fodder

food

forth

FPGA

free will

french

furcadia

future

games

git

government

grammar

hacking

HDL

hiring

history

immigration

intellectual property

italy

japan

java

jMonkeyEngine

language

learning

lisp

LucidDB

management

math

medicine

memo