What a good topic for a post eh? I spoke about Regular Expressions (RE) in my first episode of Taming The Electron and felt it was only fair to write a full intro for newbies to the subject. I have been reading “Mastering Regular Expressions” off and on for a good while now, but recently have been reading it with more focus on the material. This is probably due to my Kindle DX purchase, which you can read or watch about in a previous post Here. Regular Expressions are amazing. They are “magic.” They can make matches with EXTREME precision.
Now, most people, including most of my friends, think that languages that do operations on text are lame. Well, OS’s that are UNIX based, revolve around text. Operations on said files are what customize your OS. Customization is key to creating synergy between you and your OS. It makes you want to come back or entices you.
Most file in Unix based OS’s that I deal with are in /usr/share or /etc or sometimes “live” files, like those in /proc. /proc has a lot of files that are dynamic and is called a “virtual file-system.” They change constantly to with the changes of your overall system. If you wanted to write a cool application that checks your memory or cpu, you can use the files in /proc (/proc/meminfo, /proc/cpuinfo) to do that. You can even use them for error messages, bus messages, and more. In fact, many applications used in the administration of Unix based OS’s, such as ps, dmesg, top, and more utilize “live” files in /proc. Here is a great article for anyone new to /proc: /proc
When you open a file and read it in an application, you will most likely want to search for useful information. Sometimes you are simply handed raw data that you need to “parse” (or change in some way) to turn the data into information. The best way to search is line by line. The best way to match is character by character (like grep, or egrep). egrep is a powerful matching tool that allows you to create extremely complicated yet powerful matching expressions.
The word expression, in Regular Expressions has the same meaning that it does in plain Algebra. Regular Expression syntax was actually developed back in the 1950’s by Stephen Kleene as formal language theory, and automata theory (theoretical machines and problem solving). These are both basic sub-sets of computer science and theoretical computer science. the syntax can act like algebra in some ways, and even many languages that deal with “lame” text have their own regular expression syntax built right into them. RE should be a ANSI standard, but I hear a lot about how Perl 5 vs. Perl 6, and python, and Ruby all have slight differences in their Regular Expression syntax.
Okay, so that’s what they are good for and a bit about their history. Now, let’s try some matching patterns, meta characters, and meta-sequences. Think about characters and put them into classes. Class alpha will be your alphabet A-Z and lowercase a-z. You can specify a “range” in Regular Expressions with the square brackets “[]” say we have a text file that has a few lines with numbers and a few lines with digits, or phone numbers like so:
abcdefg
abc
12345
ghostbusters
007
31337
drums
Now, say we cat the file (show it’s contents) and we only want to get (filter) the lines that have letters only. We can use egrep or grep like so:
cat filename.txt | egrep '[A-Za-z]'
And this will display the lines:
abcdefg
abc
ghostbusters
drums
Now, we can change the class to Numeric (numbers) and do [0-9].
cat filename.txt | egrep '[0-9]'
This will print the lines that have numbers in them only (we filter OUT the alpha class). This is a very basic example. We can further filter our output by using “anchors.” Anchors will display what we are looking for ONLY if it is at the beginning of a line or the end of the line with the meta-characters (special word for operators (usually not of alpha or numeric classes)) “caret,” or “^” and “Dollars” or “$” respectively.
so say we add:
cat filename.txt | grep '^[0-9]'
This will display all lines with numbers only that start with numbers. If we switch the “^” with a “$” and put it after the [0-9] range like ‘[0-9]$’ This will match all lines that END with numbers and contain ONLY numbers. What if we put the meta-character “^” into the square brackets before our range? Well, usually meta-characters lose their meta when put into those brackets. for example the period “.” which usually means “any one character” becomes a simple period. The “^” means that it negates the range. So ‘[^0-9]‘ means match any character that is NOT in our numeric class. Here’s a cool tip: the range meta-character “-” only is a meta-character INSIDE of the square brackets. Yep, that means outside it’s just a plain old “-” character.
There are parenthesis in the language, just like Algebra, that group together “expressions.” For instance, here is an example from “Mastering Regular Expressions – O’Reilly” that searches for all instances of July 4th. The question mark meta-character searches for “one or more instance of the following character or group, so:
'(July?) (Four|4)(th)?'
Will find ALL values like:
July 4th
Jul Fourth
Jul 4
July 4
July Fourth
Their example in the book was:
'July? (Fourth|4(th)?)'
which seems to have over-looked the “th” string at the end of “fourth.” Yeah, you get picky like that, and once you get into the swing of computer languages and Regular Expressions, you start to look for the most efficient way to code.
These small examples can help you get into the flow I’m sure, or at least pique your interest in Regular Expressions. Just like any language, Regular Expressions will open up a lot of doors for you as a developer or system administrator. They too can help you with problem solving. Sometimes you can perform large, usually complicated tasks with them and sometimes small tasks would be large tasks without them. Mastering Regular Expressions is a good book. It also covers a few Awk, Sed, and Grep topics as well and brings everything together in one cool place; a bunch of papers wrapped in heavier paper.
I, without a doubt, realize that my code isn’t always efficient and that sometimes I do things in non-efficient ways. Perl’s motto is TIMTOWTDI “There’s more than one way to do it.” The Perl community accepts newbie code and more importantly, the interpreter Perl accepts newbie code. It’s a good language to start programming with and I am still a beginner programmer. If you find anything wrong with what I have stated above, simply let me know and I will fix it. I am forever a student of Awk, Sed, Grep, Vi and Regular Expressions.