Thursday, February 5, 2009

Ruby - Regex Non-Greedy Operator (?)- Using Ruby to easily scrape/search/spider a web page for "things" - Part 2

Hello again,

This is part 2 of the "Using Ruby to easily scrape/search/spider a web page for "things"" multi-part post.

In this post I will explain the use of the non-greedy regex operator. In the spider example used in part 1, I scraped for all uses of the html heading element. The html heading element includes h1, h2, h3, etc. More specifically, I was looking for all matches of the regex: <h[0-9]>(.*?)</h[0-9]>

The following breaks down and explains the above regex:
<h[0-9]> → Find <h followed by one digit ([0-9]) followed by >
.*? → Find any character (.) zero or more times (*) non-greedily (?)
</h[0-9]> → Find </h followed by one digit ([0-9]) followed by >

The parentheses (), mark what I want to capture from my regex match. In this case, it is the actual heading, I don't want to capture the <h[0-9]> or the </h[0-9]>.

The non-greedy operator (?) means that the regex should not be greedy; it should look-ahead to see if it can break what it is currently looking at. In the above example, the non-greedy operator was used to prevent the .* from matching everything and thus never allowing the regex to match </h[0-9]>. The following examples demonstrate greedy vs non-greedy:

Example 1: Greedy:
Regex: <h[0-9]>.*</h[0-9]>
Input: <h3>My Title</h3>
<h[0-9]> matches: <h3>
.* matches: My Title</h3>

Example 2: Non-Greedy:
Regex: <h[0-9]>.*?</h[0-9]>
Input: <h3>My Title</h3>
<h[0-9]> matches: <h3>
.*? matches: My Title
</h[0-9]> matches: </h3>

Further Reading:
Ruby regex, quick reference guide
Ruby-doc, user's guide to regex
Ruby API: Regexp
Rubular: A Ruby regular expression editor (Interactive)

4 comments:

simon said...

To save some typing use \d instead of [0-9]. (There are a few of those, eg \s (whitespace), \S (non-whitespace), \w (word char).

Robert Pyke said...

I will add a link to further reading with regards to regex. That said, I am intentionally choosing not to use many shortcut regex tokens in my post. This is because I am trying to give a "simple" tutorial. I imagine it is easier for a regex newcomer to read regex with character classes in it rather than the regex shortcuts.

Demon said...

Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.

http://icfun.blogspot.com/2008/04/ruby-regular-expression-handling.html

Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.

Tejuteju said...

Thank you. Well it was the nice to post and very helpful information onRuby on Rails Online Training Hyderabad