Wednesday, February 4, 2009

Ruby - Using Ruby to easily scrape/search/spider a web page for "things" - Part 1

Hello,

This is part 1 of the "Using Ruby to easily scrape/search/spider a web page for "things"" multi-part post.

This part contains the code necessary to scrape a web site for a specific thing. In the case of this example, a heading.

The other parts of this multi-part post will explain specific sections of the code and will demonstrate how to spider through a website by scraping for links.

The following code can be used to scrape/search/spider a web site for headings:

#!/sw/bin/ruby
require 'open-uri'
require 'pp'

spider_url = "http://robertpyke.com/"
pp "Looking up #{spider_url}"

# The parentheses mark what we want to capture
heading_pattern = /<h[0-9]>(.*?)<\/h[0-9]>/

headings = []
open(spider_url) do |f|
f.each do |line|
matchdata = line.scan(heading_pattern)
matchdata.each do |match| # Each match, match is an array
match.each do |string| # Each string within a match
headings << string # Store the heading we found
end
end
end
end
pp headings # Print the headings we found


Note: It should be noted that the regex used to capture the heading is by no means perfect. It has been provided as a simple starting point for people wanting to scrape web pages. Both the limitations of this regex, and a more advanced regex example, will be provided in a later part of this multi-part blog post.

4 comments:

simon said...

you've got a problem with multiline headings. what you wanna do is use scan on the entire html content. and use /m at the end of your regex. i'll expect a followup post. :)

simon said...

second point, consider using this

Robert Pyke said...

Do not fear, there will be a follow up post. I have kept the regex nice and simple so as to build upon it. The regex I have used doesn't support classes within a header. The header wont be ignored if it is within a comment. The regex will match against incorrectly formatted headers such as <h3>lol</h4>. Also, as you point out, I scan on a per-line basis. Meaning it will never match a heading declared across multiple lines. I intend to step through the code and build upon it, as you probably have noticed from my recent post, I am attempting to explain almost everything I do. This code is simply "play around" code to give some context to my up and coming tutorial posts. Oh and I have played with hpricot briefly and yes, it is awesome.

Wolf said...

Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.

http://icfun.blogspot.com/2008/04/ruby-regular-expression-handling.html

Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.