Friday, January 28, 2011

Perl: Retrieve URLs with LWP and LWP::Simple

With Perl there are many ways to make requests over the web. One method is to use the LWP module. Below is an example of using it grab the contents of a web page:

use Carp;
use LWP;

my $url = 'http://prystash.blogspot.com';
my $contents = get_contents_from($url);

print $contents;

sub get_contents_from {
my ($url) = @_;

my $agent = LWP::UserAgent->new;
my $request = HTTP::Request->new(GET => $url);
my $response = $agent->request($request);

if (!$response->is_success) {
croak "Could not get URL '$url'";
}

return $response->content
}
Another simpler method is to use the LWP::Simple module:
use Carp;
use LWP::Simple;

my $url = 'http://prystash.blogspot.com';
my $contents = get_contents_from($url);

sub get_contents_from {
my ($url) = @_;
my $contents = get($url) or croak "Could not get URL '$url'";
return $contents;
}

3 comments:

  1. plz can u tell how to refresh or reload the page

    ReplyDelete
  2. I'm not so sure you can, but wouldn't a refresh simply be another GET of the page?

    There's a little cookbook here, and I don't see anything regrading refreshing.

    ReplyDelete
  3. Hai John, can you help me.
    i have problem to get some URL at www.nytimes.com. my target to get some newpapers for some period.

    some url, i got in form: http://www.nytimes.com/2014/04/27/opinion/sunday/the-media-has-a-woman-problem.html?hp&rref=opinion
    some url, i got: http://www.nytimes.com/2014/04/27/? , useless newspaper.


    --------------
    here my sources:
    for (my $y=2014; $y <= $MaxYear; $y--) { #looping years
    for (my $m=1; $m <= $MaxMonth; $m++) { #looping months
    for (my $d=1; $d <= $MaxDays; $d++) { #looping days

    #my $URL =
    if ($d < 10 && $m < 10) {
    $URL= "http://www.nytimes.com/".$y."/".$digitZero.$m."/".$digitZero.$d."/";
    }
    elsif ($d > 9 && $m < 10) {
    $URL= "http://www.nytimes.com/".$y."/".$digitZero.$m."/".$d."/";
    }
    elsif ($d < 10 && $m > 10) {
    $URL= "http://www.nytimes.com/".$y."/".$m."/".$digitZero.$d."/";
    }
    else {
    $URL= "http://www.nytimes.com/".$y."/".$m."/".$d."/";
    }

    my ($content, $status, $is_Success, $resp) = robot_get($myURL);

    sub robot_get {
    my ($URL) = @_;
    my $browser = LWP::UserAgent->new unless $browser;
    my $response = $browser->get($URL);
    return($response->content,$response->status_line,$response->is_success, $response) if wantarray;
    return unless $response->is_success ;
    return $response->content;
    }
    ---------

    ReplyDelete