Create A Simple Web Crawler in php

Published On: 28 May 2014.By auriga team.

Digital Engineering

Create A Simple Web Crawler in php

A Web Crawler is a program that crawls through the sites in the Web and find URL’s. Normally Search Engines uses a crawler to find URL’s on the Web. Google uses a crawler written in Python. There are some other search engines that uses different types of crawlers.

For Web crawling we have to perform following steps-

1.Firstly make url of page which we have to crawl.

2.Then we have to fetch link of that particular website.Following curl () function fetches link of website–

function curl($url) {
// Assigning cURL options to an array
$options = Array(

CURLOPT_RETURNTRANSFER =&gt; TRUE,  // Setting cURL's option to return the webpage data
CURLOPT_FOLLOWLOCATION =&gt; TRUE,  // Setting cURL to follow 'location' HTTP headers
CURLOPT_AUTOREFERER =&gt; TRUE, // Automatically set the referer where following 'location' HTTP headers
CURLOPT_HEADER=&gt; TRUE,
CURLOPT_CONNECTTIMEOUT =&gt; 1200,   // Setting the amount of time (in seconds) before the request times out
CURLOPT_TIMEOUT =&gt; 1200,  // Setting the maximum amount of time for cURL to execute queries
CURLOPT_MAXREDIRS =&gt; 10, // Setting the maximum number of redirections to follow
CURLOPT_USERAGENT =&gt; "Googlebot/2.1 (+http://www.googlebot.com/bot.html)",  // Setting the useragent
CURLOPT_URL =&gt; $url, // Setting cURL's URL option with the $url variable passed into the function
CURLOPT_ENCODING=&gt;'gzip,deflate',

);

$ch = curl_init();  // Initialising cURL
curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options

$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);//to check whether any error occur or not
if($httpcode!="200")
{
return "error";
}

return $data;   // Returning the data from the function
}

function curl($url) {

// Assigning cURL options to an array

$options = Array(

CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data

CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers

CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers

CURLOPT_HEADER=> TRUE,

CURLOPT_CONNECTTIMEOUT => 1200, // Setting the amount of time (in seconds) before the request times out

CURLOPT_TIMEOUT => 1200, // Setting the maximum amount of time for cURL to execute queries

CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow

CURLOPT_USERAGENT => "Googlebot/2.1 (+http://www.googlebot.com/bot.html)", // Setting the useragent

CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function

CURLOPT_ENCODING=>'gzip,deflate',

);

$ch = curl_init(); // Initialising cURL

curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options

$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable

$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);//to check whether any error occur or not

if($httpcode!="200")

{

return "error";

}

return $data; // Returning the data from the function

}

3.Function crawl is for crawling the website and get all links of the webpage.

function crawl($html)
{
$dom-&gt;loadHTML($html);
$content = $dom-&gt;getElementsByTagname('a');
foreach ($content as $item)
{
echo $item-&gt;getAttribute('href');
}

}

4.Finally we will call
$results_page=curl($url);
crawl($results_page);