Home/Create A Simple Web Crawler in php

Create A Simple Web Crawler in php

Published On: 28 May 2014.By .
  • Product & platform Engineering

A Web Crawler is a program that crawls through the sites in the Web and find URL’s. Normally Search Engines uses a crawler to find URL’s on the Web. Google uses a crawler written in Python. There are some other search engines that uses different types of crawlers.

WebCrawlerArchitecture.svg

For Web crawling we have to perform following steps-

1.Firstly make url of page which we have to crawl.

2.Then we have to fetch link of that particular website.Following curl () function fetches link of website

function curl($url) {
// Assigning cURL options to an array
$options = Array(

CURLOPT_RETURNTRANSFER => TRUE,  // Setting cURL's option to return the webpage data
CURLOPT_FOLLOWLOCATION => TRUE,  // Setting cURL to follow 'location' HTTP headers
CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers
CURLOPT_HEADER=> TRUE,
CURLOPT_CONNECTTIMEOUT => 1200,   // Setting the amount of time (in seconds) before the request times out
CURLOPT_TIMEOUT => 1200,  // Setting the maximum amount of time for cURL to execute queries
CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow
CURLOPT_USERAGENT => "Googlebot/2.1 (+http://www.googlebot.com/bot.html)",  // Setting the useragent
CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function
CURLOPT_ENCODING=>'gzip,deflate',

);

$ch = curl_init();  // Initialising cURL
curl_setopt_array($ch, $options);   // Setting cURL's options using the previously assigned array data in $options

$data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable
$httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);//to check whether any error occur or not
if($httpcode!="200")
{
return "error";
}

return $data;   // Returning the data from the function
}

3.Function crawl is for crawling the website and get all links of the webpage.

function crawl($html)
{
$dom->loadHTML($html);
$content = $dom->getElementsByTagname('a');
foreach ($content as $item)
{
echo $item->getAttribute('href');
}

}

4.Finally we will call
$results_page=curl($url);
crawl($results_page);

Related content

We Love Conversations

Say Hello
Go to Top