Create A Simple Web Crawler in php
- Product & platform Engineering
Create A Simple Web Crawler in php
A Web Crawler is a program that crawls through the sites in the Web and find URL’s. Normally Search Engines uses a crawler to find URL’s on the Web. Google uses a crawler written in Python. There are some other search engines that uses different types of crawlers.
For Web crawling we have to perform following steps-
1.Firstly make url of page which we have to crawl.
2.Then we have to fetch link of that particular website.Following curl () function fetches link of website–
function curl($url) { // Assigning cURL options to an array $options = Array( CURLOPT_RETURNTRANSFER => TRUE, // Setting cURL's option to return the webpage data CURLOPT_FOLLOWLOCATION => TRUE, // Setting cURL to follow 'location' HTTP headers CURLOPT_AUTOREFERER => TRUE, // Automatically set the referer where following 'location' HTTP headers CURLOPT_HEADER=> TRUE, CURLOPT_CONNECTTIMEOUT => 1200, // Setting the amount of time (in seconds) before the request times out CURLOPT_TIMEOUT => 1200, // Setting the maximum amount of time for cURL to execute queries CURLOPT_MAXREDIRS => 10, // Setting the maximum number of redirections to follow CURLOPT_USERAGENT => "Googlebot/2.1 (+http://www.googlebot.com/bot.html)", // Setting the useragent CURLOPT_URL => $url, // Setting cURL's URL option with the $url variable passed into the function CURLOPT_ENCODING=>'gzip,deflate', ); $ch = curl_init(); // Initialising cURL curl_setopt_array($ch, $options); // Setting cURL's options using the previously assigned array data in $options $data = curl_exec($ch); // Executing the cURL request and assigning the returned data to the $data variable $httpcode = curl_getinfo($ch, CURLINFO_HTTP_CODE);//to check whether any error occur or not if($httpcode!="200") { return "error"; } return $data; // Returning the data from the function }
3.Function crawl is for crawling the website and get all links of the webpage.
function crawl($html) { $dom->loadHTML($html); $content = $dom->getElementsByTagname('a'); foreach ($content as $item) { echo $item->getAttribute('href'); } } 4.Finally we will call $results_page=curl($url); crawl($results_page);
Related content
Toll mangement and command centre with TMCC
We’re passionately committed to helping our clients and their customers thrive, working side by side to drive customer value and results..
A Smarter Health Safety Solution
We’re passionately committed to helping our clients and their customers thrive, working side by side to drive customer value and results..
Building fastest loan portal in India
We’re passionately committed to helping our clients and their customers thrive, working side by side to drive customer value and results..
Toll mangement and command centre with TMCC
We’re passionately committed to helping our clients and their customers thrive, working side by side to drive customer value and results...
Toll mangement and command centre with TMCC
We’re passionately committed to helping our clients and their customers thrive, working side by side to drive customer value and results..
Entreprise IT Transformation and Automation
We understand user and market, create product strategy and design experience for customers and employees to make breakthrough digital products and services