php - Most Efficient way of scrapping web pages -
i using curl contents page scrap using simple html dom. have scrap thousands of pages code takes lot of time executing. looking methods can speed code. have option of using alternative of curl more efficient , less time consuming. know can using file_get_contents gives me error after few calls.
so question is.
how can scrap webpages in bulk , best efficiency ?
i sharing code. great if can point out how speed up. appreciated.
<?php include_once ('simple_html_dom.php'); function do_it_with_curl($url) { $ch = curl_init(); curl_setopt($ch, curlopt_url, $url); curl_setopt($ch, curlopt_ssl_verifypeer, false); curl_setopt($ch, curlopt_returntransfer, true); $server_output = curl_exec ($ch); $error = curl_error($ch); $errno = curl_errno($ch); curl_close ($ch); return str_get_html($server_output); } function check_response_object( $str_obj, $type ) { if(is_object( $str_obj )) { if( $type == 'url' ) { return $str_obj->href; } else if ( $type == 'text') { return $str_obj->plaintext; } } else { return false; } } $scrap_url = ''; $scrap_err = ''; if ($_server["request_method"] == "post") { if (empty($_post["scrap_url"])) { $scrap_err = "url required"; } else { $scrap_url = $_post["scrap_url"]; header('content-type: text/csv; charset=utf-8'); header('content-disposition: attachment; filename=yellow-pages.csv'); $output = fopen('php://output', 'w'); fputcsv( $output, array( 'name', 'website', 'email', 'phone', 'address', 'reference url' ) ); $url = $scrap_url; { $html = do_it_with_curl($url); $next_page = check_response_object( $html->find('[rel="next"]', 0), 'url' ); $results = $html->find('div.organic div.result'); foreach( $results $single_result ) { $item = array(); $next_url = check_response_object( $single_result->find('a.business-name', 0), 'url' ); $next_html = do_it_with_curl( $next_url ); if( $next_html ) { $item['name'] = check_response_object( $next_html->find('h1[itemprop="name"]', 0 ), 'text' ); $item['website'] = check_response_object( $next_html->find('a.website-link', 0 ), 'url' ); $item['email'] = substr( check_response_object( $next_html->find('a.email-business', 0 ), 'url' ) , 7 ) ; $item['phone'] = check_response_object( $next_html->find('p.phone', 0 ), 'text' ); $item['address'] = check_response_object( $next_html->find('h2[itemprop="address"]', 0 ), 'text' ); $item['ypref'] = strtok( $next_url, '?' ); } fputcsv( $output , $item ); } $url = $next_page; } while ($next_page); exit(); } } ?> <form method="post" action="<?php echo htmlspecialchars($_server["php_self"]);?>"> url: <input type="text" name="scrap_url" value="<?php echo $scrap_url;?>" style="width:80%;"> <span class="error">* <?php echo $scrap_err;?></span> <br><br> <input type="submit" name="submit" value="submit"> </form>
the bottleneck not curl. it's scraping operations happen sequentially. when have thousands of web pages scrape, delays http request web page, parsing operation read html, , file operation save results csv.
you not performance improvement switching curl mechanism. better approach make algorithm multi-threaded scraping operations happen in parallel.
you can multi-threading in php using pthreads
pthreads object orientated api allows user-land multi-threading in php. includes tools need create multi-threaded applications targeted @ web or console. php applications can create, read, write, execute , synchronize threads, workers , stackables.
you can create threadpool, 20 threads example, , have separate thread handle each result loop
foreach( $results $single_result ) {...}
you can find simple example of multi-threading , thread pools in php documentation. can find more examples google search.
Comments
Post a Comment