Friday, September 23, 2011

DOMDocument, accessing other sites HTML. Bypassing the Same Origin Policy

Below is a code that retrieves <a> tags from a webpage outside the server using the DOMDocument. A built in PHP class that retrieves HTML and XML from a webpage. Bypassing the Same Origin Policy. This is an alternative to cURL which downloads the entire HTML page.



$keywords = array();
$domain = array('http://bing.com');//select website to extract

$doc = new DOMDocument;
$doc->preserveWhiteSpace = FALSE;

foreach ($domain as $key => $value) {
@$doc->loadHTMLFile($value); //Load HTML from a file
$anchor_tags = $doc->getElementsByTagName('a'); //get <a> tags by accessing the DOM
foreach ($anchor_tags as $tag) {
$keywords[] = strtolower($tag->nodeValue);
}
}
echo '
';
print_r ($keywords);
echo '
';

No comments:

Post a Comment