The purpose of this tool is simply get a output of urls and titles from a html fragment or a url because DMOZ URL cleaning engine (used on Add a page of links to unreviewed) sometimes cannot recognize the URLs inside a webpage and, instead of return all urls, it return only one or two urls. Test with both tools (official Dmoz and this one) the URL http://grandeminas.globo.com/unainet/index_jornais.htm. There are some improvements to add on Clean HTML but its working.
Usage
Put the HTML Fragment or the URL. If you choice the ouput type URL and Titles you will get a html fragment that can be parsed by Dmoz official multilinks tool.
/*
/* @description Parse HTML fragment and return only url
/* and titles.
/***************************************************************/
/*
* defining thigns
*/
define("ERROR_NO_HREF",-1);
define("ERROR_NO_TYPE",-2);
define("ERROR_URL_ERROR",-3);
define("ERROR_NONE",1);
define("TYPE_URL_TITLE",1);
define("TYPE_URL",2);
$text = $_POST["text"];
$preg = <<]*href=["']?([^"'> ]*)[^>]*>(.*?)<\/a>/i
EOF;
define("URL_PREG",$preg);
set_magic_quotes_runtime(0);
if (!empty($_POST["url"]))
{
$urlextern = @file($_POST["url"]);
if ($urlextern == FALSE)
{
$errorcode = ERROR_URL_ERROR;
}
else
{
$text = join('',$urlextern);
}
}
if (!empty($text) && empty($errorcode))
{
$errorcode = clearhtml($text, $_POST["type"], $cleaned, $totalurl);
}
if ($errorcode == ERROR_NONE)
{
printf("
Parsed html
%d urls found.
",$totalurl,htmlentities($cleaned));
}
if ($errorcode < 0)
{
print "Error: ";
switch ($errorcode)
{
case ERROR_NO_HREF:
print "Cannot get any url.";
break;
case ERROR_NO_TYPE:
print "Type parm invalid.";
break;
case ERROR_URL_ERROR:
print "Cannot open given URL.";
break;
}
print "";
unset($text);
}
if (empty($text))
{
?>