Ticket #15 (open enhancement)
Improve URL validation
- Created: 2010-05-25 22:29:38
- Reported by: Reines
- Assigned to: ridgerunner
- Milestone: 2.0-alpha5
- Component: parser
- Priority: normal
The regular expression in the parser to match URLs could be improved, to handle IPv6 addresses for example.
It would also be smart to add URL validation to the URL field in users profiles, since some people seem to put stupid things in there which causes an invalid link (i.e. http://hello%20there) to be generated in their profile.
History
Franz 2010-05-25 22:38:37
Reference topic (for developers): http://fluxbb.org/forums/viewtopic.php?id=3729
Reines 2010-06-28 16:28:33
Although this is marked for 1.4.1 it isn't exactly high priority so I think we'd be best doing this properly rather than rushing it.
taylorchu 2011-02-10 00:35:02
regexp would be really long to handle all the cases.
here is my solution:
isURL($url)
{
$oldURL = parse_url($url);
if($oldURL === false)
return false;
$newURL = '';
if(!isset($oldURL['scheme']))
return false;
$newURL += $oldURL['scheme'] + '://';
if(!isset($oldURL['host']))
return false;
$newURL += $oldURL['host'];
if(isset($oldURL['path']))
$newURL += $oldURL['path'];
if(isset($oldURL['query']))
$newURL += '?' + $oldURL['query'];
if(isset($oldURL['anchor']))
$newURL += '#' + $oldURL['anchor'];
return $url === $newURL;
}taylorchu 2011-02-10 00:41:14
here is a less strict solution(I prefer this one.)
The idea is that if it is a valid url, after slicing it and putting all pieces back together, it should still be itself.
isURL($url)
{
$oldURL = parse_url($url);
if($oldURL === false)
return false;
$newURL = '';
if(isset($oldURL['scheme']))
$newURL += $oldURL['scheme'] + '://';
if(isset($oldURL['host']))
$newURL += $oldURL['host'];
if(isset($oldURL['path']))
$newURL += $oldURL['path'];
if(isset($oldURL['query']))
$newURL += '?' + $oldURL['query'];
if(isset($oldURL['anchor']))
$newURL += '#' + $oldURL['anchor'];
return $url === $newURL;
}Reines 2011-02-10 00:42:44
It isn't just a case of validating, we also need to make URLs to make do_clickable work.
Also, to quote php.net:
This function is not meant to validate the given URL
ridgerunner 2011-02-10 01:24:56
See new "url_valid()" function in the new parser code (in functions.php). This function validates and decomposes a URI into its various components and returns an associative array containing the components. There is also a new do_clickable function replacement: "Linkify()". Both new functions are located at the end of functions.php.
ridgerunner 2011-03-12 04:16:12
I've added "url_valid()" to functions.php.
This function correctly validates HTTP, HTTPS, FTP and FTP URLs. It is used by the new parser, but I'm not sure where it might be used elsewhere. (I'm not going to hack on the old parser.)
For details on how the regular expressions were assembled, refer to my article: Regular Expression URI Validation
Franz 2011-03-12 09:22:16
- Milestone changed from 1.4.6 to 1.4.5.
Does this mean somebody else will yet have to add the function call to the parser code?
Reines 2011-03-12 10:31:45
I would be tempted for this to simply ignore the Ipv6 problem in the parser, since as ridgerunner said it is a pain to handle simply due to how the parser is written, and will be fixed by the new parser in 2.0.
The url_valid() function could simply be used to validate the website field in users profiles, which was the other thing suggested.
It appears that fluxbb.org is not a valid url, but http://fluxbb.org is. Is it the case that the url has to be prefixed with a protocol or a prefix such as www or ftp?
ridgerunner 2011-03-13 07:16:11
Yes. url_valid() only validates absolute HTTP and FTP schemes and requires the scheme, unless the URL begins with www. or ftp. and is otherwise valid.
taylorchu 2011-03-13 17:53:53
Why should we rely on regexp if php defines a function for us?
/*
this version of url_valid will return false if no host is detected, mal-formed, or result != input
note that user, pass, port is not usually in url, so we will return false
even if scheme is not defined, it will still return true if other components are valid.
*/
function url_valid($url)
{
$oldURL = parse_url($url);
if($oldURL === false)
return false;
$newURL = '';
if(isset($oldURL['scheme']))
$newURL += $oldURL['scheme'] + '://';
if(isset($oldURL['host']))
$newURL += $oldURL['host'];
else
return false;
if(isset($oldURL['path']))
$newURL += $oldURL['path'];
if(isset($oldURL['query']))
$newURL += '?' + $oldURL['query'];
if(isset($oldURL['anchor']))
$newURL += '#' + $oldURL['anchor'];
if $url === $newURL;
return $oldURL;
else
return false;
}Reines 2011-03-13 18:00:48
Why should we rely on regexp if php defines a function for us?
As I said further up the thread, parse_url is not designed to validate a URL. The PHP documentation specifically says you should not use it for validation.
taylorchu 2011-03-13 18:13:03
they mean that some people simply reply on the feature that parse_url will return false if url is seriously mal-formed. my function checks "whether url contains and only contains what it should have."
It appears that ridgerunner's function validates on 2-letter combination such as:
http://ab
http://cd
http://yz
Reines 2011-03-19 20:23:08
- Milestone changed from 1.4.5 to 2.0-alpha5.
- Status changed from fixed to open.
Reines 2011-03-19 20:47:08
I've re-opened this and moved it to 2.0 - It isn't fixed at the moment since we still need to actually add checks/make them use the new function. I would say it's rather low priority though, so unless anyone specifically wants to do it soon then it can wait until 2.0.
ridgerunner 2011-04-30 02:39:38
For the same reason "program.py" is not a valid absolute URL. This function requires an explicit scheme (unless it can be inferred by a www. or ftp. first subdomain).

