Fork me on GitHub
Subscribe 7

Ticket #15 (open enhancement)

Improve URL validation

  • Created: 2010-05-25 22:29:38
  • Reported by: Reines
  • Assigned to: ridgerunner
  • Milestone: 2.0-alpha5
  • Component: parser
  • Priority: normal

The regular expression in the parser to match URLs could be improved, to handle IPv6 addresses for example.

It would also be smart to add URL validation to the URL field in users profiles, since some people seem to put stupid things in there which causes an invalid link (i.e. http://hello%20there) to be generated in their profile.

History

FSX 2010-06-21 11:10:55

  • Owner set to FSX.

Franz 2010-06-26 22:32:25

I know that Jeff (ridgerunner) had some ideas on this topic.

Reines 2010-06-28 16:28:33

Although this is marked for 1.4.1 it isn't exactly high priority so I think we'd be best doing this properly rather than rushing it.

Reines 2010-06-30 21:55:44

  • Milestone changed from 1.4.1 to 2.0-beta1.

FSX 2010-08-11 19:11:37

  • Owner FSX removed.

Reines 2011-01-30 22:40:17

  • Milestone changed from 2.0-beta1 to 1.4.5.

Reines 2011-02-08 12:56:42

  • Owner set to ridgerunner.

taylorchu 2011-02-10 00:35:02

regexp would be really long to handle all the cases.

here is my solution:

isURL($url)
{
    $oldURL = parse_url($url);
    if($oldURL === false) 
        return false;
    $newURL = '';
    if(!isset($oldURL['scheme']))
        return false;
    $newURL += $oldURL['scheme'] + '://';
    if(!isset($oldURL['host']))
        return false;
    $newURL += $oldURL['host'];
    if(isset($oldURL['path']))
        $newURL += $oldURL['path'];
    if(isset($oldURL['query']))
        $newURL += '?' + $oldURL['query'];
    if(isset($oldURL['anchor']))
        $newURL += '#' + $oldURL['anchor'];
    return $url === $newURL;
}

taylorchu 2011-02-10 00:41:14

here is a less strict solution(I prefer this one.)
The idea is that if it is a valid url, after slicing it and putting all pieces back together, it should still be itself.


isURL($url)
{
    $oldURL = parse_url($url);
    if($oldURL === false) 
        return false;
    $newURL = '';
    if(isset($oldURL['scheme']))
        $newURL += $oldURL['scheme'] + '://';
    if(isset($oldURL['host']))
        $newURL += $oldURL['host'];
    if(isset($oldURL['path']))
        $newURL += $oldURL['path'];
    if(isset($oldURL['query']))
        $newURL += '?' + $oldURL['query'];
    if(isset($oldURL['anchor']))
        $newURL += '#' + $oldURL['anchor'];
    return $url === $newURL;
}

Reines 2011-02-10 00:42:44

It isn't just a case of validating, we also need to make URLs to make do_clickable work.

Also, to quote php.net:

parse_url wrote:

This function is not meant to validate the given URL

Reines 2011-02-10 00:43:42

urh sorry that was meant to say "need to match"... long day...

taylorchu 2011-02-10 01:11:27

is it only for do_clickable or more at somewhere else?

ridgerunner 2011-02-10 01:24:56

See new "url_valid()" function in the new parser code (in functions.php). This function validates and decomposes a URI into its various components and returns an associative array containing the components. There is also a new do_clickable function replacement: "Linkify()". Both new functions are located at the end of functions.php.

Reines 2011-03-11 09:24:23

  • Milestone changed from 1.4.5 to 1.4.6.

ridgerunner 2011-03-12 04:16:12

I've added "url_valid()" to functions.php.

This function correctly validates HTTP, HTTPS, FTP and FTP URLs. It is used by the new parser, but I'm not sure where it might be used elsewhere. (I'm not going to hack on the old parser.)

For details on how the regular expressions were assembled, refer to my article: Regular Expression URI Validation

ridgerunner 2011-03-12 04:16:51

  • Status changed from open to fixed.

Franz 2011-03-12 09:22:16

  • Milestone changed from 1.4.6 to 1.4.5.

Does this mean somebody else will yet have to add the function call to the parser code?

Reines 2011-03-12 10:31:45

I would be tempted for this to simply ignore the Ipv6 problem in the parser, since as ridgerunner said it is a pain to handle simply due to how the parser is written, and will be fixed by the new parser in 2.0.

The url_valid() function could simply be used to validate the website field in users profiles, which was the other thing suggested.

quy 2011-03-13 06:53:50

It appears that fluxbb.org is not a valid url, but http://fluxbb.org is. Is it the case that the url has to be prefixed with a protocol or a prefix such as www or ftp?

ridgerunner 2011-03-13 07:16:11

Yes. url_valid() only validates absolute HTTP and FTP schemes and requires the scheme, unless the URL begins with www. or ftp. and is otherwise valid.

Franz 2011-03-13 08:38:57

Huh? It is valid if you don't recognize the format?

taylorchu 2011-03-13 17:53:53

Why should we rely on regexp if php defines a function for us?


/*
this version of url_valid will return false if no host is detected, mal-formed, or result != input

note that user, pass, port is not usually in url, so we will return false

even if scheme is not defined, it will still return true if other components are valid.
*/
function url_valid($url)
{
    $oldURL = parse_url($url);
    if($oldURL === false) 
        return false;
    $newURL = '';
    if(isset($oldURL['scheme']))
        $newURL += $oldURL['scheme'] + '://';

    if(isset($oldURL['host']))
        $newURL += $oldURL['host'];
    else
        return false;

    if(isset($oldURL['path']))
        $newURL += $oldURL['path'];

    if(isset($oldURL['query']))
        $newURL += '?' + $oldURL['query'];

    if(isset($oldURL['anchor']))
        $newURL += '#' + $oldURL['anchor'];

    if $url === $newURL;
        return $oldURL;
    else
        return false;
}

Reines 2011-03-13 18:00:48

taylorchu wrote:

Why should we rely on regexp if php defines a function for us?

As I said further up the thread, parse_url is not designed to validate a URL. The PHP documentation specifically says you should not use it for validation.

taylorchu 2011-03-13 18:13:03

they mean that some people simply reply on the feature that parse_url will return false if url is seriously mal-formed. my function checks "whether url contains and only contains what it should have."

quy 2011-03-18 21:30:59

It appears that ridgerunner's function validates on 2-letter combination such as:
http://ab
http://cd
http://yz

ridgerunner 2011-03-19 07:14:10

And a two letter top level domain is a perfectly valid host!

Reines 2011-03-19 20:23:08

  • Milestone changed from 1.4.5 to 2.0-alpha5.
  • Status changed from fixed to open.

Reines 2011-03-19 20:47:08

I've re-opened this and moved it to 2.0 - It isn't fixed at the moment since we still need to actually add checks/make them use the new function. I would say it's rather low priority though, so unless anyone specifically wants to do it soon then it can wait until 2.0.

quy 2011-04-30 01:48:46

Why isn't fluxbb.org (without protocol and www.) be considered a valid URL format?

ridgerunner 2011-04-30 02:39:38

For the same reason "program.py" is not a valid absolute URL. This function requires an explicit scheme (unless it can be inferred by a www. or ftp. first subdomain).