Forum Home
Press F1
 
Thread ID: 81295 2007-07-23 07:54:00 Validating URLs in a form using PHP Morgenmuffel (187) Press F1
Post ID Timestamp Content User
571721 2007-07-23 07:54:00 You think i can get a straight answer on the net, my god there is so much contradicatory stuff out there

anyhow

what I want is to be able to take in URLs in the forms

example.com or
www.example.com or
http://www.example.com

and store them all as example.com

but I can't seem to get it to work i've tried using parse_url but keep geeting errors when the scheme isn't there

so any help would be appreciated
Morgenmuffel (187)
571722 2007-07-23 08:00:00 Use the preg_match() (nz.php.net) function for this - it will allow you to capture the text you are after using a regular expression. Erayd (23)
571723 2007-07-23 08:34:00 Or use eregi_replace(). It matches and replaces.

For example:

$enteredUrl = "()?([^[:space:]]+)([[:alnum:]\.,-_?)";
$newUrl = "http://\\2\\3\>\\2\\3";
$storedUrl["URL"] = eregi_replace($enteredUrl, $newUrl, $storedURL["URL"];

Note:
For convenience (for me, anyways), it is always handy to chop up a url into three parts -the "http://", the middle part, and the part from the last full-stop (e.g., .com, .net etc) onwards.

For the "$enteredUrl", the middle-part only matches (note the [^[:space]]). It does not validate. It is very submissive. I am feeling a bit lazy here. You are strongly advised to strengthen it.

Please test this thoroughly before you decide to use it.

Whay do you want to leave out the "http://" part?
vinref (6194)
571724 2007-07-23 09:13:00 Also note processing requirements - regex replace operations use more resources than simple matches; however unless your site is very highly loaded it's unlikely to matter. Erayd (23)
571725 2007-07-26 08:04:00 Thanks vinref and bletch


Or use eregi_replace(). It matches and replaces.

For example:

$enteredUrl = "()?( (http:)[^[:space:]]+)([[:alnum:]\.,-_?/&=])";
$newUrl = "http://\\2\\3\>\\2\\3";
$storedUrl["URL"] = eregi_replace($enteredUrl, $newUrl, $storedURL["URL"];

Note:
For convenience (for me, anyways), it is always handy to chop up a url into three parts -the "http://", the middle part, and the part from the last full-stop (e.g., .com, .net etc) onwards.

For the "$enteredUrl", the middle-part only matches (note the [^[:space]]).

Thanks for that, regular expressions are not my forte but the eregi_replace was a start I finally hacked together something that seems to work well, although my understanding of it is pretty thin in places (the comments are mine)



function strip_url($url){
// Replace https: or ftp://
// basically deletes anything b4 and including the //.
$url=eregi_replace('[a-zA-Z]+://([.])*','',$url);
// Replace www.
// this is the one that confuses me
//I know the caret^ means look at the start of the string
//but the pipe| throws me, I saw it other peoples regex so i copied and it worked
$url=eregi_replace('(^| )(www([-]*[.])*)','',$url);
return $url;
}







Why do you want to leave out the "http://" part?

because it is unnecessary, when i did coding many years ago I was told to always leave out things from a database that don't have to be there, and as the only links that will be submtted are just going to be standard webpages, I may as well save a femto second and just put the http:// in my actual code


'<a href ="'.$url.'>bla<
also I am slightly anal and like all my links to begin the same
Morgenmuffel (187)
571726 2007-07-26 10:42:00 One small note: some sites make a distinction between site.com and www.site.com, so you should really be storing the 'www' (or lack of) in the database with the rest of the URL. The "http://" can go however.


I was told to always leave out things from a database that don't have to be there.Very true, although best taken with a grain of salt - unless you have a truly huge database, or a very heavily loaded server, it's unlikely to make much difference. The resource savings made by leaving out information are in the areas of full table, non-indexed searches (ram & cpu time) and memory caching of tables (ram). Throughput between the database and the webserver isn't an issue - if you have a situation where the bottleneck is the connection between db and webserver, then something is seriously wrong with your setup (one of the few exceptions to this is if you are serving large amounts of binary data from your db).
Erayd (23)
571727 2007-07-26 21:50:00 One small note: some sites make a distinction between site.com (site.com) and www.site.com (http:), so you should really be storing the 'www' (or lack of) in the database with the rest of the URL. The "http://" can go however.

That shouldn't be a problem as 99% of the links will be from a known site that handles them identically, but i will file it for future reference





Very true, although best taken with a grain of salt - unless you have a truly huge database, or a very heavily loaded server, it's unlikely to make much difference.

I know but I first learnt a little programming in the early 90s and old habits die hard
Morgenmuffel (187)
571728 2007-07-27 00:10:00 That shouldn't be a problem as 99% of the links will be from a known site that handles them identically, but i will file it for future reference.Can we assume that this means you are happy for 1% of your urls to break? :rolleyes: Erayd (23)
571729 2007-07-27 01:09:00 The pipe ("|") is equivalent to "or".

You could reduce the two eregi_replace() to one, so that both the "http://" and the "www" are stripped in one go.

Keeping the "", "ftp: etc may also help in reducing ambiguity with the "www" part. I.e., storing the full url removes all ambiguity.

Good luck with the regex. Show us your code as you go along.
vinref (6194)
571730 2007-07-28 05:13:00 Bletch is right on the point of not considering "www." superfluous. I can almost guarantee that 99% of the time they're the same is a waay incorrect assumption also. I would recommend including the "www." if it is used in submission. sal (67)
1