You are not logged in.
- Topics: Active | Unanswered
Pages: 1
#1 2010-01-10 22:19:16
- MattF
- Member

- From: South Yorkshire, England
- Registered: 2008-05-06
- Posts: 1,230
- Website
UTF-8 and html characters
One of those possibly stupid questions,
but started wondering earlier and it's bugging me now. If a client browser is using UTF-8 encoding, are all characters submitted in that encoding to the server, or are the likes of <>& in the normal ISO type format? Hope that makes sense, btw.
Screw the chavs and God save the Queen!
Offline
#2 2010-01-12 00:56:54
- MattF
- Member

- From: South Yorkshire, England
- Registered: 2008-05-06
- Posts: 1,230
- Website
Re: UTF-8 and html characters
Would I be correct in assuming that these two expressions:
'/</'
'/\x3c/u'would match the < symbol using preg_replace whichever encoding is used?
Last edited by MattF (2010-01-12 00:57:37)
Screw the chavs and God save the Queen!
Offline
#3 2010-01-12 08:55:26
- Reines
- Lead developer

- From: Scotland
- Registered: 2008-05-11
- Posts: 3,165
- Website
Re: UTF-8 and html characters
One of those possibly stupid questions,
but started wondering earlier and it's bugging me now. If a client browser is using UTF-8 encoding, are all characters submitted in that encoding to the server, or are the likes of <>& in the normal ISO type format? Hope that makes sense, btw.
UTF-8 is backwards compatible with ASCII so <>& are the same no matter which encoding they use.
Offline
#4 2010-01-12 15:24:54
- MattF
- Member

- From: South Yorkshire, England
- Registered: 2008-05-06
- Posts: 1,230
- Website
Re: UTF-8 and html characters
So they're always submitted as <>& rather than their UTF-8 equivalent, regardless? Trying to get upto speed on this Unicode thing is doing my nut in at the moment.
Just out of curiosity, would that regex be correct for matching the Unicode equivalent of <?
Cheers Reines. ![]()
Screw the chavs and God save the Queen!
Offline
#5 2010-01-12 15:43:59
- Reines
- Lead developer

- From: Scotland
- Registered: 2008-05-11
- Posts: 3,165
- Website
Re: UTF-8 and html characters
In ISO-8859-1 < is stored as 3C (60 in decimal).
In UTF-8 < is stored as 3C (60 in decimal).
They are not "equivalent", they are the same.
Think of ISO-8859-1 as an extension of ASCII - The first 127 characters are the same, then it has some extra tacked on.
UTF-8 is the same idea - The first 127 characters are the same, then it has some (quite a lot!) extra tacked on.
If you look at http://www.fileformat.info/info/charset … 1/list.htm and http://www.fileformat.info/info/charset/UTF-8/list.htm you will see up until 7F (127) they are both identical.
Offline
#6 2010-01-12 18:16:09
- MattF
- Member

- From: South Yorkshire, England
- Registered: 2008-05-06
- Posts: 1,230
- Website
Re: UTF-8 and html characters
Cheers for that explanation Reines. The penny has dropped now. ![]()
There is some awfully confusing documentation on the web. Been trying to get to grips with this stuff over the last day or two and it has obviously caused me more confusion than help.
Screw the chavs and God save the Queen!
Offline
#7 2010-01-12 18:41:24
- Reines
- Lead developer

- From: Scotland
- Registered: 2008-05-11
- Posts: 3,165
- Website
Re: UTF-8 and html characters
Character sets can be incredibly confusing ![]()
Offline
#9 2010-01-13 01:02:37
- MattF
- Member

- From: South Yorkshire, England
- Registered: 2008-05-06
- Posts: 1,230
- Website
Re: UTF-8 and html characters
They definitely had that effect on me. The more I read, the less I understood. ![]()
Screw the chavs and God save the Queen!
Offline
Pages: 1
