A small correction to patrick at hexane dot org's mb_str_replace function. The original function does not work as intended in case $replacement contains $needle.
<?php
function mb_str_replace($needle, $replacement, $haystack)
{
$needle_len = mb_strlen($needle);
$replacement_len = mb_strlen($replacement);
$pos = mb_strpos($haystack, $needle);
while ($pos !== false)
{
$haystack = mb_substr($haystack, 0, $pos) . $replacement
. mb_substr($haystack, $pos + $needle_len);
$pos = mb_strpos($haystack, $needle, $pos + $replacement_len);
}
return $haystack;
}
?>
Fonctions sur les chaînes de caractères multi-octets
Références
Les jeux de caractères multi-octets et leurs techniques sont très complexes et ne peuvent être traités totalement dans cette documentation. Reportez-vous aux URL suivantes pour d'autres ressources complémentaires :
-
Unicode/UTF/UCS/etc
-
Japonais/coréen/Chinois
Sommaire
- mb_check_encoding — Vérifie si une chaîne est valide pour un encodage spécifique
- mb_convert_case — Modifie la casse d'une chaîne
- mb_convert_encoding — Conversion d'encodage
- mb_convert_kana — Convertit un "kana" en un autre ("zen-kaku", "han-kaku" et plus)
- mb_convert_variables — Convertit l'encodage de variables
- mb_decode_mimeheader — Décode un en-tête MIME
- mb_decode_numericentity — Décode les entités HTML en caractères
- mb_detect_encoding — Détecte un encodage
- mb_detect_order — Lit/modifie l'ordre de détection des encodages
- mb_encode_mimeheader — Encode une chaîne pour un en-tête MIM
- mb_encode_numericentity — Encode des entités HTML
- mb_ereg_match — Expression rationnelle POSIX pour les chaînes multi-octets
- mb_ereg_replace — Remplace des segments de chaînes, avec le support des expressions rationnelles multi-octets
- mb_ereg_search_getpos — Retourne la position du début du prochain segment repéré par une expression rationnelle
- mb_ereg_search_getregs — Lit le dernier segment de chaîne multi-octets qui correspond au masque
- mb_ereg_search_init — Configure les chaînes et les expressions rationnelles pour le support des caractères multi-octets
- mb_ereg_search_pos — Retourne la position et la longueur du segment de chaîne qui vérifie le masque de l'expression rationnelle
- mb_ereg_search_regs — Retourne le segment de chaîne trouvé par une expression rationnelle multi-octets
- mb_ereg_search_setpos — Choisit le point de départ de la recherche par expression rationnelle
- mb_ereg_search — Recherche par expression rationnelle multi-octets
- mb_ereg — Recherche par expression rationnelle avec support des caractères multi-octets
- mb_eregi_replace — Expression rationnelle avec support des caractères multi-octets, sans tenir compte de la casse
- mb_eregi — Expression rationnelle insensible à la casse avec le support des caractères multi-octets
- mb_get_info — Lit la configuration interne de l'extension mbstring
- mb_http_input — Détecte le type d'encodage d'un caractère HTTP
- mb_http_output — Lit/modifie l'encodage d'affichage
- mb_internal_encoding — Lit/modifie l'encodage interne
- mb_language — Lit/modifie le langage courant
- mb_list_encodings — Retourne un tableau contenant tous les encodages supportés
- mb_output_handler — Fonction de traitement des affichages
- mb_parse_str — Analyse les données HTTP GET/POST/COOKIE et assigne les variables globales
- mb_preferred_mime_name — Détecte l'encodage MIME
- mb_regex_encoding — Retourne le jeu de caractères courant pour les expressions rationnelles
- mb_regex_set_options — Lit et modifie les options des fonctions d'expression rationnelle à support de caractères multi-octets
- mb_send_mail — Envoie un mail encodé
- mb_split — Scinde une chaîne en tableau avec une expression rationnelle multi-octets
- mb_strcut — Coupe une partie de chaîne
- mb_strimwidth — Tronque une chaîne
- mb_stripos — Trouve la première occurrence d'une chaîne dans une autre, sans tenir compte de la casse
- mb_stristr — Trouve la première occurrence d'une chaîne dans une autre, sans tenir compte de la casse
- mb_strlen — Retourne la taille d'une chaîne
- mb_strpos — Repère la première occurrence d'un caractère dans une chaîne
- mb_strrchr — Trouve la dernière occurrence d'un caractère d'une chaîne dans une autre
- mb_strrichr — Trouve la dernière occurrence d'un caractère d'une chaîne dans une autre, insensible à la casse
- mb_strripos — Trouve la position de la dernière occurrence d'une chaîne dans une autre, en tenant compte de la casse
- mb_strrpos — Repère la dernière occurrence d'un caractère dans une chaîne
- mb_strstr — Trouve la première occurrence d'une chaîne dans une autre
- mb_strtolower — Met tous les caractères en minuscules
- mb_strtoupper — Met tous les caractères en majuscules
- mb_strwidth — Retourne la taille d'une chaîne
- mb_substitute_character — Lit/modifie les caractères de substitution
- mb_substr_count — Compte le nombre d'occurrences d'une sous-chaîne
- mb_substr — Lit une sous-chaîne
Fonctions sur les chaînes de caractères multi-octets
marc at ermshaus dot org
04-Oct-2008 06:05
04-Oct-2008 06:05
patrick at hexane dot org
27-Jun-2008 11:18
27-Jun-2008 11:18
I wonder why there isn't a mb_str_replace(). Here's one for now:
function mb_str_replace( $needle, $replacement, $haystack ) {
$needle_len = mb_strlen($needle);
$pos = mb_strpos( $haystack, $needle);
while (!($pos ===false)) {
$front = mb_substr( $haystack, 0, $pos );
$back = mb_substr( $haystack, $pos + $needle_len);
$haystack = $front.$replacement.$back;
$pos = mb_strpos( $haystack, $needle);
}
return $haystack;
}
tonyboyd
18-Oct-2007 02:52
18-Oct-2007 02:52
JOECOLE, isn't this the same thing?
$str = mb_convert_case($str, MB_CASE_TITLE, "UTF-8");
Smelly
26-Apr-2007 01:09
26-Apr-2007 01:09
Below is some code to output a UTF-8 encoded CSV in a way understandable by Excel. It requires iconv instead of mbstring.
header("Content-type: application/octet-stream");
header("Content-Transfer-Encoding: binary");
header("Content-Disposition: attachment; filename=report.xls");
// assume $tmpString contains UTF-8 encoded CSV:
$tmpString = iconv ( 'UTF-8', 'UTF-16LE//IGNORE', $tmpString );
print chr(255).chr(254).$tmpString;
chris at maedata dot com
25-Apr-2007 12:50
25-Apr-2007 12:50
The opposite of what Eugene Murai wrote in a previous comment is true when importing/uploading a file. For instance, if you export an Excel spreadsheet using the Save As Unicode Text option, you can use the following to convert it to UTF-8 after uploading:
//Convert file to UTF-8 in case Windows mucked it up
$file = explode( "\n", mb_convert_encoding( trim( file_get_contents( $_FILES['file']['tmp_name'] ) ), 'UTF-8', 'UTF-16' ) );
mdoocy at u dot washington dot edu
15-Mar-2007 02:30
15-Mar-2007 02:30
Note that some of the multi-byte functions run in O(n) time, rather than constant time as is the case for their single-byte equivalents. This includes any functionality requiring access at a specific index, since random access is not possible in a string whose number of bytes will not necessarily match the number of characters. Affected functions include: mb_substr(), mb_strstr(), mb_strcut(), mb_strpos(), etc.
motin at demomusic dot nu
16-Feb-2007 09:24
16-Feb-2007 09:24
Follow up on last note from 2007-jan-20: http://se2.php.net/manual/en/function.mb-strlen.php#72979
There is the correct way of simulating singlebyte strlen as well as some pitfalls to watch out for when developing in a mb-func_overload:ed environment.
motin at demomusic dot nu
20-Jan-2007 09:12
20-Jan-2007 09:12
As peter dot albertsson at spray dot se already pointed out, overloading strlen may break code that handles binary data and relies upon strlen for bytelengths.
The problem occurs when a file is filled with a string using fwrite in the following manner:
$len = strlen($data);
fwrite($fp, $data, $len);
fwrite takes amount of bytes as the third parameter, but mb_strlen returns the amount of characters in the string. Since multibyte characters are possibly more than one byte in length each - this will result in that the last characters of $data never gets written to the file.
After hours of investigating why PEAR::Cache_Lite didn't work - the above is what I found.
I made an attempt at using single byte functions, but it doesn't work. Posting here anyway in case it helps someone else:
/**
* PHP Singe byte functions simulation (non successful)
*
* Usage: sb_string(functionname, arg1, arg2, etc);
* Example: sb_string("strlen", "tuöéä"); returns 8 (should...)
*/
function sb_string() {
$arguments = func_get_args();
$func_overloading = ini_get("mbstring.func_overload");
ini_set("mbstring.func_overload", 0);
$ret = call_user_func_array(array_shift($arguments), $arguments);
ini_set("mbstring.func_overload", $func_overloading);
return $ret;
}
pdezwart .at. snocap
11-Oct-2006 02:28
11-Oct-2006 02:28
If you are trying to emulate the UnicodeEncoding.Unicode.GetBytes() function in .NET, the encoding you want to use is: UCS-2LE
hayk at mail dot ru
18-Aug-2006 03:36
18-Aug-2006 03:36
Since PHP 5.1.0 and PHP 4.4.2 there is an Armenian ArmSCII-8 (ArmSCII-8, ArmSCII8, ARMSCII-8, ARMSCII8) encoding avaliable.
daniel at softel dot jp
24-Jul-2006 07:41
24-Jul-2006 07:41
Note that although "multi-byte" hints at total internationalization, the mb_ API was designed by a Japanese person to support the Japanese language.
Some of the functions, for example mb_convert_kana(), make absolutely no sense outside of a Japanese language environment.
It should perhaps be considered "lucky" if the functions work with non-Japanese multi-byte languages.
I don't mean any disrespect to the mb_ API because I'm using it everyday and I appreciate its usefulness, but maybe a better name would be the jp_ API.
Aardvark
14-Mar-2006 03:37
14-Mar-2006 03:37
Since not all hosted servces currently support the multi-byte function set, it may still be necessary to process Unicode strings using standard single byte functions. The function at the following link - http://www.kanolife.com/escape/2006/03/php-unicode-processing.html - shows by example how to do this. While this only covers UTF-8, the standard PHP function "iconv" allows conversion into and out of UTF-8 if strings need to be input or output in other encodings.
peter kehl
10-Mar-2006 12:34
10-Mar-2006 12:34
UTF-16LE solution for CSV for Excel by Eugene Murai works well:
$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');
However, then Excel on Mac OS X doesn't identify columns properly and its puts each whole row in its own cell. In order to fix that, use TAB "\\t" character as CSV delimiter rather than comma or colon.
You may also want to use HTTP encoding header, such as
header( "Content-type: application/vnd.ms-excel; charset=UTF-16LE" );
15-Aug-2005 10:24
get the string octet-size, when mbstring.func_overload is set to 2 :
<?php
function str_sizeof($string) {
return count(preg_split("`.`", $string)) - 1 ;
}
?>
answering to peter albertsson, once you got your data octet-size, you can access each octet with something
$string[0] ... $string[$size-1], since the [ operator doesn't complies with multibytes strings.
peter dot albertsson at spray dot se
21-May-2005 06:43
21-May-2005 06:43
Setting mbstring.func_overload = 2 may break your applications that deal with binary data.
After having set mbstring.func_overload = 2 and mbstring.internal_encoding = UTF-8 I can't even read a binary file and print/echo it to output without corrupting it.
nzkiwi at NOSPAMmte dot biglobe dot ne dot jp
14-Apr-2005 07:37
14-Apr-2005 07:37
A friend has pointed out that the entry
"mbstring.http_input PHP_INI_ALL" in Table 1 on the mbstring page appears to be wrong: above Example 4 it says that "There is no way to control HTTP input character conversion from PHP script. To disable HTTP input character conversion, it has to be done in php.ini".
Also the table shows the old-PHP-version defaults:
;; Disable HTTP Input conversion
mbstring.http_input = pass *BUT* (for PHP 4.3.0 or higher)
;; Disable HTTP Input conversion
mbstring.encoding_translation = Off
Eugene Murai
24-Feb-2005 02:20
24-Feb-2005 02:20
PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says "Unicode", it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP's "UTF-16" means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.
Example:
$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');
Geoffrey
01-Feb-2005 04:59
01-Feb-2005 04:59
For Windows users php_mbstring can be added as follows:-
if you have dowloaded the "short" version of PHP,
(php-4.3.10-installer.exe), download the full version .
(php-4.3.10-Win32.zip)
unzip it, find php_mbstring.dll in
f:\php-4.3.10-Win32\extensions, and copy it across to your
php\extensions directory
use Notepad to open your PHP.INI
change the extension_dir line to read
extension_dir = "e:\php\extensions\" (or whatever your
directory is called)
remove the semi-colon on line
; extension=php_mbstring.dll
save PHP.INI, restart PHP
