Fonctions sur les chaînes de caractères multioctets

Références

Les jeux de caractères multioctets et leurs techniques sont très complexes et ne peuvent être traités totalement dans cette documentation. Reportez-vous aux URL suivantes pour d'autres ressources complémentaires :

Unicode/UTF/UCS/etc

» http://www.unicode.org/
Japonais/coréen/Chinois

» https://resources.oreilly.com/examples/9781565922242/blob/master/doc/cjk.inf

Sommaire

mb_check_encoding — Vérifie si les chaînes sont valide pour l'encodage spécifié
mb_chr — Retourne un caractère par sa valeur de point de code Unicode
mb_convert_case — Modifie la casse d'une chaîne
mb_convert_encoding — Convertir une chaîne d'un codage de caractères à un autre
mb_convert_kana — Convertit un "kana" en un autre ("zen-kaku", "han-kaku" et plus)
mb_convert_variables — Convertit l'encodage de variables
mb_decode_mimeheader — Décode un en-tête MIME
mb_decode_numericentity — Décode les entités HTML en caractères
mb_detect_encoding — Détecte un encodage
mb_detect_order — Lit/modifie l'ordre de détection des encodages
mb_encode_mimeheader — Encode une chaîne pour un en-tête MIME
mb_encode_numericentity — Encode les caractères en référence numérique HTML
mb_encoding_aliases — Récupère les aliases d'un type d'encodage connu
mb_ereg_match — Expression rationnelle POSIX pour les chaînes multioctets
mb_ereg_replace_callback — Rechercher et remplacer par expression régulière avec support multi octets utilisant une fonction de callback
mb_ereg_replace — Remplace des segments de chaîne à l'aide des expressions régulières
mb_ereg_search_getpos — Retourne la position du début du prochain segment repéré par une expression rationnelle
mb_ereg_search_getregs — Lit le dernier segment de chaîne multioctets qui correspond au masque
mb_ereg_search_init — Configure les chaînes et les expressions régulières pour le support des caractères multioctets
mb_ereg_search_pos — Retourne la position et la longueur du segment de chaîne qui vérifie le masque de l'expression rationnelle
mb_ereg_search_regs — Retourne le segment de chaîne trouvé par une expression rationnelle multioctets
mb_ereg_search_setpos — Choisit le point de départ de la recherche par expression rationnelle
mb_ereg_search — Recherche par expression rationnelle multioctets
mb_ereg — Recherche par expression rationnelle avec support des caractères multioctets
mb_eregi_replace — Expression rationnelle avec support des caractères multioctets, sans tenir compte de la casse
mb_eregi — Expression rationnelle insensible à la casse avec le support des caractères multioctets
mb_get_info — Lit la configuration interne de l'extension mbstring
mb_http_input — Détecte le type d'encodage d'un caractère HTTP
mb_http_output — Lit/modifie l'encodage d'affichage
mb_internal_encoding — Lit/modifie l'encodage interne
mb_language — Définit/Récupère le langage courant
mb_list_encodings — Retourne un tableau contenant tous les encodages supportés
mb_ord — Récupère le point de code Unicode d'un caractère
mb_output_handler — Fonction de traitement des affichages
mb_parse_str — Analyse les données HTTP GET/POST/COOKIE et assigne les variables globales
mb_preferred_mime_name — Détecte l'encodage MIME
mb_regex_encoding — Définit/Récupère l'encodage des caractères pour les expressions régulières multioctets
mb_regex_set_options — Lit et modifie les options des fonctions d'expression rationnelle à support de caractères multioctets
mb_scrub — Remplacez les séquences d'octets mal formées par le caractère de substitution.
mb_send_mail — Envoie un mail encodé
mb_split — Scinde une chaîne en tableau avec une expression rationnelle multioctets
mb_str_pad — Pad a multibyte string to a certain length with another multibyte string
mb_str_split — Pour une chaîne multi-octets donnée, renvoie un tableau de ses caractères
mb_strcut — Coupe une partie de chaîne
mb_strimwidth — Tronque une chaîne
mb_stripos — Trouve la première occurrence d'une chaîne dans une autre, sans tenir compte de la casse
mb_stristr — Trouve la première occurrence d'une chaîne dans une autre, sans tenir compte de la casse
mb_strlen — Retourne la taille d'une chaîne
mb_strpos — Repère la première occurrence d'un caractère dans une chaîne
mb_strrchr — Trouve la dernière occurrence d'un caractère d'une chaîne dans une autre
mb_strrichr — Trouve la dernière occurrence d'un caractère d'une chaîne dans une autre, insensible à la casse
mb_strripos — Trouve la position de la dernière occurrence d'une chaîne dans une autre, en ne tenant pas compte de la casse
mb_strrpos — Repère la dernière occurrence d'un caractère dans une chaîne
mb_strstr — Trouve la première occurrence d'une chaîne dans une autre
mb_strtolower — Met tous les caractères en minuscules
mb_strtoupper — Met tous les caractères en majuscules
mb_strwidth — Retourne la taille d'une chaîne
mb_substitute_character — Définit/Récupère les caractères de substitution
mb_substr_count — Compte le nombre d'occurrences d'une sous-chaîne
mb_substr — Lit une sous-chaîne

Improve This Page

Learn How To Improve This Page • Submit a Pull Request • Report a Bug

＋add a note

User Contributed Notes 35 notes

down

deceze at gmail dot com ¶

11 years ago

Please note that all the discussion about mb_str_replace in the comments is pretty pointless. str_replace works just fine with multibyte strings:

<?php

$string  = '漢字はユニコード';
$needle  = 'は';
$replace = 'Foo';

echo str_replace($needle, $replace, $string);
// outputs: 漢字Fooユニコード

?>

The usual problem is that the string is evaluated as binary string, meaning PHP is not aware of encodings at all. Problems arise if you are getting a value "from outside" somewhere (database, POST request) and the encoding of the needle and the haystack is not the same. That typically means the source code is not saved in the same encoding as you are receiving "from outside". Therefore the binary representations don't match and nothing happens.

down

Eugene Murai ¶

19 years ago

PHP can input and output Unicode, but a little different from what Microsoft means: when Microsoft says "Unicode", it unexplicitly means little-endian UTF-16 with BOM(FF FE = chr(255).chr(254)), whereas PHP's "UTF-16" means big-endian with BOM. For this reason, PHP does not seem to be able to output Unicode CSV file for Microsoft Excel. Solving this problem is quite simple: just put BOM infront of UTF-16LE string.

Example:

$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');

down

Hayley Watson ¶

5 years ago

SOME multibyte encodings can safely be used in str_replace() and the like, others cannot. It's not enough to ensure that all the strings involved use the same encoding: obviously they have to, but it's not enough. It has to be the right sort of encoding.

UTF-8 is one of the safe ones, because it was designed to be unambiguous about where each encoded character begins and ends in the string of bytes that makes up the encoded text. Some encodings are not safe: the last bytes of one character in a text followed by the first bytes of the next character may together make a valid character. str_replace() knows nothing about "characters", "character encodings" or "encoded text". It only knows about the string of bytes. To str_replace(), two adjacent characters with two-byte encodings just looks like a sequence of four bytes and it's not going to know it shouldn't try to match the middle two bytes.

While real-world examples can be found of str_replace() mangling text, it can be illustrated by using the HTML-ENTITIES encoding. It's not one of the safe ones. All of the strings being passed to str_replace() are valid HTML-ENTITIES-encoded text so the "all inputs use the same encoding" rule is satisfied.

The text is "x<y". It is represented by the byte string [78 26 6c 74 3b 79]. Note that the text has three characters, but the string has six bytes.

<?php

$string = 'x&lt;y';
mb_internal_encoding('HTML-ENTITIES');

echo "Text length: ", mb_strlen($string), "\tString length: ", strlen($string), " ... ", $string, "\n";
// Three characters, six bytes; the text reads "x<y".

$newstring = str_replace('l', 'g', $string);
echo "Text length: ", mb_strlen($newstring), "\tString length: ", strlen($newstring), " ... ", $newstring, "\n";
// Three characters, six bytes, but now the text reads "x>y"; the wrong characters have changed.

$newstring = str_replace(';', ':', $string);
echo "Text length: ", mb_strlen($newstring), "\tString length: ", strlen($newstring), " ... ", $newstring, "\n";
// Now even the length of the text is wrong and the text is trashed.

?>

Even though neither 'l' nor ';' appear in the text "x<y", str_replace() still found and changed bytes. In one case, it changed the text to "x>y" and in the other it broke the encoding completely.

One more reason to use UTF-8 if you can, I guess.

down

mdoocy at u dot washington dot edu ¶

17 years ago

Note that some of the multi-byte functions run in O(n) time, rather than constant time as is the case for their single-byte equivalents. This includes any functionality requiring access at a specific index, since random access is not possible in a string whose number of bytes will not necessarily match the number of characters. Affected functions include: mb_substr(), mb_strstr(), mb_strcut(), mb_strpos(), etc.

down

mitgath at gmail dot com ¶

15 years ago

according to:

http://bugs.php.net/bug.php?id=21317

here's missing function



<?php

function mb_str_pad ($input, $pad_length, $pad_string, $pad_style, $encoding="UTF-8") {

   return str_pad($input,

strlen($input)-mb_strlen($input,$encoding)+$pad_length, $pad_string, $pad_style);

}

?>

down

treilor at gmail dot com ¶

9 years ago

A small note for those who will follow rawsrc at gmail dot com's advice: mb_split uses regular expressions, in which case it may make sense to use built-in function mb_ereg_replace.

down

Anonymous ¶

10 years ago

Yet another single-line mb_trim() function

<?php
function mb_trim($string, $trim_chars = '\s'){
    return preg_replace('/^['.$trim_chars.']*(?U)(.*)['.$trim_chars.']*$/u', '\\1',$string);
}
$string = '           "some text."      ';
echo mb_trim($string, '\s".');
//some text
?>

down

peter kehl ¶

18 years ago

UTF-16LE solution for CSV for Excel by Eugene Murai works well:
$unicode_str_for_Excel = chr(255).chr(254).mb_convert_encoding( $utf8_str, 'UTF-16LE', 'UTF-8');

However, then Excel on Mac OS X doesn't identify columns properly and its puts each whole row in its own cell. In order to fix that, use TAB "\\t" character as CSV delimiter rather than comma or colon.

You may also want to use HTTP encoding header, such as
header( "Content-type: application/vnd.ms-excel; charset=UTF-16LE" );

down

roydukkey at roydukkey dot com ¶

14 years ago

This would be one way to create a multibyte substr_replace function



<?php

function mb_substr_replace($output, $replace, $posOpen, $posClose) {

        return mb_substr($output, 0, $posOpen).$replace.mb_substr($output, $posClose+1);

    }

?>

down

php at kamiware dot org ¶

7 years ago

str_replace is NOT multi-bite safe.

This Ukrainian word gives a bug when used in the next code: відео

$rubishcharacters='[#|\[{}\]´`≠,;.:-\\_<>=*+"\'?()!§$&%';
$searchstring='відео';

$result = str_replace(str_split($rubishcharacters), ' ', $searchstring);

down

Ben XO ¶

15 years ago

PHP5 has no mb_trim(), so here's one I made. It work just as trim(), but with the added bonus of PCRE character classes (including, of course, all the useful Unicode ones such as \pZ).



Unlike other approaches that I've seen to this problem, I wanted to emulate the full functionality of trim() - in particular, the ability to customise the character list.



<?php

    /**

     * Trim characters from either (or both) ends of a string in a way that is

     * multibyte-friendly.

     *

     * Mostly, this behaves exactly like trim() would: for example supplying 'abc' as

     * the charlist will trim all 'a', 'b' and 'c' chars from the string, with, of

     * course, the added bonus that you can put unicode characters in the charlist.

     *

     * We are using a PCRE character-class to do the trimming in a unicode-aware

     * way, so we must escape ^, \, - and ] which have special meanings here.

     * As you would expect, a single \ in the charlist is interpretted as

     * "trim backslashes" (and duly escaped into a double-\ ). Under most circumstances

     * you can ignore this detail.

     *

     * As a bonus, however, we also allow PCRE special character-classes (such as '\s')

     * because they can be extremely useful when dealing with UCS. '\pZ', for example,

     * matches every 'separator' character defined in Unicode, including non-breaking

     * and zero-width spaces.

     *

     * It doesn't make sense to have two or more of the same character in a character

     * class, therefore we interpret a double \ in the character list to mean a

     * single \ in the regex, allowing you to safely mix normal characters with PCRE

     * special classes.

     *

     * *Be careful* when using this bonus feature, as PHP also interprets backslashes

     * as escape characters before they are even seen by the regex. Therefore, to

     * specify '\\s' in the regex (which will be converted to the special character

     * class '\s' for trimming), you will usually have to put *4* backslashes in the

     * PHP code - as you can see from the default value of $charlist.

     *

     * @param string 

     * @param charlist list of characters to remove from the ends of this string.

     * @param boolean trim the left?

     * @param boolean trim the right?

     * @return String

     */

    function mb_trim($string, $charlist='\\\\s', $ltrim=true, $rtrim=true)

    {

        $both_ends = $ltrim && $rtrim;



        $char_class_inner = preg_replace(

            array( '/[\^\-\]\\\]/S', '/\\\{4}/S' ),

            array( '\\\\\\0', '\\' ),

            $charlist

        );



        $work_horse = '[' . $char_class_inner . ']+';

        $ltrim && $left_pattern = '^' . $work_horse;

        $rtrim && $right_pattern = $work_horse . '$';



        if($both_ends)

        {

            $pattern_middle = $left_pattern . '|' . $right_pattern;

        }

        elseif($ltrim)

        {

            $pattern_middle = $left_pattern;

        }

        else

        {

            $pattern_middle = $right_pattern;

        }



        return preg_replace("/$pattern_middle/usSD", '', $string) );

    }

?>

down

Daniel Rhodes ¶

10 years ago

Here's a cheap and cheeky function to remove leading and trailing *punctuation* (or more specifically "non-word characters") from a UTF-8 string in whatever language. (At least it works well enough for Japanese and English.)

/**
 * Trim singlebyte and multibyte punctuation from the start and end of a string
 * 
 * @author Daniel Rhodes
 * @note we want the first non-word grabbing to be greedy but then
 * @note we want the dot-star grabbing (before the last non-word grabbing)
 * @note to be ungreedy
 * 
 * @param string $string input string in UTF-8
 * @return string as $string but with leading and trailing punctuation removed
 */
function mb_punctuation_trim($string)
{
    preg_match('/^[^\w]{0,}(.*?)[^\w]{0,}$/iu', $string, $matches); //case-'i'nsensitive and 'u'ngreedy
    
    if(count($matches) < 2)
    {
        //some strange error so just return the original input
        return $string;
    }
    
    return $matches[1];
}

Hope you like it!

down

mattr at telebody dot com ¶

9 years ago

A brief note on Daniel Rhodes' mb_punctuation_trim().
The regular expression modifier u does not mean ungreedy, rather it means the pattern is in UTF-8 encoding. Instead the U modifier should be used to get ungreedy behavior. (I have not otherwise tested his code.)
See http://php.net/manual/en/reference.pcre.pattern.modifiers.php

down

abidul dot rmdn at gmail dot com ¶

4 years ago

Having to migrate to MB functions can be a bit of pain if you have a big project. it took us a while at my company but then we made a small script and explained it in a small blog.
https://link.medium.com/25w1LronCX

which makes it really easy to migrate to mb_ functions.

down

rr_news at live dot de ¶

7 years ago

The suggestion from "mt at mediamedics dot nl" is not that bad like the down votes indicate. There is only one small bug which can be easily fixed to make it work. 
The head of the "for" need to be modified by replacing $i + $split_length by $i += $split_length.

Here is the full working code, with additional check to verify that the method doesn't exists already:

<?php
if ( !function_exists('mb_str_split') )
{
    function mb_str_split($string, $split_length = 1)
    {
        mb_internal_encoding('UTF-8'); 
        mb_regex_encoding('UTF-8');  

        $split_length = ($split_length <= 0) ? 1 : $split_length;

        $mb_strlen = mb_strlen($string, 'utf-8');

        $array = array();

        for($i = 0; $i < $mb_strlen; $i += $split_length)
        {
            $array[] = mb_substr($string, $i, $split_length);
        }

        return $array;
    }
}
?>

down

Daniel Rhodes ¶

10 years ago

Here's a cheap and cheeky function to remove leading and trailing *punctuation* (or more specifically "non-word characters") from a UTF-8 string in whatever language. (At least it works well enough for Japanese and English.)

/**
 * Trim singlebyte and multibyte punctuation from the start and end of a string
 * 
 * @author Daniel Rhodes
 * @note we want the first non-word grabbing to be greedy but then
 * @note we want the dot-star grabbing (before the last non-word grabbing)
 * @note to be ungreedy
 * 
 * @param string $string input string in UTF-8
 * @return string as $string but with leading and trailing punctuation removed
 */
function mb_punctuation_trim($string)
{
    preg_match('/^[^\w]{0,}(.*?)[^\w]{0,}$/iu', $string, $matches); //case-'i'nsensitive and 'u'ngreedy
    
    if(count($matches) < 2)
    {
        //some strange error so just return the original input
        return $string;
    }
    
    return $matches[1];
}

Hope you like it!

down

rawsrc at gmail dot com ¶

12 years ago

Hi,

For those who are looking for mb_str_replace, here's a simple function :
<?php
function mb_str_replace($needle, $replacement, $haystack) {
   return implode($replacement, mb_split($needle, $haystack));
}
?>
I haven't found a simpliest way to proceed :-)

down

v dot r dot sanaty at gmail dot com ¶

6 years ago

The multibyte version of substr_replace function:
(Inspired by roydukkey's note with some corrections)

function mb_substr_replace($string, $replacement, $start, $length){
    return mb_substr($string, 0, $start).$replacement.mb_substr($string, $start+$length);
}

down

sakai at d4k dot net ¶

14 years ago

I hope this mb_str_replace will work for arrays.  Please use mb_internal_encoding() beforehand, if you need to change the encoding.

Thanks to marc at ermshaus dot org for the original.

<?php

if(!function_exists('mb_str_replace')) {

    function mb_str_replace($search, $replace, $subject) {

        if(is_array($subject)) {
            $ret = array();
            foreach($subject as $key => $val) {
                $ret[$key] = mb_str_replace($search, $replace, $val);
            }
            return $ret;
        }

        foreach((array) $search as $key => $s) {
            if($s == '') {
                continue;
            }
            $r = !is_array($replace) ? $replace : (array_key_exists($key, $replace) ? $replace[$key] : '');
            $pos = mb_strpos($subject, $s);
            while($pos !== false) {
                $subject = mb_substr($subject, 0, $pos) . $r . mb_substr($subject, $pos + mb_strlen($s));
                $pos = mb_strpos($subject, $s, $pos + mb_strlen($r));
            }
        }

        return $subject;

    }

}

?>

down

nzkiwi at NOSPAMmte dot biglobe dot ne dot jp ¶

19 years ago

A friend has pointed out that the entry 
"mbstring.http_input PHP_INI_ALL" in Table 1 on the mbstring page appears to be wrong: above Example 4 it says that "There is no way to control HTTP input character conversion from PHP script. To disable HTTP input character conversion, it has to be done in php.ini". 
Also the table shows the old-PHP-version defaults: 
;; Disable HTTP Input conversion 
mbstring.http_input = pass  *BUT* (for PHP 4.3.0 or higher) 
;; Disable HTTP Input conversion 
mbstring.encoding_translation = Off

down

daniel at softel dot jp ¶

17 years ago

Note that although "multi-byte" hints at total internationalization, the mb_ API was designed by a Japanese person to support the Japanese language.

Some of the functions, for example mb_convert_kana(), make absolutely no sense outside of a Japanese language environment.

It should perhaps be considered "lucky" if the functions work with non-Japanese multi-byte languages.

I don't mean any disrespect to the mb_ API because I'm using it everyday and I appreciate its usefulness, but maybe a better name would be the jp_ API.

down

Daniel Rhodes ¶

10 years ago

Here's a cheap and cheeky function to remove leading and trailing *punctuation* (or more specifically "non-word characters") from a UTF-8 string in whatever language. (At least it works well enough for Japanese and English.)

/**
 * Trim singlebyte and multibyte punctuation from the start and end of a string
 * 
 * @author Daniel Rhodes
 * @note we want the first non-word grabbing to be greedy but then
 * @note we want the dot-star grabbing (before the last non-word grabbing)
 * @note to be ungreedy
 * 
 * @param string $string input string in UTF-8
 * @return string as $string but with leading and trailing punctuation removed
 */
function mb_punctuation_trim($string)
{
    preg_match('/^[^\w]{0,}(.*?)[^\w]{0,}$/iu', $string, $matches); //case-'i'nsensitive and 'u'ngreedy
    
    if(count($matches) < 2)
    {
        //some strange error so just return the original input
        return $string;
    }
    
    return $matches[1];
}

Hope you like it!

down

efesar ¶

13 years ago

This small mb_trim function works for me. 



<?php

function mb_trim( $string )

{

    $string = preg_replace( "/(^\s+)|(\s+$)/us", "", $string );

    

    return $string;

}

?>

down

johannesponader at dontspamme dot googlemail dot co ¶

13 years ago

Please note that when migrating code to handle UTF-8 encoding, not only the functions mentioned here are useful, but also the function htmlentities() has to be changed to htmlentities($var, ENT_COMPAT, "UTF-8") or similar. I didn't scan the manual for it, but there could be some more functions that need adjustments like this.

down

marc at ermshaus dot org ¶

15 years ago

A small correction to patrick at hexane dot org's mb_str_replace function. The original function does not work as intended in case $replacement contains $needle.

<?php
function mb_str_replace($needle, $replacement, $haystack)
{
    $needle_len = mb_strlen($needle);
    $replacement_len = mb_strlen($replacement);
    $pos = mb_strpos($haystack, $needle);
    while ($pos !== false)
    {
        $haystack = mb_substr($haystack, 0, $pos) . $replacement
                . mb_substr($haystack, $pos + $needle_len);
        $pos = mb_strpos($haystack, $needle, $pos + $replacement_len);
    }
    return $haystack;
}
?>

down

patrick at hexane dot org ¶

15 years ago

I wonder why there isn't a mb_str_replace().  Here's one for now:

function mb_str_replace( $needle, $replacement, $haystack ) {
  $needle_len = mb_strlen($needle);
  $pos = mb_strpos( $haystack, $needle);
  while (!($pos ===false)) {
    $front = mb_substr( $haystack, 0, $pos );
    $back  = mb_substr( $haystack, $pos + $needle_len);
    $haystack = $front.$replacement.$back;
    $pos = mb_strpos( $haystack, $needle);
  }
  return $haystack;
}

down

chris at maedata dot com ¶

17 years ago

The opposite of what Eugene Murai wrote in a previous comment is true when importing/uploading a file. For instance, if you export an Excel spreadsheet using the Save As Unicode Text option, you can use the following to convert it to UTF-8 after uploading:

//Convert file to UTF-8 in case Windows mucked it up
$file = explode( "\n", mb_convert_encoding( trim( file_get_contents( $_FILES['file']['tmp_name'] ) ), 'UTF-8', 'UTF-16' ) );

down

pdezwart .at. snocap ¶

17 years ago

If you are trying to emulate the UnicodeEncoding.Unicode.GetBytes() function in .NET, the encoding you want to use is: UCS-2LE

down

Anonymous ¶

18 years ago

get the string octet-size, when mbstring.func_overload is set to 2 :

<?php
function str_sizeof($string) {
    return count(preg_split("`.`", $string)) - 1 ;
}
?>

answering to peter albertsson, once you got your data octet-size, you can access each octet with something
$string[0] ... $string[$size-1], since the [ operator doesn't complies with multibytes strings.

down

-1

hayk at mail dot ru ¶

17 years ago

Since PHP 5.1.0 and PHP 4.4.2 there is an Armenian ArmSCII-8 (ArmSCII-8, ArmSCII8, ARMSCII-8, ARMSCII8) encoding avaliable.

down

-2

mt at mediamedics dot nl ¶

14 years ago

A multibyte one-to-one alternative for the str_split function (http://php.net/manual/en/function.str-split.php):

<?php
    function mb_str_split($string, $split_length = 1){
            
        mb_internal_encoding('UTF-8'); 
        mb_regex_encoding('UTF-8');  
        
        $split_length = ($split_length <= 0) ? 1 : $split_length;
        
        $mb_strlen = mb_strlen($string, 'utf-8');
        
        $array = array();
                
        for($i = 0; $i < $mb_strlen; $i + $split_length){
        
            $array[] = mb_substr($string, $i, $split_length); 
        }

        return $array;
    
    }
?>

down

-2

peter dot albertsson at spray dot se ¶

18 years ago

Setting mbstring.func_overload = 2 may break your applications that deal with binary data.

After having set mbstring.func_overload = 2 and  mbstring.internal_encoding = UTF-8 I can't even read a binary file and print/echo it to output without corrupting it.

down

-1

peter AT(no spam) dezzignz dot com ¶

14 years ago

The function trim() has not failed me so far in my multibyte applications, but in case one needs a truly multibyte function, here it is. The nice thing is that the character to remove can be whitespace or any other specified character, even a multibyte character.

<?php

// multibyte string split

function mbStringToArray ($str) {
    if (empty($str)) return false;
    $len = mb_strlen($str);
    $array = array();
    for ($i = 0; $i < $len; $i++) {
        $array[] = mb_substr($str, $i, 1);
        }
    return $array;
    }

// removes $rem at both ends

function mb_trim ($str, $rem = ' ') {
    if (empty($str)) return false;
    // convert to array
    $arr = mbStringToArray($str);
    $len = count($arr);
    // left side
    for ($i = 0; $i < $len; $i++) {
        if ($arr[$i] === $rem) $arr[$i] = '';
        else break;
        }
    // right side
    for ($i = $len-1; $i >= 0; $i--) {
        if ($arr[$i] === $rem) $arr[$i] = '';
        else break;
        }
    // convert to string
    return implode ('', $arr);
    }

?>

down

-1

Aardvark ¶

18 years ago

Since not all hosted servces currently support the multi-byte function set, it may still be necessary to process Unicode strings using standard single byte functions.  The function at the following link - http://www.kanolife.com/escape/2006/03/php-unicode-processing.html - shows by example how to do this.  While this only covers UTF-8, the standard PHP function "iconv" allows conversion into and out of UTF-8 if strings need to be input or output in other encodings.

down

-3

motin at demomusic dot nu ¶

17 years ago

As peter dot albertsson at spray dot se already pointed out, overloading strlen may break code that handles binary data and relies upon strlen for bytelengths. 

The problem occurs when a file is filled with a string using fwrite in the following manner:

$len = strlen($data);
fwrite($fp, $data, $len);

fwrite takes amount of bytes as the third parameter, but mb_strlen returns the amount of characters in the string. Since multibyte characters are possibly more than one byte in length each - this will result in that the last characters of $data never gets written to the file. 

After hours of investigating why PEAR::Cache_Lite didn't work - the above is what I found. 

I made an attempt at using single byte functions, but it doesn't work. Posting here anyway in case it helps someone else:

/**
* PHP Singe byte functions simulation (non successful)
* 
* Usage: sb_string(functionname, arg1, arg2, etc);
* Example: sb_string("strlen", "tuöéä"); returns 8 (should...)
*/
function sb_string() {

  $arguments = func_get_args(); 

  $func_overloading = ini_get("mbstring.func_overload");

  ini_set("mbstring.func_overload", 0);

  $ret = call_user_func_array(array_shift($arguments), $arguments);

  ini_set("mbstring.func_overload", $func_overloading);

  return $ret;

}

＋add a note