C++ split Chinese string 分割中文 #27

Shellbye · 2018-08-13T06:54:03Z

相比于方便快捷的Python，C++的常用操作确实是匮乏很多，最近项目需要分割中文字符串，我这个C++新手在网上找了好长时间都没有结果，最后还是歪打正着的找到了这个SO的问答，才有了眉目。

#include <iostream>
#include <string>
#include <vector>

std::vector<std::string> split_chinese(std::string s) {
    std::vector<std::string> t;
    for (size_t i = 0; i < s.length();)
    {
        int cplen = 1;
        // 以下的几个if，要参考这里 https://en.wikipedia.org/wiki/UTF-8#Description
        if ((s[i] & 0xf8) == 0xf0)      // 11111000, 11110000
            cplen = 4;
        else if ((s[i] & 0xf0) == 0xe0) // 11100000
            cplen = 3;
        else if ((s[i] & 0xe0) == 0xc0) // 11000000
            cplen = 2;
        if ((i + cplen) > s.length())
            cplen = 1;
        t.push_back(s.substr(i, cplen));
        i += cplen;
    }
    return t;
}

int main(int argc, char *argv[])
{
    std::string s = "这是一组中文";
    std::vector<std::string> t = split_chinese(s);
    for(auto a : t) {
        std::cout << a << std::endl;
    }
    return 0;
}

Shellbye · 2023-01-28T05:18:37Z

您可以自由使用哈

…

---原始邮件--- 发件人: ***@***.***> 发送时间: 2023年1月28日(周六) 中午12:19 收件人: ***@***.***>; 抄送: ***@***.******@***.***>; 主题: Re: [Shellbye/Shellbye.github.io] C++ split Chinese string 分割中文 (#27) 您好。请问：如果我想用这段内容的话，您会给这段内容采取哪一款发行许可证？ — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: ***@***.***>

hpc203 · 2023-12-30T13:37:09Z

相比于方便快捷的Python，C++的常用操作确实是匮乏很多，最近项目需要分割中文字符串，我这个C++新手在网上找了好长时间都没有结果，最后还是歪打正着的找到了这个SO的问答，才有了眉目。

#include <iostream>
#include <string>
#include <vector>

std::vector<std::string> split_chinese(std::string s) {
    std::vector<std::string> t;
    for (size_t i = 0; i < s.length();)
    {
        int cplen = 1;
        // 以下的几个if，要参考这里 https://en.wikipedia.org/wiki/UTF-8#Description
        if ((s[i] & 0xf8) == 0xf0)      // 11111000, 11110000
            cplen = 4;
        else if ((s[i] & 0xf0) == 0xe0) // 11100000
            cplen = 3;
        else if ((s[i] & 0xe0) == 0xc0) // 11000000
            cplen = 2;
        if ((i + cplen) > s.length())
            cplen = 1;
        t.push_back(s.substr(i, cplen));
        i += cplen;
    }
    return t;
}

int main(int argc, char *argv[])
{
    std::string s = "这是一组中文";
    std::vector<std::string> t = split_chinese(s);
    for(auto a : t) {
        std::cout << a << std::endl;
    }
    return 0;
}

这个程序有bug呀，我换一句输入， "杰尼龟"

这时候程序打印的结果却是这样的

这个是在win10系统里编译运行的结果，在ubuntu系统里，运行的结果就是正常的了。看来split_chinese函数这个代码只适合在linxu系统里使用的。

okideal · 2024-03-11T02:03:04Z

win10转换为GBK编码使用没问题

Shellbye added the C/C++ label Aug 13, 2018

Shellbye mentioned this issue Aug 31, 2018

C++版本汉字转拼音 Chinese character to pinyin #31

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

C++ split Chinese string 分割中文 #27

C++ split Chinese string 分割中文 #27

Shellbye commented Aug 13, 2018 •

edited

Loading

Shellbye commented Jan 28, 2023 via email

hpc203 commented Dec 30, 2023 •

edited

Loading

okideal commented Mar 11, 2024

C++ split Chinese string 分割中文 #27

C++ split Chinese string 分割中文 #27

Comments

Shellbye commented Aug 13, 2018 • edited Loading

Shellbye commented Jan 28, 2023 via email

hpc203 commented Dec 30, 2023 • edited Loading

okideal commented Mar 11, 2024

Shellbye commented Aug 13, 2018 •

edited

Loading

hpc203 commented Dec 30, 2023 •

edited

Loading