Skip to content

C++ split Chinese string 分割中文 #27

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Shellbye opened this issue Aug 13, 2018 · 3 comments
Open

C++ split Chinese string 分割中文 #27

Shellbye opened this issue Aug 13, 2018 · 3 comments
Labels

Comments

@Shellbye
Copy link
Owner

Shellbye commented Aug 13, 2018

相比于方便快捷的Python,C++的常用操作确实是匮乏很多,最近项目需要分割中文字符串,我这个C++新手在网上找了好长时间都没有结果,最后还是歪打正着的找到了这个SO的问答,才有了眉目。

#include <iostream>
#include <string>
#include <vector>

std::vector<std::string> split_chinese(std::string s) {
    std::vector<std::string> t;
    for (size_t i = 0; i < s.length();)
    {
        int cplen = 1;
        // 以下的几个if,要参考这里 https://en.wikipedia.org/wiki/UTF-8#Description
        if ((s[i] & 0xf8) == 0xf0)      // 11111000, 11110000
            cplen = 4;
        else if ((s[i] & 0xf0) == 0xe0) // 11100000
            cplen = 3;
        else if ((s[i] & 0xe0) == 0xc0) // 11000000
            cplen = 2;
        if ((i + cplen) > s.length())
            cplen = 1;
        t.push_back(s.substr(i, cplen));
        i += cplen;
    }
    return t;
}

int main(int argc, char *argv[])
{
    std::string s = "这是一组中文";
    std::vector<std::string> t = split_chinese(s);
    for(auto a : t) {
        std::cout << a << std::endl;
    }
    return 0;
}
@Shellbye
Copy link
Owner Author

Shellbye commented Jan 28, 2023 via email

@hpc203
Copy link

hpc203 commented Dec 30, 2023

相比于方便快捷的Python,C++的常用操作确实是匮乏很多,最近项目需要分割中文字符串,我这个C++新手在网上找了好长时间都没有结果,最后还是歪打正着的找到了这个SO的问答,才有了眉目。

#include <iostream>
#include <string>
#include <vector>

std::vector<std::string> split_chinese(std::string s) {
    std::vector<std::string> t;
    for (size_t i = 0; i < s.length();)
    {
        int cplen = 1;
        // 以下的几个if,要参考这里 https://en.wikipedia.org/wiki/UTF-8#Description
        if ((s[i] & 0xf8) == 0xf0)      // 11111000, 11110000
            cplen = 4;
        else if ((s[i] & 0xf0) == 0xe0) // 11100000
            cplen = 3;
        else if ((s[i] & 0xe0) == 0xc0) // 11000000
            cplen = 2;
        if ((i + cplen) > s.length())
            cplen = 1;
        t.push_back(s.substr(i, cplen));
        i += cplen;
    }
    return t;
}

int main(int argc, char *argv[])
{
    std::string s = "这是一组中文";
    std::vector<std::string> t = split_chinese(s);
    for(auto a : t) {
        std::cout << a << std::endl;
    }
    return 0;
}

这个程序有bug呀,我换一句输入, "杰尼龟"
image
这时候程序打印的结果却是这样的
image

这个是在win10系统里编译运行的结果,在ubuntu系统里,运行的结果就是正常的了。看来split_chinese函数这个代码只适合在linxu系统里使用的。

@okideal
Copy link

okideal commented Mar 11, 2024

win10转换为GBK编码使用没问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants