正则

核心组件

C++ 正则表达式库主要由以下几个核心组件构成：

std::regex: 用于表示一个编译后的正则表达式的对象。将一个字符串形式的正则模式“编译”成 std::regex 对象，可以显著提高后续匹配操作的效率。
std::smatch (或 std::cmatch, std::wsmatch 等): 用于存储匹配结果的容器。

smatch 用于 std::string。
cmatch 用于 C 风格字符串 (const char*)。
wsmatch 用于 std::wstring。
它们都是 std::match_results 的特化版本。smatch 是最常用的。

std::ssub_match: smatch 容器中的元素类型，代表一个单独的捕获组（sub-match）。
算法函数:

std::regex_match(): 尝试将整个输入序列与正则表达式进行匹配。
std::regex_search(): 在输入序列中搜索第一个与正则表达式匹配的子序列。
std::regex_replace(): 搜索并替换所有与正则表达式匹配的子序列。

迭代器:

std::sregex_iterator: 用于遍历输入字符串中所有不重叠的匹配项。
std::sregex_token_iterator: 功能更强大的迭代器，可以用于更复杂的切分和提取任务。

基本使用步骤

通常，使用 C++ 正则表达式遵循以下步骤：

包含头文件 #include <regex>。
定义一个字符串形式的正则表达式模式。
创建一个 std::regex 对象来“编译”这个模式。
准备好要进行匹配的目标字符串。
选择合适的算法函数（regex_match, regex_search, regex_replace）执行操作。
如果匹配成功，从 std::smatch 对象中提取结果。

1. `std::regex_match`：完全匹配

regex_match 用于判断一个字符串是否能被一个正则表达式完整地匹配。如果字符串有多余的字符，它就会返回 false。

使用场景：验证输入格式，如邮箱、电话号码、日期等。

示例代码：

#include <iostream>
#include <string>
#include <regex>
 
void check_date_format(const std::string& date_str) {
    // 匹配 YYYY-MM-DD 格式的日期
    // R"(...)" 是原始字符串字面量，可以避免大量使用反斜杠 \
    // \d{4} - 匹配4个数字
    // -     - 匹配字面量 -
    // \d{2} - 匹配2个数字
    std::regex date_pattern(R"(\d{4}-\d{2}-\d{2})");
 
    if (std::regex_match(date_str, date_pattern)) {
        std::cout << "'" << date_str << "' is a valid date format." << std::endl;
    } else {
        std::cout << "'" << date_str << "' is NOT a valid date format." << std::endl;
    }
}
 
int main() {
    check_date_format("2023-10-27");      // 成功
    check_date_format("2023-10-27 extra"); // 失败，因为有 " extra"
    check_date_format("invalid-date");     // 失败
    return 0;
}

输出：

'2023-10-27' is a valid date format.
'2023-10-27 extra' is NOT a valid date format.
'invalid-date' is NOT a valid date format.

2. `std::regex_search`：搜索子串

regex_search 用于在一个字符串中查找第一个符合正则表达式的子串。这比 regex_match 更常用，因为它允许你在大段文本中寻找特定模式。

使用场景：从一段日志中提取错误信息，从 HTML 中提取链接等。

示例代码（提取捕获组）：

smatch 对象非常重要。如果匹配成功：

match[0] 存储整个匹配到的子串。
match[1] 存储第一个捕获组 (...) 的内容。
match[2] 存储第二个捕获组 (...) 的内容，以此类推。

#include <iostream>
#include <string>
#include <regex>
 
void find_name_and_age(const std::string& text) {
    // 捕获组用圆括号 () 表示
    // (\w+)   : 捕获一个或多个单词字符（字母、数字、下划线），作为名字
    // is      : 匹配 " is "
    // (\d+)   : 捕获一个或多个数字，作为年龄
    std::regex pattern(R"((\w+) is (\d+))");
    std::smatch match_result; // 用于存储匹配结果
 
    if (std::regex_search(text, match_result, pattern)) {
        std::cout << "Found a match in: '" << text << "'" << std::endl;
        std::cout << "Full match (match[0]): " << match_result[0].str() << std::endl;
        std::cout << "Name (group 1, match[1]): " << match_result[1].str() << std::endl;
        std::cout << "Age (group 2, match[2]): " << match_result[2].str() << std::endl;
        
        // match_result.size() 返回捕获组数量 + 1 (包括完整匹配)
        std::cout << "Total captured groups + full match: " << match_result.size() << std::endl;
    } else {
        std::cout << "No match found in: '" << text << "'" << std::endl;
    }
}
 
int main() {
    find_name_and_age("My cat Tom is 3 years old.");
    std::cout << "--------------------" << std::endl;
    find_name_and_age("Alice is 25 and Bob is 30."); // 只会找到第一个
    return 0;
}

输出：

Found a match in: 'My cat Tom is 3 years old.'
Full match (match[0]): Tom is 3
Name (group 1, match[1]): Tom
Age (group 2, match[2]): 3
Total captured groups + full match: 3
--------------------
Found a match in: 'Alice is 25 and Bob is 30.'
Full match (match[0]): Alice is 25
Name (group 1, match[1]): Alice
Age (group 2, match[2]): 25
Total captured groups + full match: 3

3. `std::sregex_iterator`：遍历所有匹配项

如果想找到字符串中所有匹配的子串，而不是仅仅第一个，就需要使用迭代器。

使用场景：提取一段文本中所有的 URL 或邮箱地址。

示例代码：

#include <iostream>     // 引入输入输出流库，用于输出匹配的数字结果
#include <string>       // 引入字符串库，用于处理 std::string 类型
#include <regex>        // 引入正则表达式库，用于执行模式匹配
#include <iterator>     // 引入迭代器支持，用于正则匹配结果的迭代
 
// 函数：find_all_numbers
// 功能：查找并输出输入文本中所有的整数（以单词边界分隔的数字）
void find_all_numbers(const std::string& text) {
    // 定义正则表达式：匹配一个或多个数字，且前后为单词边界（防止匹配子串）
    std::regex num_pattern(R"(\b\d+\b)");
 
    // 使用 std::sregex_iterator 创建迭代器，开始于文本开头，结束于文本末尾
    auto words_begin = std::sregex_iterator(text.begin(), text.end(), num_pattern);
 
    // 默认构造一个空的迭代器，作为结束标志
    auto words_end = std::sregex_iterator();
 
    // 输出原始字符串信息
    std::cout << "Found numbers in '" << text << "':" << std::endl;
 
    // 遍历所有匹配项
    for (auto i = words_begin; i != words_end; ++i) {
        // 获取当前匹配结果，类型为 std::smatch
        const std::smatch &match = *i;
 
        // 输出匹配到的字符串（即整数）
        std::cout << "  " << match.str() << std::endl;
    }
}
 
// 主函数入口
int main() {
    // 调用函数，传入待匹配的字符串
    find_all_numbers("There are 3 apples, 10 oranges, and 123 bananas.");
 
    // 返回 0，表示程序正常结束
    return 0;
}

输出：

Found numbers in 'There are 3 apples, 10 oranges, and 123 bananas.':
  3
  10
  123

4. `std::regex_replace`：搜索和替换

regex_replace 是一个非常强大的工具，用于查找并替换文本。

使用场景：数据清洗、格式转换、屏蔽敏感信息。

在替换字符串中，可以使用特殊字符来引用捕获组：

$&: 代表整个匹配的子串。
$1, $2, …: 代表第 N 个捕获组的内容。
$:`: 代表匹配处之前的所有内容。
$': 代表匹配处之后的所有内容。
$$: 代表一个字面量 $。

示例代码：

#include <iostream>
#include <string>
#include <regex>
 
void reformat_date(const std::string& text) {
    // 捕获年、月、日
    std::regex date_pattern(R"((\d{4})-(\d{2})-(\d{2}))");
    
    // 替换格式：将 YYYY-MM-DD 替换为 MM/DD/YYYY
    // $2 代表第二个捕获组(月), $3 代表第三个(日), $1 代表第一个(年)
    const std::string replacement_format = "$2/$3/$1";
 
    std::string result = std::regex_replace(text, date_pattern, replacement_format);
 
    std::cout << "Original: " << text << std::endl;
    std::cout << "Replaced: " << result << std::endl;
}
 
int main() {
    reformat_date("Today's date is 2023-10-27, tomorrow is 2023-10-28.");
    return 0;
}

输出：

Original: Today's date is 2023-10-27, tomorrow is 2023-10-28.
Replaced: Today's date is 10/27/2023, tomorrow is 10/28/2023.

错误处理和性能

错误处理

如果正则表达式的语法有误，std::regex 的构造函数会抛出 std::regex_error 异常。在生产环境中，最好用 try-catch 块包围它。

try {
    std::regex bad_pattern("([a-z]"); // 括号不匹配，是无效的正则
} catch (const std::regex_error& e) {
    std::cerr << "Regex error: " << e.what() << std::endl;
    std::cerr << "Error code: " << e.code() << std::endl;
}

性能考量

构造 std::regex 对象是昂贵的操作，因为它需要解析和编译正则表达式。如果一个正则表达式需要被多次使用（例如在循环中），请务必在循环外创建 std::regex 对象，然后重复使用它。

// 好的实践
std::regex r(R"(\w+)");
for (const auto& s : my_string_list) {
    if (std::regex_search(s, r)) {
        // ...
    }
}
 
// 不好的实践（每次循环都重新构造）
for (const auto& s : my_string_list) {
    if (std::regex_search(s, std::regex(R"(\w+)"))) { // 非常低效！
        // ...
    }
}

正则表达式语法选项

std::regex 构造函数可以接受第二个参数，用于指定语法风格。最常用的是 std::regex::ECMAScript，它也是默认值。其他还包括 basic, extended, awk, grep 等。ECMAScript (基本就是 JavaScript 的正则语法) 功能强大且通用，建议始终使用它。

另一个有用的标志是 std::regex::icase，用于不区分大小写的匹配。

// 不区分大小写匹配 "error"
std::regex case_insensitive_pattern("error", std::regex::icase);
 
std::string text = "An Error occurred, but it's not a big ERROR.";
if (std::regex_search(text, case_insensitive_pattern)) {
    std::cout << "Found 'error' (case-insensitive)." << std::endl;
}

总结


功能	C++ 组件/函数	关键点
定义模式	`std::regex`	昂贵的操作，应重复使用
存储结果	`std::smatch`	`m[0]`是完整匹配, `m[1]`是第一个捕获组
完全匹配	`std::regex_match`	整个字符串必须匹配，用于验证
搜索子串	`std::regex_search`	查找第一个匹配项，用于提取
查找全部	`std::sregex_iterator`	使用循环遍历所有不重叠的匹配
替换	`std::regex_replace`	使用 `$1`, `$2` 等引用捕获组

Quartz 4

Explorer

正则

核心组件

基本使用步骤

1. `std::regex_match`：完全匹配

2. `std::regex_search`：搜索子串

3. `std::sregex_iterator`：遍历所有匹配项

4. `std::regex_replace`：搜索和替换

错误处理和性能

错误处理

性能考量

正则表达式语法选项

总结

Graph View

Table of Contents

Quartz 4

Explorer

正则

核心组件

基本使用步骤

1. std::regex_match：完全匹配

2. std::regex_search：搜索子串

3. std::sregex_iterator：遍历所有匹配项

4. std::regex_replace：搜索和替换

错误处理和性能

错误处理

性能考量

正则表达式语法选项

总结

Graph View

Table of Contents

1. `std::regex_match`：完全匹配

2. `std::regex_search`：搜索子串

3. `std::sregex_iterator`：遍历所有匹配项

4. `std::regex_replace`：搜索和替换