当前位置：锐英源 / 开源技术 / C++行业技术开源 / C++多线程计数基准测试、C++性能实战最佳推荐

服务方向

人工智能数据处理
人工智能培训
kaldi数据准备
小语种语音识别
语音识别标注
语音识别系统
语音识别转文字
kaldi开发技术服务
软件开发: 运动控制卡上位机; 机械加工软件; 量化预测; 股票离线分析软件
软件开发培训: Java 安卓移动开发; VC++; C#软件; 汇编和破解; 驱动开发

联系方式

固话：0371-63888850
手机：138-0381-0136
Q Q：396806883
微信：ryysoft

C++多线程计数基准测试、C++性能实战最佳推荐

本文是翻译内容，如果和原作者意见不一致，请联系锐英源软件。

C++是为性能存在的语言，本文用几个技巧让性能逐级提升，从复杂互斥到简单互斥，再用分片降低互斥使用次数，从STL默认并行函数到自己编写并行函数，特别是最后一步，自己编写并行函数，实在太牛了，太让人佩服了。需要完整源代码请找我，大家也可以自己下载。计数任务可以扩展为常用任务，所以学会本文会提升一些常见并行多线程架构的性能。

介绍
本文的基准测试代码来自 VC++：Windows 上的 30 个多线程错误文章，但它们分散在不同的部分中。这篇文章是为了在一篇文章中给予他们应有的关注。除非另有说明，否则 C++17 std::for_each parallel 用于调用所有逻辑内核上的线程。计数基准测试是在 Windows 7 上具有 12 个逻辑内核的 Intel i11 上完成的。Ubuntu 24.04 基准测试结果显示在最后一部分。

互斥
第一个示例涉及使用 C++11 mutex同步来自多个线程的计数。

void inc_mutex()
{
timer watch;
watch.start("inc mutex");

int count = 0;
std::mutex mut;

std::for_each(std::execution::par, Vec.begin(), Vec.end(),
[&mut, &count](size_t num) -> void
{
if (is_valid((int)(num)))
{
std::lock_guard<std::mutex> guard(mut);
++count;
}
});

watch.stop();

std::cout << "total count:" << count << std::endl;
}
Vec随机初始化，直到 VEC_SIZE为100000000。 rand()是注入不确定性，以阻止编译器在编译时计算结果并优化代码。

const size_t VEC_SIZE = 100000000U;

void init_vector()
{
srand(time(NULL));
for (size_t i = 0; i < VEC_SIZE; ++i)
Vec.push_back(rand() % 20);
}

bool is_valid(int n)
{
return (n % 2 == 0);
}
mutex基准测试结果如下所示。

Windows Benchmark Results
=========================
inc mutex: 3136ms

原子计数
由于我们只处理一个整数变量count，因此我们可以放弃mutex并使用 C++11 atomic整数处理count 来提高性能。

void inc_atomic()
{
timer watch;
watch.start("inc atomic");

std::atomic<int> count = 0;

std::for_each(std::execution::par, Vec.begin(), Vec.end(),
[&count](size_t num) -> void
{
if (is_valid((int)(num)))
++count;
});

watch.stop();

std::cout << "total count:" << count << std::endl;
}
和 integer 的基准测试结果如下所示。我们有超过 66% 的改进。mutexatomic

Windows Benchmark Results
=========================
inc mutex: 3136ms
inc atomic: 1007ms

Windows InterlockedIncrement
如果您在 Windows 上使用 C++11 之前的编译器，则可以使用 Windows的 InterlockedIncrement来实现相同的性能增益。

#ifdef _WIN32
void inc_winInterlocked()
{
timer watch;
watch.start("inc win Interlocked");

LONG count = 0;

std::for_each(std::execution::par, Vec.begin(), Vec.end(),
[&count](size_t num) -> void
{
if (is_valid((int)(num)))
::InterlockedIncrement(&count);
});

watch.stop();

std::cout << "total count:" << count << std::endl;
}
#endif
的基准测试结果， integer 和如下所示。mutexatomicInterlockedIncrement

Windows Benchmark Results
=========================
inc mutex: 3136ms
inc atomic: 1007ms
inc win Interlocked: 1005ms

无锁定
对于这个无锁基准测试，我们将使用我编写的 loop::parallel_for，而不是使用 C++17 并行std::for_each 。loop::parallel_for函数声明如下。当numOfThreads 为零时，不会生成线程。当numOfThreads为 1 时，所有工作都在当前线程中完成。当numOfThreads 是 -1时，线程数等于逻辑内核数。 numOfThreads可以设置为任何其他数字。经验法则是设置小于 logical cores。

template<typename Func>
void parallel_for(int numOfThreads, size_t beginIndex, size_t endIndex, Func f)
Func签名如下。 threadID并且index使我们能够消除同步。threadID的值是0从到 numOfThreads- 1。index的值是从beginIndex 到 endIndex - 1。

void Func(int threadID, size index);
我们创建一个CountStruct来存储每个线程的count. buf成员是为了防止虚假分享。在所有线程结束后，total将从所有CountStruct 累加.

struct CountStruct
{
CountStruct() : Count(0)
{
memset((void*)buf, 0, sizeof(buf));
}
int Count;
char buf[128]; // to prevent false sharing
};

void inc_no_lock()
{
int threads = std::thread::hardware_concurrency();
std::vector<CountStruct> count(threads);

timer watch;
watch.start("inc no lock");

loop::parallel_for(threads, (size_t)(0), (size_t)(VEC_SIZE),
[&count](int threadIndex, int index) -> void
{
if (is_valid(Vec[index]))
++(count[threadIndex].Count);
});

int total = 0;
for (auto& st : count)
total += st.Count;

watch.stop();

std::cout << "total count:" << total << std::endl;
}
无锁的基准如下。

Windows Benchmark Results
=========================
inc mutex: 3136ms
inc atomic: 1007ms
inc win Interlocked: 1005ms
inc no lock: 73ms

单线程
在本节中，我们使用普通 for循环来对单线程性能进行基准测试。注意：此for循环没有上一个代码基准测试的线程启动开销。

C++
void inc_single_thread()
{
timer watch;
watch.start("inc single_thread");

int count = 0;

for (size_t i = 0; i < Vec.size(); ++i)
{
if (is_valid(Vec[i]))
++count;
}

watch.stop();

std::cout << "total count:" << count << std::endl;
}
将显示单线程的基准测试结果。

Windows Benchmark Results
=========================
inc mutex: 3136ms
inc atomic: 1007ms
inc win Interlocked: 1005ms
inc no lock: 73ms
inc single_thread: 86ms

每个核心一个锁
在本节中，我们使用 C++17 并行std::for_each，但我们放弃了std::for_each并行的便利性，根据逻辑内核的数量生成线程，在每个线程内部，我们计算工作分工并手动进行循环以将计数存储在本地temp_count，在每个线程的末尾，我们在累积temp_count 到count之前获得一个mutex。注意：获取mutex的数量正好是逻辑内核的数量。

void inc_less_mutex_lock()
{
timer watch;
watch.start("inc less mutex lock");

const size_t threads = std::thread::hardware_concurrency();
std::vector<size_t> vecIndex;
for (size_t i = 0; i < threads; ++i)
vecIndex.push_back(i);

int count = 0;
std::mutex mut;

std::for_each(std::execution::par, vecIndex.begin(), vecIndex.end(),
[&mut, &count, threads](size_t index) -> void
{
size_t thunk = Vec.size() / threads;
size_t start = thunk * index;
size_t end = start + thunk;
if (index == (threads - 1))
{
size_t remainder = Vec.size() % threads;
end += remainder;
}
int temp_count = 0;
for (int i = start; i < end; ++i)
{
if (is_valid((int)(Vec[i])))
{
++temp_count;
}
}
{
std::lock_guard<std::mutex> guard(mut);
count += temp_count;
}
});

watch.stop();

std::cout << "total count:" << count << std::endl;
}
非常令人惊讶的是，每个逻辑内核获取mutex（在我的 Intel i7 870 案例中为 12 个内核）在 Windows 上取得了最好的结果。

Windows Benchmark Results
=========================
inc mutex: 3136ms
inc atomic: 1007ms
inc win Interlocked: 1005ms
inc no lock: 73ms
inc single_thread: 86ms
inc less mutex lock: 33ms

WSL 上 Ubuntu 24.04 的基准测试结果
本节列出了 Windows 11 WSL、Ubuntu 24.04 的 G++ 13.2 和 Clang 18.1.3 的基准测试结果。CPU 与具有逻辑内核的 Intel i7 870 12 相同。两个编译器的结果相似。采用无锁、按内核采用mutex和单线程产生了类似的性能。请注意，对于正常的工作量，比如计算质数，工作所花费的时间使写入count的时间相形见绌。尽管 count 性能特点相似，但与单线程相比，其他两个的优势是我们可以在线程中划分工作负载。

Ubuntu G++ Benchmark Results
============================
inc mutex: 719ms
inc atomic: 473ms
inc no lock: 51ms
inc single_thread: 48ms
inc less mutex lock: 54ms

Ubuntu Clang Benchmark Results
==============================
inc mutex: 745ms
inc atomic: 463ms
inc no lock: 54ms
inc single_thread: 52ms
inc less mutex lock: 58ms
以下是在 G++ 和 Clang 下编译代码的命令。请记住将和复制到您的 Ubuntu 文件夹。在非 Windows 平台上编译时，Windows 基准测试被排除在外。parallel_for_each.htimer.hInterlockedIncrement

g++ MultithreadedCount.cpp -O3 -std=c++17
clang++ MultithreadedCount.cpp -O3 -std=c++17

友情链接

CAD培训 / ERP管理系统