SMP矩阵乘法

2014-11-24 03:20:59 · 作者: · 浏览: 0

矩阵乘法的并行化基本都是用加农算法,但是在共享内存的情况下,我觉得加农并没有优势。

加农保证了在每个变量全局单副本的情况下,并行度的提升。在共享内存时,没有变量复制的成本,所以直接使用带状划分可以避免迭代中间的barrier开销,提高效率。

SMP下实现矩阵乘法


[cpp]
#include "stdafx.h"
#include "matrixOperation.h"
#include

int _tmain(int argc, _TCHAR* argv[])
{
const int size=5000;
double **a,**b,**c;
a=new double*[size];
b=new double*[size];
c=new double*[size];
for(int i=0;i {
a[i]=new double[size];
b[i]=new double[size];
c[i]=new double[size];
}
cout<<"mem set"< //read file
cout< cout< cout< //for more cache hits
//transposition b and place data needed in one cache block
matrixTransposition(b,size);
cout<<"data prepared"< long start=time(0);
// omp_set_nested(true);
#pragma omp parallel for num_threads(16) schedule(dynamic)
for(int i=0;i {
// #pragma omp parallel for firstprivate(i) num_threads(4)
for(int j=0;j {
c[i][j]=0;
for(int k=0;k {
c[i][j]+=a[i][k]*b[j][k];//different from the original formulation
}
}
cout<<".";
}
long end=time(0);
cout< writeMatrix("out",c,size);
for(int i=0;i {
delete[] a[i];
delete[] b[i];
delete[] c[i];
}
delete[] a;
delete[] b;
delete[] c;
cin>>start;
return 0;
}

#include "stdafx.h"
#include "matrixOperation.h"
#include

int _tmain(int argc, _TCHAR* argv[])
{
const int size=5000;
double **a,**b,**c;
a=new double*[size];
b=new double*[size];
c=new double*[size];
for(int i=0;i {
a[i]=new double[size];
b[i]=new double[size];
c[i]=new double[size];
}
cout<<"mem set"< //read file
cout< cout< cout< //for more cache hits
//transposition b and place data needed in one cache block
matrixTransposition(b,size);
cout<<"data prepared"< long start=time(0);
// omp_set_nested(true);
#pragma omp parallel for num_threads(16) schedule(dynamic)
for(int i=0;i {
// #pragma omp parallel for firstprivate(i) num_threads(4)
for(int j=0;j {
c[i][j]=0;
for(int k=0;k {
c[i][j]+=a[i][k]*b[j][k];//different from the original formulation
}
}
cout<<".";
}
long end=time(0);
cout< writeMatrix("out",c,size);
for(int i=0;i {
delete[] a[i];
delete[] b[i];
delete[] c[i];
}
delete[] a;
delete[] b;
delete[] c;
cin>>start;
return 0;
}
i7 2600处理器,5000*5000的矩阵相乘上面的参数效果较好,纯计算时间在126秒左右。