前言:参加完ASC,回家两天,看到HPCGame开赛心痒痒

A.欢迎参赛!

这个题就是介绍一下比赛的相关内容,直接提交后就会返回提供的集群账号密码

B.流量席卷土豆

流量席卷土豆.mhtml

这个题挺有意思的,刚开始看到题的时候说实话我人有点慌了(好多字)冷静下来认真读题后发现,就是一个作业提交和软件使用而已。

解题步骤如下:

1、srun提交命令提取 SSH 流量:

(赛题中提供的路径有问题,找了一下还有一层子目录pcaps)

1
srun -p C064M0256G -N4 --ntasks-per-node=4 bash -c 'tshark -r /lustre/shared_data/potato_kingdom_univ_trad_cluster/pcaps/$SLURM_PROCID.pcap -Y ssh -w $SLURM_PROCID.ssh.pcap'

当前路径就会得到如下文件

image-20240123134122253

2、合并 PCAP 文件:

1
mergecap -w merged.pcap *.ssh.pcap
image-20240123134322241

3、破解密码:

1
quantum-cracker merged.pcap
image-20240123134517990

4、获取 Slurm JobID

1
sacct -u $(whoami) --format JobID | tail
image-20240123134612339

取最后一个的整数,提交即可。

C.简单的编译

简单编译.mhtml

这道题就是写Makefile和CMakeLists.txt

1
2
3
4
5
6
7
8
9
在本题中,你需要写一个简单的Makefile文件,和一个简单的CMakeLists.txt文件,来编译 Handout 中所提供的三个简单程序。

其中,hello_cuda.cu是一个简单的cuda程序,hello_mpi.cpp是一个简单的mpi程序,hello_omp.cpp是一个简单的 OpenMP 程序。它们都做了同一个简单的事情:从文件中读取一个向量并求和。

你需要上传Makefile和CMakeLists.txt文件。我们会根据以下策略来评测你所写的配置文件的正确性。

对于Makefile文件,我们会在项目根目录下执行make命令。然后在项目根目录下检查程序是否被生成,并运行以检测正确性。
对于CMakeLists.txt文件,我们会在项目根目录下执行mkdir build; cd build; cmake ..; make。然后我们会在build目录下检查程序是否被正确生成,并运行以检测正确性。
对于所有类型的文件,hello_cuda.cu所编译出的文件名应为hello_cuda;hello_mpi.cpp所编译出的文件名应为hello_mpi;hello_omp.cpp所编译出的文件名应为hello_omp。

集群GPU节点连不上,用记事板手搓了一下,提交上去Makefile一直报错,

查了一下是tab缩进的问题,后来在终端里面用vim编写,复制,完美解决

(长记性了,用vim来编辑才能正常识别tab)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
.PHONY: all clean

all: hello_cuda hello_mpi hello_omp

hello_cuda: hello_cuda.cu
nvcc -o hello_cuda hello_cuda.cu

hello_mpi: hello_mpi.cpp
mpic++ -o hello_mpi hello_mpi.cpp

hello_omp: hello_omp.cpp
g++ -fopenmp -o hello_omp hello_omp.cpp

clean:
rm -f hello_cuda hello_mpi hello_omp
1
2
3
4
5
6
7
8
9
10
11
12
13
cmake_minimum_required(VERSION 3.10)
project(test)

enable_language(CUDA)
add_executable(hello_cuda hello_cuda.cu)

find_package(MPI REQUIRED)
add_executable(hello_mpi hello_mpi.cpp)
target_link_libraries(hello_mpi MPI::MPI_CXX)

find_package(OpenMP REQUIRED)
add_executable(hello_omp hello_omp.cpp)
target_link_libraries(hello_omp OpenMP::OpenMP_CXX)

D.小北问答 Classic

小北问答 Classic.mhtml

image-20240123135107256
1
答案:{"t1":"65.396","t2.1":"105.26","t2.2":"107.14","t3":"iv. GCC","t4.1":"ii. 进程","t4.2":"i. 线程","t4.3":"i. 线程","t5.1":"8","t5.2":"i. 在 CPU 中设计独立的 AVX512 运算单元,但有可能导致调用 AVX512 指令集时因功耗过大而降频","t5.3":"ii. 复用两个 AVX2 运算单元执行 AVX512 运算","t6":"iv. OpenGL","t8":"iv. PCIe","t9":"ii. Slurm","t10":"i. 缓存未命中","t7":"iii,iv"}

E.齐心协力

齐心协力.mhtml

这道题卡了两天,Ray集群老是出错,后来偶然交了一发过了

后来发现提示里有放置组的概念,学习了一下,改了一下代码,111.6/120分,满足了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import ray
import numpy as np
import os

@ray.remote(num_cpus=1)
class Counter:
def __init__(self, flag_index):
self.weight_matrix = np.load(f"weights/weight_{flag_index}.npy")

def process(self, input_matrix):
return np.maximum(0, np.dot(input_matrix, self.weight_matrix))

if __name__ == "__main__":
ray.init(address=os.environ['RAY_CLUSTER_ADDR'])

# 创建放置组,每组有 4 个 CPU
placement_groups = [ray.util.placement_group([{"CPU": 4}], strategy="STRICT_PACK") for _ in range(4)]
ray.get([pg.ready() for pg in placement_groups]) # 等待放置组就绪

if not os.path.exists("outputs"):
os.makedirs("outputs")

# 在每个放置组中创建 Counter 实例
counters = [[Counter.options(placement_group=placement_groups[i], placement_group_bundle_index=0).remote(j) for j in range(4)] for i in range(4)]

output_results = []
for i in range(100):
input_matrix = np.load(f"inputs/input_{i}.npy")
result = input_matrix
for counter in counters[i % 4]:
result = counter.process.remote(result)
output_results.append(result)

# 收集和保存结果
for i, result in enumerate(output_results):
output = ray.get(result)
np.save(f"outputs/output_{i}.npy", output)

F.高性能数据校验

高性能数据校验.mhtml

这道题一开始随便开了个MPI竟然就有80多分,而且我题都没认真看,后来随便改改卡到了90(没错,是卡bug到了90)题目重测后直接回归0分,着急!

仔细读了一下题,看到提升里的SHA512函数,就想着用SHA512替换原来的一堆函数(没有认真去看那几个函数),然后就陷入了死循环里,SHA512没法进行并行计算,第i个块的计算依赖于第i-1个。

后来偶然的仔细看了一下代码,发baseline代码中的函数完全解决了上面的问题,可以拆开做,只需要用更多的数组来记录一下就行了image-20240125213452456

修改了一下代码:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
void checksum(uint8_t *data, size_t len, uint8_t *obuf) {
int num_block = (len + BLOCK_SIZE - 1) / BLOCK_SIZE;

EVP_MD *sha512 = EVP_MD_fetch(nullptr, "SHA512", nullptr);
EVP_MD_CTX *ctx[num_block];
for(int i=0; i<num_block; i++)//多创建一点ctx让记录数据和校验位分开计算
{
ctx[i]=EVP_MD_CTX_new();
EVP_DigestInit_ex(ctx[i], sha512, nullptr);
}
uint8_t prev_md[SHA512_DIGEST_LENGTH];
SHA512(nullptr, 0, prev_md);
for (int i = 0; i < num_block; i++) {
uint8_t buffer[BLOCK_SIZE]{};
EVP_DigestInit_ex(ctx[i], sha512, nullptr);
std::memcpy(buffer, data + i * BLOCK_SIZE, std::min(BLOCK_SIZE, len - i * BLOCK_SIZE));
EVP_DigestUpdate(ctx[i], buffer, BLOCK_SIZE);
}
for(int i=0;i<num_block;i++)
{
EVP_DigestUpdate(ctx[i], prev_md, SHA512_DIGEST_LENGTH);
unsigned int len = 0;
EVP_DigestFinal_ex(ctx[i], prev_md, &len);
EVP_MD_CTX_free(ctx[i]);
}
std::memcpy(obuf, prev_md, SHA512_DIGEST_LENGTH);
EVP_MD_free(sha512);
}

就这一改,run了一下,成了!思路正确!测试了一下几个for循环的耗时,中间数据处理的循环占比最高image-20240125213808760

那就没问题了,开始修改MPI并行程序:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
#include <algorithm>
#include <chrono>
#include <cstring>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <mpi.h>
#include <openssl/evp.h>
#include <openssl/sha.h>

namespace fs = std::filesystem;

constexpr size_t BLOCK_SIZE = 1024 * 1024;

void print_checksum(std::ostream &os, uint8_t *md, size_t len);

int main(int argc, char *argv[]) {

MPI_Init(&argc, &argv);

int rank, nprocs;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

if (rank == 0) {
if (argc < 3) {
std::cout << "Usage: " << argv[0] << " <input_file> <output_file>"
<< std::endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
}
fs::path input_path = argv[1];
fs::path output_path = argv[2];

auto total_begin_time = std::chrono::high_resolution_clock::now();

auto file_size = fs::file_size(input_path);
std::cout << input_path << " size: " << file_size << std::endl;
int num_block = (file_size + BLOCK_SIZE - 1) / BLOCK_SIZE;
int proc_num_block = num_block / 8;
auto len = file_size / 8;
uint8_t *buffer = nullptr;
if (file_size != 0) {
buffer = new uint8_t[len];
std::ifstream istrm(input_path, std::ios::binary);
istrm.seekg(rank*len);
istrm.read(reinterpret_cast<char *>(buffer), len);
}

auto begin_time = std::chrono::high_resolution_clock::now();

EVP_MD *sha512 = EVP_MD_fetch(nullptr, "SHA512", nullptr);
EVP_MD_CTX *ctx[proc_num_block];
for(int i=0; i<proc_num_block; i++)
{
ctx[i]=EVP_MD_CTX_new();
EVP_DigestInit_ex(ctx[i], sha512, nullptr);
}

uint8_t prev_md[SHA512_DIGEST_LENGTH];

for (int i = 0; i < proc_num_block; i++) {
uint8_t buffer_temp[BLOCK_SIZE]{};
std::memcpy(buffer_temp, buffer + i * BLOCK_SIZE, std::min(BLOCK_SIZE, len - i * BLOCK_SIZE));
EVP_DigestUpdate(ctx[i], buffer_temp, BLOCK_SIZE);
}
delete[] buffer;

if(rank==0){
SHA512(nullptr, 0, prev_md);
for(int i=0;i<proc_num_block;i++)
{
EVP_DigestUpdate(ctx[i], prev_md, SHA512_DIGEST_LENGTH);
unsigned int len_ex = 0;
EVP_DigestFinal_ex(ctx[i], prev_md, &len_ex);
EVP_MD_CTX_free(ctx[i]);
}
MPI_Send(prev_md, SHA512_DIGEST_LENGTH, MPI_UINT8_T, 1, 0, MPI_COMM_WORLD);
}else if(rank>0 && rank<7)
{
MPI_Recv(prev_md, SHA512_DIGEST_LENGTH, MPI_UINT8_T, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
for(int i=0;i<proc_num_block;i++)
{
EVP_DigestUpdate(ctx[i], prev_md, SHA512_DIGEST_LENGTH);
unsigned int len_ex = 0;
EVP_DigestFinal_ex(ctx[i], prev_md, &len_ex);
EVP_MD_CTX_free(ctx[i]);
}
MPI_Send(prev_md, SHA512_DIGEST_LENGTH, MPI_UINT8_T, rank + 1, 0, MPI_COMM_WORLD);
}else if(rank==7){
MPI_Recv(prev_md, SHA512_DIGEST_LENGTH, MPI_UINT8_T, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
for(int i=0;i<proc_num_block;i++)
{
EVP_DigestUpdate(ctx[i], prev_md, SHA512_DIGEST_LENGTH);
unsigned int len_ex = 0;
EVP_DigestFinal_ex(ctx[i], prev_md, &len_ex);
EVP_MD_CTX_free(ctx[i]);
}
}

EVP_MD_free(sha512);

if(rank==7)
{
auto end_time = std::chrono::high_resolution_clock::now();

// print debug information
std::cout << "checksum: ";
print_checksum(std::cout, prev_md, SHA512_DIGEST_LENGTH);
std::cout << std::endl;

auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
end_time - begin_time);

std::cout << "checksum time cost: " << std::dec << duration.count() << "ms"
<< std::endl;

// write checksum to output file
std::ofstream output_file(output_path);

print_checksum(output_file, prev_md, SHA512_DIGEST_LENGTH);

auto total_end_time = std::chrono::high_resolution_clock::now();

auto total_duration = std::chrono::duration_cast<std::chrono::milliseconds>(
total_end_time - total_begin_time);

std::cout << "total time cost: " << total_duration.count() << "ms"
<< std::endl;

}

MPI_Finalize();
return 0;
}

void print_checksum(std::ostream &os, uint8_t *md, size_t len) {
for (int i = 0; i < len; i++) {
os << std::setw(2) << std::setfill('0') << std::hex
<< static_cast<int>(md[i]);
}
}

效果不算特别好,90/120分,检查了一下,发现读取占用时间太长,

偶然想到可以不用一次性读入全部数据,让IO和CPU形成流水线即可

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
#include <algorithm>
#include <chrono>
#include <cstring>
#include <filesystem>
#include <fstream>
#include <iostream>
#include <mpi.h>
#include <openssl/evp.h>
#include <openssl/sha.h>

namespace fs = std::filesystem;

constexpr size_t BLOCK_SIZE = 1024 * 1024;

void print_checksum(std::ostream &os, uint8_t *md, size_t len);

int main(int argc, char *argv[]) {

MPI_Init(&argc, &argv);

int rank, nprocs;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

if (rank == 0) {
if (argc < 3) {
std::cout << "Usage: " << argv[0] << " <input_file> <output_file>"
<< std::endl;
MPI_Abort(MPI_COMM_WORLD, 1);
}
}
fs::path input_path = argv[1];
fs::path output_path = argv[2];

auto total_begin_time = std::chrono::high_resolution_clock::now();

auto file_size = fs::file_size(input_path);
std::cout << input_path << " size: " << file_size << std::endl;
int num_block = (file_size + BLOCK_SIZE - 1) / BLOCK_SIZE;
int proc_num_block = num_block / 8;
auto len = file_size / 8;

auto begin_time = std::chrono::high_resolution_clock::now();

EVP_MD *sha512 = EVP_MD_fetch(nullptr, "SHA512", nullptr);
EVP_MD_CTX *ctx[proc_num_block];
for(int i=0; i<proc_num_block; i++)
{
ctx[i]=EVP_MD_CTX_new();
EVP_DigestInit_ex(ctx[i], sha512, nullptr);
}

uint8_t prev_md[SHA512_DIGEST_LENGTH];
std::ifstream istrm(input_path, std::ios::binary);
uint8_t buffer_temp[BLOCK_SIZE];
for (int i = 0; i < proc_num_block; i++) {
istrm.seekg(rank*len + i*BLOCK_SIZE);
istrm.read(reinterpret_cast<char *>(buffer_temp),BLOCK_SIZE);
//IO和CPU流水线
EVP_DigestUpdate(ctx[i], buffer_temp, BLOCK_SIZE);
}

if(rank==0){
SHA512(nullptr, 0, prev_md);
for(int i=0;i<proc_num_block;i++)
{
EVP_DigestUpdate(ctx[i], prev_md, SHA512_DIGEST_LENGTH);
unsigned int len_ex = 0;
EVP_DigestFinal_ex(ctx[i], prev_md, &len_ex);
EVP_MD_CTX_free(ctx[i]);
}
MPI_Send(prev_md, SHA512_DIGEST_LENGTH, MPI_UINT8_T, 1, 0, MPI_COMM_WORLD);
}else if(rank>0 && rank<7)
{
MPI_Recv(prev_md, SHA512_DIGEST_LENGTH, MPI_UINT8_T, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
for(int i=0;i<proc_num_block;i++)
{
EVP_DigestUpdate(ctx[i], prev_md, SHA512_DIGEST_LENGTH);
unsigned int len_ex = 0;
EVP_DigestFinal_ex(ctx[i], prev_md, &len_ex);
EVP_MD_CTX_free(ctx[i]);
}
MPI_Send(prev_md, SHA512_DIGEST_LENGTH, MPI_UINT8_T, rank + 1, 0, MPI_COMM_WORLD);
}else if(rank==7){
MPI_Recv(prev_md, SHA512_DIGEST_LENGTH, MPI_UINT8_T, rank - 1, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
for(int i=0;i<proc_num_block;i++)
{
EVP_DigestUpdate(ctx[i], prev_md, SHA512_DIGEST_LENGTH);
unsigned int len_ex = 0;
EVP_DigestFinal_ex(ctx[i], prev_md, &len_ex);
EVP_MD_CTX_free(ctx[i]);
}
}

EVP_MD_free(sha512);

if(rank==7)
{
auto end_time = std::chrono::high_resolution_clock::now();

// print debug information
std::cout << "checksum: ";
print_checksum(std::cout, prev_md, SHA512_DIGEST_LENGTH);
std::cout << std::endl;

auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(
end_time - begin_time);

std::cout << "checksum time cost: " << std::dec << duration.count() << "ms"
<< std::endl;

// write checksum to output file
std::ofstream output_file(output_path);

print_checksum(output_file, prev_md, SHA512_DIGEST_LENGTH);

auto total_end_time = std::chrono::high_resolution_clock::now();

auto total_duration = std::chrono::duration_cast<std::chrono::milliseconds>(
total_end_time - total_begin_time);

std::cout << "total time cost: " << total_duration.count() << "ms"
<< std::endl;

}
MPI_Finalize();
return 0;
}

void print_checksum(std::ostream &os, uint8_t *md, size_t len) {
for (int i = 0; i < len; i++) {
os << std::setw(2) << std::setfill('0') << std::hex
<< static_cast<int>(md[i]);
}
}

速度得到了非常显著的提升,get满分

H.矩阵乘法

矩阵乘法.mhtml

经典题目,同样的配方,用AVX512+Openmp+分块计算基本就能有不错的效果,不过我也只拿到了73/100分(菜就该多练),直接把printf去掉了都拿不到高分┭┮﹏┭┮

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
#include <iostream>
#include <chrono>
#include <immintrin.h>
#include <omp.h>
#include <cmath>
#define BLOCKSIZE 128
#define AVX_F_CAPACITY 8

void mul(double* a, double* b, double* c, uint64_t n1, uint64_t n2, uint64_t n3) {
#pragma omp parallel for
for (uint64_t i = 0; i < n1; i+=BLOCKSIZE) {
for (uint64_t j = 0; j < n2; j+=BLOCKSIZE) {
for (uint64_t k = 0; k < n3; k+=BLOCKSIZE) {

for(uint64_t ii=i; ii<i+BLOCKSIZE; ii+=AVX_F_CAPACITY)
{
for(uint64_t kk=k; kk<k+BLOCKSIZE; kk+=16)
{
__m512d vc0,vc1,vc2,vc3,vc4,vc5,vc6,vc7,vc8,vc9,vc10,vc11,vc12,vc13,vc14,vc15,vb,vb1;
vc0 = _mm512_load_pd(&c[ii*n3+kk]);
vc8 = _mm512_load_pd(&c[ii*n3+kk+8]);
vc1 = _mm512_load_pd(&c[(ii+1)*n3+kk]);
vc9 = _mm512_load_pd(&c[(ii+1)*n3+kk+8]);
vc2 = _mm512_load_pd(&c[(ii+2)*n3+kk]);
vc10 = _mm512_load_pd(&c[(ii+2)*n3+kk+8]);
vc3 = _mm512_load_pd(&c[(ii+3)*n3+kk]);
vc11 = _mm512_load_pd(&c[(ii+3)*n3+kk+8]);
vc4 = _mm512_load_pd(&c[(ii+4)*n3+kk]);
vc12 = _mm512_load_pd(&c[(ii+4)*n3+kk+8]);
vc5 = _mm512_load_pd(&c[(ii+5)*n3+kk]);
vc13 = _mm512_load_pd(&c[(ii+5)*n3+kk+8]);
vc6 = _mm512_load_pd(&c[(ii+6)*n3+kk]);
vc14 = _mm512_load_pd(&c[(ii+6)*n3+kk+8]);
vc7 = _mm512_load_pd(&c[(ii+7)*n3+kk]);
vc15 = _mm512_load_pd(&c[(ii+7)*n3+kk+8]);

for(uint64_t jj=j; jj<j+BLOCKSIZE; jj+=AVX_F_CAPACITY)
{
vb=_mm512_load_pd(&b[jj*n3 + kk]);
vb1=_mm512_load_pd(&b[jj*n3 + kk+8]);
vc0 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj]),vb,vc0);
vc8 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj]),vb1,vc8);
vc1 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj]),vb,vc1);
vc9 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj]),vb1,vc9);
vc2 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj]),vb,vc2);
vc10 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj]),vb1,vc10);
vc3 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj]),vb,vc3);
vc11 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj]),vb1,vc11);
vc4 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj]),vb,vc4);
vc12 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj]),vb1,vc12);
vc5 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj]),vb,vc5);
vc13 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj]),vb1,vc13);
vc6 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj]),vb,vc6);
vc14 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj]),vb1,vc14);
vc7 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj]),vb,vc7);
vc15 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj]),vb1,vc15);


vb=_mm512_load_pd(&b[(jj+1)*n3 + kk]);
vb1=_mm512_load_pd(&b[(jj+1)*n3 + kk+8]);
vc0 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+1]),vb,vc0);
vc8 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+1]),vb1,vc8);
vc1 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+1]),vb,vc1);
vc9 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+1]),vb1,vc9);
vc2 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+1]),vb,vc2);
vc10 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+1]),vb1,vc10);
vc3 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+1]),vb,vc3);
vc11 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+1]),vb1,vc11);
vc4 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+1]),vb,vc4);
vc12 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+1]),vb1,vc12);
vc5 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+1]),vb,vc5);
vc13 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+1]),vb1,vc13);
vc6 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+1]),vb,vc6);
vc14 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+1]),vb1,vc14);
vc7 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+1]),vb,vc7);
vc15 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+1]),vb1,vc15);

vb=_mm512_load_pd(&b[(jj+2)*n3 + kk]);
vb1=_mm512_load_pd(&b[(jj+2)*n3 + kk+8]);
vc0 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+2]),vb,vc0);
vc8 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+2]),vb1,vc8);
vc1 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+2]),vb,vc1);
vc9 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+2]),vb1,vc9);
vc2 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+2]),vb,vc2);
vc10 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+2]),vb1,vc10);
vc3 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+2]),vb,vc3);
vc11 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+2]),vb1,vc11);
vc4 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+2]),vb,vc4);
vc12 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+2]),vb1,vc12);
vc5 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+2]),vb,vc5);
vc13 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+2]),vb1,vc13);
vc6 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+2]),vb,vc6);
vc14 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+2]),vb1,vc14);
vc7 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+2]),vb,vc7);
vc15 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+2]),vb1,vc15);

vb=_mm512_load_pd(&b[(jj+3)*n3 + kk]);
vb1=_mm512_load_pd(&b[(jj+3)*n3 + kk+8]);
vc0 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+3]),vb,vc0);
vc8 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+3]),vb1,vc8);
vc1 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+3]),vb,vc1);
vc9 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+3]),vb1,vc9);
vc2 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+3]),vb,vc2);
vc10 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+3]),vb1,vc10);
vc3 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+3]),vb,vc3);
vc11 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+3]),vb1,vc11);
vc4 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+3]),vb,vc4);
vc12 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+3]),vb1,vc12);
vc5 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+3]),vb,vc5);
vc13 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+3]),vb1,vc13);
vc6 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+3]),vb,vc6);
vc14 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+3]),vb1,vc14);
vc7 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+3]),vb,vc7);
vc15 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+3]),vb1,vc15);

vb=_mm512_load_pd(&b[(jj+4)*n3 + kk]);
vb1=_mm512_load_pd(&b[(jj+4)*n3 + kk+8]);
vc0 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+4]),vb,vc0);
vc8 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+4]),vb1,vc8);
vc1 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+4]),vb,vc1);
vc9 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+4]),vb1,vc9);
vc2 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+4]),vb,vc2);
vc10 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+4]),vb1,vc10);
vc3 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+4]),vb,vc3);
vc11 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+4]),vb1,vc11);
vc4 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+4]),vb,vc4);
vc12 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+4]),vb1,vc12);
vc5 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+4]),vb,vc5);
vc13 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+4]),vb1,vc13);
vc6 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+4]),vb,vc6);
vc14 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+4]),vb1,vc14);
vc7 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+4]),vb,vc7);
vc15 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+4]),vb1,vc15);

vb=_mm512_load_pd(&b[(jj+5)*n3 + kk]);
vb1=_mm512_load_pd(&b[(jj+5)*n3 + kk+8]);
vc0 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+5]),vb,vc0);
vc8 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+5]),vb1,vc8);
vc1 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+5]),vb,vc1);
vc9 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+5]),vb1,vc9);
vc2 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+5]),vb,vc2);
vc10 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+5]),vb1,vc10);
vc3 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+5]),vb,vc3);
vc11 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+5]),vb1,vc11);
vc4 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+5]),vb,vc4);
vc12 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+5]),vb1,vc12);
vc5 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+5]),vb,vc5);
vc13 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+5]),vb1,vc13);
vc6 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+5]),vb,vc6);
vc14 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+5]),vb1,vc14);
vc7 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+5]),vb,vc7);
vc15 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+5]),vb1,vc15);


vb=_mm512_load_pd(&b[(jj+6)*n3 + kk]);
vb1=_mm512_load_pd(&b[(jj+6)*n3 + kk+8]);
vc0 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+6]),vb,vc0);
vc8 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+6]),vb1,vc8);
vc1 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+6]),vb,vc1);
vc9 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+6]),vb1,vc9);
vc2 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+6]),vb,vc2);
vc10 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+6]),vb1,vc10);
vc3 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+6]),vb,vc3);
vc11 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+6]),vb1,vc11);
vc4 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+6]),vb,vc4);
vc12 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+6]),vb1,vc12);
vc5 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+6]),vb,vc5);
vc13 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+6]),vb1,vc13);
vc6 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+6]),vb,vc6);
vc14 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+6]),vb1,vc14);
vc7 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+6]),vb,vc7);
vc15 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+6]),vb1,vc15);

vb=_mm512_load_pd(&b[(jj+7)*n3 + kk]);
vb1=_mm512_load_pd(&b[(jj+7)*n3 + kk+8]);
vc0 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+7]),vb,vc0);
vc8 = _mm512_fmadd_pd(_mm512_set1_pd(a[ii*n2+jj+7]),vb1,vc8);
vc1 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+7]),vb,vc1);
vc9 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+1)*n2+jj+7]),vb1,vc9);
vc2 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+7]),vb,vc2);
vc10 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+2)*n2+jj+7]),vb1,vc10);
vc3 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+7]),vb,vc3);
vc11 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+3)*n2+jj+7]),vb1,vc11);
vc4 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+7]),vb,vc4);
vc12 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+4)*n2+jj+7]),vb1,vc12);
vc5 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+7]),vb,vc5);
vc13 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+5)*n2+jj+7]),vb1,vc13);
vc6 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+7]),vb,vc6);
vc14 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+6)*n2+jj+7]),vb1,vc14);
vc7 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+7]),vb,vc7);
vc15 = _mm512_fmadd_pd(_mm512_set1_pd(a[(ii+7)*n2+jj+7]),vb1,vc15);
}
_mm512_store_pd(&c[ii*n3 + kk],vc0);
_mm512_store_pd(&c[ii*n3 + kk+8],vc8);
_mm512_store_pd(&c[(ii+1)*n3 + kk],vc1);
_mm512_store_pd(&c[(ii+1)*n3 + kk+8],vc9);
_mm512_store_pd(&c[(ii+2)*n3 + kk],vc2);
_mm512_store_pd(&c[(ii+2)*n3 + kk+8],vc10);
_mm512_store_pd(&c[(ii+3)*n3 + kk],vc3);
_mm512_store_pd(&c[(ii+3)*n3 + kk+8],vc11);
_mm512_store_pd(&c[(ii+4)*n3 + kk],vc4);
_mm512_store_pd(&c[(ii+4)*n3 + kk+8],vc12);
_mm512_store_pd(&c[(ii+5)*n3 + kk],vc5);
_mm512_store_pd(&c[(ii+5)*n3 + kk+8],vc13);
_mm512_store_pd(&c[(ii+6)*n3 + kk],vc6);
_mm512_store_pd(&c[(ii+6)*n3 + kk+8],vc14);
_mm512_store_pd(&c[(ii+7)*n3 + kk],vc7);
_mm512_store_pd(&c[(ii+7)*n3 + kk+8],vc15);
}
}

}
}
}
}
int main() {
uint64_t n1, n2, n3;
FILE* fi;

fi = fopen("conf.data", "rb");
fread(&n1, 1, 8, fi);
fread(&n2, 1, 8, fi);
fread(&n3, 1, 8, fi);

double* a = (double*)_mm_malloc(n1 * n2 * 8,64);
double* b = (double*)_mm_malloc(n2 * n3 * 8,64);
double* c = (double*)_mm_malloc(n1 * n3 * 8,64);

fread(a, 1, n1 * n2 * 8, fi);
fread(b, 1, n2 * n3 * 8, fi);
fclose(fi);

for (uint64_t i = 0; i < n1; i++) {
for (uint64_t k = 0; k < n3; k++) {
c[i * n3 + k] = 0;
}
}


auto t1 = std::chrono::steady_clock::now();
mul(a, b, c, n1, n2, n3);
auto t2 = std::chrono::steady_clock::now();
int d1 = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count();
printf("%d\n", d1);
// printf("c:%.4lf\n",c[202]);

fi = fopen("out.data", "wb");
fwrite(c, 1, n1 * n3 * 8, fi);
fclose(fi);

return 0;
}

I.logistic 方程

Logistic方程.mhtml

这题卡了一下午,分数一直是20多分,实在看不下去

一开始只是对it、itv函数进行简单的AVX和openmp处理,出来的结果实在不让人满意,后来经过对函数结构的观察,发现两个循环可以换位置,于是交换了一下,再在内部开openmp,确实还是差点意思,想到这道题反复强调向量化,就尝试将迭代次数减少一部分,放进内部循环,此时发现效果还可以。代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#define N 8
double count_itv(double r, double x) {
return r * x * (1.0 - x);
}

void itv(double r, double* x, int64_t n, int64_t itn) {
#pragma omp parallel
{
for (int64_t j = 0; j < itn/N; j++) {
#pragma omp for
for (int64_t i = 0; i < n; i++) {
for (int64_t k = 0; k <N; k++) {
x[i] = count_itv(r, x[i]);
}
}
}
}
}

数字16是经过反复尝试,调整出来的最好结果,因为itn=32768,所以必须选择可以除尽的数

image-20240124010147362

还差4分,难受,改了一下手动向量化,代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
__m512d avx_one = _mm512_set1_pd(1.0);
__m512d avx_r = _mm512_set1_pd(r);
#pragma omp parallel
{
for (short i=0; i<itn/N; ++i) {
#pragma omp for
for (int64_t j=0; j<n; j+=8) {
__m512d avx_x = _mm512_load_pd(&x[j]);
for (short k=0; k<N; ++k) {
__m512d avx_tmp = _mm512_mul_pd(avx_r, avx_x);
avx_x = _mm512_mul_pd(avx_tmp, _mm512_sub_pd(avx_one, avx_x));
}
_mm512_store_pd(&x[j], avx_x);
}
}
}
image-20240124013348557

额(⊙o⊙)…毁灭吧,我累了

后期突然注意到,openmp开错了,这道题还是openmp+向量化,但是我神奇般的过了...卡了个bug

来自一位学长的修正:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
#include <iostream>
#include <chrono>
#include <omp.h>
#include <immintrin.h>

// Define macro for aligning memory to 64 bytes for AVX-512
#define ALIGNMENT 64

double it(double r, double x, int64_t itn) {
for (int64_t i = 0; i < itn; i++) {
x = r * x * (1.0 - x);
}
return x;
}

void itv(double r, double* x, int64_t n, int64_t itn) {
__builtin_assume_aligned(x, 64);
__m512d r_vec=_mm512_set1_pd(r);
__m512d vec_1=_mm512_set1_pd(1.0);
#pragma omp parallel for collapse(1)
#pragma vector aligned
for (int64_t i = 0; i < n; i += 64) {
__m512d x_vec1 = _mm512_loadu_pd(&x[i]);
__m512d x_vec2 = _mm512_loadu_pd(&x[i+8]);
__m512d x_vec3 = _mm512_loadu_pd(&x[i+16]);
__m512d x_vec4 = _mm512_loadu_pd(&x[i+24]);
__m512d x_vec5 = _mm512_loadu_pd(&x[i+32]);
__m512d x_vec6 = _mm512_loadu_pd(&x[i+40]);
__m512d x_vec7 = _mm512_loadu_pd(&x[i+48]);
__m512d x_vec8 = _mm512_loadu_pd(&x[i+56]);
for (int64_t j = 0; j < itn; j++) {
x_vec1 = _mm512_mul_pd(_mm512_mul_pd(r_vec, x_vec1),
_mm512_sub_pd(vec_1, x_vec1));
x_vec2 = _mm512_mul_pd(_mm512_mul_pd(r_vec, x_vec2),
_mm512_sub_pd(vec_1, x_vec2));
x_vec3 = _mm512_mul_pd(_mm512_mul_pd(r_vec, x_vec3),
_mm512_sub_pd(vec_1, x_vec3));
x_vec4 = _mm512_mul_pd(_mm512_mul_pd(r_vec, x_vec4),
_mm512_sub_pd(vec_1, x_vec4));
x_vec5 = _mm512_mul_pd(_mm512_mul_pd(r_vec, x_vec5),
_mm512_sub_pd(vec_1, x_vec5));
x_vec6 = _mm512_mul_pd(_mm512_mul_pd(r_vec, x_vec6),
_mm512_sub_pd(vec_1, x_vec6));
x_vec7 = _mm512_mul_pd(_mm512_mul_pd(r_vec, x_vec7),
_mm512_sub_pd(vec_1, x_vec7));
x_vec8 = _mm512_mul_pd(_mm512_mul_pd(r_vec, x_vec8),
_mm512_sub_pd(vec_1, x_vec8));
}
_mm512_storeu_pd(&x[i], x_vec1);
_mm512_storeu_pd(&x[i+8], x_vec2);
_mm512_storeu_pd(&x[i+16], x_vec3);
_mm512_storeu_pd(&x[i+24], x_vec4);
_mm512_storeu_pd(&x[i+32], x_vec5);
_mm512_storeu_pd(&x[i+40], x_vec6);
_mm512_storeu_pd(&x[i+48], x_vec7);
_mm512_storeu_pd(&x[i+56], x_vec8);
}
}

int main(){
FILE* fi;
fi = fopen("conf.data", "rb");

int64_t itn;
double r;
int64_t n;
double* x;

fread(&itn, 1, 8, fi);
fread(&r, 1, 8, fi);
fread(&n, 1, 8, fi);

// Allocate aligned memory for better AVX-512 performance
x = (double*)_mm_malloc(n * 8, ALIGNMENT);

fread(x, 1, n * 8, fi);
fclose(fi);

auto t1 = std::chrono::steady_clock::now();

// Use OpenMP for parallelization and AVX-512 for vectorization
itv(r, x, n, itn);

auto t2 = std::chrono::steady_clock::now();
int d1 = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count();
printf("%d\n", d1);

fi = fopen("out.data", "wb");
fwrite(x, 1, n * 8, fi);
fclose(fi);

// Free aligned memory
_mm_free(x);

return 0;
}

J.H-66

H-66.mhtml

先用了gprof进行了热点分析,结果如下:

image-20240124102831082

可以看到,大部分时间都是花费在了getsp函数上,act函数也是个小热点,

先拿getsp开刀,

1月29日:“遗憾,玩不动这道题”

L.洪水困兽

洪水困兽.mhtml

签到送分题,OpenMP直接过

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
#include <array>
#include <fstream>
#include <iostream>
#include <omp.h>
#include <vector>
#include <cmath>
#include <tuple>

using std::vector, std::array, std::tuple, std::string;

void particle2grid(int resolution, int numparticle,
const vector<double> &particle_position,
const vector<double> &particle_velocity,
vector<double> &velocityu, vector<double> &velocityv,
vector<double> &weightu, vector<double> &weightv) {
double grid_spacing = 1.0 / resolution;
double inv_grid_spacing = 1.0 / grid_spacing;
auto get_frac = [&inv_grid_spacing](double x, double y) {
int xidx = floor(x * inv_grid_spacing);
int yidx = floor(y * inv_grid_spacing);
double fracx = x * inv_grid_spacing - xidx;
double fracy = y * inv_grid_spacing - yidx;
return tuple(array<int, 2>{xidx, yidx},
array<double, 4>{fracx * fracy, (1 - fracx) * fracy,
fracx * (1 - fracy),
(1 - fracx) * (1 - fracy)});
};

#pragma omp parallel for
for (int i = 0; i < numparticle; i++) {
array<int, 4> offsetx = {0, 1, 0, 1};
array<int, 4> offsety = {0, 0, 1, 1};

auto [idxu, fracu] =
get_frac(particle_position[i * 2 + 0],
particle_position[i * 2 + 1] - 0.5 * grid_spacing);
auto [idxv, fracv] =
get_frac(particle_position[i * 2 + 0] - 0.5 * grid_spacing,
particle_position[i * 2 + 1]);

for (int j = 0; j < 4; j++) {
int tmpidx_u = (idxu[0] + offsetx[j]) * resolution + (idxu[1] + offsety[j]);
int tmpidx_v = (idxv[0] + offsetx[j]) * (resolution + 1) + (idxv[1] + offsety[j]);
//关键点就在这里,知道有这个操作就能签到成功,但看到有人手写锁好像也成功了
#pragma omp atomic
velocityu[tmpidx_u] += particle_velocity[i * 2 + 0] * fracu[j];

#pragma omp atomic
weightu[tmpidx_u] += fracu[j];

#pragma omp atomic
velocityv[tmpidx_v] += particle_velocity[i * 2 + 1] * fracv[j];

#pragma omp atomic
weightv[tmpidx_v] += fracv[j];
}
}
}

int main(int argc, char *argv[]) {
if (argc < 2) {
printf("Usage: %s inputfile\n", argv[0]);
return -1;
}

string inputfile(argv[1]);
std::ifstream fin(inputfile, std::ios::binary);
if (!fin) {
printf("Error opening file");
return -1;
}

int resolution;
int numparticle;
vector<double> particle_position;
vector<double> particle_velocity;

fin.read((char *)(&resolution), sizeof(int));
fin.read((char *)(&numparticle), sizeof(int));

particle_position.resize(numparticle * 2);
particle_velocity.resize(numparticle * 2);

printf("resolution: %d\n", resolution);
printf("numparticle: %d\n", numparticle);

fin.read((char *)(particle_position.data()),
sizeof(double) * particle_position.size());
fin.read((char *)(particle_velocity.data()),
sizeof(double) * particle_velocity.size());

vector<double> velocityu((resolution + 1) * resolution, 0.0);
vector<double> velocityv((resolution + 1) * resolution, 0.0);
vector<double> weightu((resolution + 1) * resolution, 0.0);
vector<double> weightv((resolution + 1) * resolution, 0.0);


string outputfile;

particle2grid(resolution, numparticle, particle_position,
particle_velocity, velocityu, velocityv, weightu,
weightv);
outputfile = "output.dat";

std::ofstream fout(outputfile, std::ios::binary);
if (!fout) {
printf("Error output file");
return -1;
}
fout.write((char *)(&resolution), sizeof(int));
fout.write(reinterpret_cast<char *>(velocityu.data()),
sizeof(double) * velocityu.size());
fout.write(reinterpret_cast<char *>(velocityv.data()),
sizeof(double) * velocityv.size());
fout.write(reinterpret_cast<char *>(weightu.data()),
sizeof(double) * weightu.size());
fout.write(reinterpret_cast<char *>(weightv.data()),
sizeof(double) * weightv.size());

return 0;
}

其余题:

光之游戏.mhtml

3D生命游戏.mhtml

RISC-V的题没有放