TA的每日心情 | 开心 2020-4-8 10:45 |
|---|
签到天数: 227 天 [LV.7]分神
|
W; t9 @* m9 ~+ z
在论文里,这是第3.2.2节的内容
# l. i _; T- T/ t# ]
- }/ e2 E+ y. g" a3.2.2. Efficient Implementation of Cross-Node All-to-All Communication
! o$ ]% }% E4 D9 ]" ~" J, {In order to ensure sufficient computational performance for DualPipe, we customize efficient" H" I, w# j* h% C
cross-node all-to-all communication kernels (including dispatching and combining) to conserve7 k8 [* x1 r# N# m3 g3 o6 j- K
the number of SMs dedicated to communication. The implementation of the kernels is codesigned with the MoE gating algorithm and the network topology of our cluster. To be specific,
5 c6 l( E; @9 Jin our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications
9 d' m/ u0 x# R, q* [5 iare handled via NVLink. NVLink offers a bandwidth of 160 GB/s, roughly 3.2 times that of IB5 Z& p5 b0 Q8 L- x8 @4 ^, t
(50 GB/s). To effectively leverage the different bandwidths of IB and NVLink, we limit each8 Q5 a% i9 ]% o3 }9 X9 h
token to be dispatched to at most 4 nodes, thereby reducing IB traffic. For each token, when its
) w3 E( L2 Z6 C T' v4 d, b* arouting decision is made, it will first be transmitted via IB to the GPUs with the same in-node$ v% t/ ~( w( g8 z( h& E \
index on its target nodes. Once it reaches the target nodes, we will endeavor to ensure that it is
7 j9 S8 l* S4 Z; e3 J! x; Xinstantaneously forwarded via NVLink to specific GPUs that host their target experts, without, x+ a8 z# t' @" r7 M9 U2 O2 ?! l2 G
being blocked by subsequently arriving tokens. In this way, communications via IB and NVLink
8 p% g' [0 U( @3 xare fully overlapped, and each token can efficiently select an average of 3.2 experts per node
7 _; d8 V" }# Hwithout incurring additional overhead from NVLink. This implies that, although DeepSeek-V3
`* g" U/ S6 k0 Q5 s% J: F13; W7 }1 l0 T* C
selects only 8 routed experts in practice, it can scale up this number to a maximum of 13 experts
: l6 F* s7 i& x& g(4 nodes × 3.2 experts/node) while preserving the same communication cost. Overall, under* j' N9 i( z$ _" Q; T: [
such a communication strategy, only 20 SMs are sufficient to fully utilize the bandwidths of IB
$ P% h- d# A: k5 f* Xand NVLink.
# C, M$ u6 g: c+ b! v/ yIn detail, we employ the warp specialization technique (Bauer et al., 2014) and partition
: o' I) J6 s _$ p. T1 u20 SMs into 10 communication channels. During the dispatching process, (1) IB sending, (2): X4 k4 a1 m( v2 |, g+ w
IB-to-NVLink forwarding, and (3) NVLink receiving are handled by respective warps. The
8 H8 j! r% ^+ {2 m! f3 o( |number of warps allocated to each communication task is dynamically adjusted according to the% N2 N( a8 a3 Y8 P% y* c0 G: }
actual workload across all SMs. Similarly, during the combining process, (1) NVLink sending,
6 P; C/ Z3 S5 A ^(2) NVLink-to-IB forwarding and accumulation, and (3) IB receiving and accumulation are also: S- z# ?4 n% M, i9 V) B
handled by dynamically adjusted warps. In addition, both dispatching and combining kernels0 P' ]8 |8 j$ V) g, k
overlap with the computation stream, so we also consider their impact on other SM computation1 D* z, @# e# E9 q
kernels. Specifically, we employ customized PTX (Parallel Thread Execution) instructions and& x" b& |( L; e4 N9 w
auto-tune the communication chunk size, which significantly reduces the use of the L2 cache5 l0 c3 `" G- s+ B! a! L2 H3 m; `
and the interference to other SMs.
: P# [# o) I# p0 Q8 P9 O7 R/ p
+ l. }" V1 J; k' b! r" S- U6 @通俗一点说,就是为了实现高效的跨节点全面通信。解决的问题本质上和唐家山老师日志里说的双机对拷的场景差不多。一般来说单机多卡之间用nvlink,多机多卡之间依赖IB网络,但nvlink的速率是IB网络的速率的3.2倍,需要通过一些优化来实现更好的传输策略。这是一整套方案。0 ^4 U6 c) n. t- J0 W( }+ ~
+ L6 \, G7 H2 B7 w
我的理解,使用PTX在其中,是为了更精准的定制线程执行减少通信块分配传输之间的串扰。
, u6 }" t0 J5 b) S4 \) ^9 g6 D; |7 y" M4 G
目的不是为了绕cuda,反而是为了让cuda的效率更高。4 k) S2 B+ J; M" w/ Z2 ^
2 G! K/ I$ q2 p' ^5 b7 K类比一下,就好比发现网卡驱动在对拷特定内存块的时候会和应用的线程执行出现串行导致效率降低,而绕开操作系统定义的与网卡驱动的接口,直接使用网卡支持的指令集进行了优化。 |
评分
-
查看全部评分
|