Artificial Intelligence
ãã©ãã·ã¥ã¢ãã³ã·ã§ã³ïŒå€å§åšã®å¹çã驿°

ãã®å®è£
ã¯åçŽã§ã¯ããããåè¿°ã®éå¹çæ§ã«æ©ãŸãããŠããã scores
圢ç¶ã (batch_size, seq_len, seq_len) ã§ãããã³ãœã«ã¯ãé·ãã·ãŒã±ã³ã¹ã§ã¯æ³å€ã«å€§ãããªãå¯èœæ§ããããŸãã
ãã©ãã·ã¥ã¢ãã³ã·ã§ã³ãå ¥å
ãã©ãã·ã¥ã¢ãã³ã·ã§ã³ã ããªã»ããªãšååã«ãã£ãŠç޹ä»ããã 2022幎ã®è«æã§çºè¡šãããFlash Attentionã¯ãã¡ã¢ãªäœ¿çšéãå€§å¹ ã«åæžããèšç®å¹çãåäžãããã³ã³ãã¥ãŒãã£ã³ã°ã¢ãã³ã·ã§ã³ã®ã¢ãããŒãã§ããFlash Attentionã®èåŸã«ããäž»èŠãªã¢ã€ãã¢ã¯æ¬¡ã®ãšããã§ãã
- ã¿ã€ã«: 倧ããªã¢ãã³ã·ã§ã³ ãããªãã¯ã¹ããé«éãªã³ããã SRAM ã«åãŸãå°ããªã¿ã€ã«ã«åå²ããŸãã
- åèšç®: 泚æè¡åå šäœãä¿åãã代ããã«ãåŸæ¹ãã¹äžã«å¿ èŠã«å¿ããŠãã®äžéšãåèšç®ããŸãã
- IO 察å¿å®è£ : GPU ã¡ã¢ãªéå±€ã®ç°ãªãã¬ãã«éã§ã®ããŒã¿ç§»åãæå°éã«æããããã«ã¢ã«ãŽãªãºã ãæé©åããŸãã
ãã©ãã·ã¥ ã¢ãã³ã·ã§ã³ ã¢ã«ãŽãªãºã
Flash Attention ã¯ãæ¬è³ªçã«ãã¢ãã³ã·ã§ã³ ã¡ã«ããºã ã®èšç®æ¹æ³ãåèããŸããã¢ãã³ã·ã§ã³ ãããªãã¯ã¹å šäœãäžåºŠã«èšç®ããã®ã§ã¯ãªããææ°ã® GPU ã®ã¡ã¢ãªéå±€ãæŽ»çšããŠãããã¯åäœã§åŠçããŸãã
ã¢ã«ãŽãªãºã ã®æŠèŠã¯æ¬¡ã®ãšããã§ãã
- å ¥å: HBM (é«åž¯åå¹ ã¡ã¢ãª) å ã®è¡å QãKãV ããã³ãµã€ãº M ã®ãªã³ããã SRAMã
- ããã㯠ãµã€ãºã¯äœ¿çšå¯èœãª SRAM ã«åºã¥ããŠèšç®ãããŸãã
- åºåè¡å O ãšè£å©ãã¯ãã« l ããã³ m ã®åæåã
- ã¢ã«ãŽãªãºã ã¯ãå ¥åãããªãã¯ã¹ã SRAM ã«åãŸãããã«ãããã¯ã«åå²ããŸãã
- 2 ã€ã®ãã¹ããããã«ãŒãããããã®ãããã¯ãåŠçããŸãã
- å€åŽã®ã«ãŒãã¯Kãããã¯ãšVãããã¯ãããŒãããŸã
- å åŽã®ã«ãŒãã¯Qãããã¯ãããŒãããèšç®ãå®è¡ããŸãã
- ãªã³ãããèšç®ã«ã¯ãè¡åä¹ç®ããœããããã¯ã¹ãåºåèšç®ãå«ãŸããŸãã
- åãããã¯ã®åŠçåŸãçµæã¯ HBM ã«æžãæ»ãããŸãã
ãã®ãããã¯åäœã®èšç®ã«ãããFlash Attention ã¯æ£ç¢ºãªæ³šæãèšç®ããªãããã¡ã¢ãª ãããããªã³ããå€§å¹ ã«å°ããæããããšãã§ããŸãã
ãã©ãã·ã¥ã¢ãã³ã·ã§ã³ã®èåŸã«ããæ°åŠ
Flash Attention ãæ©èœãããããã®éµã¯ããããã¯åäœã§ãœããããã¯ã¹ãèšç®ã§ããæ°åŠçãªããªãã¯ã§ãããã®è«æã§ã¯ã2 ã€ã®éèŠãªå ¬åŒã玹ä»ãããŠããŸãã
- ãœããããã¯ã¹åè§£:
softmax(x) = exp(x - m) / Σexp(x - m)
ããã§ãm 㯠x ã®æå€§å€ã§ãã
- ãœããããã¯ã¹å䜵:
softmax(x ⪠y) = softmax(softmax(x) * e^(m_x - m), softmax(y) * e^(m_y - m))
ããã§ãm = max(m_x, m_y)
ãããã®åŒã«ãããFlash Attention ã¯åãããã¯ã®éšåçãªãœããããã¯ã¹çµæãèšç®ããããããæ£ããçµã¿åãããŠæçµçµæãåŸãããšãã§ããŸãã
å®è£ ã®è©³çް
Flash Attention ã®ç°¡ç¥åãããå®è£ ã詳ããèŠãŠããã®äžæ žãšãªãæŠå¿µã説æããŸãããã
import torch def flash_attention(Q, K, V, block_size=256): batch_size, seq_len, d_model = Q.shape # Initialize output and running statistics O = torch.zeros_like(Q) L = torch.zeros((batch_size, seq_len, 1)) M = torch.full((batch_size, seq_len, 1), float('-inf')) for i in range(0, seq_len, block_size): Q_block = Q[:, i:i+block_size, :] for j in range(0, seq_len, block_size): K_block = K[:, j:j+block_size, :] V_block = V[:, j:j+block_size, :] # Compute attention scores for this block S_block = torch.matmul(Q_block, K_block.transpose(-2, -1)) / (d_model ** 0.5) # Update running max M_new = torch.maximum(M[:, i:i+block_size], S_block.max(dim=-1, keepdim=True).values) # Compute exponentials exp_S = torch.exp(S_block - M_new) exp_M_diff = torch.exp(M[:, i:i+block_size] - M_new) # Update running sum L_new = exp_M_diff * L[:, i:i+block_size] + exp_S.sum(dim=-1, keepdim=True) # Compute output for this block O[:, i:i+block_size] = ( exp_M_diff * O[:, i:i+block_size] + torch.matmul(exp_S, V_block) ) / L_new # Update running statistics L[:, i:i+block_size] = L_new M[:, i:i+block_size] = M_new return O
ãã®å®è£ ã¯ç°¡ç¥åãããŠããŸãããFlash Attention ã®æ¬è³ªãæããŠããŸãããããã¯å ã®å ¥åãåŠçããå®è¡äžã®çµ±èš (M ãš L) ãç¶æããŠããã¹ãŠã®ãããã¯ã«ããã£ãŠãœããããã¯ã¹ãæ£ããèšç®ããŸãã
ãã©ãã·ã¥ã¢ãã³ã·ã§ã³ã®åœ±é¿
Flash Attention ã®å°å ¥ã¯ãç¹ã«å€§èŠæš¡ãªèšèªã¢ãã«ãé·ãã³ã³ããã¹ãã®ã¢ããªã±ãŒã·ã§ã³ã«ãããŠãæ©æ¢°åŠç¿ã®åéã«å€§ããªåœ±é¿ãäžããŸãããäž»ãªå©ç¹ã¯æ¬¡ã®ãšããã§ãã
- ã¡ã¢ãªäœ¿çšéã®åæž: Flash Attention ã¯ãã¡ã¢ãªã®è€éãã O(N^2) ãã O(N) ã«åæžããŸããããã§ãN ã¯ã·ãŒã±ã³ã¹ã®é·ãã§ããããã«ãããåãããŒããŠã§ã¢ã§ã¯ããã«é·ãã·ãŒã±ã³ã¹ãåŠçã§ããããã«ãªããŸãã
- é床ã®åäžããŒã¿ã®ç§»åãæå°éã«æããGPU ã®èšç®èœåãããæå¹ã«æŽ»çšããããšã§ãFlash Attention ã¯å€§å¹ ãªé«éåãå®çŸããŸããèè ãã¯ãæšæºå®è£ ãšæ¯èŒã㊠GPT-3 ã®ãã¬ãŒãã³ã°ãæå€§ 2 åé«éã§ãããšå ±åããŠããŸãã
- æ£ç¢ºãªèšç®: ä»ã®æ³šææé©åææ³ãšã¯ç°ãªããFlash Attention ã¯è¿äŒŒå€ã§ã¯ãªãæ£ç¢ºãªæ³šæãèšç®ããŸãã
- æ¡åŒµæ§: ã¡ã¢ãªãããããªã³ããåæžããããããæå€§æ°çŸäžã®ããŒã¯ã³ãŸã§ãã¯ããã«é·ãã·ãŒã±ã³ã¹ã«ã¹ã±ãŒãªã³ã°ã§ããããã«ãªããŸãã
å®äžçãžã®åœ±é¿
Flash Attention ã®åœ±é¿ã¯åŠè¡ç ç©¶ã ãã«ãšã©ãŸããŸãããå€ãã®äººæ°ã®æ©æ¢°åŠç¿ã©ã€ãã©ãªãã¢ãã«ã«æ¥éã«æ¡çšãããŠããŸãã
- ãã§ã€ã¹ãã©ã³ã¹ãã©ãŒããŒãæ±ãç· ãã: 人æ°ã® Transformers ã©ã€ãã©ãªã«ã¯ Flash Attention ãçµ±åãããŠããããŠãŒã¶ãŒã¯ãã®å©ç¹ãç°¡åã«æŽ»çšã§ããŸãã
- GPT-4 以é: 確èªã¯ãããŠããŸããããGPT-4 ã®ãããªé«åºŠãªèšèªã¢ãã«ã¯ãé·ãã³ã³ããã¹ããåŠçããããã« Flash Attention ã«äŒŒãææ³ã䜿çšããŠããå¯èœæ§ããããšããæšæž¬ããããŸãã
- ãã³ã°ã³ã³ããã¹ãã¢ãã«: Flash Attention ã«ãããæžç±å šäœãé·ããããªãåŠçã§ããã¢ãã«ãªã©ãéåžžã«é·ãã³ã³ããã¹ããåŠçã§ããæ°äžä»£ã®ã¢ãã«ãå¯èœã«ãªããŸããã
ãã©ãã·ã¥æ³šç®: æè¿ã®åå
ãã©ãã·ã¥ã¢ãã³ã·ã§ã³-2
ãªãªãžãã«ã®Flash Attentionã®æåãåºã«ãåãããŒã ã 2幎ã«FlashAttention-2023ãå°å ¥ãã®æŽæ°ããŒãžã§ã³ã§ã¯ãããã€ãã®æ¹åãå ããããŠããŸãã
- ãããªãæé©åFlashAttention-2 ã¯ããã«åªãã GPU 䜿çšçãå®çŸããA70 GPU ã®çè«äžã®ããŒã¯ FLOPS ã®æå€§ 100% ã«éããŸãã
- æ¹è¯ãããããã¯ã¯ãŒããã¹: åŸæ¹ãã¹ã¯åæ¹ãã¹ãšã»ãŒåãé床ã«ãªãããã«æé©åãããŠããããã¬ãŒãã³ã°ã®é床ãå€§å¹ ã«åäžããŸãã
- ããŸããŸãªæ³šç®ããªãšãŒã·ã§ã³ã®ãµããŒã: FlashAttention-2 ã¯ãã°ã«ãŒãåãããã¯ãšãª ã¢ãã³ã·ã§ã³ããã«ãã¯ãšãª ã¢ãã³ã·ã§ã³ãªã©ãããŸããŸãªã¢ãã³ã·ã§ã³ ããªã¢ã³ãã®ãµããŒããæ¡åŒµããŸãã
ãã©ãã·ã¥ã¢ãã³ã·ã§ã³-3
2024幎ã«ãªãªãŒã¹ãããFlashAttention-3 ãã®ç ç©¶åéã«ãããææ°ã®é²æ©ã衚ããŠããŸããããã©ãŒãã³ã¹ãããã«åäžãããããã®ããã€ãã®æ°ããææ³ãå°å ¥ãããŠããŸãã
- éåæèšç®: æ°ãã GPU åœä»€ã®éåææ§ã掻çšããŠãããŸããŸãªèšç®ãéãåãããŸãã
- FP8 ãµããŒã: äœç²ŸåºŠFP8æŒç®ãå©çšããããã«é«éãªåŠçãå®çŸããŸãã
- äžè²«æ§ã®ãªãåŠç: äœç²ŸåºŠãã©ãŒãããã䜿çšãããšãã«éååèª€å·®ãæžããææ³ã
以äžã¯ãFlashAttention-3 ãéåæèšç®ã掻çšããæ¹æ³ã®ç°¡ç¥åãããäŸã§ãã
import torch from torch.cuda.amp import autocast def flash_attention_3(Q, K, V, block_size=256): with autocast(dtype=torch.float8): # Using FP8 for computation # ... (similar to previous implementation) # Asynchronous computation example with torch.cuda.stream(torch.cuda.Stream()): # Compute GEMM asynchronously S_block = torch.matmul(Q_block, K_block.transpose(-2, -1)) / (d_model ** 0.5) # Meanwhile, on the default stream: # Prepare for softmax computation # Synchronize streams torch.cuda.synchronize() # Continue with softmax and output computation # ... return O
ãã®ã³ãŒã ã¹ããããã¯ãFlashAttention-3 ãéåæèšç®ãš FP8 ç²ŸåºŠãæŽ»çšããæ¹æ³ã瀺ããŠããŸããããã¯åçŽåãããäŸã§ãããå®éã®å®è£ ã¯ã¯ããã«è€éã§ããŒããŠã§ã¢åºæã«ãªãããšã«æ³šæããŠãã ããã
ãããžã§ã¯ãã«ãã©ãã·ã¥ã¢ãã³ã·ã§ã³ãå®è£ ãã
ç¬èªã®ãããžã§ã¯ãã§ Flash Attention ãæŽ»çšããããšã«é¢å¿ãããå Žåã¯ãããã€ãã®ãªãã·ã§ã³ããããŸãã
- æ¢åã®ã©ã€ãã©ãªã䜿çšãã: Hugging Face Transformers ãªã©ã®å€ãã®äººæ°ã©ã€ãã©ãªã«ã¯çŸåšãFlash Attention ã®å®è£ ãå«ãŸããŠããŸããææ°ããŒãžã§ã³ã«æŽæ°ããé©åãªãã©ã°ãæå¹ã«ããã ãã§ååãªå ŽåããããŸãã
- ã«ã¹ã¿ã å®è£ : ããé«åºŠãªå¶åŸ¡ãç¹æ®ãªãŠãŒã¹ã±ãŒã¹ã®å Žåã¯ãFlash Attention ãèªåã§å®è£ ããããšããå§ãããŸããxformers ã©ã€ãã©ãªã¯ãåªãããªãã¡ã¬ã³ã¹å®è£ ãæäŸããŸãã
- ããŒããŠã§ã¢åºæã®æé©å: ç¹å®ã®ããŒããŠã§ã¢ (NVIDIA H100 GPU ãªã©) ã䜿çšããŠããå Žåã¯ãããã©ãŒãã³ã¹ãæå€§éã«é«ããããã«ããŒããŠã§ã¢åºæã®æ©èœã掻çšããããšããå§ãããŸãã
Hugging Face Transformers ã©ã€ãã©ãªã§ Flash Attention ã䜿çšããæ¹æ³ã®äŸã次ã«ç€ºããŸãã
from transformers import AutoModel, AutoConfig # Enable Flash Attention config = AutoConfig.from_pretrained("bert-base-uncased") config.use_flash_attention = True # Load model with Flash Attention model = AutoModel.from_pretrained("bert-base-uncased", config=config) # Use the model as usual # ...
課é¡ãšä»åŸã®æ¹åæ§
ãã©ãã·ã¥ ã¢ãã³ã·ã§ã³ã¯æ³šæã¡ã«ããºã ã®å¹çæ§ã®åäžã«å€§ããªé²æ©ãéããŸãããããŸã 課é¡ãšä»åŸã®ç ç©¶é åãæ®ã£ãŠããŸãã
- ããŒããŠã§ã¢ã®ç¹ç°æ§çŸåšã®å®è£ ã¯ãå€ãã®å Žåãç¹å®ã® GPU ã¢ãŒããã¯ãã£åãã«æé©åãããŠããŸãããããã®æé©åãããŸããŸãªããŒããŠã§ã¢ã«äžè¬åããããšã¯ãäŸç¶ãšããŠèª²é¡ãšãªã£ãŠããŸãã
- ä»ã®æè¡ãšã®çµ±å: Flash Attention ãããã«ãŒãã³ã°ãéååãã¢ãã«å§çž®ãªã©ã®ä»ã®æé©åææ³ãšçµã¿åãããããšã¯ã掻çºã«ç ç©¶ãããŠããåéã§ãã
- ä»ã®ãã¡ã€ã³ãžã®æ¡åŒµ: Flash Attention 㯠NLP ã§å€§ããªæåãåããŠããŸããããã®å©ç¹ãã³ã³ãã¥ãŒã¿ãŒ ããžã§ã³ããã«ãã¢ãŒãã« ã¢ãã«ãªã©ã®ä»ã®é åã«æ¡åŒµããããã®åãçµã¿ãçŸåšãç¶ããŠããŸãã
- çè«ççè§£: Flash Attention ããªãããã»ã©ããŸãæ©èœããã®ãã«ã€ããŠã®çè«ççè§£ãæ·±ããããšã§ãããã«åŒ·åãªæé©åãå®çŸã§ããå¯èœæ§ããããŸãã
ãŸãšãïŒ
Flash Attention ã¯ãGPU ã¡ã¢ãªéå±€ãå·§ã¿ã«æŽ»çšããæ°åŠçãªããªãã¯ãæ¡çšããããšã§ã粟床ãç ç²ã«ããããšãªããé床ãšã¡ã¢ãªäœ¿çšéã®äž¡æ¹ãå€§å¹ ã«æ¹åããŸãã
ãã®èšäºã§èª¬æããããã«ãFlash Attention ã®åœ±é¿ã¯åãªãæé©åææ³ãã¯ããã«è¶ ããŠããŸããããã«ããããã匷åã§å¹ççãªã¢ãã«ã®éçºãå¯èœã«ãªããŸããã