公司網(wǎng)站建設(shè)調(diào)研,攜程前端網(wǎng)站開發(fā)團隊,點餐網(wǎng)站模板,電商網(wǎng)站開發(fā)需要多少錢簡介本文詳細介紹了LLM推理中的三種并行計算方法#xff1a;數(shù)據(jù)并行通過在多設(shè)備上復(fù)制模型并并行處理不同批次數(shù)據(jù)提升速度#xff1b;模型并行將模型拆分到多設(shè)備上解決單設(shè)備顯存不足問題#xff1b;流水線并行通過微批次調(diào)度實現(xiàn)GPU并行計算提高利用率。文章對比分析…簡介本文詳細介紹了LLM推理中的三種并行計算方法數(shù)據(jù)并行通過在多設(shè)備上復(fù)制模型并并行處理不同批次數(shù)據(jù)提升速度模型并行將模型拆分到多設(shè)備上解決單設(shè)備顯存不足問題流水線并行通過微批次調(diào)度實現(xiàn)GPU并行計算提高利用率。文章對比分析了三者在顯存占用、吞吐量和性能上的權(quán)衡指出需根據(jù)模型規(guī)模和硬件限制選擇合適策略組合使用能進一步釋放擴展?jié)摿?。本文將分析?shù)據(jù)并行Data Parallelism、模型并行Model Parallelism與流水線并行Pipeline Parallelism在推理引擎中的實現(xiàn)方式并討論它們對顯存占用、吞吐量及整體性能權(quán)衡的影響。在數(shù)據(jù)并行Data Parallel范式中我們將數(shù)據(jù)集劃分到多個計算設(shè)備上而每個設(shè)備均保留一份完整的模型副本。當(dāng)模型規(guī)模能夠輕松裝入單個設(shè)備顯存時該方法尤為高效。通過并行處理不同批次的數(shù)據(jù)理論上可將推理速度提升為可用設(shè)備數(shù)量的倍數(shù)。然而當(dāng)模型體積過大以至于無法完整放入單個 GPU 時這一策略便不再可行。Figure 1: Data Parallel Diagram(credits : eraser.io)圖 1數(shù)據(jù)并行示意圖來源eraser.io在多設(shè)備環(huán)境下數(shù)據(jù)集需均勻劃分。例如若共有 100 條數(shù)據(jù)并使用 2 張 GPU理想策略是按 50/50 比例切分。為保證隨機性并避免偏差通常先打亂索引再將其平均分配到各 GPU。下面給出具體實現(xiàn)。#Dataset file : dataset.pyimport torchfrom random import Randomimport torch.distributed as distfrom torch.utils.data import DataLoaderclassPartition(): #Standard pytorch dataset class(containing len and getitem methods) def__init__(self, data, index): self.data data #Entire Dataset(list) self.index index #Indices(list) : Different for each device def__len__(self): returnlen(self.index) #Partition represents the chunk of data(not entire data) def__getitem__(self, index): data_idx self.index[index] returnself.data[data_idx]classDataPartitioner(): def__init__(self, data, sizes[0.5, 0.5], seed1234): self.data data self.partitions partitions rng Random() rng.seed(seed) #Get the length of the entire dataset data_len len(data) indices list(range(data_len)) #Shuffle the indices rng.shuffle(indices) #Partition the indices for the devices start_idx 0 for size in sizes: part_len int(size * data_len) self.partitions.append(indices[start_idx:start_idx part_len]) start_idx part_len defuse(self, partition): return Partition(self.data, self.partitions[partition]) #Dataset, List of Indices defpartition_dataset(rank, world_size, dataset, batch_size128, collate_fnNone): partitioned_batch_size batch_size // world_size sizes [1/ world_size for _ inrange(world_size)] #world_size : The number of devices partitioner DataPartitioner(dataset, sizessizes) partition partitioner.use(rank) #Wrap this in a dataloader dataloader DataLoader( partition, batch_sizepartitioned_batch_size, collate_fncollate_fn ) return dataloader邏輯清晰對吧若對world_size與rank的含義存疑可簡單理解為?world_size表示參與訓(xùn)練的設(shè)備總數(shù)如 GPU 數(shù)量。?rank用于標(biāo)識其中的某一具體設(shè)備。例如使用 4 張 GPU 時world_size設(shè)為 4而rank取值 0–3分別對應(yīng)每張 GPU。大規(guī)模訓(xùn)練需兩大核心組件多進程并發(fā)torch.multiprocessing與設(shè)備間高效通信torch.distributed。思考題為何需要通信請先行思考。分布式數(shù)據(jù)加載器已就緒接下來實現(xiàn)訓(xùn)練流程。#importsimport tqdmimport torchimport dataset #The dataloader that we wroteimport numpy as npimport torch.nn as nnfrom functools import partialimport torch.distributed as distfrom torch.utils.data import DataLoaderfrom torch.multiprocessing import Processfrom transformers import AutoConfig, GPT2LMHeadModelfrom utils import get_tokenizer, collate_batch#You can write your own tokenizer, train and generate functionsdefaverage_gradients(model): world_size dist.get_world_size() for param in model.parameters(): if param.grad isnotNone: dist.all_reduce(param.grad.data, opdist.ReduceOp.SUM) #Communication overhead across gpus param.grad.data / world_sizedeftrain(model, optimizer, examples, batch_size, collate_fn, desc, rank0, average_gradients_fnNone): model.train() tokens_per_sec [] tokens_num [] for i, batch inenumerate(prog_bar : tqdm.tqdm(examples, descfTraining ({desc}))): t0 time.time() optimizer.zero_grad() logits model(input_idsbatch[input_ids]).logits loss torch.nn.functional.cross_entropy( inputlogits.reshape((-1, logits.shape[-1])), targetbatch[labels].reshape(-1), reductionnone) loss (torch.sum(loss * batch[label_token_weights].reshape(-1)) / torch.sum(batch[label_token_weights])) loss.backward() #Calculates the gradients if average_gradients_fn isnotNone: average_gradients_fn(model) optimizer.step() #Updates the weights batch_time time.time() - t0 tokens np.prod(batch[input_ids].shape) tokens_per_sec.append(tokens / batch_time) tokens_num.append(tokens) prog_bar.set_postfix( tokens_per_sectokens / batch_time, lossloss.item()) return np.mean(tokens_per_sec), tokens_num defsetup(rank, world_size, backend): #Sets up the communication between multiple devices(GPUs) os.environ[MASTER_ADDRESS] localhost os.environ[MASTER_PORT] 33333 dist.init_process_group(backendbackend, rankrank, world_sizeworld_size)#This function will be run by each process concurrently, the id for the process is rankdefrun_dp(rank, world_size, backend, dataset_name, model_max_length, n_epochs, batch_size, learning_rate): setup(rank, world_size, backend) config AutoConfig.from_pretrained(gpt2) model GPT2LMHeadModel(configconfig).to(rank) #This is great! We are loading it on each device, thats why rank!!! optimizer torch.optim.AdamW(model.parameters(), lrlearning_rate) #Surprisingly, AdamW works great for LLMs #We will use german(deutsch) to english translation dataset dataset { split: datasets.load_dataset(dataset_name, splitsplit)[translation] for split in [train, validation, test] } src_key, tgt_key de, en dataset[train] dataset[train][:5000] dataset[validation] dataset[validation][:1000] dataset[test] dataset[test][:100] #tokenization tokenizer get_tokenizer(examplesdataset[train], vocab_sizeconfig.vocab_size, src_keysrc_key, tgt_keytgt_key) #collate function : partial pre-fills some of the arguments collate_fn partial(collate_batch, src_keysrc_key, tgt_keytgt_key, tokenizertokenizer, model_max_lengthmodel_max_length, devicerank) train_loader partition_dataset(rank, world_size, dataset[train], batch_sizebatch_size, collate_fncollate_fn) val_loader DataLoader(dataset[validation], batch_sizebatch_size, shuffleFalse, collate_fncollate_fn) test_loader DataLoader(dataset[test], batch_sizebatch_size, shuffleFalse, collate_fncollate_fn) total_time [] total_tokens_per_sec [] for epoch_idx inrange(n_epochs): start time.time() avg_tokens_per_sec, _ train( modelmodel, optimizeroptimizer, examplestrain_loader, batch_sizebatch_size, collate_fncollate_fn, descdesc, rankrank, average_gradients_fnaverage_gradients) end time.time()if __name__ __main__: import torch.multiprocessing as mp mp.set_start_method(spawn, forceTrue) parser argparse.ArgumentParser() parser.add_argument(--pytest, typebool, defaultFalse) parser.add_argument(--dataset, typestr) parser.add_argument(--model_max_length, typeint, default128) parser.add_argument(--n_epochs, typeint, default10) parser.add_argument(--batch_size, typeint, default128) parser.add_argument(--learning_rate, typefloat, default1e-4) parser.add_argument(--world_size, typeint, default2) args parser.parse_args() backend nccl#for cpu choose gloo for rank inrange(world_size): p Process( targetrun_dp, args(rank, world_size, backend, args.dataset, args.model_max_length, args.n_epochs, args.batch_size, args.learning_rate) ) p.start() processes.append(p) # Wait for all processes to finish for p in processes: p.join()在更新模型權(quán)重之前所有設(shè)備上的梯度會被求平均以保證各副本的一致性。這意味著在訓(xùn)練的每一步每張 GPU 都擁有完全相同的模型副本。![Figure 2: All reduce method for averaging gradients](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*Zp-hjHOABglU6Fu9obNhxA.jpeg) *圖 2用于梯度平均的 All-reduce 方法*下面做一個快速實驗。筆者手頭有兩張 H10080 GBGPU對于 GPT-2 這種小模型來說實屬“殺雞用牛刀”。然而正如俗語所言“若手中只有火箭筒也只能用它打蚊子”。因此本文將測試在該配置下可獲得的訓(xùn)練吞吐率與訓(xùn)練耗時。![Figure 3: Throughput comparison during training for single and2 GPUs](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*5mrZxaeab_MF-CScRnM3wg.jpeg) *圖 3單卡與雙卡訓(xùn)練期間的吞吐率對比*結(jié)果顯示雙卡方案的吞吐率訓(xùn)練過程中處理的 token 數(shù)幾乎達到單卡的 2 倍但這一速度提升以硬件數(shù)量翻倍為代價。![Figure 4: Training time comparison for single and2 GPUs](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*Mrn70ZsIt63vydyh1h-i4Q.jpeg) *圖 4單卡與雙卡訓(xùn)練耗時對比*盡管預(yù)期每個 epoch 的平均訓(xùn)練時間應(yīng)約為單卡方案的一半但多次實驗表明情況略有不同大多數(shù) epoch 確實提速明顯但總有 1–2 個 epoch 出現(xiàn)顯著波動。這種峰值可能源于多種開銷例如數(shù)據(jù)加載器取下一批數(shù)據(jù)的延遲、Python 垃圾回收的觸發(fā)或 NCCL 跨設(shè)備通信帶來的同步延遲。## 模型并行Model Parallel模型并行Model Parallel的核心思想是將模型本身拆分到多個設(shè)備上。例如當(dāng)模型包含 12 層且擁有 2 張 GPU 時可把前 6 層置于 GPU 0后 6 層置于 GPU 1。每張設(shè)備僅負責(zé)前向與反向傳播的一部分從而允許訓(xùn)練單張 GPU 無法容納的超大模型。![圖 5模型并行流程示意圖](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*LDTJolRGnYE9fcQ2d57EMg.png) *圖 5模型并行流程示意圖*在傳統(tǒng)模型并行中執(zhí)行是**順序**的任一時刻只有一張 GPU 處于活躍計算狀態(tài)。因此即便擁有 4 張 GPU 并將模型切分也只能讓一張 GPU 參與計算其余 GPU 處于空閑等待。這導(dǎo)致 GPU 利用率低下設(shè)備數(shù)量增加時收益遞減。pythonimport mathdefget_device_map(n_layers, devices): Returns a dictionary of layers distributed evenly across all devices. layers list(range(n_layers)) n_blocks int(math.ceil(n_layers / len(devices))) layers_list [layers[i : i n_blocks] for i inrange(0, n_layers, n_blocks)] returndict(zip(devices, layers_list))defparallelize(self, device_mapNone): Distribute the model layers across the devices based on the device_map. self.device_map ( get_device_map(len(self.h), range(torch.cuda.device_count())) if device_map isNoneelse device_map ) self.model_parallel True self.first_device cpuifcpuinself.device_map.keys() elsecuda: str(min(self.device_map.keys())) self.last_device cuda: str(max(self.device_map.keys())) self.wte self.wte.to(self.first_device) self.wpe self.wpe.to(self.first_device) # Load onto devices for k, v inself.device_map.items(): for block in v: cuda_device cuda: str(k) self.h[block] self.h[block].to(cuda_device) # ln_f to last self.ln_f self.ln_f.to(self.last_device)為簡化說明以 GPT-2 為例模型共 12 層平均分布在 2 張 GPU 上。表 1不同層在 GPU 上的分布示意表 1不同層在 GPU 上的分布示意? wte、wpe → “cuda:0”迭代 1k0, v[0,1,2,3,4,5]└── self.h[0].to(“cuda:0”)└── self.h[1].to(“cuda:0”)└── self.h[2].to(“cuda:0”)└── self.h[3].to(“cuda:0”)└── self.h[4].to(“cuda:0”)└── self.h[5].to(“cuda:0”)迭代 2k1, v[6,7,8,9,10,11]└── self.h[6].to(“cuda:1”)└── self.h[7].to(“cuda:1”)└── self.h[8].to(“cuda:1”)└── self.h[9].to(“cuda:1”)└── self.h[10].to(“cuda:1”)└── self.h[11].to(“cuda:1”)? ln_f、lm_head → “cuda:1”為進一步驗證我們使用 7B 參數(shù)的 LLaMA-2 進行實驗。在 2 張與 4 張 V10016 GBGPU 上以相同配置運行模型并行評估其性能與可擴展性。表 2LLaMA-2–7B 模型并行參數(shù)與結(jié)果表 2LLaMA-2–7B 模型并行參數(shù)與結(jié)果實驗再次證明任一時刻僅有一張 GPU 處于活躍狀態(tài)。數(shù)據(jù)像水流一樣順序經(jīng)過每張 GPU增加 GPU 數(shù)量并不能加快“水流”速度只是把管道切成更多段。然而它確實使得原本無法單卡加載的超大模型得以訓(xùn)練。流水線并行Pipeline Parallel流水線并行Pipeline Parallel通過將深度學(xué)習(xí)模型劃分為若干順序階段并讓不同 GPU 并行處理輸入批次中的不同微批次micro-batch從而在多 GPU 上分布模型。與傳統(tǒng)模型并行每次僅讓一張 GPU 計算不同流水線并行允許每張 GPU 負責(zé)模型的一部分并在不同微批次上并行工作。具體而言假設(shè)模型共 32 層可用 GPU 為 4 張可將模型切分為每 8 層一段。當(dāng)一批輸入數(shù)據(jù)到達時先將其進一步拆分為微批次。第一個微批次在 GPU 0層 0–7上開始處理一旦完成立即把中間激活傳遞給 GPU 1層 8–15同時 GPU 0 開始處理第二個微批次。依此類推GPU 2 在收到 GPU 1 的輸出后立即處理第一個微批次GPU 3 接收 GPU 2 的結(jié)果繼續(xù)計算形成持續(xù)的數(shù)據(jù)流。經(jīng)過短暫的“預(yù)熱”階段稱為pipeline bubble后所有 GPU 保持活躍每張 GPU 在任意時刻都在處理不同的微批次。與模型并行相比這種重疊執(zhí)行顯著提高了 GPU 利用率類似于流水線裝配每個階段GPU專注特定任務(wù)一旦流水線填滿系統(tǒng)效率大幅提升。圖 6模型并行 vs 流水線并行圖 6模型并行 vs 流水線并行調(diào)度器負責(zé)在每個時鐘周期決定哪張 GPU 處理哪個微批次。def _clock_cycles(num_batches: int, num_partitions: int) - Iterable[List[Tuple[int, int]]]: Key insight: batch i is at partition j when: clock i j total_clocks num_batches num_partitions - 1 # Fill Run Drain for clock in range(total_clocks): schedule [] for i in range(num_batches): j clock - i # Which partition is batch i at? if 0 j num_partitions: # Is it a valid partition? schedule.append((i, j)) # (batch_idx, partition_idx) yield schedule示例3 個批次3 個分區(qū)? Clock 0: [(0,0)] → GPU 0 處理批次 0? Clock 1: [(1,0), (0,1)] → GPU 0 處理批次 1GPU 1 處理批次 0? Clock 2: [(2,0), (1,1), (0,2)] → 3 張 GPU 全部忙碌? Clock 3: [(2,1), (1,2)] → GPU 1 處理批次 2GPU 2 處理批次 1? Clock 4: [(2,2)] → GPU 2 處理批次 2def forward(self, x): # 1. SPLIT: Divide batch into micro-batches batches list(x.split(self.split_size, dim0)) # 2. SCHEDULE: Generate the clock cycle schedule schedules _clock_cycles(num_batches, num_partitions) # 3. EXECUTE: Process each clock cycle for schedule in schedules: self.compute(batches, schedule) # ← This runs GPUs in parallel! # 4. COMBINE: Concatenate results output torch.cat(batches, dim0) return output.to(last_device) plaintext def compute(self, batches, schedule): # PHASE 1: Submit ALL tasks in parallel (non-blocking) for batch_idx, partition_idx in schedule: batch batches[batch_idx].to(devices[partition_idx]) defcompute_fn(): return partition(batch) # Run layers on this GPU task Task(compute_fn) self.in_queues[partition_idx].put(task) # Send to worker thread # PHASE 2: Collect ALL results for batch_idx, partition_idx in schedule: success, result self.out_queues[partition_idx].get() # Wait for result batches[batch_idx] output # Store result for next stage或許有人會問為何不能像數(shù)據(jù)并行那樣使用進程流水線并行需要在每個時鐘周期進行高頻、緊密的 GPU 間協(xié)調(diào)。線程與共享內(nèi)存隊列的開銷極小而獨立進程則會引入過多通信延遲。數(shù)據(jù)并行、模型并行與流水線并行各自提供了在多 GPU 上擴展深度學(xué)習(xí)負載的不同思路但它們的權(quán)衡決定了適用場景。表 3數(shù)據(jù)并行、模型并行與流水線并行對比總結(jié)表 3數(shù)據(jù)并行、模型并行與流水線并行對比總結(jié)最終不存在放之四海而皆準(zhǔn)的策略。若模型可裝入內(nèi)存數(shù)據(jù)并行通常最為直接若模型規(guī)模巨大則模型并行或流水線并行必不可少。將策略組合如流水線并行數(shù)據(jù)并行更能進一步釋放擴展?jié)摿?。關(guān)鍵在于理解模型規(guī)模、硬件限制以及每種方法在內(nèi)存、通信與性能上的權(quán)衡。如何學(xué)習(xí)AI大模型大模型時代火爆出圈的LLM大模型讓程序員們開始重新評估自己的本領(lǐng)。 “AI會取代那些行業(yè)”“誰的飯碗又將不保了”等問題熱議不斷。不如成為「掌握AI工具的技術(shù)人」畢竟AI時代誰先嘗試誰就能占得先機想正式轉(zhuǎn)到一些新興的 AI 行業(yè)不僅需要系統(tǒng)的學(xué)習(xí)AI大模型。同時也要跟已有的技能結(jié)合輔助編程提效或上手實操應(yīng)用增加自己的職場競爭力。但是LLM相關(guān)的內(nèi)容很多現(xiàn)在網(wǎng)上的老課程老教材關(guān)于LLM又太少。所以現(xiàn)在小白入門就只能靠自學(xué)學(xué)習(xí)成本和門檻很高那么針對所有自學(xué)遇到困難的同學(xué)們我?guī)痛蠹蚁到y(tǒng)梳理大模型學(xué)習(xí)脈絡(luò)將這份LLM大模型資料分享出來包括LLM大模型書籍、640套大模型行業(yè)報告、LLM大模型學(xué)習(xí)視頻、LLM大模型學(xué)習(xí)路線、開源大模型學(xué)習(xí)教程等, 有需要的小伙伴可以掃描下方二維碼領(lǐng)取↓↓↓學(xué)習(xí)路線第一階段從大模型系統(tǒng)設(shè)計入手講解大模型的主要方法第二階段在通過大模型提示詞工程從Prompts角度入手更好發(fā)揮模型的作用第三階段大模型平臺應(yīng)用開發(fā)借助阿里云PAI平臺構(gòu)建電商領(lǐng)域虛擬試衣系統(tǒng)第四階段大模型知識庫應(yīng)用開發(fā)以LangChain框架為例構(gòu)建物流行業(yè)咨詢智能問答系統(tǒng)第五階段大模型微調(diào)開發(fā)借助以大健康、新零售、新媒體領(lǐng)域構(gòu)建適合當(dāng)前領(lǐng)域大模型第六階段以SD多模態(tài)大模型為主搭建了文生圖小程序案例第七階段以大模型平臺應(yīng)用與開發(fā)為主通過星火大模型文心大模型等成熟大模型構(gòu)建大模型行業(yè)應(yīng)用。學(xué)會后的收獲? 基于大模型全棧工程實現(xiàn)前端、后端、產(chǎn)品經(jīng)理、設(shè)計、數(shù)據(jù)分析等通過這門課可獲得不同能力? 能夠利用大模型解決相關(guān)實際項目需求大數(shù)據(jù)時代越來越多的企業(yè)和機構(gòu)需要處理海量數(shù)據(jù)利用大模型技術(shù)可以更好地處理這些數(shù)據(jù)提高數(shù)據(jù)分析和決策的準(zhǔn)確性。因此掌握大模型應(yīng)用開發(fā)技能可以讓程序員更好地應(yīng)對實際項目需求? 基于大模型和企業(yè)數(shù)據(jù)AI應(yīng)用開發(fā)實現(xiàn)大模型理論、掌握GPU算力、硬件、LangChain開發(fā)框架和項目實戰(zhàn)技能學(xué)會Fine-tuning垂直訓(xùn)練大模型數(shù)據(jù)準(zhǔn)備、數(shù)據(jù)蒸餾、大模型部署一站式掌握? 能夠完成時下熱門大模型垂直領(lǐng)域模型訓(xùn)練能力提高程序員的編碼能力大模型應(yīng)用開發(fā)需要掌握機器學(xué)習(xí)算法、深度學(xué)習(xí)框架等技術(shù)這些技術(shù)的掌握可以提高程序員的編碼能力和分析能力讓程序員更加熟練地編寫高質(zhì)量的代碼。1.AI大模型學(xué)習(xí)路線圖2.100套AI大模型商業(yè)化落地方案3.100集大模型視頻教程4.200本大模型PDF書籍5.LLM面試題合集6.AI產(chǎn)品經(jīng)理資源合集獲取方式有需要的小伙伴可以保存圖片到wx掃描二v碼免費領(lǐng)取【保證100%免費】

97色伦色在线综合视频,无玛专区,18videosex性欧美黑色,日韩黄色电影免费在线观看,国产精品伦理一区二区三区,在线视频欧美日韩,亚洲欧美在线中文字幕不卡

公司網(wǎng)站建設(shè)調(diào)研攜程前端網(wǎng)站開發(fā)團隊

成都市醫(yī)院網(wǎng)站建設(shè)成都六度網(wǎng)站建設(shè)

網(wǎng)站建設(shè)新方向泛搜索wordpress

傻瓜式網(wǎng)站簡單界面php做網(wǎng)站瀏覽量

國家建設(shè)部人才交流中心網(wǎng)站wang域名建的網(wǎng)站

建站技巧網(wǎng)站建設(shè)銷售做些什么工作

網(wǎng)站導(dǎo)航怎么做的福州網(wǎng)簽查詢系統(tǒng)