Lmst

Andrej Karpathy (@karpathy)

FP8 학습을 활성화해 'time to GPT-2'가 4.3% 개선되어 2.91시간으로 단축되었고, 8×H100 스팟 인스턴스 가격을 쓰면 GPT-2 재현 비용이 약 $20 수준이라고 보고. 과거 GPT-2 공개 논란을 언급하며 현재의 경제성과 성능 향상을 강조함.

https://x.com/karpathy/status/2018804068874064198

#fp8 #training #gpt2 #h100 #optimization

YES SUCCEEDED!!!

Just rendered an image at 944×1152 (slightly above 1024×1024) using Flux1-Schnell-FP8 on my 6700 XT, and it works! (Image 1 is the Real-ESRGAN 2× upscaled version)

Workflow 1: Sampling (Image 2)

Prompt executed → UNet generates the latent

Step 1 (model load + latent generation) took 419 seconds

Output: Latent tensor saved to disk

Workflow 2 : VAE Decode (Image 3)

Latent loaded → VAE decodes the image

Duration: 7.5 seconds

Advantage: UNet doesn’t need to stay in VRAM → VRAM freed, even on 12 GB GPUs

The problem with the stock LoadLatent Node

Dropdown only shows files if they were produced / annotated by a previous SaveLatent Node

Node is designed to pass latents inside a graph, not load arbitrary files from disk

Purpose: prevents accidentally loading wrong files

Workaround (Image 4)

Edited /ComfyUI/nodes.py, class LoadLatent

Hardcoded latent path → Node now loads directly from disk

Result: Workflow 2 runs instantly, UNet can be unloaded

Timing

Step 1 (model load + latent generation): 419 s

Step 2 (VAE decode): 7.5 s

Result: High-res images on a 12 GB RDNA2 GPU are now possible on Flux1-Schnell-FP8 without ComfyUI crashing! (Image 5 is the original output)

This might actually become my new Flux workflow: render quick 512×512 previews first (which works perfectly on RDNA2 GPUs), sort out the good ones, extract the seed from the PNG metadata, and then re-render only the selected images with the same seed using the split workflow at higher resolutions. This way, high-resolution Flux1-Schnell-FP8 renders become possible on 12 GB RDNA2 GPUs D:

Question at the end: Has anyone ever done this before? Because I have no clue xD

#ComfyUI #flux #Flux1SchnellFP8 #FP8 #AMD #RDNA2 #VAE #AIArt #Pixelfed #HighResolution #GPUOptimization #LatentWorkflow #AIWorkflow #AIHacks #RealESRGAN #Upscale #AIExperiment #CreativeAI #DigitalArt #AICommunity #python #linux #opensource #foss

via #Microsoft : Maia 200: The AI accelerator built for inference

https://ift.tt/CjP7Rem
#Maia200 #AIinference #AIAcelerator #MaiaSDK #Azure #Microsoft #Foundry #Copilot #OpenAI #GPT5 #LLM #SyntheticData #ReinforcementLearning #FP8 #FP4 #TSMC3nm #HBM3e #Datacenter #CloudCompu…

🚀 Đã backport FP8 cho RTX 3090, không cần H100! Bằng cách bỏ chuyển sang fp16 trong bộ nhớ toàn cục, tiết kiệm VRAM đáng kể, dù hiệu suất tính toán hơi giảm. Đã tích hợp torch extension, bạn có thể thử ngay trong workflow của mình. #AI #MachineLearning #FP8 #RTX3090 #CUDA #DeepLearning #AI_Vietnam #CôngNghệ

https://www.reddit.com/r/LocalLLaMA/comments/1qn0dl8/backporting_fp8_to_the_rtx_3090_no_h100_required/

Brie Wensleydale (@SlipperyGem)

Qwen Image 2512(BF16, GGUF 포맷)와 Flux Klein 9B(FP8)를 비교한 의견. 작성자는 Qwen 쪽의 표현과 정확도를 더 선호하며 Flux Klein 쪽에서 여러 문제가 보인다고 평가하고 있음. 두 모델/포맷 간 품질과 안정성 차이를 묻는 내용.

https://x.com/SlipperyGem/status/2013621827369869503

#qwen #flux #gguf #bf16 #fp8

Tăng tốc GPU đời cũ với giải pháp Software FP8! 🚀

Một nhà phát triển vừa ra mắt giải pháp giả lập định dạng FP8 bằng phần mềm (sử dụng Triton kernels) cho các dòng GPU không hỗ trợ phần cứng như RTX 30/20 series.

🔥 Kết quả:
- Tốc độ tăng gấp 3 lần đối với các tác vụ giới hạn bởi băng thông bộ nhớ (GEMV, FlashAttention).
- Hoạt động trên mọi GPU đời cũ.
- Tối ưu hóa việc đóng gói dữ liệu chính xác thấp vào FP32.

#AI #GPU #FP8 #MachineLearning #DeepLearning #CongNghe #PhanMem #Triton

https:/

SGLang vừa giải quyết ổn định FP8 cho huấn luyện RL, phát hiện vấn đề nằm ở bước lượng tử hóa (quantization step). Đây là bước tiến lớn cho RLHF và tinh chỉnh RL cục bộ, giúp đơn giản hóa việc sử dụng độ chính xác hỗn hợp.
#SGLang #FP8 #RLTraining #Quantization #AI #MachineLearning #HuấnLuyệnRL #TríTuệNhânTạo #HọcMáy

https://www.reddit.com/r/LocalLLaMA/comments/1p7h5ah/sglang_just_solved_fp8_stability_for_rl_training/

Tin tuyệt vời cho dân chơi LLM địa phương! Giờ đây bạn có thể thực hiện FP8 reinforcement learning ngay trên máy tính cá nhân với VRAM chỉ 5GB. Tốc độ nhanh hơn, ít tốn VRAM hơn so với BF16/FP16. Thử ngay với RTX 40/50 series!
#LocalLLM #AI #MachineLearning #hocmay #trituenhantao #fp8 #reinforcementlearning

https://www.reddit.com/r/LocalLLaMA/comments/1p6k0h2/you_can_now_do_fp8_reinforcement_learning_locally/

FP8-Flow-MoE: A Casting-Free FP8 Recipe without Double Quantization Error

#FP8 #Precision

https://hgpu.org/?p=30341

🐢 Breaking news: A team of 🧙‍♂️ #wizards has magically discovered that #AMD #GPUs can handle something called "Matrix #Core Programming" with a little pixie dust called #FP16, #FP8, and #FP4. Who knew? 🤯 Get ready to revolutionize the universe... or just your local coffee shop's spreadsheet calculations. ☕📈
https://salykova.github.io/matrix-cores-cdna #Matrix #Programming #HackerNews #ngated

Fp8 runs ~100 tflops faster when the kernel name has "cutlass" in it

https://github.com/triton-lang/triton/pull/7298

#HackerNews #Fp8 #cutlass #tflops #performance #optimization #HackerNews #triton

Mô hình Qwen3-Next-80B-A3B đã chính thức được lượng tử hóa FP8, giúp giảm dung lượng và tăng tốc độ xử lý AI. Đây là bước tiến quan trọng trong việc tối ưu hóa mô hình ngôn ngữ lớn! 🤖✨

#AI #TríTuệNhânTạo #Qwen #LượngTửHóa #FP8 #MachineLearning #HọcMáy

https://www.reddit.com/r/LocalLLaMA/comments/1nnhlx5/official_fp8quantizion_of_qwen3next80ba3b/

Малые числа, большие возможности: Роль плавающей запятой в ИИ

Числа с плавающей запятой лежат в основе подавляющего большинства компьютерных вычислений, особенно в сферах искусственного интеллекта (ИИ) и машинного обучения. Они позволяют моделям эффективно обрабатывать данные, обеспечивая баланс между точностью и скоростью вычислений. Развитие вычислительных технологий требует новых форматов, которые оптимизируют использование памяти и ускоряют вычислительные процессы без значительных потерь точности. Одним из перспективных форматов стал FP8 — 8-битный формат чисел с плавающей запятой, который может улучшить производительность вычислений и сократить энергопотребление.

https://habr.com/ru/companies/itglobalcom/articles/934910/

#fp8 #ai #ieee #квантование #машинное_обучение #обработка_данных #nvidia #amd #intel #ocp

FP8 is ~100 tflops faster when the kernel name has "cutlass" in it

https://twitter.com/cis_female/status/1943069934332055912

#HackerNews #FP8 #tflops #cutlass #performance #optimization #AI

#JackDongarra Makes a Stand for Traditional #HPC: "US still doesn’t have a clear, long-term plan for what comes next.... U.S. risks falling behind."

Challenges to high-performance computing threaten #US #innovation

The #AI boom has led chip makers to focus on #FP16 and #FP8, not the #FP64 used by scientific research. If chip companies stop making the parts that #scientists need, then it could become harder to do important research.
https://theconversation.com/challenges-to-high-performance-computing-threaten-us-innovation-255188

Meet DeepSeek-V3 — the 671 billion parameter beast that’s making OpenAI and Anthropic nervous 👀

👀

🧠 It’s:
✔ Faster
✔ Cheaper ($5.6M training vs $60M+)
✔ More accurate on key tasks like coding, math, and comprehension
✔ Open-source + MIT licensed
✔ Deployable across NVIDIA, AMD & Huawei

📊 Performance Highlights:
🔹 MMLU: 88.5%
🔹 HumanEval: 82.6%
🔹 DROP: 91.6
🔹 MATH-500: 90.2%
🔹 Chinese C-Eval: 86.5%

But wait... ⚠️

🚨 Your data goes to Chinese servers.
🚨 It dodges politically sensitive questions.
🚨 It’s already being banned by gov agencies for “privacy risks.”

So is it the best LLM of 2025 or a privacy nightmare?

📥 Read the full analysis report here → https://deepseekagi.org/deepseek-v3-architecture/

💬 Drop your thoughts in the comments 👇
#DeepSeekV3 #AIRevolution #GPT4 #Claude3 #OpenSourceAI #AIComparison #MoE #FP8 #FutureTech #FacebookAI #LLMBattle

Triple bird 🐦‍⬛
#birds #vsco #googlepixel #fp8 #fujipro800z

🧐 Welcome to the thrilling world of "#DeepSeek," where they unleash their groundbreaking #FP8 #GEMM #Kernels, as if these buzzwords mean anything to normal humans. 🤖✨ Now you too can revel in the #excitement of "#fine-grained #scaling," because who doesn't dream of spending their weekends scaling kernels? 🎉 #GitHub's #navigation menu is undoubtedly the real star here, stealing the show with its riveting toggle action. 🚀
https://github.com/deepseek-ai/DeepGEMM #tech #HackerNews #ngated

DeepSeek Open Sources DeepGEMM: Clean and efficient FP8 GEMM kernels — https://github.com/deepseek-ai/DeepGEMM
#HackerNews #DeepSeek #DeepGEMM #FP8 #AI #Kernels #OpenSource

FP32, FP16, BF16 и FP8 — разбираемся в основных типах чисел с плавающей запятой

Привет, Хабр! Сегодня давайте поговорим о том, как современные вычисления на GPU стали более гибкими и эффективными благодаря различным форматам чисел с плавающей запятой ( FP64 , FP32 , FP16 , BFLOAT16 и FP8 ). Эти форматы не просто числа — за каждым из них стоит конкретная область применения. В разных ситуациях мы сталкиваемся с задачами, где важны либо скорость, либо точность, и правильно выбранный тип floating point помогает оптимизировать ресурсы. Давайте разберём всё это на примерах и поймём, в каких задачах каждый из этих форматов будет наиболее полезен.

https://habr.com/ru/companies/serverflow/articles/847068/

#FP16 #fp32 #FP64 #BF16 #floating_point #плавающая_запятая #fp8 #числа_с_плавающей_запятой #формат_с_плавающей_запятой

#FP8

Client Info