Lmst

Tencent HY (@TencentHunyuan)

RLVR의 엔지니어링 관측성(observability) 강화를 위해 세부 진단용 오픈소스 도구 'GradLoc'을 공개한다고 발표했습니다. GradLoc은 미세한 엔지니어링 진단 장벽을 낮추어 시스템 내부(일종의 '블랙박스')를 들여다보고 더 깊은 이해와 분석을 가능하게 하려는 목적입니다.

https://x.com/TencentHunyuan/status/2022542917911695862

#observability #gradloc #rlvr #opensource #devtools

Tencent HY (@TencentHunyuan)

LayerClip이라는 해법을 제안합니다. Layerwise Gradient Heterogeneity 문제를 해결하기 위해 전역 클램프 대신 각 계층 통계에 기반한 적응적 제약을 적용하는 Layerwise Gradient Clipping 기법을 도입해, RLVR 훈련의 안정성을 향상시키는 세밀한 제어를 제공합니다.

https://x.com/TencentHunyuan/status/2022542910001156339

#layerclip #gradientclipping #rlvr #optimization

ICLR 2026 tổng hợp: Cộng đồng nghiên cứu tập trung vào GRPO (157 bài) thay vì DPO, ưu tiên RLVR (125 bài) thay vì RLHF, và 202 bài về Mamba/SSMs. Nait (tuning thông minh chỉ 10% dữ liệu) giúp tối ưu hiệu quả. 257 bài về tính toán lúc test, 123 bài về hallucination. Cảnh báo: mô hình tuân thủ tốt dễ bị tấn công injection. #AI #HọcMáy #ICLR2026 #NCKH #DeepLearning #Mamba #RLVR #GRPO #MạngNeural #BảoMậtAI #ViễnTưởngAI

https://www.reddit.com/r/LocalLLaMA/comments/1qsh7dz/analyzed_5357_iclr_2026_acc

🧠 Mới! Notebook code RLVR kết hợp GRPO từ đầu, được chia sẻ trong dự án “Reasoning‑from‑Scratch”. Hữu ích cho những ai muốn khám phá mô hình RL và tối ưu hoá trong AI/ML. #AI #MachineLearning #RLVR #GRPO #LậpTrình #MãNguồn

https://www.reddit.com/r/LocalLLaMA/comments/1qgcj8b/rlvr_with_grpo_from_scratch_code_notebook/

RLVR promises faster sampling but leaves reasoning untouched—base LLMs still carry the heavy‑lifting of trajectories. The paper (NeurIPS 2025) shows that gains come from smarter teacher‑distillation and minor architectural tweaks, not a new reasoning engine. Curious how sampling efficiency separates from true understanding? Dive into the details. #RLVR #SamplingEfficiency #LLMReasoning #NeurIPS2025

🔗 https://aidailypost.com/news/rlvr-lifts-sampling-efficiency-not-reasoning-base-models-hold

2025년 LLM 혁명: RLVR로 훈련비용 90% 절감, 추론 모델의 시대가 왔다

2025년 LLM 분야를 장악한 RLVR+GRPO 기술과 훈련 비용 혁명. 벤치마크의 함정부터 LLM을 슈퍼파워로 활용하는 법까지, Sebastian Raschka의 연례 리뷰를 소개합니다.

https://aisparkup.com/posts/7892

Nick Kukoz (@NickKukoz)

arXiv 논문(2512.04359)을 인용해 저자들이 RLVR entropy 문제를 다룸으로써 추론 능력을 일관되게 개선하는 방법을 발견했다는 내용을 알립니다. 논문 링크만 제공되어 구체적 메커니즘은 원문 확인이 필요하지만, 'RLVR entropy'를 해소해 추론 성능을 향상시켰다는 연구 발표입니다.

https://x.com/NickKukoz/status/2003399858011668891

#arxiv #research #reasoning #rlvr

ajay dhisone (@AjayDhisone)

작성자는 2023년의 '변호사 시험 합격' 수준에서 2025년에는 모델이 합격 이유를 설명하고 숨겨진 chain-of-thought까지 보여주는 수준으로 발전했다며, RLVR(관련 강화학습 기술)의 급격한 연구 발전을 강조하고 있다.

https://x.com/AjayDhisone/status/2003125435266408772

#rlvr #research #reasoning #chainofthought

2025 saw significant advancements in #LLMs, with #ReinforcementLearning from #VerifiableRewards (#RLVR) emerging as a key stage in training, leading to improved #reasoning capabilities. The industry also began to understand the unique “jagged” intelligence of LLMs, excelling in specific domains but lacking generalisation. https://karpathy.bearblog.dev/year-in-review-2025/?eicker.news #tech #media #news

New research from Tsinghua shows that reasoning‑augmented LLMs solve tasks with fewer calls but don’t surpass raw capability. The study compares chain‑of‑thought prompting, RL‑based RLVR, and pass@1 metrics, highlighting efficiency gains for open‑source models. Worth a read for anyone tracking LLM benchmarks. #LLM #ChainOfThought #RLVR #PassAt1

🔗 https://aidailypost.com/news/study-finds-reasoning-llms-are-more-efficient-not-more-capable

Implicit Actor Critic Coupling via a Supervised Learning Framework for RLVR

https://arxiv.org/abs/2507.15855

#HackerNews #Implicit #Actor #Critic #Coupling #Supervised #Learning #RLVR #ReinforcementLearning

→ Les 4 étapes pour entrainer un LLM
https://scienceetonnante.com/blog/2025/04/25/les-4-etapes-pour-entrainer-un-llm/

« Voilà le principe de l'apprentissage par renforcement avec une récompense vérifiable [RLVR], qui permet de se passer d'humains qui doivent juger si la réponse est conforme ou pas. »

#entrainer #LLM #apprentissage #RLVR #humains

New study challenges a key belief about Reinforcement Learning with Verifiable Rewards (RLVR) for #LLMs:
#RLVR boosts efficiency but doesn't create new reasoning skills — #AI base models already had them!
https://arxiv.org/abs/2504.13837

Forschende der Tsinghua University und der Shanghai Jiao Tong University zeigen in einer Studie, dass #RLVR zwar die Wahrscheinlichkeit erhöht, beim ersten Versuch eine richtige Antwort zu generieren – das sogenannte pass@1 –, aber keine neuen Problemlösestrategien erschließt.
https://the-decoder.de/forscher-zweifeln-an-reasoning-modellen-effizienter-ja-intelligenter-nein/
#KI