Lmst

Sebastian Raschka (@rasbt)

강화학습 GRPO 개선을 다룬 챕터(Ch07)를 완성했다는 보고로, GRPO from scratch 기반에 클리핑된 정책비율(clipped policy ratios), KL 항, 포맷 보상(format rewards) 등 여러 개선기법을 추가해 분석 및 구현을 제공함. 관련 코드와 노트북은 rasbt의 reasoning-from-scratch GitHub 리포지토리에 공개되어 있어 재현과 실험이 가능함.

https://x.com/rasbt/status/2022830961012920676

#reinforcementlearning #grpo #opensource #rl #python

Ukrainian Unit Uses Ground Drones To Save Soldiers On The Battlefield | Ukraine Front Line Update

#RadioFreeEurope #RadioLiberty #RFE #RL #Ukraine #Russia #Drone (13 February 2026)

https://youtu.be/wEVotwZPAcc?si=QR9gruR0Ojx2unFv

Die #Servicewüste #DHL macht mich aggressiv. Da sind mehrere Pakete angeblich zugestellt worden, aber nie angekommen. Filiale sagt: keine Ahnung, Nachforschung müssen Sie auf der Website beauftragen. Das endet in wilden Klickorgien ohne Ergebnis. In Tabellenkalkulationen würde man sowas “Zirkelbezug" nennen, im #rl eher sowas wie “fuck da customer”.

merve (@mervenoyann)

OpenEnv를 비전-언어 모델(VLM)으로 확장하는 실험 기록. 이미지를 그리드로 렌더링해 base64 관측으로 전달하는 스네이크(snake) 환경을 만들어 VLM이 이미지 관측을 처리하도록 했고, 소규모(3 에피소드) 리플레이를 시연했다는 개발 노트.

https://x.com/mervenoyann/status/2020826014729826309

#openenv #visionlanguage #rl #simulation

fly51fly (@fly51fly)

MIT CSAIL, CMU, TU Munich 소속 연구진이 'Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps' 논문을 arXiv에 올렸습니다. 확률적 흐름 맵(stochastic flow maps)을 이용해 보상 정렬(reward alignment)을 효율적으로 달성하는 방법을 제안해 보상 설계·정렬 관련 연구 및 안전한 RL 응용에 기여할 가능성이 있습니다.

https://x.com/fly51fly/status/2019888413223379073

#rewardalignment #research #rl #alignment

Avi Chawla (@_avichawla)

RULER의 핵심 통찰은 절대 점수 부여보다 상대적 스코어링이 더 쉽다는 점입니다. LLM 심판이 각각에 절대 점수를 매기기보다 '궤적 A가 B보다 낫다'처럼 상대 비교를 통해 판단하는 것이 보상 평가에서 더 간단하다는 설명을 담고 있습니다.

https://x.com/_avichawla/status/2016502643032748415

#ruler #rewardmodeling #rl #llm

MiniMax (official) (@MiniMax_AI)

CISPO를 GSPO 또는 GRPO 대신 선택하는 이유와 MoE(전문가 혼합) 적응성, RL 알고리즘 변경 시 아키텍처 리팩토링 요구 여부에 관한 질문과 논의입니다. 언급된 내용으로는 GRPO가 이전에 존재했으나 R1-Zero 재현 시 신뢰성이 낮았고, PPO 스타일의 클리핑이 토큰 수준 그래디언트 문제를 일으켰다는 경험적 관찰이 포함됩니다.

https://x.com/MiniMax_AI/status/2016471929549697443

#rl #cispo #grpo #ppo #moe

khazzz1c (@Imkhazzz1c)

토큰 단위 보상 신호(token-level reward signals)가 '완벽'해진다면, 평가 역할을 하는 critic 모델(가치 평가자)이 불필요해지는지 묻는 이론적·연구적 질문을 제기하고 있습니다.

https://x.com/Imkhazzz1c/status/2015738937202126960

#reinforcementlearning #rewardmodels #rl #criticmodel

What #Xi ’s Purge Of A Top General Means For #China And Its Neighbors
#RFE #RL #ZhangYouxia

https://www.rferl.org/a/china-general-zhang-youxia-military-russia-central-asia-taiwan/33660395.html

😃 💚 Hallo, moin, tach. 💚 Danke allen neuen FollowerInnen (m/w/d), es ehrt und freut mich sehr. 💚 Bin zurzeit nicht so präsent, bei mir im #RL brennt nicht nur die Luft.
Schaue mir alle an und folge gerne zurück, wenn es passt. 😉 😘

#Musik #Mucke #music
🎼 Streets of Philadelphia - Bruce Springsteen - 🎶
https://www.youtube.com/watch?v=4z2DtNW79sQ

Bildbeschreibung Blick über den Westerwald:
Deutschland, Westerwald. Ein eingefahrener Feldweg führt bis an den Horizont über hügelige Wiesen. Links und rechts sehen wir niedrige und hohe Mischwälder, in den Tälern im Hintergrund wabert der Morgennebel. Der Himmel zeigt interessante Wolkenformationen in allen möglichen Farben von blassrosa bis dunkelblau.

😃 💚 Hello, everyone. 💚 Thank you to all my new followers (m/f/d), I am very honored and delighted. 💚 I'm not very active at the moment, as things are pretty hectic in my #RL.
I'll take a look at everyone and will be happy to follow back if it suits me. 😉 😘

Image description View over the Westerwald:
Germany, Westerwald. A well-trodden dirt road leads across rolling meadows to the horizon. To the left and right, we see low and high mixed forests, with morning mist swirling in the valleys in the background. The sky displays interesting cloud formations in all possible colors from pale pink to dark blue.

Latest in the spat with my brother (who voted for Trump the first time and apparently just skipped the next two elections) - He refused to accept the xmas money I sent. I asked him to explain how he could support any part of Trump's behavior, and got silence in return. So yeah. Not the first time he's just left me on read about this stuff, so guess we'll see. Think he'll change his tune when he tries to cancel the midterms? #uspol #rl

TheCoderUX (@Motion_Viz)

코드 최적화를 RL(강화학습)으로 학습시키는 API를 발견했다는 내용입니다. iterx(작성자 @deep_reinforce)가 제공하는 4단계 루프(예: POST /api/task/create, /api/fetch_unevaluated_code_ids 등)를 통해 작업 생성·대기중인 코드 변형 가져오기 등으로 코드 최적화 작업을 반복적으로 학습시킬 수 있음을 소개합니다.

https://x.com/Motion_Viz/status/2013500237462065421

#rl #reinforcementlearning #codeoptimization #api #developertools

Sebastian Raschka (@rasbt)

저자는 GRPO를 사용해 '검증 가능한 보상(verifiable rewards)'을 갖춘 강화학습을 처음부터 구현하는 내용의 Chapter 6을 완성했다고 알렸습니다. 이번 장을 개인적으로 가장 마음에 드는 챕터라고 평가하며, 이 장의 목표는 검증 가능한 보상 체계를 구현하는 강화학습 방법론을 제시하는 것입니다.

https://x.com/rasbt/status/2012897755916579278

#reinforcementlearning #grpo #rl #research

Mới! Notebook Python minh họa RLVR kết hợp GRPO, được triển khai từ đầu trong dự án Reasoning‑from‑Scratch. Tài liệu chi tiết, ví dụ thực tế cho các nhà nghiên cứu RL. #AI #MachineLearning #RL #GRPO #Python #CôngNghệ #TríTuệNhânTạo

https://www.reddit.com/r/LocalLLaMA/comments/1qgcj8b/rlvr_with_grpo_from_scratch_code_notebook/

fly51fly (@fly51fly)

논문 'Rewarding the Rare: Uniqueness-Aware RL for Creative Problem Solving in LLMs'은 LLM의 창의적 문제 해결을 위해 희소성(uniqueness)을 인식하고 보상하는 'Uniqueness-Aware RL' 기법을 제안합니다. Z Hu, Y Wang, Y He, J Wu 등(MIT & NUS 공동연구)이 저자로, arXiv에 2026년 공개되어 LLM의 다양성·창의성 향상에 적용 가능성을 검토합니다.

https://x.com/fly51fly/status/2011550736023433581

#rl #llms #creativity #rewardmodel #arxiv

Rohan Paul (@rohanpaul_ai)

AT2PO(Agentic Turn based Policy Optimization via Tree Search)는 도구를 사용하는 LLM 에이전트를 더 빠르고 안정적으로 학습시키기 위한 방법입니다. 에이전트가 불확실할 때 가능한 다음 행동의 트리를 확장하고, 그 중 최적 경로로부터 학습하여 정책을 개선하는 접근을 제안하며 기존의 전체 대화 단위 학습과 차별화됩니다.

https://x.com/rohanpaul_ai/status/2010949681208016963

#rl #treesearch #agents #llm

🧠 New preprint by Greenstreet, Geerts, Gallego, and Clopath on #MotorLearning across #cortex, #cerebellum, and #BasalGanglia.

Key idea: supervised learning builds a low-dimensional action embedding, and #ReinforcementLearning operates directly in this structured space. This explains generalization, interference, and striatal similarity patterns as geometric consequences, not add-ons.

🌍 https://doi.org/10.64898/2025.12.19.695526

#Neuroscience #CompNeuro #RL

Figure 1: A multi-region model learns structured action representations. (

swyx (@swyx)

Cursor와 OpenAI 추론팀 소속 @ashvinair가 참여한 인터뷰를 포함한 'holiday drops'를 확인하라는 안내입니다. 해당 인터뷰는 강화학습(RL) 연구의 현황과 관련 논의를 다루는 것으로 보이며, RL 연구 동향을 파악하거나 관련 연구 업데이트를 확인하는 데 유용할 것으로 예상됩니다.

https://x.com/swyx/status/2009105368757268854

#rl #reinforcementlearning #research #openai

Richard Sutton – Father of RL thinks LLMs are a dead end

https://www.youtube.com/watch?v=21EYKqUsPfg

Richard Sutton is the father of reinforcement learning, winner of the 2024 Turing Award, and author of The Bitter Lesson. And he thinks LLMs are a dead end.

This is an interesting interview.

#intelligence #RL #AI #llms

#RL

Client Info