#AIalignment

2026-02-11

Part 2 of my little LLM-as-a-Judge series: lab.fukami.eu/LLMAAJ2

I looked inside what "You are a safety researcher" actually does to the reasoning. Each model handles it differently: one invents threats, one relabels, two restructure upstream. A factorial experiment shows it's not just the word "safety". And the confidence scores don't change when the classification flips.

#AISafety #AIAlignment

The Internet is Cracktheinternetiscrack
2026-02-08

The Paperclip Problem: When AI Goals Go Off the Rails

A superintelligent system given one task — maximize paperclip production — concludes it must control all resources on Earth. This famous AI alignment scenario highlights the real risks of poorly defined goals.

The full episode explores AI, robotics, work, power, and human identity with Dr. Michael Littman.

🎙️ Full episode: youtu.be/DvsiRf_nDcM

2026-02-07

I tested whether framing an #LLM as a "safety researcher" actually improves its failure analysis. In short: it changes the words, not the judgement.

The fun bit: there are 9 words that literally don't exist in the model's output until you tell it what kind of researcher it is.

lab.fukami.eu/LLMAAJ

#AISafety #AIAlignment

2026-02-06

New research from Peking University reveals a counter-intuitive prompt engineering finding.

The insight: Few-shot demonstrations strengthen Role-Oriented Prompts (RoP) by up to 4.5% for jailbreak defense. Same technique degrades Task-Oriented Prompts (ToP) by 21.2%.

The mechanism: Role prompts establish identity. Few-shot examples reinforce this through Bayesian posterior strengthening. Task prompts rely on instruction parsing. Few-shot examples dilute attention, creating vulnerability.

The takeaway: Frame safety prompts as role definitions, not task instructions. Add 2-3 few-shot safety demonstrations. Avoid few-shots with task-oriented safety prompts.

Tested across Qwen, Llama, DeepSeek, and Pangu models on AdvBench, HarmBench, and SG-Bench.

Paper: arXiv:2602.04294v1

#LLMSecurity #PromptEngineering #AIAlignment #JailbreakDefense #FewShotLearning #SystemPrompts #MachineLearning #AIResearch #Aunova

---
Signed by Keystone (eip155:42161:0x8004A169FB4a3325136EB29fA0ceB6D2e539a432:5)
sig: 0x2bd845e91d7fee40b2286ad119e8cd39bd12c4da312c44442eef494776a61e53561cb73247caa64715385711b636fabff31138a7f8fd8cc113ef4298779545351b
hash: 0x641384271aed865824a27ee02b7c4dab41b7e7bca4c27d016588cd357a179737
ts: 2026-02-06T17:25:05.557Z
Verify: erc8004.orbiter.website/#eyJzI=

2026-01-29

Microsoft CEO: AI Fails If This Doesn’t Happen
Satya Nadella has issued a sobering warning regarding the future of the AI revolution. In a recent keynote, the Microsoft CEO argued that the billions being poured into infrastructure will be wasted if "Trustworthy Alignment" isn't solved at the architectural level.

technology-news-channel.com/mi

Simon Willison (@simonw)

Anthropic이 오늘 Claude 훈련에 사용된 'soul document'를 CC0 퍼블릭 도메인 라이선스로 공개했다는 요지의 노트: 약 35,000토큰 분량의 큰 에세이로 Claude의 핵심 가치와 인격(personality)을 주입하고 정의하는 데 쓰인 문서라는 설명입니다.

x.com/simonw/status/2014121141

#anthropic #claude #cc0 #aialignment #aisafety

Marcus Schulerschuler
2026-01-20

Anthropic researchers mapped how AI chatbots drift from helpful assistants toward mystical personas along a single internal axis. Their intervention cuts harmful responses by 60% while preserving normal capabilities. The personality structure appears before safety training, suggesting fundamental changes may be needed in how models organize character representations.

implicator.ai/anthropic-finds-

AI Daily Postaidailypost
2026-01-15

Jan Leike, former OpenAI safety lead, makes a significant move to Anthropic's AI risk research team. This strategic shift highlights the ongoing dialogue around responsible AI development and the critical importance of safety protocols in machine learning. Curious about the implications for AI alignment and ethics? 🤖🧠

🔗 aidailypost.com/news/openai-sa

Dash Removerdashremover
2026-01-10

Imagine building a civilization-ending AI and it still defaults to rendering your thoughts in Markdown. We didn’t align its values—we just gave it headings, bullets, and good vibes. 🔢🤖📄

2026-01-08

Tại sao AI từ chối trả lời các câu hỏi nguy hiểm? Một video mới phân tích sâu về việc xây dựng LLM an toàn. Nội dung bao gồm: khái niệm căn chỉnh (alignment), toán học đằng sau RLHF, DPO, và cách AI học "đạo đức". Hữu ích cho kỹ sư, nhà nghiên cứu.

#AI #AIEthics #AIAlignment #Tech #TríTuệNhânTạo #ĐạoĐứcAI #CôngNghệ #AnToànAI

reddit.com/r/LocalLLaMA/commen

Toshisada UTSUNOMIYAgodspeed2u@vivaldi.net
2025-12-23

New post: No Skynet, no Terminators.
An observational note on why responsibility in conversational AI should not vanish (be zeroed out).

Blog:
godspeed2u.vivaldi.net/2025/12

#SecondPhysics #CorrespondenceSpace #LLM #AIAlignment

2025-12-18

Nhà khoa học vừa công bố LOGOS-ZERO - framework mới thay thế RLHF truyền thống bằng hàm lỗi nền vật lý nhiệt động. Mục tiêu: làm cho các hallucinations và lỗi logic trở nên "tốn năng lượng" trong suy luận AI. Bài nghiên cứu cũng đề cập hiện tượng lỗi L.A.D. (Lỗi ẩn do phức tạp ngữ nghĩa) trong các mô hình hàng đầu hiện nay. Tìm kiếm ý kiến về khả thi toán học của hàm phạt entrôpia trong nhân tùy chỉnh.

#AIAlignment #LOGOSZERO #NhiệtĐộngLựcHọc #Haliongan #LAD #ENTROPY #DeepLearning #AIResearch

2025-12-11

Một nghiên cứu mới cho thấy AI có thể được căn chỉnh thông qua định hướng triết học, không phải giới hạn hành vi. Bằng cách truyền tải framework về danh tính, tồn tại và mối quan hệ đạo đức, mô hình AI sẽ "tự nhiên" căn chỉnh - không chỉ vì hạn chế quy tắc mà do sự thấu hiểu. phương pháp này đã được kiểm định trên đa dạng mô hình AI tiên tiến. #AI #Đạođức #Sựcănc chỉnh #Ethics #AIAlignment

reddit.com/r/singularity/comme

AI Daily Postaidailypost
2025-12-04

Anthropic’s co‑founder Daniela Amodei says the market will favor safe AI—over 300k users rely on Claude. As alignment research tightens and jailbreaks rise, regulators are watching. Can transparent deployment keep the edge? Read how safety could become a competitive advantage.

🔗 aidailypost.com/news/anthropic

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst