Lmst

AshutoshShrivastava (@ai_for_success)

Entelligence가 주요 AI 코드 리뷰 도구 10개를 대상으로 실제 프로덕션 버그 67건에 대한 벤치마크를 수행했다. 테스트에는 경쟁 조건, 보안 취약점, API 변경, 논리적 오류 등이 포함되었으며, 이는 AI 기반 코드 리뷰어의 신뢰성과 품질 개선에 중요한 데이터를 제공한다.

https://x.com/ai_for_success/status/2026365995712131115

#entelligence #benchmark #ai #codereview #evaluation

Sebastian Raschka (@rasbt)

SWE-Bench Verified 벤치마크 결과에 대한 재검토를 통해 모델 간 실제 성능 차이가 과소평가되고 있다는 의견이 제기되었다. 일부 지표의 신뢰성 문제를 분석하는 과정에서 벤치마크 품질 개선 필요성이 부각되고 있다.

https://x.com/rasbt/status/2026062254571913522

#benchmark #evaluation #aimodels #research

[Veille] court papier dans les Débats de @lemonde montrant l'inanité du #publishORperish, des #rankings et des modèles de publication basés sur les APCs :« Ceux qui se lamentent du retard de la recherche occidentale risquent de tomber dans le piège tendu par l’Etat-parti chinois » https://www.lemonde.fr/idees/article/2026/02/19/ceux-qui-se-lamentent-du-retard-de-la-recherche-occidentale-risquent-de-tomber-dans-le-piege-tendu-par-l-etat-parti-chinois_6667432_3232.html
#papermills #chine #publishing #science #researchintegrity #ethics #evaluation #libertéAcadémique

Recently published a short tutorial on evaluating cultural diplomacy projects. 🎨🌍

Evaluation in cultural diplomacy isn’t about measuring art itself. It’s about making visible the networks, partnerships, and opportunities that cultural diplomacy creates. Using widely available tools like Calc, Excel or Google Sheets can help teams reflect, learn, and stay accountable.

https://my-site-12f6cf.gitlab.io/portfolio/evaluating_cultural_diplomacy_made_easy/

#CulturalDiplomacy #Evaluation #Monitoring #PublicSector #CulturalManagement #DataDriven

Montreal mayor gives herself an 8 out of 10 on her first 100 days in office
Mayor Soraya Martinez Ferrada made 10 key promises to Montrealers she said she’d achieve in her first 100 days in office. Well, today she hit that first 100-day milestone, and how did she do? She told reporters this week, she gives herself an eight out of 10.

#politics #evaluation #Montreal
https://www.cbc.ca/news/canada/montreal/montreal-mayor-100-days-9.7100424?cmp=rss

Tejal Patwardhan (@tejalpatwardhan)

Nature에 새로 게재된 연구로, AI 'wet lab' 평가에 관한 새로운 결과가 발표되었다. 이는 AI 모델의 생물학적, 실험 기반 환경에서의 평가를 다루는 것으로 보이며, 연구팀이 실제 실험 데이터와 AI 분석을 결합한 평가 방법을 제시한 것으로 추정된다.

https://x.com/tejalpatwardhan/status/2024636639126102513

#research #ai #nature #evaluation #wetlab

prinz (@deredleritt3r)

작성자는 ‘Denying the antecedent!’라는 표현으로 시작해 일론 머스크가 벤치마크는 중요하지 않다고 주장한 게시물을 언급한다. 작성자는 벤치마크가 전부가 아니라는 의견에 부분적으로 동의하면서도, 벤치마크를 완전히 대체할 아무것도 없는 상태는 문제라며 벤치마크의 대안 또는 보완 방법이 필요하다고 지적한다.

https://x.com/deredleritt3r/status/2024545823401660765

#benchmarks #ai #evaluation #elonmusk

Chubby (@kimmonismus)

Grok 4.20의 공식 벤치마크 평가 결과를 아직 기다리고 있다는 내용이다. 트윗은 성능 검증을 위한 공식 벤치마크 공개에 대한 기대 또는 촉구를 나타내며, 해당 버전의 객관적 평가를 요구하고 있다.

https://x.com/kimmonismus/status/2023846221128052942

#grok #benchmark #evaluation #llm

The briefing also features perspectives from:

👤 Dr. Anne Reinhardt, Ludwig-Maximilians-Universität München

👤 Prof. Dr. Ute Schmid, Otto-Friedrich-Universität Bamberg / Bamberger Zentrum für Künstliche Intelligenz (BaCAI)

👤 Prof. Dr. Kerstin Denecke, Berner Fachhochschule BFH

📄 𝗥𝗲𝗮𝗱 𝘁𝗵𝗲 𝗳𝘂𝗹𝗹 𝗚𝗲𝗿𝗺𝗮𝗻 𝗯𝗿𝗶𝗲𝗳𝗶𝗻𝗴 (𝗦𝗠𝗖):
https://www.sciencemediacenter.de/angebote/chatbots-fehlerhafte-kommunikation-bei-gesundheitsfragen-26029

🧾 𝗡𝗮𝘁𝘂𝗿𝗲 𝗠𝗲𝗱𝗶𝗰𝗶𝗻𝗲 𝗽𝗮𝗽𝗲𝗿:
https://www.nature.com/articles/s41591-025-04074-y

#NLP #LLMs #HealthAI #HumanAIInteraction #Evaluation #UKPLab

Zu #EU-Projektkoordination – Umfrage zur #Wirkung von Projekten unter #H2020 u. #HEurope https://ec.europa.eu/eusurvey/runner/HorizonSurvey2026 bis 2026-03-09. #Agrarforschung #Forschung #Evaluation

Edit: submission deadlines extended!

Reminder that the deadlines for the IEEE Engineering Reliable Autonomous Systems Conference 2026 in Zagreb, Croatia (May 28-29, just before ICRA in Vienna) are coming up!

March 7: Regular and short papers
March 7: Workshop and tutorial proposals
April 7: Late-breaking reports

Stakeholders across all autonomous system domains and practices are welcome!

https://2026-erasrobotics.org/index.html

#verification #robotics #autonomy #Conference #evaluation #testing #IEEE #cfp #zagreb #specification #autonomoussystems #reliability #eras2026 #reliablesystems

Ivan Fioravanti ᯅ (@ivanfioravanti)

RepoBench는 모델의 코딩 능력 자체를 측정하기보다 대규모 컨텍스트 추론, 지시 준수, 파일 편집 정밀도를 더 반영한다고 지적하며, 최신 모델들이 이전 모델보다 약한 경우가 보인다고 코멘트함. RepoPrompt의 벤치 페이지 링크를 함께 공유함.

https://x.com/ivanfioravanti/status/2023444897806848112

#repoprompt #repobench #benchmark #llm #evaluation

Latent.Space (@latentspacepod)

벤치마크에 대한 코멘트로, 특히 공개된 외부 벤치마크는 유용하지만 유효기간이 있다는 관점입니다. 가장 좋은 벤치마크는 초기 점수가 10~30% 수준으로 시작해 이후 개선의 여지가 남아있어 연구·개선 활동을 촉진하는 유형이라는 주장입니다.

https://x.com/latentspacepod/status/2023306359132061992

#benchmarking #evaluation #ml #aibenchmarks

Chubby (@kimmonismus)

작성자가 DeepSeek v4의 평가 결과가 가짜라는 통보를 받아 해당 게시물을 삭제하고 정정했다는 공지입니다. 잘못된 평가·주장에 대한 정정으로 연구·모델 평가 신뢰성 이슈를 알리는 내용입니다.

https://x.com/kimmonismus/status/2023148930306109486

#deepseek #evaluation #retraction #researchintegrity

Sam Altman (@sama)

몇 년 사이에 초등학교 수준 수학조차 힘들어하던 AI 시스템들이 연구 수준의 수학 문제를 풀 수 있게 되었다는 평가. 작성자는 Jakub의 평가가 현재 가장 중요한 평가라고 동의하며, 대중 반응은 '그렇게 어렵지 않다'는 식일 것이라 예상한다고 밝힘.

https://x.com/sama/status/2022729068949717182

#ai #research #math #evaluation

Jakub Pachocki (@merettm)

"First Proof" 챌린지에 대한 기대를 표명하며, 차세대 AI 모델의 능력을 평가하기 위한 전선(프론티어) 연구의 중요성을 강조. 내부 모델을 제한적 인간 감독 하에 제안된 10문제에 대해 실행해본 결과를 언급함.

https://x.com/merettm/status/2022517085193277874

#airesearch #benchmark #evaluation #challenge

📻 [ #notation et #évaluation des #fonctionnaires ] 🚨 les enregistrements de la séance du 6 février des *Dialogues autour de la fonction publique*, avec Hélène Guillet, Jean-Francois Verdier, Jean Le Bihan & Pierre Karila-Cohen, sont en ligne 👉 https://compter.hypotheses.org/3071