Lmst

Deep Research without Deep Pockets.

I pulled apart the premium “Deep Research” tools to see what they actually do, then built the same workflow locally without needing a huge model or huge spend.

The trick: make the pipeline do the hard work (search + reduce + evidence), so the LLM mostly just writes.

Part 5 of the DocSummarizer series: https://www.mostlylucid.net/blog/doomsummarizer-deep-research

What’s your best technique for reducing “model made it up” without just throwing a bigger model at it?
#rag #llm #deepresearch #ai #llm #lucene

The upcoming #OpenSearchCon Europe will have a designated "Search & Apache Lucene" track.
If you're involved in Apache #Lucene project, or the broader search and relevancy ecosystem, I encourage you to consider submitting a talk proposal to share your experience.
The conference will take place 16-17 April in Prague.
The CFP is open until 18th January: https://events.linuxfoundation.org/opensearchcon-europe/program/cfp/

@OpenSearchProject @theasf #OpenSearch #search

@madduci @Lilith

Das ist im wesentlichen #elasticsearch und das ist eine Verpackung aka UI um #lucene

@Lilith

#solr und #lucene gibt es schon im wesentlichen seit 2004.

Speaking of performance, I knew my plugin wouldn't be fast for any large repo - variously it indexed the repo into lucene then walked different branches to find containment. It's been to long for me to be specific about O-ness but it was definitely viable for small repos.

#jira #git #lucene

@parttimenerd That's an interesting approach, thanks a lot for sharing!

I also toyed with a similar idea a while back: https://binjr.eu/blog/2023/08/new-data-adapter-jdk-flight-recorder/
With that said, there are some differences in the approach I took over the one you discussed in your post.
For one, I opted to use an inverted index (#Lucene) instead of a relational DB as my backend, which comes with it's own trade-offs, like offering a query language that is somewhat easier to use, but not as nearly as powerful.
The other main difference, is that the route I used to get there is kinda like the opposite from the one you took: while you went from the backend working your way up to the UI, I very much started there (as I already had it) and worked my way down.
Doing things this way around meant that I could benefit immediately from the UI features that were there already (which was the whole point, of course) but it makes integrating new ones that don't fit so naturally with the rest of the tool, much more time consuming...

At any rate, I would love to hear your thoughts if you find the time to give it a try!
(you can get it here: https://github.com/binjr/binjr/releases)

🚀 Introducing Lucene-on-Faiss

⚡ 2x boost in search throughput
💡 Decrease memory limitations

📖 Blog: https://opensearch.org/blog/lucene-on-faiss-powering-opensearchs-high-performance-memory-efficient-vector-search/?ajs_aid=d47608d2-1716-4230-91b0-66101998e898

#OpenSearch #VectorSearch #Lucene #Faiss #AI #GenerativeAI #ANN #SearchTech

SQL vs NoSQL: Выбор подходящей базы данных для вашего проекта

Одним из самых фундаментальных и критически важных решений при создании современного приложения является выбор технологии для хранения данных.

#DST #DSTGlobal #ДСТ #ДСТГлобал #DSTplatform #ДСТПлатформ #базаданных #SQL #NoSQL #РСУБД #СУБД #PostgreSQL #Redis #MongoDB #JSON #BJSON #WordPress #Drupal #DLE #BigData #Oracle #Database #Microsoft #SQLServer #ACID #Cassandra #Elasticsearch #Apache #Lucene

Источник: https://dstglobal.ru/club/1101-sql-vs-nosql-vybor-podhodjaschei-bazy-dannyh-dlja-vashego-proekta

SQL vs NoSQL: Выбор подходящей базы данных для вашего проекта

#INTERLIS leicht gemacht, Teil 52 - Neues vom Modelfinder: https://blog.sogeo.services/blog/2025/07/21/interlis-leicht-gemacht-number-52.html #Java #SpringBoot #GraalVM #jte #htmx #Lucene

Devoxx Poland is just a couple of days away!
Join my talk Wednesday at the Data & AI track to learn about the #OpenSearch project, and how it can provide you search, analytics, observability and vector database capabilities, all #opensource @linuxfoundation
👉 https://devoxx.pl/talk-details/?id=8605

#data #ai #developers #search #analytics #vectordb #observability #lucene #devoxx #devoxxpl #DevoxxPoland

#OpenSearch 3.0 is out! 🍾 🥳
After 3 years of 2.x, it's time for the next leap, which brings major upgrades to performance, data management, #vectorDB functionality, and much more.
📈 Upgrade to Apache #Lucene 10 and #JDK 21+
📈 Pull-based ingestion for streaming data, with support for Apache #Kafka and Amazon #Kinesis
📈 Power agentic #AI with native #MCP support
📈 Investigate logs with expanded PPL query tools, backed by Apache #Calcite

Check out @OpenSearchProject blog:
https://opensearch.org/blog/unveiling-opensearch-3-0/

If you are curious about the inner workings of #cassandra, #debezium, #druid, #elasticsearch, #lucene, #kafka, #neo4j, or #spark then check out https://glennengstrand.info/software/opensource/analysis which presents a static code analysis of these eight open source giants.

nvidia GTC is coming to the bay area next week. we'll be there with a
* talk about bringing #lucene to the GPU
* a "guess that prompt" meetup between galileo + UnstructuredIO + elastic. join us to outsmart AI ;)
https://lu.ma/guess-that-prompt

A shard is a #lucene instance that runs on a node, that's part of a cluster, and is replicated for fault tolerance.

If that didn't make 100% sense - we now have a 10 minute video explaining
#elastic
infrastructure.

(it's basically a super efficient library with librarians on roller skates?)

https://www.youtube.com/watch?v=sAySPSyL2qE

#lucene 9 end of life cleanup 💥
https://github.com/apache/lucene/pull/13882

#BSI WID-SEC-2024-3313: [NEU] [hoch] #Apache #Lucene: Schwachstelle ermöglicht Codeausführung

Ein entfernter, anonymer Angreifer kann eine Schwachstelle in Apache Lucene ausnutzen, um beliebigen Programmcode auszuführen.

https://wid.cert-bund.de/portal/wid/securityadvisory?name=WID-SEC-2024-3313

Just blogged about "How to sort items by a #custom date in an #Umbraco v13+ #Examine Index".

https://www.nurhakkaya.com/2024/10/how-to-sort-items-by-custom-date-in.html

#lucene #search #umbraco #examine #opensource #hacktoberfest #contribution

Погружение в недра Apache Lucene: архитектура индекса, выполнение поиска и репликация данных

Это перевод моей статьи в моем блоге про архитектуру Apache Lucene , про одну из самых известных библиотек реализации поискового индекса. Elasticsearch и Solr, широко известные реализации масштабируемых решений для поиска, они используют эту библиотеку под капотом. Я работаю над созданием решений для поиска в сфере электронной коммерции, и постоянно сталкиваюсь с этой библиотекой при повседневной работе. Apache Lucene реализует большую часть необходимого функционала для построения поисковой системы. Начиная с процесса токенизации, который извлекает канонические формы слов в виде токенов, продолжая полной реализацией инвертированного индекса, и завершая репликацией сегментов в режиме близком к реальному времени. Количество практически полезных фичей, реализованных за два десялилетия существования библиотеки, колоссально. Эта библиотека интегрирует знания из лингвистики, математики и компьютерных наук. Инвертированный индекс Apache Lucene реализует архитектуру инвертированного индекса. На уровне реализации логический индекс содержит коллекцию неизменяемых сегментов, хранящихся как файлы в файловой системе. Каждый сегмент сам по себе является инвертированным индексом. Такой индекс — это структура данных словаря с терминами в качестве ключей и данными по размещению (postings) в качестве значений. Постинг — это список идентификаторов документов и количеств вхождений термина в данном документе. Этот словарь использует Finite State Transducers, FST [1] для поиска терминов, что можно представить как нечто похожее на отсортированные списки с пропусками [2]. Такая отсортированная навигационная карта является краеугольным камнем для эффективного поиска по огромным обьемам документов. Lucene также очень эффективен в использовании памяти. Среди прочих алгоритмов, он использует алгоритмы кодирования разницами для сжатия идентификаторов документов в постингах [3]. Упрощенно идея этого сжатия заключается в сортировке списока целых чисел и сохранения дельт между ними. Это также повышает производительность операций ввода-вывода диска.

https://habr.com/ru/articles/852666/

#lucene #поиск #поисковые_системы #лингвистика

#lucene 10 is out: https://lucene.apache.org/core/corenews.html#apache-lucenetm-1000-available
and the #elasticsearch upgrade is already kicking off, starting the countdown for version 9: https://github.com/elastic/elasticsearch/pull/114741