Lmst

I wouldn't trust an LLM not to be generating based upon other already-published unencoded stuff.

A less expensive, and far more trustworthy, way to decode it is to just pipe the encoded body through gbase64 -d and then iconv -f CP1252 .

#PeterMandelson #UKPolitics #EpsteinFiles #TextProcessing #AIs #LLMs

Speech and Language Processing (3rd ed. draft) - by Dan Jurafsky and James H. Martin (Stanford):

https://web.stanford.edu/~jurafsky/slp3/

#NLP #TextProcessing #AI #Algorithms

sentencex - by Wikimedia:

https://github.com/wikimedia/sentencex

A sentence segmentation library with wide language support optimized for speed and utility.

Written in #Rust.

Bindings are available for #Python, #NodeJS and #WASM

Might be useful for my #SpeechToText system! 👀

#NLP #TextProcessing #Segmentation #RustLang

#APLQuest 2013-03: Write a function that returns the number of words in the given character scalar or vector (see https://apl.quest/2013/3/ to test your solution and view ours). #APL #WordCount #TextProcessing

LLMs are getting better at character-level text manipulation

https://blog.burkert.me/posts/llm_evolution_character_manipulation/

#HackerNews #LLMs #CharacterManipulation #TextProcessing #AIInnovation #MachineLearning

The palindrome problem – Unicode edition

https://wiesmann.codiferes.net/wordpress/archives/41500

#C++ #CodePoints #GraphemeClusters #java #Javascript #ProgrammingLanguage #Python #Swift #TextProcessing #Unicode

Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.

F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA

#NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique

Slide from Information Service Engineering 2025, LEcture 02, Natural Language PRocessing 01, A Brief History of NLP, NLP timeline. The timeline is located in the middle of the slide from top to bottom. The pointer on the timeline indicates 1990s. On the left, the formula for conditional probability of a word, following a given series of words, is given as a formula. Below, an AI generated portrait of William Shakespeare is displayed with 4 speech buubles, representing artificially generated text based on 1-grams, 2-grams, 3-grams and 4 grams. The 4-grams text example looks a lot like original Shakespeare text. On the right side the following text is displayed:
N-grams for statistical language modeling were introduced and popularised by Frederick Jelinek and Stanley F. Chen from IBM Thomas J. Watson Research Center, who developed efficient algorithms and techniques for estimating n-gram probabilities from large text corpora for speech recognition and machine translation.

Bibliographical reference:
F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA.

#TIL you can pass variables to an #awk script with the option -v. This is useful, for example, when you want to include the file name in the output:

```
find . -type f -iname '*.csv' -exec awk -F, -v filename={} '{print filename, $2}' {} \;
```

Even though seemingly awkward at first glance, #awk is definitely one of the most versatile and useful tools on #linux.

#bash #commandline #shell #programming #textprocessing

🚀 Behold the epic tale of Janet's #PEG #module, where the author heroically excludes regular expressions like they're yesterday's news. 💥 Marvel at the labyrinth of #parsing magic that claims to be more readable, but only if you have a PhD in arcane text processing. 📜✨
https://bakpakin.com/writing/how-janets-peg-works.html #Janet #readability #textprocessing #regex #HackerNews #ngated

Once again, keyword matching to the rescue…

#textprocessing

https://www.oregonlive.com/nation/2025/03/photo-of-enola-gay-aircraft-among-26000-images-flagged-for-removal-in-pentagons-dei-purge.html

🔠 Panel: More than Chatbots: Multimodal Large Language Models in Humanities Workflows

At #DHd2025, Nina Rastinger explores how well #AI handles abbreviations & NER:

✅ NER works well, even with small, low-cost models
❌ Abbreviations are tricky—costs & resource demands skyrocket
🚀 GPT o1 improves performance, even on abbreviations, but remains resource-intensive
Balancing accuracy & efficiency in text processing remains a challenge! ⚖️

#AI #NER #TextProcessing #DigitalHumanities

Nina Rastinger at Panel More than Chatbots: Multimodal Large Language Models in Humanities Workflows #dhd2025

Master regular expressions for efficient substring extraction in Python! Learn greedy vs. non-greedy matching & capturing groups for precise results. #Regex #Python #StringExtraction #Tutorial #Programming #TextProcessing
https://tech-champion.com/data-science/regex-string-extraction-in-python-mastering-regular-expressions-for-substring-search

Ummmm

#linuxadmin #emacs #tool #opensource #lisp #textprocessing

https://sites.google.com/site/steveyegge2/the-emacs-problem

New at PragProg

Staffan Nöteberg helps you really understand how the machinery works under the hood. Learn advanced tools like reluctant, lookbehind and nondeterministic finite automata to write efficient and elegant regexes with ease.

In this illustrated guide, you gain precisely that understanding., even with no prior knowledge of Regular Expressions.

http://pragprog.com/titles/d-snrem

@staffannoteberg

#regularexpressions #patternmatching #regex #regexp #textprocessing

Maybe I’m just growing really old. Today I stumbled across a GitHub repository with a few hundred lines of python that could be one or two awk, sed or grep oneliners.
Seriously if you are using Linux or one of the BSDs learn how to use the standard text utilities that come with the OS.
In modern times jq should be added to the traditional list.

#text #python #unix #linux #BSD #textprocessing

Discovered a neat new tool last week: https://github.com/wr7/refold

It's similar to `fmt` and `fold` except that it automatically handles prefixes. Vim/Neovim `gq` can do this out of the box but fails (for me at least) when multiple prefixes are present, such as a Markdown block-quote inside Rust comments. E.g.

```
// > Some quoted text
// > to reflow.
```

`refold` handles this.

#Rust #RustLang #TextProcessing #TextManipulation #TextEditor

Getting ready to run an online introductory XSLT course for people writing or maintaining stylesheets.

#XSLT #XML #Schematron #XSpec #declarative #functionalProgramming #textProcessing #digitalHumanities #JATS

Understanding the Flux Framework: A Comprehensive Guide

https://zurl.co/xJYU

#FluxFramework
#AIArchitecture
#MachineLearning
#NeuralNetworks
#Multimodal
#TextProcessing
#ImageGeneration
#DeepLearning
#AIResearch
#TechInnovation

Hello!

I am pleased to announce a new version of my "CLI text processing with GNU Coreutils" ebook. This ebook will help you learn 20+ specialized text processing commands provided by the coreutils package.

Links:

* Free PDF/EPUB: https://learnbyexample.gumroad.com/l/cli_coreutils (till 10-Apr-2024)
* Web version: https://learnbyexample.github.io/cli_text_processing_coreutils/
* Markdown source, exercise solutions, etc: https://github.com/learnbyexample/cli_text_processing_coreutils
* Short video about the book: https://youtu.be/oCnJLu_PUbY
* Interactive TUI app: https://github.com/learnbyexample/TUI-apps/tree/main/CLI-Exercises (includes some coreutils exercises)

I would highly appreciate it if you'd let me know how you felt about this book. It could be anything from a simple thank you, pointing out a typo, mistakes in code snippets, which aspects of the book worked for you (or didn't!) and so on. Reader feedback is essential and especially so for self-published authors.

Happy learning :)

#linux #cli #coreutils #textprocessing

Cover image my "CLI text processing with GNU Coreutils" ebook featuring a cute panda.

Highlights that there are 200+ examples and 100+ exercises.

An old blog post I wrote, which I had nearly forgotten. «Falsehoods programmers believe about text»

https://wiesmann.codiferes.net/wordpress/archives/30296

#unicode #ansi #textprocessing

#TextProcessing

Client Info