#TextProcessing

2026-02-05

@janfrode

I wouldn't trust an LLM not to be generating based upon other already-published unencoded stuff.

A less expensive, and far more trustworthy, way to decode it is to just pipe the encoded body through gbase64 -d and then iconv -f CP1252 .

#PeterMandelson #UKPolitics #EpsteinFiles #TextProcessing #AIs #LLMs

Jan :rust: :ferris:janriemer@floss.social
2026-01-11

Speech and Language Processing (3rd ed. draft) - by Dan Jurafsky and James H. Martin (Stanford):

web.stanford.edu/~jurafsky/slp

#NLP #TextProcessing #AI #Algorithms

Jan :rust: :ferris:janriemer@floss.social
2025-11-01

sentencex - by Wikimedia:

github.com/wikimedia/sentencex

A sentence segmentation library with wide language support optimized for speed and utility.

Written in #Rust.

Bindings are available for #Python, #NodeJS and #WASM

Might be useful for my #SpeechToText system! 👀

#NLP #TextProcessing #Segmentation #RustLang

Dyalogdyalog
2025-10-20

2013-03: Write a function that returns the number of words in the given character scalar or vector (see apl.quest/2013/3/ to test your solution and view ours).

2025-05-09

Building on the 90s, statistical n-gram language models, trained on vast text collections, became the backbone of NLP research. They fueled advancements in nearly all NLP techniques of the era, laying the groundwork for today's AI.

F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA

#NLP #LanguageModels #HistoryOfAI #TextProcessing #AI #historyofscience #ISE2025 @fizise @fiz_karlsruhe @tabea @enorouzi @sourisnumerique

Slide from Information Service Engineering 2025, LEcture 02, Natural Language PRocessing 01, A Brief History of NLP, NLP timeline. The timeline is located in the middle of the slide from top to bottom. The pointer on the timeline indicates 1990s. On the left, the formula for conditional probability of a word, following a given series of words, is given as a formula. Below, an AI generated portrait of William Shakespeare is displayed with 4 speech buubles, representing artificially generated text based on 1-grams, 2-grams, 3-grams and 4 grams. The 4-grams text example looks a lot like original Shakespeare text. On the right side the following text is displayed: 
N-grams for statistical language modeling were introduced and popularised by Frederick Jelinek and Stanley F. Chen from IBM Thomas J. Watson Research Center, who developed efficient algorithms and techniques for estimating n-gram probabilities from large text corpora for speech recognition and machine translation.

Bibliographical reference:
F. Jelinek (1997), Statistical Methods for Speech Recognition, MIT Press, Cambridge, MA.
Stephandurchaus
2025-04-19

you can pass variables to an script with the option -v. This is useful, for example, when you want to include the file name in the output:

```
find . -type f -iname '*.csv' -exec awk -F, -v filename={} '{print filename, $2}' {} \;
```

Even though seemingly awkward at first glance, is definitely one of the most versatile and useful tools on .

N-gated Hacker Newsngate
2025-04-14

🚀 Behold the epic tale of Janet's , where the author heroically excludes regular expressions like they're yesterday's news. 💥 Marvel at the labyrinth of magic that claims to be more readable, but only if you have a PhD in arcane text processing. 📜✨
bakpakin.com/writing/how-janet

Holle Medinghmeding
2025-03-06

🔠 Panel: More than Chatbots: Multimodal Large Language Models in Humanities Workflows

At , Nina Rastinger explores how well handles abbreviations & NER:

✅ NER works well, even with small, low-cost models
❌ Abbreviations are tricky—costs & resource demands skyrocket
🚀 GPT o1 improves performance, even on abbreviations, but remains resource-intensive
Balancing accuracy & efficiency in text processing remains a challenge! ⚖️

Nina Rastinger at Panel More than Chatbots: Multimodal Large Language Models in Humanities Workflows #dhd2025
2025-02-11

Master regular expressions for efficient substring extraction in Python! Learn greedy vs. non-greedy matching & capturing groups for precise results.
tech-champion.com/data-science

Pragmatic Bookshelf 📚pragprog@techhub.social
2025-01-23

New at PragProg

Staffan Nöteberg helps you really understand how the machinery works under the hood. Learn advanced tools like reluctant, lookbehind and nondeterministic finite automata to write efficient and elegant regexes with ease.

In this illustrated guide, you gain precisely that understanding., even with no prior knowledge of Regular Expressions.

pragprog.com/titles/d-snrem

@staffannoteberg

#regularexpressions #patternmatching #regex #regexp #textprocessing

Grumpy Old Techie 🕊️grumpyoldtechie@hostux.social
2024-10-21

Maybe I’m just growing really old. Today I stumbled across a GitHub repository with a few hundred lines of python that could be one or two awk, sed or grep oneliners.
Seriously if you are using Linux or one of the BSDs learn how to use the standard text utilities that come with the OS.
In modern times jq should be added to the traditional list.

#text #python #unix #linux #BSD #textprocessing

Discovered a neat new tool last week: github.com/wr7/refold

It's similar to `fmt` and `fold` except that it automatically handles prefixes. Vim/Neovim `gq` can do this out of the box but fails (for me at least) when multiple prefixes are present, such as a Markdown block-quote inside Rust comments. E.g.

```
// > Some quoted text
// > to reflow.
```

`refold` handles this.

#Rust #RustLang #TextProcessing #TextManipulation #TextEditor

2024-08-24

Getting ready to run an online introductory XSLT course for people writing or maintaining stylesheets.

#XSLT #XML #Schematron #XSpec #declarative #functionalProgramming #textProcessing #digitalHumanities #JATS

Hello!

I am pleased to announce a new version of my "CLI text processing with GNU Coreutils" ebook. This ebook will help you learn 20+ specialized text processing commands provided by the coreutils package.

Links:

* Free PDF/EPUB: learnbyexample.gumroad.com/l/c (till 10-Apr-2024)
* Web version: learnbyexample.github.io/cli_t
* Markdown source, exercise solutions, etc: github.com/learnbyexample/cli_
* Short video about the book: youtu.be/oCnJLu_PUbY
* Interactive TUI app: github.com/learnbyexample/TUI- (includes some coreutils exercises)

I would highly appreciate it if you'd let me know how you felt about this book. It could be anything from a simple thank you, pointing out a typo, mistakes in code snippets, which aspects of the book worked for you (or didn't!) and so on. Reader feedback is essential and especially so for self-published authors.

Happy learning :)

#linux #cli #coreutils #textprocessing

Cover image my "CLI text processing with GNU Coreutils" ebook featuring a cute panda.

Highlights that there are 200+ examples and 100+ exercises.
🔏 Matthias Wiesmannthias
2024-03-01

An old blog post I wrote, which I had nearly forgotten. «Falsehoods programmers believe about text»

wiesmann.codiferes.net/wordpre

Client Info

Server: https://mastodon.social
Version: 2025.07
Repository: https://github.com/cyevgeniy/lmst