Lmst

What's your favorite Python SEO crawler?

According to The Google, there seems to be one dominant option.

python3 -m pip install advertools

If you dig deeper, you'll find two other crawlers:
- One for status codes and all response headers
- One for downloading images from a list of URLs

#advertools
#SEO
#DataScience
#Python

Using Python for doing SEO, versus using Python to develop software/tools for SEO.

The first activity is called SEO.

The second one is called software development.

Important difference.

You can use Python for crawling (try one of the #advertools crawlers), analyzing log files (also advertools), XML sitemaps (yes, yes, advertools), running bulk robots.txt tests, weighted n-grams, and much more. These are SEO tasks. Running them in bulk, using a programming language and its powerful

1/4

Using a proxy while crawling

This is another feature of using the meta parameter while crawling with #advertools.

It's as simple as providing a proxy URL.

There is also a link to using rotating proxies if you're interested

https://bit.ly/3SXh8b8

#crawling #scraping #scrapy #proxy

Happy to share a new release of #advertools v0.16

This release adds a new parameter "meta" to the crawl function.

Options to use it:

🔵 Set arbitrary metadata about the crawl
🔵 Set custom request headers per URL
🔵 Limited support for crawling some JavaScript websites

Details and example code:

https://bit.ly/3SXh8b8

#SEO #crawling #scraping #python #DataScience #advertools #scrapy

Evergreen crawling: tracking updated content

One of the things you can do is focus on a certain set of pages, and check key changes.

In this example, you'll see when and which sponsors were added to the sponsors page on BrightonSEO's website. This is based on parsed data from the page's JSON-LD.

#advertools #crawling #scraping #SEO

Evergreen crawling with XML sitemaps (status check)

This is how things will look.
All URLs in the sitemap get crawled the first time.

Then a relatively tiny number of URLs is crawled (new URLs in the sitemap, & the URLs with a changed lastmod).

URLs are saved to the same crawl file to immediately compare what changed, which URLs are frequently updated, which ones were recently introduced, & when.

https://bit.ly/4fnhiSO

#advertools #DataScience #Python #crawling #scraping #SEO

Evergreen crawling with XML sitemaps (updated to run in bulk)

The script now takes a list of tuples of XML sitemap URL, and website name, and runs the through them all, creating dynamic file names, for example

{name}_sitemap.csv
{name}_crawl.jl
{name}_errors.txt

Now you just need to add a URL and a name to add a new website to the process, instead of creating new specific files from scratch.

https://bit.ly/4fnhiSO

#advertools #crawling #sitemaps #scraping #SEO #DataScience #Python

Evergreen crawling using XML sitemaps

Crawl once then only crawl new &/or modified URLs

1. Download a sitemap
2. Crawl its URLs
3. Save it to a CSV file
4. After a month/week/day download the same sitemap again
5. Find URLs to crawl
A. New URLs not found in the last_sitemap
B. URLs that exist in both, but with a different lastmod
6. Crawl the new URLs
7. Save the current_sitemap and overwrite the last_sitemap
8. repeat

https://bit.ly/4fnhiSO

#advertools #SEO #crawling #scraping #Python

I'm liking the new default body text selector in #advertools

Many xpath/css selectors would be needed to extract the content on these pages, from many other templates on the same website?
It automatically excludes header, footer, and nav elements.

This is key in extracting the main text of (sub)category pages which might not have the same template across the website.

I'd love to know if you try it and/or have suggestions or issues.

#crawling #scraping #SEO #DataScience #Python

XML sitemaps: How to set custom request headers while fetching

Three examples are demonstrated:

🔵 Setting a custom User-agent
🔵 Fetching only if the sitemap's ETag was changed
🔵 Fetching only if the sitemap was modified since the Last-Modified response header

Let me know if you have other headers that might be interesting to use.

https://bit.ly/3WbCAd6

#advertools #SEO #DataScience #Python

Comparing crawls

A guide on how to compare any element (text or numeric) across two crawls

🔵 Select an element to compare (h1, size, etc)
🔵 Get URLs where the element has changed
🔵 Calculate the similarity ratio between changed elements
🔵 Get the matching blocks (sub-strings)
🔵 Filter URLs where a numeric value has changed by X%

Full details and code here:

https://bit.ly/3SeWVx7

#crawling #scraping #Python #analytics #DataScience #advertools #SEO

#advertools v0.15 is out! New features:

🔵 Supply any custom request headers when fetching XML sitemaps (User-agent, If-None-Match, If-Modified-Since, etc…, contributed by Joe Joiner. Thanks!

🔵 Compare crawls (adv.crawlytics.compare). Supply two crawl DataFrames, as well as a column name. You get the values that are different, and if numeric, by how much, absolute and percentage.

1/3

Testing new release v0.15.0

#advertools

A guide on how to audit and analyze internal links on a website:

🔵 Crawl the site
🔵 Map URLs to their links
🔵 Create a directed graph to represent the links
🔵 Score nodes (URLs) with various metrics (in/out/links, degree centrality, pagerank)
🔵 Cluster, summarize, and visualize URLs by internal pagerank

https://bit.ly/3xYT8ge

#advertools #scikitlearn #networkx #links

Updated guide on auditing external links:

🔵 Count domains most linked to
🔵 Interactive status code chart
🔵 Find and locate broken links
🔵 Full redirect report

https://bit.ly/4cQke86

#SEO #DataScience #advertools #scrawping #crawling #Python

Just wrote an explanation of the advertools crawl file.

If you use it for crawling, & would like to know in more detail how it is structured, what kinds of columns are available, & how they relate to one another.

Some columns are independent, containing a single value.
Some columns contain multiple values of the same tag on the same page.
Some columns form part of a whole.
It's all one jsonlines (.jl) file.

https://bit.ly/3Lbkq65

#SEO #DataScience #advertools #scraping #crawling #Python

Day 100 of #100DaysOfCode
ZERO! 🎊🎉🥳

Create a crawl + LLM content evaluation app

🔵 Enter up to 10 URLs
🔵 The app crawls them, extracts title, body text and evaluates
🔵 Evaluation is done by asking #Google's helpful content guidelines questions, answered by #ChatGPT
🔵 Export/copy URL, title, body text & evaluations

Now that the 100 days are over, and the addiction is developed, I have no guarantees!

https://www.youtube.com/watch?v=GOFJXRALDw4

#SEO #DataScience #Python #crawling #scraping #advertools #AI

Day 99 of #100DaysOfCode
ONE!

Make a minor #advertools release using the new body text selector:

🔵 To be customizable later
🔵 A step toward cleaner automated content evaluation with LLMs
🔵 More useful than I thought
🔵 Pages that have many different selectors
🔵 Websites with many different templates

Check the example screenshots. Otherwise possible but tedious.

🔵 Bonus: use it with any crawler

pip install advertools>=0.14.3

#SEO #DataScience #Python #crawling #scraping #advertools

Day 98 of #100DaysOfCode
TWO!

Create a function that generates an XPath expression for extracting the main body text from a webpage:

🔵 Ignore tags that don't typically contain text: nav, footer, img, video, etc
🔵 Include tags that typically do: b, h1, h2, li, etc
🔵 All within <body>
🔵 Ability to customize: maybe an iframe has content that you consider part of the main content?

https://bit.ly/4cfwizT

#SEO #DataScience #Python #crawling #scraping #advertools

Happy to announce a new cohort for my course: Data Science with Python for SEO
🔵 For absolute beginners
🔵 Cohort starts July 15, 2024
🔵 Make a leap in your data skills
🔵 Run, automate, and scale many SEO tasks with Python like crawling, analyzing XML sitemaps, text/keyword analysis
🔵 In depth intro to data manipulation and visualization skills
🔵 Get started with #advertools #pandas and #plotly

https://bit.ly/dsseo-course
1/

#advertools

Client Info