ScrapingBee to Google Sheets, full sitemap URLs

You grab a sitemap, paste a few URLs into a sheet, and call it done. Then you realize the site has five sitemap files, two of them are gzipped, and robots.txt points to a different index than sitemap.xml.

This is the kind of mess that slows down SEO leads, bogs down marketing ops, and frankly annoys agency teams trying to run audits at scale. With this ScrapingBee Sheets automation, you send one domain and get a growing “all URLs” sheet you can actually trust.

Below is the exact workflow, what it fixes, and what you’ll need to get every sitemap URL into Google Sheets without babysitting the process.

How This Automation Works

The full n8n workflow, from trigger to final output:

n8n Workflow Template: ScrapingBee to Google Sheets, full sitemap URLs

Click to explore

flowchart LR

    subgraph sg0["Flow 1"]
        direction LR
        n0@{ icon: "mdi:cog", form: "rounded", label: "Parse XML as JSON", pos: "b", h: 48 }
        n1@{ icon: "mdi:swap-horizontal", form: "rounded", label: "Check Binary Payload", pos: "b", h: 48 }
        n2["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/webhook.dark.svg' width='40' height='40' /></div><br/>Inbound Domain Hook"]
        n3@{ icon: "mdi:web", form: "rounded", label: "Fetch robots.txt", pos: "b", h: 48 }
        n4@{ icon: "mdi:web", form: "rounded", label: "Retrieve sitemap.xml", pos: "b", h: 48 }
        n5@{ icon: "mdi:swap-horizontal", form: "rounded", label: "Detect sitemap refs", pos: "b", h: 48 }
        n6@{ icon: "mdi:swap-horizontal", form: "rounded", label: "Check for GZ archive", pos: "b", h: 48 }
        n7@{ icon: "mdi:cog", form: "rounded", label: "Unzip GZ content", pos: "b", h: 48 }
        n8["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Map Binary Data Key"]
        n9["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Collect Non-XML URLs"]
        n10["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Gather XML URLs"]
        n11@{ icon: "mdi:web", form: "rounded", label: "Fetch XML Resource", pos: "b", h: 48 }
        n12@{ icon: "mdi:database", form: "rounded", label: "Append URLs to Sheet", pos: "b", h: 48 }
        n11 --> n1
        n2 --> n3
        n10 --> n11
        n6 --> n7
        n6 --> n0
        n7 --> n8
        n9 --> n12
        n1 --> n6
        n1 --> n9
        n1 --> n10
        n3 --> n5
        n3 --> n4
        n4 --> n1
        n0 --> n9
        n0 --> n10
        n5 --> n10
        n5 --> n4
        n8 --> n0
    end

    %% Styling
    classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
    classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
    classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef disabled stroke-dasharray: 5 5,opacity: 0.5
    class n1,n5,n6 decision
    class n12 database
    class n2,n3,n4,n11 api
    class n8,n9,n10 code
    classDef customIcon fill:none,stroke:none
    class n2,n8,n9,n10 customIcon

The Problem: Sitemaps Are Never “Just One File”

Pulling sitemap URLs sounds simple until you do it for real websites. You download sitemap.xml, it’s actually a sitemap index that links to ten more XMLs, and some of those link to more. Then there’s robots.txt, which may list a different sitemap than the one you guessed. Add gzipped sitemaps (sitemap.xml.gz) and you’ve got a workflow that breaks the moment you try to do it manually. You waste an afternoon, still miss pages, and you don’t feel confident enough to base an audit on the list you collected.

It adds up fast. Here’s where it breaks down.

You end up copying partial URL lists into Google Sheets, then realizing you missed entire sitemap branches.
Compressed sitemap files force extra steps, and it’s easy to skip them when you’re moving quickly.
Sites that block basic requests make “quick scripts” unreliable, which means you rerun work and still doubt the output.
When sitemaps update, your spreadsheet becomes stale unless someone repeats the whole process again.

The Solution: Send a Domain, Get Every URL in a Sheet

This n8n workflow starts with a simple webhook call that includes the domain (like domain=www.example.com). From there, it uses ScrapingBee to fetch robots.txt, checks for sitemap references, and falls back to sitemap.xml when robots.txt doesn’t help. Each XML resource it discovers gets fetched and inspected, including tricky cases like binary payloads and gzipped archives. The workflow extracts two things: the “real” website URLs you want to audit, and any additional XML sitemap links that should be followed next. Those XML links feed back into the same fetch-and-parse loop until there are no more sitemap files to chase, and every discovered page URL gets appended into a Google Sheet.

The workflow starts when you send a GET request to the webhook with a domain. ScrapingBee pulls robots.txt and sitemap.xml, then the workflow recursively collects sitemap links and page links (including .gz sitemaps). Finally, it appends all non-XML URLs into a single links column in Google Sheets so you can filter, dedupe, and audit.

What You Get: Automation vs. Results

What This Workflow Automates

Results You’ll Get

Accepts a domain via webhook and kicks off the run automatically.
Fetches robots.txt and checks for sitemap references before falling back to sitemap.xml.
Detects gzipped sitemap files, unzips them, and parses the XML safely.
Recursively follows newly found XML sitemap links until the crawl is complete.

Turn “half a sitemap” into a complete URL inventory in about 10–30 minutes, depending on sitemap size.
A single Google Sheet becomes your source of truth for audits, migrations, or indexing checks.
Fewer missed pages, which means fewer nasty surprises during technical SEO reviews.
Less manual copying and fewer formatting mistakes in spreadsheets.
A repeatable process you can rerun anytime a sitemap changes.

Example: What This Looks Like

Say you’re auditing a mid-sized ecommerce site with 12 sitemap files, and each one takes about 10 minutes to download, open, extract, and paste into a sheet. That’s roughly 2 hours, and that’s before you notice three of them are .gz archives. With this workflow, you send one webhook request (a minute), let n8n and ScrapingBee process and recurse for maybe 20 minutes, and your Google Sheet fills as it goes. You still review the list, but you’re not doing the grunt work.

What You’ll Need

n8n instance (try n8n Cloud free)
Self-hosting option if you prefer (Hostinger works well)
ScrapingBee for fetching robots.txt and sitemaps reliably.
Google Sheets to store the final list of URLs.
ScrapingBee API key (get it from your ScrapingBee dashboard).

Skill level: Intermediate. You’ll paste credentials, create a sheet column named links, and test a webhook call.

Don’t want to set this up yourself? Talk to an automation expert (free 15-minute consultation).

How It Works

A webhook starts the run. You call the workflow URL with a single query parameter (domain=…). That’s it. No CSV uploads, no manual inputs beyond the domain.

Sitemaps get discovered the way search engines expect. The workflow fetches robots.txt via ScrapingBee, checks for sitemap references, and then retrieves sitemap.xml if needed. This matters because many sites don’t put their “real” sitemap where you’d assume.

Each XML resource is inspected and parsed. The workflow checks if the response is text or binary, detects .gz archives, unzips when required, and parses XML into structured data so URLs can be extracted consistently.

URLs are split into two streams. Normal page URLs get appended to Google Sheets immediately, while newly found XML sitemap links get fed back into the workflow to be scraped next. That loop continues until there are no more XML links left to fetch.

You can easily modify the destination from Google Sheets to Excel 365 or a database based on your needs. See the full implementation guide below for customization options.

Step-by-Step Implementation Guide

Step 1: Configure the Webhook Trigger

Set up the inbound webhook that accepts a domain and initiates sitemap discovery.

Add the Inbound Domain Hook node as the trigger.
Set Path to 1da30868-fbca-4e8e-8580-485afb3fd956.
Plan to call the webhook with a query parameter named domain (e.g., ?domain=example.com).

⚠️ Common Pitfall: The workflow expects $('Inbound Domain Hook').item.json.query.domain. If you pass the domain in the body instead of query, the URLs in Fetch robots.txt and Retrieve sitemap.xml will be empty.

Step 2: Connect ScrapingBee and Configure Sitemap Discovery

Pull robots.txt and sitemap.xml using ScrapingBee to discover sitemap references.

In Fetch robots.txt, set URL to =https://{{ $('Inbound Domain Hook').item.json.query.domain }}/robots.txt.
Credential Required: Connect your ScrapingBeeApi credentials in Fetch robots.txt.
In Retrieve sitemap.xml, set URL to =https://{{ $('Inbound Domain Hook').item.json.query.domain }}/sitemap.xml.
Credential Required: Connect your ScrapingBeeApi credentials in Retrieve sitemap.xml and Fetch XML Resource (used for sitemap URLs discovered later).
Keep Detect sitemap refs as-is; it checks ={{ $json.data }} for the string Sitemap: to locate sitemap references.

If a site blocks direct requests, ScrapingBee’s proxy can still fetch /robots.txt and /sitemap.xml without changing any node settings.

Step 3: Set Up Binary Handling and XML Parsing

Handle binary sitemap responses, including gzip archives, before parsing XML.

In Check Binary Payload, keep the condition ={{ $binary }} is not empty to gate binary processing.
In Check for GZ archive, keep Left Value as ={{ $binary.data.fileExtension }} and Right Value as gz.
Use Unzip GZ content to decompress gzip sitemaps before parsing.
In Map Binary Data Key, keep the JavaScript that remaps file_0 to data so Parse XML as JSON can read it.
In Parse XML as JSON, set Operation to xml and Binary Property Name to =data.

⚠️ Common Pitfall: If Map Binary Data Key is removed, Parse XML as JSON will not find data and parsing will fail for gzipped sitemaps.

Step 4: Configure URL Extraction and Parallel Branching

Split parsed XML into non-XML URLs for output and XML URLs for recursive fetching.

Keep the Collect Non-XML URLs code to emit items with { "link": "..." } while excluding .xml, .xml.gz, www.sitemaps.org, and www.w3.org.
Keep the Gather XML URLs code to emit items with { "xml": "..." } for nested sitemap crawling.
Parse XML as JSON outputs to both Collect Non-XML URLs and Gather XML URLs in parallel.
Confirm Gather XML URLs connects to Fetch XML Resource to fetch nested sitemap XML via ={{ $json.xml }}.

Parallel extraction prevents recursion from delaying non-XML URL output, ensuring you start saving URLs immediately.

Step 5: Configure Output to Google Sheets

Append discovered non-XML URLs to a spreadsheet for storage and analysis.

Add Append URLs to Sheet after Collect Non-XML URLs.
Credential Required: Connect your googleSheetsOAuth2Api credentials in Append URLs to Sheet.
Set Operation to append and enable Use Append.
Set Document to [YOUR_ID] and Sheet to Sheet1 (gid=0).
In Columns, map links to ={{ $json.link }}.

Step 6: Test and Activate Your Workflow

Validate execution with a sample domain and then enable the workflow for production.

Click Execute Workflow and trigger Inbound Domain Hook with a test URL like ?domain=example.com.
Verify Fetch robots.txt and Retrieve sitemap.xml return content, and that Check Binary Payload routes correctly.
Confirm Append URLs to Sheet adds rows with URL values in the links column.
Once successful, toggle the workflow to Active for live webhook use.

🔒

Unlock Full Step-by-Step Guide

Get the complete implementation guide + downloadable template

Common Gotchas

ScrapingBee credentials can expire or need specific permissions. If things break, check your ScrapingBee dashboard API key status and plan limits first.
If you’re using Wait nodes or external rendering, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
Google Sheets appends can silently go sideways if the sheet doesn’t have a links column or the Google account loses access. Check the Google Sheets node connection and the target sheet ID.

Frequently Asked Questions

How long does it take to set up this ScrapingBee Sheets automation?

About 30 minutes if your ScrapingBee key and Google Sheet are ready.

Do I need coding skills to automate sitemap URL collection?

No. You’ll mostly connect accounts and paste in the webhook URL for testing.

Is n8n free to use for this ScrapingBee Sheets automation workflow?

Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in ScrapingBee API costs based on your usage.

Where can I host n8n to run this automation?

Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.

Can I customize this ScrapingBee Sheets automation workflow for Excel instead of Google Sheets?

Yes, but you’ll swap the destination node. Replace the “Append URLs to Sheet” Google Sheets step with Microsoft Excel 365 (or a database) and keep the rest of the sitemap discovery logic the same. Common customizations include writing to multiple tabs, deduplicating before insert, and storing extra columns like the source sitemap file.

Why is my ScrapingBee connection failing in this workflow?

Usually it’s an invalid or expired API key, so regenerate it in ScrapingBee and update the credential in n8n. It can also be a plan limit issue if the site has a lot of XML files, or a blocked request that needs ScrapingBee’s premium/stealth proxy options enabled on the “Fetch robots.txt”, “Retrieve sitemap.xml”, or “Fetch XML Resource” nodes. If failures happen only on .gz files, check the gzip detection and unzip branch, because a binary payload can be misread if the content-type is unusual.

How many URLs can this ScrapingBee Sheets automation handle?

It depends more on your memory and API limits than a hard URL cap. On n8n Cloud Starter, you’re limited by monthly executions, which is fine for periodic audits but not ideal for nonstop crawling. If you self-host, there’s no execution limit, and capacity mostly comes down to server RAM and how heavy the sitemaps are. For very large sitemap indexes, run it during off-hours and consider writing incremental results (which this workflow already does by appending as it finds URLs).

Is this ScrapingBee Sheets automation better than using Zapier or Make?

For recursive sitemap scraping, yes, most of the time. Zapier and Make are great for linear “A to B” workflows, but they get awkward when you need loops, branching logic for gz files, and “keep going until there are no more XML links.” n8n handles that structure cleanly, and you can self-host for high-volume runs without paying per tiny step. If your use case is just “fetch one sitemap and write it to a sheet,” Zapier can be simpler. Talk to an automation expert if you want a quick recommendation based on your volume.

Once this is running, “get me the full sitemap URL list” stops being a half-day task. The workflow does the chasing and parsing so you can get back to the audit.

ScrapingBee to Google Sheets, full sitemap URLs

How This Automation Works

n8n Workflow Template: ScrapingBee to Google Sheets, full sitemap URLs

The Problem: Sitemaps Are Never “Just One File”