ScrapingBee to Google Sheets, full sitemap URLs
You grab a sitemap, paste a few URLs into a sheet, and call it done. Then you realize the site has five sitemap files, two of them are gzipped, and robots.txt points to a different index than sitemap.xml.
This is the kind of mess that slows down SEO leads, bogs down marketing ops, and frankly annoys agency teams trying to run audits at scale. With this ScrapingBee Sheets automation, you send one domain and get a growing “all URLs” sheet you can actually trust.
Below is the exact workflow, what it fixes, and what you’ll need to get every sitemap URL into Google Sheets without babysitting the process.
How This Automation Works
The full n8n workflow, from trigger to final output:
n8n Workflow Template: ScrapingBee to Google Sheets, full sitemap URLs
flowchart LR
subgraph sg0["Flow 1"]
direction LR
n0@{ icon: "mdi:cog", form: "rounded", label: "Parse XML as JSON", pos: "b", h: 48 }
n1@{ icon: "mdi:swap-horizontal", form: "rounded", label: "Check Binary Payload", pos: "b", h: 48 }
n2["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/webhook.dark.svg' width='40' height='40' /></div><br/>Inbound Domain Hook"]
n3@{ icon: "mdi:web", form: "rounded", label: "Fetch robots.txt", pos: "b", h: 48 }
n4@{ icon: "mdi:web", form: "rounded", label: "Retrieve sitemap.xml", pos: "b", h: 48 }
n5@{ icon: "mdi:swap-horizontal", form: "rounded", label: "Detect sitemap refs", pos: "b", h: 48 }
n6@{ icon: "mdi:swap-horizontal", form: "rounded", label: "Check for GZ archive", pos: "b", h: 48 }
n7@{ icon: "mdi:cog", form: "rounded", label: "Unzip GZ content", pos: "b", h: 48 }
n8["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Map Binary Data Key"]
n9["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Collect Non-XML URLs"]
n10["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Gather XML URLs"]
n11@{ icon: "mdi:web", form: "rounded", label: "Fetch XML Resource", pos: "b", h: 48 }
n12@{ icon: "mdi:database", form: "rounded", label: "Append URLs to Sheet", pos: "b", h: 48 }
n11 --> n1
n2 --> n3
n10 --> n11
n6 --> n7
n6 --> n0
n7 --> n8
n9 --> n12
n1 --> n6
n1 --> n9
n1 --> n10
n3 --> n5
n3 --> n4
n4 --> n1
n0 --> n9
n0 --> n10
n5 --> n10
n5 --> n4
n8 --> n0
end
%% Styling
classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef disabled stroke-dasharray: 5 5,opacity: 0.5
class n1,n5,n6 decision
class n12 database
class n2,n3,n4,n11 api
class n8,n9,n10 code
classDef customIcon fill:none,stroke:none
class n2,n8,n9,n10 customIcon
The Problem: Sitemaps Are Never “Just One File”
Pulling sitemap URLs sounds simple until you do it for real websites. You download sitemap.xml, it’s actually a sitemap index that links to ten more XMLs, and some of those link to more. Then there’s robots.txt, which may list a different sitemap than the one you guessed. Add gzipped sitemaps (sitemap.xml.gz) and you’ve got a workflow that breaks the moment you try to do it manually. You waste an afternoon, still miss pages, and you don’t feel confident enough to base an audit on the list you collected.
It adds up fast. Here’s where it breaks down.
- You end up copying partial URL lists into Google Sheets, then realizing you missed entire sitemap branches.
- Compressed sitemap files force extra steps, and it’s easy to skip them when you’re moving quickly.
- Sites that block basic requests make “quick scripts” unreliable, which means you rerun work and still doubt the output.
- When sitemaps update, your spreadsheet becomes stale unless someone repeats the whole process again.
The Solution: Send a Domain, Get Every URL in a Sheet
This n8n workflow starts with a simple webhook call that includes the domain (like domain=www.example.com). From there, it uses ScrapingBee to fetch robots.txt, checks for sitemap references, and falls back to sitemap.xml when robots.txt doesn’t help. Each XML resource it discovers gets fetched and inspected, including tricky cases like binary payloads and gzipped archives. The workflow extracts two things: the “real” website URLs you want to audit, and any additional XML sitemap links that should be followed next. Those XML links feed back into the same fetch-and-parse loop until there are no more sitemap files to chase, and every discovered page URL gets appended into a Google Sheet.
The workflow starts when you send a GET request to the webhook with a domain. ScrapingBee pulls robots.txt and sitemap.xml, then the workflow recursively collects sitemap links and page links (including .gz sitemaps). Finally, it appends all non-XML URLs into a single links column in Google Sheets so you can filter, dedupe, and audit.
What You Get: Automation vs. Results
| What This Workflow Automates | Results You’ll Get |
|---|---|
|
|
Example: What This Looks Like
Say you’re auditing a mid-sized ecommerce site with 12 sitemap files, and each one takes about 10 minutes to download, open, extract, and paste into a sheet. That’s roughly 2 hours, and that’s before you notice three of them are .gz archives. With this workflow, you send one webhook request (a minute), let n8n and ScrapingBee process and recurse for maybe 20 minutes, and your Google Sheet fills as it goes. You still review the list, but you’re not doing the grunt work.
What You’ll Need
- n8n instance (try n8n Cloud free)
- Self-hosting option if you prefer (Hostinger works well)
- ScrapingBee for fetching robots.txt and sitemaps reliably.
- Google Sheets to store the final list of URLs.
- ScrapingBee API key (get it from your ScrapingBee dashboard).
Skill level: Intermediate. You’ll paste credentials, create a sheet column named links, and test a webhook call.
Don’t want to set this up yourself? Talk to an automation expert (free 15-minute consultation).
How It Works
A webhook starts the run. You call the workflow URL with a single query parameter (domain=…). That’s it. No CSV uploads, no manual inputs beyond the domain.
Sitemaps get discovered the way search engines expect. The workflow fetches robots.txt via ScrapingBee, checks for sitemap references, and then retrieves sitemap.xml if needed. This matters because many sites don’t put their “real” sitemap where you’d assume.
Each XML resource is inspected and parsed. The workflow checks if the response is text or binary, detects .gz archives, unzips when required, and parses XML into structured data so URLs can be extracted consistently.
URLs are split into two streams. Normal page URLs get appended to Google Sheets immediately, while newly found XML sitemap links get fed back into the workflow to be scraped next. That loop continues until there are no more XML links left to fetch.
You can easily modify the destination from Google Sheets to Excel 365 or a database based on your needs. See the full implementation guide below for customization options.
Step-by-Step Implementation Guide
Step 1: Configure the Webhook Trigger
Set up the inbound webhook that accepts a domain and initiates sitemap discovery.
- Add the Inbound Domain Hook node as the trigger.
- Set Path to
1da30868-fbca-4e8e-8580-485afb3fd956. - Plan to call the webhook with a query parameter named domain (e.g.,
?domain=example.com).
⚠️ Common Pitfall: The workflow expects $('Inbound Domain Hook').item.json.query.domain. If you pass the domain in the body instead of query, the URLs in Fetch robots.txt and Retrieve sitemap.xml will be empty.
Step 2: Connect ScrapingBee and Configure Sitemap Discovery
Pull robots.txt and sitemap.xml using ScrapingBee to discover sitemap references.
- In Fetch robots.txt, set URL to
=https://{{ $('Inbound Domain Hook').item.json.query.domain }}/robots.txt. - Credential Required: Connect your ScrapingBeeApi credentials in Fetch robots.txt.
- In Retrieve sitemap.xml, set URL to
=https://{{ $('Inbound Domain Hook').item.json.query.domain }}/sitemap.xml. - Credential Required: Connect your ScrapingBeeApi credentials in Retrieve sitemap.xml and Fetch XML Resource (used for sitemap URLs discovered later).
- Keep Detect sitemap refs as-is; it checks
={{ $json.data }}for the stringSitemap:to locate sitemap references.
/robots.txt and /sitemap.xml without changing any node settings.Step 3: Set Up Binary Handling and XML Parsing
Handle binary sitemap responses, including gzip archives, before parsing XML.
- In Check Binary Payload, keep the condition
={{ $binary }}is not empty to gate binary processing. - In Check for GZ archive, keep Left Value as
={{ $binary.data.fileExtension }}and Right Value asgz. - Use Unzip GZ content to decompress gzip sitemaps before parsing.
- In Map Binary Data Key, keep the JavaScript that remaps
file_0todataso Parse XML as JSON can read it. - In Parse XML as JSON, set Operation to
xmland Binary Property Name to=data.
⚠️ Common Pitfall: If Map Binary Data Key is removed, Parse XML as JSON will not find data and parsing will fail for gzipped sitemaps.
Step 4: Configure URL Extraction and Parallel Branching
Split parsed XML into non-XML URLs for output and XML URLs for recursive fetching.
- Keep the Collect Non-XML URLs code to emit items with
{ "link": "..." }while excluding.xml,.xml.gz,www.sitemaps.org, andwww.w3.org. - Keep the Gather XML URLs code to emit items with
{ "xml": "..." }for nested sitemap crawling. - Parse XML as JSON outputs to both Collect Non-XML URLs and Gather XML URLs in parallel.
- Confirm Gather XML URLs connects to Fetch XML Resource to fetch nested sitemap XML via
={{ $json.xml }}.
Step 5: Configure Output to Google Sheets
Append discovered non-XML URLs to a spreadsheet for storage and analysis.
- Add Append URLs to Sheet after Collect Non-XML URLs.
- Credential Required: Connect your googleSheetsOAuth2Api credentials in Append URLs to Sheet.
- Set Operation to
appendand enable Use Append. - Set Document to
[YOUR_ID]and Sheet toSheet1(gid=0). - In Columns, map links to
={{ $json.link }}.
Step 6: Test and Activate Your Workflow
Validate execution with a sample domain and then enable the workflow for production.
- Click Execute Workflow and trigger Inbound Domain Hook with a test URL like
?domain=example.com. - Verify Fetch robots.txt and Retrieve sitemap.xml return content, and that Check Binary Payload routes correctly.
- Confirm Append URLs to Sheet adds rows with URL values in the links column.
- Once successful, toggle the workflow to Active for live webhook use.
Common Gotchas
- ScrapingBee credentials can expire or need specific permissions. If things break, check your ScrapingBee dashboard API key status and plan limits first.
- If you’re using Wait nodes or external rendering, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
- Google Sheets appends can silently go sideways if the sheet doesn’t have a links column or the Google account loses access. Check the Google Sheets node connection and the target sheet ID.
Frequently Asked Questions
About 30 minutes if your ScrapingBee key and Google Sheet are ready.
No. You’ll mostly connect accounts and paste in the webhook URL for testing.
Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in ScrapingBee API costs based on your usage.
Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.
Yes, but you’ll swap the destination node. Replace the “Append URLs to Sheet” Google Sheets step with Microsoft Excel 365 (or a database) and keep the rest of the sitemap discovery logic the same. Common customizations include writing to multiple tabs, deduplicating before insert, and storing extra columns like the source sitemap file.
Usually it’s an invalid or expired API key, so regenerate it in ScrapingBee and update the credential in n8n. It can also be a plan limit issue if the site has a lot of XML files, or a blocked request that needs ScrapingBee’s premium/stealth proxy options enabled on the “Fetch robots.txt”, “Retrieve sitemap.xml”, or “Fetch XML Resource” nodes. If failures happen only on .gz files, check the gzip detection and unzip branch, because a binary payload can be misread if the content-type is unusual.
It depends more on your memory and API limits than a hard URL cap. On n8n Cloud Starter, you’re limited by monthly executions, which is fine for periodic audits but not ideal for nonstop crawling. If you self-host, there’s no execution limit, and capacity mostly comes down to server RAM and how heavy the sitemaps are. For very large sitemap indexes, run it during off-hours and consider writing incremental results (which this workflow already does by appending as it finds URLs).
For recursive sitemap scraping, yes, most of the time. Zapier and Make are great for linear “A to B” workflows, but they get awkward when you need loops, branching logic for gz files, and “keep going until there are no more XML links.” n8n handles that structure cleanly, and you can self-host for high-volume runs without paying per tiny step. If your use case is just “fetch one sitemap and write it to a sheet,” Zapier can be simpler. Talk to an automation expert if you want a quick recommendation based on your volume.
Once this is running, “get me the full sitemap URL list” stops being a half-day task. The workflow does the chasing and parsing so you can get back to the audit.
Need Help Setting This Up?
Our automation experts can build and customize this workflow for your specific needs. Free 15-minute consultation—no commitment required.