Scrapeless to Qdrant, pages become searchable data
Copying chunks of web pages into docs or spreadsheets feels quick. Then you try to find that one quote later, realize you lost the source URL, and the “research” turns into a messy pile.
Marketing leads doing competitive research get hit hardest, but agency owners building client knowledge bases and ops folks maintaining internal wikis run into the same wall. This Scrapeless Qdrant automation turns web pages into searchable data, so you can retrieve answers in seconds instead of re-reading tabs for an hour.
You’ll see how the workflow pulls reliable HTML, cleans it with AI, creates embeddings, and stores everything in Qdrant with status webhooks for visibility.
How This Automation Works
The full n8n workflow, from trigger to final output:
n8n Workflow Template: Scrapeless to Qdrant, pages become searchable data
flowchart LR
subgraph sg0["When clicking 'Test workflow' Flow"]
direction LR
n0@{ icon: "mdi:play-circle", form: "rounded", label: "When clicking 'Test workflow'", pos: "b", h: 48 }
n1@{ icon: "mdi:swap-vertical", form: "rounded", label: "Set Fields - URL and Webhook..", pos: "b", h: 48 }
n2["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Scrapeless Web Request"]
n3["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Format Claude Output"]
n4["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Check Collection Exists"]
n5@{ icon: "mdi:swap-horizontal", form: "rounded", label: "Collection Exists Check", pos: "b", h: 48 }
n6["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Create Qdrant Collection"]
n7["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Claude Data extractor"]
n8["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Ollama Embeddings"]
n9["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Qdrant Vector store"]
n10["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Claude AI Agent"]
n11["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Webhook for structured AI ag.."]
n12["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Expot data webhook"]
n13["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>AI Data Checker"]
n13 --> n12
n10 --> n3
n8 --> n9
n9 --> n11
n3 --> n8
n7 --> n10
n2 --> n13
n2 --> n7
n4 --> n5
n5 --> n1
n5 --> n6
n6 --> n1
n0 --> n4
n1 --> n2
end
%% Styling
classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
classDef disabled stroke-dasharray: 5 5,opacity: 0.5
class n0 trigger
class n5 decision
class n2,n4,n6 api
class n3,n7,n8,n9,n10,n11,n12,n13 code
classDef customIcon fill:none,stroke:none
class n2,n3,n4,n6,n7,n8,n9,n10,n11,n12,n13 customIcon
The Problem: Web Research Doesn’t Stay Searchable
Web pages are a terrible long-term storage format for anything you need to reuse. They change, load weirdly, block scrapers, and bury the one key sentence you actually cared about. So teams do the “manual safety net”: copy, paste, tidy, label, and hope future-you remembers where it came from. It’s not just time. It’s mental load, duplicated work, and mistakes that quietly spread into briefs, decks, and client recommendations.
It adds up fast. Here’s where it breaks down in real life.
- You lose context when the source URL, page title, and timestamp aren’t stored alongside the notes.
- JavaScript-heavy sites and anti-bot protections make “quick scraping” unreliable, so the workflow dies right at the start.
- Even when you capture the page, you end up with unstructured text, which means search is basically Ctrl+F and luck.
- Teams re-collect the same info every month because nothing is indexed in a way an agent or teammate can query.
The Solution: Turn Web Pages Into a Reusable Vector Library
This workflow creates a clean pipeline from “URL I need to understand” to “searchable knowledge I can reuse.” It starts when you manually launch the workflow in n8n (perfect for ad hoc research sprints), then assigns a target URL and a webhook destination for updates. Scrapeless fetches the page HTML in a way that handles tough sites more reliably than basic scrapers. From there, an AI validation step checks that you actually got meaningful content, and an AI extraction script turns messy HTML into structured JSON you can trust.
Next, the workflow enhances and formats the extracted content, then generates embeddings using a local Ollama model (all-minilm). Finally, it checks Qdrant for the right collection, creates it if needed, saves the vectors, and notifies you via webhook when the run completes or fails. You end up with a searchable vector database that’s ready for semantic search, internal tools, or RAG agents.
What You Get: Automation vs. Results
| What This Workflow Automates | Results You’ll Get |
|---|---|
|
|
Example: What This Looks Like
Say you collect 20 competitor pages every week. Manually, you might spend about 10 minutes per page to copy sections, clean them up, and label sources, which is roughly 3 hours. With this workflow, you paste a URL once and let the pipeline run: a minute to launch it, a few minutes to scrape and extract, then vectors land in Qdrant automatically. You still review the output, but the repetitive work largely disappears.
What You’ll Need
- n8n instance (try n8n Cloud free)
- Self-hosting option if you prefer (Hostinger works well)
- Scrapeless to fetch reliable web page HTML
- Qdrant to store and search vector embeddings
- Scrapeless API token (get it from your Scrapeless dashboard)
Skill level: Intermediate. You’ll paste API keys, run Qdrant/Ollama, and edit a couple of fields like collection name and webhook URL.
Don’t want to set this up yourself? Talk to an automation expert (free 15-minute consultation).
How It Works
Manual launch with a target URL. You start the workflow, and it sets the URL you want to capture plus a webhook destination for updates.
Reliable extraction from the open web. Scrapeless fetches the HTML, which helps when sites are JavaScript-heavy or try to block basic scrapers. A content validator then checks that the scrape returned real content, not an error page or empty shell.
AI turns HTML into structured data. The extraction script and enhancement agent reshape the raw page into clean JSON and readable text, so you’re not embedding random navigation menus and cookie banners. Honestly, this is where most DIY pipelines fail.
Vectors get stored (and you get notified). Ollama generates embeddings locally, Qdrant stores them in the right collection (creating it if it doesn’t exist), and a webhook notifier sends completion or error status so you’re not guessing.
You can easily modify the extraction schema to capture different fields based on your needs. See the full implementation guide below for customization options.
Step-by-Step Implementation Guide
Step 1: Configure the Manual Trigger
Start the workflow with a manual trigger so you can test the full pipeline on demand.
- Add the Manual Launch Trigger node as your workflow trigger.
- Leave all settings at their defaults in Manual Launch Trigger.
- Connect Manual Launch Trigger to Verify Collection Presence.
Step 2: Connect Qdrant Collection Checks
This step ensures the vector collection exists before you send any embeddings.
- Open Verify Collection Presence and set URL to
http://localhost:6333/collections/hacker-news. - In Verify Collection Presence, keep Send Headers enabled and set header Content-Type to
application/json. - Configure Collection Presence Gate with the condition Left Value set to
{{ $node['Verify Collection Presence'].json.result ? $node['Verify Collection Presence'].json.status : 'not_found' }}and Right Value set took. - From Collection Presence Gate, connect the true branch to Assign URL & Webhook and the false branch to Provision Qdrant Collection.
- In Provision Qdrant Collection, set URL to
http://localhost:6333/collections/hacker-newsand Method toPUT. - Connect Provision Qdrant Collection to Assign URL & Webhook.
⚠️ Common Pitfall: If Qdrant isn’t running on localhost:6333, this gate will always fail. Update the URLs in both Verify Collection Presence and Provision Qdrant Collection if your Qdrant host is different.
Step 3: Connect the Scraping Request
Configure the external scraping call that fetches the web page HTML.
- Open Assign URL & Webhook and keep the default settings (the node acts as a placeholder for future assignments).
- Open External Scrape Request and set URL to
https://api.scrapeless.com/api/v1/unlocker/request. - Set Method to
POSTand Specify Body tojson. - Set JSON Body to the provided payload (includes
"url": "https://news.ycombinator.com/"and rendering options). - In External Scrape Request, set header x-api-token to your API key value (currently
[CONFIGURE_YOUR_API_KEY]).
⚠️ Common Pitfall: Leaving [CONFIGURE_YOUR_API_KEY] in the header will cause the scrape to fail. Replace it with a valid Scrapeless API token.
Step 4: Set Up the AI Extraction and Validation Path
The workflow branches after scraping to handle AI validation and AI-enhanced extraction. External Scrape Request outputs to both AI Content Validator and Claude Extraction Script in parallel.
- Open AI Content Validator and ensure the Claude API call uses the
https://api.anthropic.com/v1/messagesendpoint with headerx-api-keyset to your Anthropic key (currently[CONFIGURE_YOUR_API_KEY]). - Connect AI Content Validator to Export Data Webhook to output a formatted file of extracted data.
- Open Claude Extraction Script and confirm it uses the same Anthropic endpoint and
x-api-keyheader with your key. - Connect Claude Extraction Script to AI Enhancement Agent, then to Claude Output Formatter to parse and normalize the JSON output.
⚠️ Common Pitfall: The Claude API key must be set in AI Content Validator, Claude Extraction Script, and AI Enhancement Agent (each uses [CONFIGURE_YOUR_API_KEY] in the request headers).
Step 5: Vectorize and Store in Qdrant
Transform the extracted content into embeddings and save them into Qdrant.
- In Claude Output Formatter, keep the default JavaScript logic to parse and structure Claude output.
- Open Ollama Vectorizer and ensure the embeddings endpoint is set to
http://127.0.0.1:11434/api/embeddingswith modelall-minilm. - Connect Ollama Vectorizer to Qdrant Vector Saver to store vectors.
- In Qdrant Vector Saver, verify the storage endpoint
http://127.0.0.1:6333/collections/hacker-news/pointsis correct for your Qdrant instance. - Connect Qdrant Vector Saver to Webhook Status Notifier to emit success/error notifications.
⚠️ Common Pitfall: If Ollama is not running locally, Ollama Vectorizer will fail. Update the URL if your Ollama host differs.
Step 6: Configure Output Webhooks
Send results to external platforms for visibility and archival.
- In Export Data Webhook, add webhook URLs for
discord,slack,linear,teams, ortelegramif you want file outputs. - In Webhook Status Notifier, add webhook URLs for
discord,slack,teams,telegram, orcustomto receive status alerts. - Leave any unused webhook fields blank to skip notifications for those platforms.
Step 7: Test and Activate Your Workflow
Run the workflow manually to validate all branches, then activate for production use.
- Click Execute Workflow on Manual Launch Trigger to run the full pipeline.
- Confirm that External Scrape Request completes and that AI Content Validator and Claude Extraction Script run in parallel.
- Check that Ollama Vectorizer returns a
vectorand Qdrant Vector Saver reportssuccess: true. - Verify notifications in Webhook Status Notifier and exported data outputs in Export Data Webhook.
- When satisfied, switch the workflow to Active to use it in production.
Common Gotchas
- Scrapeless credentials can expire or need specific permissions. If things break, check your Scrapeless API token in the n8n HTTP Request node first.
- If you’re using Wait nodes or external rendering, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
- Qdrant collection names and dimensions must match what you’re storing. If vectors fail to insert, check the collection settings and confirm your Ollama embedding model output matches.
Frequently Asked Questions
Plan for about 60 minutes if Qdrant and Ollama aren’t installed yet.
No. You’ll mostly paste credentials and edit a few fields like URL, collection name, and webhook destination.
Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in Scrapeless usage and any AI model costs if you run hosted LLMs.
Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.
Yes, but you’ll swap the “External Scrape Request” step for a file ingestion path. If your files live in Drive, use a Google Drive Trigger to grab the PDF, then read it with Read PDF (or Read Binary File if needed) before sending the extracted text into the same extraction, embedding, and Qdrant save steps. Common tweaks include adding metadata like document type, client name, and a stable file URL.
Most of the time it’s an API token issue. Regenerate your Scrapeless token, update it in the HTTP Request node, and re-run a single URL to confirm. If the scrape returns a “success” but the content is empty, the site may require different Scrapeless settings (render mode, headers, or geo). Also check your webhook notifier output, because failures sometimes show up there first.
On self-hosted n8n, there’s no hard execution limit, so capacity mostly depends on your server and Scrapeless limits.
Often, yes, because this pipeline needs branching logic (like “create the Qdrant collection only if it’s missing”), code steps, and tight control over the payload you store. n8n also lets you self-host, which matters when you’re processing lots of pages and don’t want every run billed as a premium task. Zapier or Make can still work if you keep it simple, but you’ll usually end up fighting limitations once you add validation, formatting, and embeddings. If you’re unsure, decide based on where you want the data to live and how much you’ll scale. Talk to an automation expert and we’ll map the cleanest option.
Once your pages land in Qdrant, they stop being “stuff you read” and become “stuff you can use.” Set it up once, then let search do the heavy lifting.
Need Help Setting This Up?
Our automation experts can build and customize this workflow for your specific needs. Free 15-minute consultation—no commitment required.