January 22, 2026

Scrapeless to Qdrant, pages become searchable data

Lisa Granqvist Partner Workflow Automation Expert

Get a free AI assessment → ⬇️ Use template

Copying chunks of web pages into docs or spreadsheets feels quick. Then you try to find that one quote later, realize you lost the source URL, and the “research” turns into a messy pile.

Marketing leads doing competitive research get hit hardest, but agency owners building client knowledge bases and ops folks maintaining internal wikis run into the same wall. This Scrapeless Qdrant automation turns web pages into searchable data, so you can retrieve answers in seconds instead of re-reading tabs for an hour.

You’ll see how the workflow pulls reliable HTML, cleans it with AI, creates embeddings, and stores everything in Qdrant with status webhooks for visibility.

How This Automation Works

The full n8n workflow, from trigger to final output:

n8n Workflow Template: Scrapeless to Qdrant, pages become searchable data

Click to explore

flowchart LR

    subgraph sg0["When clicking 'Test workflow' Flow"]
        direction LR
        n0@{ icon: "mdi:play-circle", form: "rounded", label: "When clicking 'Test workflow'", pos: "b", h: 48 }
        n1@{ icon: "mdi:swap-vertical", form: "rounded", label: "Set Fields - URL and Webhook..", pos: "b", h: 48 }
        n2["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Scrapeless Web Request"]
        n3["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Format Claude Output"]
        n4["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Check Collection Exists"]
        n5@{ icon: "mdi:swap-horizontal", form: "rounded", label: "Collection Exists Check", pos: "b", h: 48 }
        n6["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/httprequest.dark.svg' width='40' height='40' /></div><br/>Create Qdrant Collection"]
        n7["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Claude Data extractor"]
        n8["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Ollama Embeddings"]
        n9["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Qdrant Vector store"]
        n10["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Claude AI Agent"]
        n11["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Webhook for structured AI ag.."]
        n12["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>Expot data webhook"]
        n13["<div style='background:#f5f5f5;padding:10px;border-radius:8px;display:inline-block;border:1px solid #e0e0e0'><img src='https://flowpast.com/wp-content/uploads/n8n-workflow-icons/code.svg' width='40' height='40' /></div><br/>AI Data Checker"]
        n13 --> n12
        n10 --> n3
        n8 --> n9
        n9 --> n11
        n3 --> n8
        n7 --> n10
        n2 --> n13
        n2 --> n7
        n4 --> n5
        n5 --> n1
        n5 --> n6
        n6 --> n1
        n0 --> n4
        n1 --> n2
    end

    %% Styling
    classDef trigger fill:#e8f5e9,stroke:#388e3c,stroke-width:2px
    classDef ai fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef aiModel fill:#e8eaf6,stroke:#3f51b5,stroke-width:2px
    classDef decision fill:#fff8e1,stroke:#f9a825,stroke-width:2px
    classDef database fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef api fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef code fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px
    classDef disabled stroke-dasharray: 5 5,opacity: 0.5
    class n0 trigger
    class n5 decision
    class n2,n4,n6 api
    class n3,n7,n8,n9,n10,n11,n12,n13 code
    classDef customIcon fill:none,stroke:none
    class n2,n3,n4,n6,n7,n8,n9,n10,n11,n12,n13 customIcon

The Problem: Web Research Doesn’t Stay Searchable

Web pages are a terrible long-term storage format for anything you need to reuse. They change, load weirdly, block scrapers, and bury the one key sentence you actually cared about. So teams do the “manual safety net”: copy, paste, tidy, label, and hope future-you remembers where it came from. It’s not just time. It’s mental load, duplicated work, and mistakes that quietly spread into briefs, decks, and client recommendations.

It adds up fast. Here’s where it breaks down in real life.

You lose context when the source URL, page title, and timestamp aren’t stored alongside the notes.
JavaScript-heavy sites and anti-bot protections make “quick scraping” unreliable, so the workflow dies right at the start.
Even when you capture the page, you end up with unstructured text, which means search is basically Ctrl+F and luck.
Teams re-collect the same info every month because nothing is indexed in a way an agent or teammate can query.

The Solution: Turn Web Pages Into a Reusable Vector Library

This workflow creates a clean pipeline from “URL I need to understand” to “searchable knowledge I can reuse.” It starts when you manually launch the workflow in n8n (perfect for ad hoc research sprints), then assigns a target URL and a webhook destination for updates. Scrapeless fetches the page HTML in a way that handles tough sites more reliably than basic scrapers. From there, an AI validation step checks that you actually got meaningful content, and an AI extraction script turns messy HTML into structured JSON you can trust.

Next, the workflow enhances and formats the extracted content, then generates embeddings using a local Ollama model (all-minilm). Finally, it checks Qdrant for the right collection, creates it if needed, saves the vectors, and notifies you via webhook when the run completes or fails. You end up with a searchable vector database that’s ready for semantic search, internal tools, or RAG agents.

What You Get: Automation vs. Results

What This Workflow Automates

Results You’ll Get

Pulling page HTML through Scrapeless so extraction is consistent.
Validating scraped content before wasting time on downstream AI steps.
Converting unstructured HTML into structured JSON fields you can reuse.
Vectorizing content and saving it into Qdrant, with webhooks for status.

Save about 2 hours per research batch because you stop re-reading pages.
Search by meaning, not exact keywords, which is huge for messy sources.
Build a reusable “memory” for briefs, client work, and internal docs.
Fewer copy-paste errors and fewer missing sources in deliverables.
A foundation for RAG agents, private search, and content monitoring.

Example: What This Looks Like

Say you collect 20 competitor pages every week. Manually, you might spend about 10 minutes per page to copy sections, clean them up, and label sources, which is roughly 3 hours. With this workflow, you paste a URL once and let the pipeline run: a minute to launch it, a few minutes to scrape and extract, then vectors land in Qdrant automatically. You still review the output, but the repetitive work largely disappears.

What You’ll Need

n8n instance (try n8n Cloud free)
Self-hosting option if you prefer (Hostinger works well)
Scrapeless to fetch reliable web page HTML
Qdrant to store and search vector embeddings
Scrapeless API token (get it from your Scrapeless dashboard)

Skill level: Intermediate. You’ll paste API keys, run Qdrant/Ollama, and edit a couple of fields like collection name and webhook URL.

Don’t want to set this up yourself? Talk to an automation expert (free 15-minute consultation).

How It Works

Manual launch with a target URL. You start the workflow, and it sets the URL you want to capture plus a webhook destination for updates.

Reliable extraction from the open web. Scrapeless fetches the HTML, which helps when sites are JavaScript-heavy or try to block basic scrapers. A content validator then checks that the scrape returned real content, not an error page or empty shell.

AI turns HTML into structured data. The extraction script and enhancement agent reshape the raw page into clean JSON and readable text, so you’re not embedding random navigation menus and cookie banners. Honestly, this is where most DIY pipelines fail.

Vectors get stored (and you get notified). Ollama generates embeddings locally, Qdrant stores them in the right collection (creating it if it doesn’t exist), and a webhook notifier sends completion or error status so you’re not guessing.

You can easily modify the extraction schema to capture different fields based on your needs. See the full implementation guide below for customization options.

Step-by-Step Implementation Guide

Step 1: Configure the Manual Trigger

Start the workflow with a manual trigger so you can test the full pipeline on demand.

Add the Manual Launch Trigger node as your workflow trigger.
Leave all settings at their defaults in Manual Launch Trigger.
Connect Manual Launch Trigger to Verify Collection Presence.

Step 2: Connect Qdrant Collection Checks

This step ensures the vector collection exists before you send any embeddings.

Open Verify Collection Presence and set URL to http://localhost:6333/collections/hacker-news.
In Verify Collection Presence, keep Send Headers enabled and set header Content-Type to application/json.
Configure Collection Presence Gate with the condition Left Value set to {{ $node['Verify Collection Presence'].json.result ? $node['Verify Collection Presence'].json.status : 'not_found' }} and Right Value set to ok.
From Collection Presence Gate, connect the true branch to Assign URL & Webhook and the false branch to Provision Qdrant Collection.
In Provision Qdrant Collection, set URL to http://localhost:6333/collections/hacker-news and Method to PUT.
Connect Provision Qdrant Collection to Assign URL & Webhook.

⚠️ Common Pitfall: If Qdrant isn’t running on localhost:6333, this gate will always fail. Update the URLs in both Verify Collection Presence and Provision Qdrant Collection if your Qdrant host is different.

Step 3: Connect the Scraping Request

Configure the external scraping call that fetches the web page HTML.

Open Assign URL & Webhook and keep the default settings (the node acts as a placeholder for future assignments).
Open External Scrape Request and set URL to https://api.scrapeless.com/api/v1/unlocker/request.
Set Method to POST and Specify Body to json.
Set JSON Body to the provided payload (includes "url": "https://news.ycombinator.com/" and rendering options).
In External Scrape Request, set header x-api-token to your API key value (currently [CONFIGURE_YOUR_API_KEY]).

⚠️ Common Pitfall: Leaving [CONFIGURE_YOUR_API_KEY] in the header will cause the scrape to fail. Replace it with a valid Scrapeless API token.

Step 4: Set Up the AI Extraction and Validation Path

The workflow branches after scraping to handle AI validation and AI-enhanced extraction. External Scrape Request outputs to both AI Content Validator and Claude Extraction Script in parallel.

Open AI Content Validator and ensure the Claude API call uses the https://api.anthropic.com/v1/messages endpoint with header x-api-key set to your Anthropic key (currently [CONFIGURE_YOUR_API_KEY]).
Connect AI Content Validator to Export Data Webhook to output a formatted file of extracted data.
Open Claude Extraction Script and confirm it uses the same Anthropic endpoint and x-api-key header with your key.
Connect Claude Extraction Script to AI Enhancement Agent, then to Claude Output Formatter to parse and normalize the JSON output.

⚠️ Common Pitfall: The Claude API key must be set in AI Content Validator, Claude Extraction Script, and AI Enhancement Agent (each uses [CONFIGURE_YOUR_API_KEY] in the request headers).

Step 5: Vectorize and Store in Qdrant

Transform the extracted content into embeddings and save them into Qdrant.

In Claude Output Formatter, keep the default JavaScript logic to parse and structure Claude output.
Open Ollama Vectorizer and ensure the embeddings endpoint is set to http://127.0.0.1:11434/api/embeddings with model all-minilm.
Connect Ollama Vectorizer to Qdrant Vector Saver to store vectors.
In Qdrant Vector Saver, verify the storage endpoint http://127.0.0.1:6333/collections/hacker-news/points is correct for your Qdrant instance.
Connect Qdrant Vector Saver to Webhook Status Notifier to emit success/error notifications.

⚠️ Common Pitfall: If Ollama is not running locally, Ollama Vectorizer will fail. Update the URL if your Ollama host differs.

Step 6: Configure Output Webhooks

Send results to external platforms for visibility and archival.

In Export Data Webhook, add webhook URLs for discord, slack, linear, teams, or telegram if you want file outputs.
In Webhook Status Notifier, add webhook URLs for discord, slack, teams, telegram, or custom to receive status alerts.
Leave any unused webhook fields blank to skip notifications for those platforms.

Step 7: Test and Activate Your Workflow

Run the workflow manually to validate all branches, then activate for production use.

Click Execute Workflow on Manual Launch Trigger to run the full pipeline.
Confirm that External Scrape Request completes and that AI Content Validator and Claude Extraction Script run in parallel.
Check that Ollama Vectorizer returns a vector and Qdrant Vector Saver reports success: true.
Verify notifications in Webhook Status Notifier and exported data outputs in Export Data Webhook.
When satisfied, switch the workflow to Active to use it in production.

🔒

Unlock Full Step-by-Step Guide

Get the complete implementation guide + downloadable template

Common Gotchas

Scrapeless credentials can expire or need specific permissions. If things break, check your Scrapeless API token in the n8n HTTP Request node first.
If you’re using Wait nodes or external rendering, processing times vary. Bump up the wait duration if downstream nodes fail on empty responses.
Qdrant collection names and dimensions must match what you’re storing. If vectors fail to insert, check the collection settings and confirm your Ollama embedding model output matches.

Frequently Asked Questions

How long does it take to set up this Scrapeless Qdrant automation?

Plan for about 60 minutes if Qdrant and Ollama aren’t installed yet.

Do I need coding skills to automate web pages into searchable data?

No. You’ll mostly paste credentials and edit a few fields like URL, collection name, and webhook destination.

Is n8n free to use for this Scrapeless Qdrant automation workflow?

Yes. n8n has a free self-hosted option and a free trial on n8n Cloud. Cloud plans start at $20/month for higher volume. You’ll also need to factor in Scrapeless usage and any AI model costs if you run hosted LLMs.

Where can I host n8n to run this automation?

Two options: n8n Cloud (managed, easiest setup) or self-hosting on a VPS. For self-hosting, Hostinger VPS is affordable and handles n8n well. Self-hosting gives you unlimited executions but requires basic server management.

Can I customize this Scrapeless Qdrant automation workflow for PDF files instead of web pages?

Yes, but you’ll swap the “External Scrape Request” step for a file ingestion path. If your files live in Drive, use a Google Drive Trigger to grab the PDF, then read it with Read PDF (or Read Binary File if needed) before sending the extracted text into the same extraction, embedding, and Qdrant save steps. Common tweaks include adding metadata like document type, client name, and a stable file URL.

Why is my Scrapeless connection failing in this workflow?

Most of the time it’s an API token issue. Regenerate your Scrapeless token, update it in the HTTP Request node, and re-run a single URL to confirm. If the scrape returns a “success” but the content is empty, the site may require different Scrapeless settings (render mode, headers, or geo). Also check your webhook notifier output, because failures sometimes show up there first.

How many pages can this Scrapeless Qdrant automation handle?

On self-hosted n8n, there’s no hard execution limit, so capacity mostly depends on your server and Scrapeless limits.

Is this Scrapeless Qdrant automation better than using Zapier or Make?

Often, yes, because this pipeline needs branching logic (like “create the Qdrant collection only if it’s missing”), code steps, and tight control over the payload you store. n8n also lets you self-host, which matters when you’re processing lots of pages and don’t want every run billed as a premium task. Zapier or Make can still work if you keep it simple, but you’ll usually end up fighting limitations once you add validation, formatting, and embeddings. If you’re unsure, decide based on where you want the data to live and how much you’ll scale. Talk to an automation expert and we’ll map the cleanest option.

Once your pages land in Qdrant, they stop being “stuff you read” and become “stuff you can use.” Set it up once, then let search do the heavy lifting.

Need Help Setting This Up?

Our automation experts can build and customize this workflow for your specific needs. Free 15-minute consultation—no commitment required.

Lisa Granqvist

Workflow Automation Expert

Expert in workflow automation and no-code tools.

{
"id": "",
"meta": {
"instanceId": "ce118273c7c4e6605410257b51d3b3c683a2b0963e2e909894abdcfde1963320"
},
"name": "Automated Web Data Extraction Pipeline",
"tags": [],
"nodes": [
{
"id": "flowpast-topbar-4219",
"name": "Flowpast Branding",
"type": "n8n-nodes-base.stickyNote",
"position": [
0,
20
],
"parameters": {
"color": 7,
"width": 1550,
"height": 80,
"content": "## Flowpast.com | Automation Workflow Library\n**\ud83d\udcd6 Full tutorial & setup guide:** flowpast.com"
},
"typeVersion": 1
},
{
"id": "05f02bd8-01d5-49fa-a6cf-989499d1b299",
"name": "Manual Launch Trigger",
"type": "n8n-nodes-base.manualTrigger",
"position": [
1450,
180
],
"parameters": [],
"typeVersion": 1
},
{
"id": "279c7fef-a0fa-40c6-84e0-3f47c64f61d0",
"name": "Assign URL & Webhook",
"type": "n8n-nodes-base.set",
"position": [
760,
260
],
"parameters": {
"options": []
},
"typeVersion": 3.4
},
{
"id": "9f4ae239-db55-418a-9984-0b7291432484",
"name": "External Scrape Request",
"type": "n8n-nodes-base.httpRequest",
"position": [
490,
315
],
"parameters": {
"url": "https://api.scrapeless.com/api/v1/unlocker/request",
"method": "POST",
"options": [],
"jsonBody": "{\n  \"actor\": \"unlocker.webunlocker\",\n  \"proxy\": {\n    \"country\": \"ANY\"\n  },\n  \"input\": {\n    \"url\": \"https://news.ycombinator.com/\",\n    \"method\": \"GET\",\n    \"redirect\": true,\n    \"js_render\": true,\n    \"js_instructions\": [\n      {\n        \"wait\": 100\n      }\n    ],\n    \"block\": {\n      \"resources\": [\n        \"image\",\n        \"font\",\n        \"script\"\n      ]\n    }\n  }\n}",
"sendBody": true,
"sendHeaders": true,
"specifyBody": "json",
"headerParameters": {
"parameters": [
{
"name": "x-api-token",
"value": "[CONFIGURE_YOUR_API_KEY]"
}
]
}
},
"typeVersion": 4.2
},
{
"id": "4bde24dc-931f-40ef-9453-7978fd04fc1a",
"name": "Claude Output Formatter",
"type": "n8n-nodes-base.code",
"position": [
0,
565
],
"parameters": {
"jsCode": "// Format Claude Output - Parse and structure Claude response\n// Second node: Formats Claude API response for Qdrant and workflow\n\nconst claudeResponse = items[0].json;\n\nif (claudeResponse.error) {\n  console.error('\u274c Received error from Claude extractor:', claudeResponse.message);\n  return [{\n    json: {\n      id: Math.random().toString(36).substr(2, 9),\n      page_type: \"error\",\n      metadata: {\n        title: \"Extraction Error\",\n        description: `Error during extraction: ${claudeResponse.message}`,\n        url: \"Unknown\",\n        extracted_at: new Date().toISOString(),\n        error: true\n      },\n      content: {\n        main_text: `Processing failed: ${claudeResponse.message}`,\n        summary: \"Data extraction failed\"\n      },\n      vector_ready: false,\n      processing_error: claudeResponse\n    }\n  }];\n}\n\nlet extractedData = {};\n\ntry {\n  if (claudeResponse.content && Array.isArray(claudeResponse.content)) {\n    const responseText = claudeResponse.content[0].text;\n    console.log('\ud83d\udd0d Processing Claude response text...');\n    \n    const jsonMatch = responseText.match(/```json\\n([\\s\\S]*?)\\n```/) || responseText.match(/\\{[\\s\\S]*\\}/);\n    \n    if (jsonMatch) {\n      try {\n        extractedData = JSON.parse(jsonMatch[1] || jsonMatch[0]);\n        console.log('\u2705 Successfully parsed Claude JSON response');\n      } catch (parseError) {\n        console.error('\u274c JSON parsing error:', parseError);\n        \n        extractedData = {\n          page_type: \"parse_error\",\n          metadata: {\n            title: \"JSON Parse Error\",\n            description: \"Failed to parse Claude response as JSON\",\n            url: \"Unknown\",\n            extracted_at: new Date().toISOString(),\n            parse_error: parseError.message\n          },\n          content: {\n            main_text: responseText,\n            summary: \"Raw Claude response (unparseable)\",\n            raw_response: responseText\n          }\n        };\n      }\n    } else {\n      console.warn('\u26a0\ufe0f No JSON structure found in Claude response');\n      \n      extractedData = {\n        page_type: \"unstructured\",\n        metadata: {\n          title: \"Unstructured Response\",\n          description: \"Claude response without JSON structure\",\n          url: \"Unknown\",\n          extracted_at: new Date().toISOString()\n        },\n        content: {\n          main_text: responseText,\n          summary: \"Unstructured content from Claude\",\n          raw_response: responseText\n        }\n      };\n    }\n  } else {\n    throw new Error('Unexpected Claude response format');\n  }\n\n  if (!extractedData.id) {\n    extractedData.id = Math.random().toString(36).substr(2, 9);\n  }\n\n  extractedData.technical_metadata = {\n    extraction_source: \"scrapeless\",\n    ai_processor: \"claude-3-7-sonnet-20250219\",\n    processing_timestamp: new Date().toISOString(),\n    workflow_version: \"n8n-v2\",\n    data_quality: extractedData.page_type !== \"error\" && extractedData.page_type !== \"parse_error\" ? \"high\" : \"low\"\n  };\n\n  extractedData.vector_ready = extractedData.content && extractedData.content.main_text ? true : false;\n\n  if (extractedData.content && extractedData.content.main_text) {\n    if (extractedData.content.main_text.length < 50) {\n      extractedData.technical_metadata.content_warning = \"Content too short for meaningful vectorization\";\n    }\n    \n    extractedData.searchable_content = [\n      extractedData.metadata?.title || '',\n      extractedData.metadata?.description || '',\n      extractedData.content.main_text || '',\n      extractedData.content.summary || '',\n      (extractedData.content.key_points || []).join(' '),\n      (extractedData.entities?.topics || []).join(' ')\n    ].filter(text => text.length > 0).join(' ');\n  }\n\n  console.log('\u2705 Format processing complete:', {\n    page_type: extractedData.page_type,\n    has_content: !!extractedData.content?.main_text,\n    vector_ready: extractedData.vector_ready,\n    id: extractedData.id\n  });\n\n  return [{ json: extractedData }];\n\n} catch (error) {\n  console.error('\u274c Error during Claude response formatting:', error);\n  \n  return [{\n    json: {\n      id: Math.random().toString(36).substr(2, 9),\n      page_type: \"format_error\",\n      metadata: {\n        title: \"Formatting Error\",\n        description: `Error during response formatting: ${error.message}`,\n        url: \"Unknown\",\n        extracted_at: new Date().toISOString(),\n        error: true\n      },\n      content: {\n        main_text: `Formatting failed: ${error.message}`,\n        summary: \"Failed to format Claude response\"\n      },\n      technical_metadata: {\n        extraction_source: \"claude_formatter\",\n        error_details: error.message,\n        raw_claude_response: claudeResponse,\n        processing_timestamp: new Date().toISOString()\n      },\n      vector_ready: false\n    }\n  }];\n}"
},
"typeVersion": 2
},
{
"id": "9b524862-ed1b-4601-bfa6-928fbebde0f9",
"name": "Verify Collection Presence",
"type": "n8n-nodes-base.httpRequest",
"onError": "continueRegularOutput",
"position": [
1145,
255
],
"parameters": {
"url": "http://localhost:6333/collections/hacker-news",
"options": [],
"sendHeaders": true,
"headerParameters": {
"parameters": [
{
"name": "Content-Type",
"value": "application/json"
}
]
}
},
"typeVersion": 4.2,
"alwaysOutputData": true
},
{
"id": "0c6d1977-4812-4cd9-aa0a-b5c7adeb7e16",
"name": "Collection Presence Gate",
"type": "n8n-nodes-base.if",
"position": [
985,
175
],
"parameters": {
"options": [],
"conditions": {
"options": {
"version": 1,
"leftValue": "",
"caseSensitive": true,
"typeValidation": "strict"
},
"combinator": "and",
"conditions": [
{
"id": "64e5c63b-c488-44cc-9d26-2027e059c4b2",
"operator": {
"name": "filter.operator.equals",
"type": "string",
"operation": "equals"
},
"leftValue": "={{ $node['Verify Collection Presence'].json.result ? $node['Verify Collection Presence'].json.status : 'not_found' }}",
"rightValue": "ok"
}
]
}
},
"typeVersion": 2
},
{
"id": "22104741-3314-42fb-bc94-3a742af94245",
"name": "Provision Qdrant Collection",
"type": "n8n-nodes-base.httpRequest",
"position": [
720,
120
],
"parameters": {
"url": "http://localhost:6333/collections/hacker-news",
"method": "PUT",
"options": [],
"sendBody": true,
"sendHeaders": true,
"bodyParameters": {
"parameters": [
[]
]
},
"headerParameters": {
"parameters": [
{
"name": "Content-Type",
"value": "application/json"
}
]
}
},
"typeVersion": 4.2
},
{
"id": "0431e4e1-d5fe-404b-8891-e8b4dc157d5f",
"name": "Claude Extraction Script",
"type": "n8n-nodes-base.code",
"position": [
340,
420
],
"parameters": {
"jsCode": "// Claude Data Extractor - Raw extraction from HTML\n// First node: Makes API call to Claude for content extraction\n\nconst inputData = items[0].json;\n\nlet htmlContent = '';\nif (inputData.data && inputData.data.html) {\n  htmlContent = inputData.data.html;\n} else if (inputData.data && inputData.data.content) {\n  htmlContent = inputData.data.content;\n} else if (inputData.content) {\n  htmlContent = inputData.content;\n} else {\n  htmlContent = JSON.stringify(inputData);\n}\n\nconst pageUrl = inputData.url || inputData.data?.url || 'Unknown URL';\n\nconst extractionPrompt = `You are an expert web content extractor. Analyze this HTML content and extract important information in a structured JSON format.\n\n**INSTRUCTIONS:**\n1. Identify the content type (article, e-commerce, blog, news, documentation, etc.)\n2. Extract relevant information based on the type\n3. Create structured and consistent JSON output\n4. Ignore technical HTML (menus, ads, footers, etc.)\n\n**REQUIRED OUTPUT FORMAT:**\n\\`\\`\\`json\n{\n  \"page_type\": \"article|product|blog|news|documentation|listing|other\",\n  \"metadata\": {\n    \"title\": \"Main page title\",\n    \"description\": \"Description or summary\",\n    \"url\": \"${pageUrl}\",\n    \"extracted_at\": \"${new Date().toISOString()}\",\n    \"language\": \"en|fr|es|...\",\n    \"author\": \"Author if available\",\n    \"date_published\": \"Date if available\",\n    \"tags\": [\"tag1\", \"tag2\"]\n  },\n  \"content\": {\n    \"main_text\": \"Main content extracted and cleaned\",\n    \"summary\": \"Summary in 2-3 sentences\",\n    \"key_points\": [\"Point 1\", \"Point 2\", \"Point 3\"],\n    \"sections\": [\n      {\n        \"title\": \"Section 1\",\n        \"content\": \"Section content\"\n      }\n    ]\n  },\n  \"structured_data\": {\n    // For e-commerce\n    \"price\": \"Price if product\",\n    \"currency\": \"EUR|USD|...\",\n    \"availability\": \"In stock/Out of stock\",\n    \"rating\": \"Rating if available\",\n    \n    // For articles/news\n    \"category\": \"Category\",\n    \"reading_time\": \"Estimated reading time\",\n    \n    // For all types\n    \"images\": [\"Image URL 1\", \"Image URL 2\"],\n    \"links\": [\n      {\"text\": \"Link text\", \"url\": \"Link URL\"}\n    ]\n  },\n  \"entities\": {\n    \"people\": [\"Names of people mentioned\"],\n    \"organizations\": [\"Organizations/companies\"],\n    \"locations\": [\"Places mentioned\"],\n    \"technologies\": [\"Technologies/tools mentioned\"],\n    \"topics\": [\"Main topics\"]\n  }\n}\n\\`\\`\\`\n\n**HTML TO ANALYZE:**\n${htmlContent.substring(0, 15000)} ${htmlContent.length > 15000 ? '...[TRUNCATED]' : ''}\n\nReturn ONLY the structured JSON, without additional explanations.`;\n\nconst claudePayload = {\n  model: \"claude-3-7-sonnet-20250219\",\n  max_tokens: 4096,\n  messages: [\n    {\n      role: \"user\",\n      content: extractionPrompt\n    }\n  ]\n};\n\ntry {\n  const options = {\n    method: 'POST',\n    url: 'https://api.anthropic.com/v1/messages',\n    headers: {\n      'x-api-key': '[CONFIGURE_YOUR_API_KEY]',\n      'content-type': 'application/json'\n    },\n    body: claudePayload,\n    json: true\n  };\n\n  const claudeResponse = await this.helpers.request(options);\n  console.log('\u2705 Claude extraction call successful');\n  \n  return [{ json: claudeResponse }];\n\n} catch (error) {\n  console.error('\u274c Error during Claude extraction:', error);\n  \n  return [{\n    json: {\n      error: true,\n      message: error.message,\n      original_data: inputData,\n      timestamp: new Date().toISOString()\n    }\n  }];\n}"
},
"typeVersion": 2
},
{
"id": "b04dfca9-ebf0-46f7-b1e5-93ddf79e2451",
"name": "Ollama Vectorizer",
"type": "n8n-nodes-base.code",
"position": [
165,
625
],
"parameters": {
"jsCode": "// Simple Ollama Embeddings\n// Gets text embeddings from Ollama using the all-minilm model (you can use other models)\n\nconst inputData = items[0].json;\n\nlet textToEmbed = '';\n\nif (inputData.content && typeof inputData.content === 'string') {\n  textToEmbed = inputData.content;\n} else if (inputData.content && inputData.content.main_text) {\n  textToEmbed = inputData.content.main_text;\n  \n  if (inputData.content.summary) {\n    textToEmbed += ' ' + inputData.content.summary;\n  }\n} else if (inputData.searchable_content) {\n  textToEmbed = inputData.searchable_content;\n} else if (inputData.metadata && inputData.metadata.title) {\n  textToEmbed = inputData.metadata.title;\n  if (inputData.metadata.description) {\n    textToEmbed += ' ' + inputData.metadata.description;\n  }\n} else {\n  textToEmbed = JSON.stringify(inputData).substring(0, 1000);\n}\n\ntextToEmbed = textToEmbed.substring(0, 2000);\n\ntry {\n  console.log('\ud83d\udd0d Getting embeddings for:', textToEmbed.substring(0, 100) + '...');\n  \n  const response = await this.helpers.request({\n    method: 'POST',\n    url: 'http://127.0.0.1:11434/api/embeddings',\n    headers: {\n      'Content-Type': 'application/json'\n    },\n    body: {\n      model: \"all-minilm\",\n      prompt: textToEmbed\n    },\n    json: true\n  });\n  \n  if (!response.embedding || !Array.isArray(response.embedding)) {\n    throw new Error('No valid embedding returned from Ollama');\n  }\n  \n  console.log(`\u2705 Got embedding with ${response.embedding.length} dimensions`);\n  \n  return [{\n    json: {\n      ...inputData,\n      vector: response.embedding,\n      vector_info: {\n        dimensions: response.embedding.length,\n        model: \"all-minilm\",\n        created_at: new Date().toISOString()\n      }\n    }\n  }];\n  \n} catch (error) {\n  console.error('\u274c Error getting embeddings:', error);\n  \n  return [{\n    json: {\n      ...inputData,\n      error: true,\n      error_message: error.message,\n      error_type: 'embedding_failed',\n      error_time: new Date().toISOString()\n    }\n  }];\n}"
},
"typeVersion": 2
},
{
"id": "17a38e65-1f04-4c2d-9fc7-fd05c2d7c14d",
"name": "Qdrant Vector Saver",
"type": "n8n-nodes-base.code",
"position": [
485,
675
],
"parameters": {
"jsCode": "// Simple Qdrant Storage\n// Stores vectors in Qdrant\n\n// Get data with vector from Ollama\nconst inputData = items[0].json;\n\n// 1. Generate a valid Qdrant ID (must be integer)\nconst pointId = Math.floor(Math.random() * 1000000000);\n\n// 2. Extract basic metadata\nconst title = \n  (inputData.metadata && inputData.metadata.title) || \n  inputData.title || \n  'Untitled';\n\nconst url = \n  (inputData.metadata && inputData.metadata.url) || \n  inputData.url || \n  '';\n\n// 3. Check if we have a vector\nconst hasVector = inputData.vector && Array.isArray(inputData.vector) && inputData.vector.length > 0;\n\nif (!hasVector) {\n  console.error('\u274c No valid vector found in input');\n  return [{\n    json: {\n      error: true,\n      message: 'No valid vector found',\n      id: pointId,\n      title: title\n    }\n  }];\n}\n\n// 4. Create Qdrant payload\nconst qdrantPayload = {\n  points: [\n    {\n      id: pointId,         \n      vector: inputData.vector,\n      payload: {\n        title: title,\n        url: url,\n        original_id: inputData.id || '',\n        \n        // Content\n        page_type: inputData.page_type || 'unknown',\n        content: typeof inputData.content === 'string' \n          ? inputData.content.substring(0, 1000) \n          : (inputData.content && inputData.content.main_text \n              ? inputData.content.main_text.substring(0, 1000) \n              : ''),\n        \n        author: (inputData.metadata && inputData.metadata.author) || '',\n        language: (inputData.metadata && inputData.metadata.language) || 'en',\n        tags: (inputData.metadata && inputData.metadata.tags) || [],\n        \n        vector_dimensions: inputData.vector.length,\n        stored_at: new Date().toISOString()\n      }\n    }\n  ]\n};\n\n// 5. Store in Qdrant\ntry {\n  console.log(`\ud83d\udcbe Storing document \"${title}\" with ID ${pointId} in Qdrant`);\n  \n  const response = await this.helpers.request({\n    method: 'PUT',\n    url: 'http://127.0.0.1:6333/collections/hacker-news/points',\n    headers: {\n      'Content-Type': 'application/json'\n    },\n    body: qdrantPayload,\n    json: true\n  });\n  \n  console.log('\u2705 Successfully stored in Qdrant:', response);\n  \n  return [{\n    json: {\n      success: true,\n      id: pointId,\n      title: title,\n      vector_dimensions: inputData.vector.length,\n      qdrant_response: response,\n      timestamp: new Date().toISOString()\n    }\n  }];\n  \n} catch (error) {\n  console.error('\u274c Error storing in Qdrant:', error);\n  \n  // Check if collection doesn't exist\n  if (error.message && (error.message.includes('404') || \n                         error.message.includes('collection not found'))) {\n    try {\n      // we already check if collection exist before but in case we verify it one more time\n      console.log('\ud83d\udd27 Creating collection \"hacker-news\"...');\n      \n      await this.helpers.request({\n        method: 'PUT',\n        url: 'http://127.0.0.1:6333/collections/hacker-news',\n        headers: {\n          'Content-Type': 'application/json'\n        },\n        body: {\n          vectors: {\n            size: inputData.vector.length,\n            distance: \"Cosine\"\n          }\n        },\n        json: true\n      });\n      \n      console.log('\u2705 Collection created, retrying storage...');\n      \n      const response = await this.helpers.request({\n        method: 'PUT',\n        url: 'http://127.0.0.1:6333/collections/hacker-news/points',\n        headers: {\n          'Content-Type': 'application/json'\n        },\n        body: qdrantPayload,\n        json: true\n      });\n      \n      return [{\n        json: {\n          success: true,\n          collection_created: true,\n          id: pointId,\n          title: title,\n          vector_dimensions: inputData.vector.length,\n          qdrant_response: response,\n          timestamp: new Date().toISOString()\n        }\n      }];\n      \n    } catch (retryError) {\n      console.error('\u274c Error creating collection:', retryError);\n      \n      return [{\n        json: {\n          error: true,\n          message: 'Failed to create collection: ' + retryError.message,\n          id: pointId,\n          title: title\n        }\n      }];\n    }\n  }\n  \n  return [{\n    json: {\n      error: true,\n      message: error.message,\n      id: pointId,\n      title: title,\n      timestamp: new Date().toISOString()\n    }\n  }];\n}"
},
"typeVersion": 2
},
{
"id": "c0939f66-cee8-44c2-9766-f33c1306dd45",
"name": "AI Enhancement Agent",
"type": "n8n-nodes-base.code",
"position": [
190,
445
],
"parameters": {
"jsCode": "// AI Agent - Enhanced Data Validation & Correction\n// Between Claude Data Extractor and Format Claude Output\n// Validates, enriches and corrects raw extraction\n\nconst claudeResponse = items[0].json;\n\nif (claudeResponse.error) {\n  console.log('\u26a0\ufe0f Received error from Claude Data Extractor, passing through...');\n  return [{ json: claudeResponse }];\n}\n\nlet extractedContent = '';\nif (claudeResponse.content && Array.isArray(claudeResponse.content)) {\n  extractedContent = claudeResponse.content[0].text;\n} else {\n  extractedContent = JSON.stringify(claudeResponse);\n}\n\nconst validationPrompt = `You are an AI data validator and enhancer. Analyze this raw extraction result and improve it.\n\n**ORIGINAL EXTRACTION RESULT:**\n${extractedContent}\n\n**YOUR TASKS:**\n1. **Validate the JSON Structure**: Ensure the extraction is valid JSON\n2. **Fix Parsing Errors**: Correct any malformed JSON or missing fields\n3. **Enhance Missing Data**: Fill in missing metadata when possible\n4. **Standardize Format**: Ensure consistent structure\n5. **Quality Check**: Verify content makes sense\n\n**VALIDATION & ENHANCEMENT RULES:**\n- If JSON is malformed, fix the syntax\n- If required fields are missing, add them with reasonable defaults\n- If content is too short, extract more from the raw data if available\n- If page_type is wrong, correct it based on content analysis\n- If dates are malformed, standardize them to ISO format\n- If URLs are partial, make them complete when possible\n\n**REQUIRED OUTPUT FORMAT:**\nReturn a VALID JSON object with this exact structure:\n\\`\\`\\`json\n{\n  \"page_type\": \"article|product|blog|news|documentation|listing|other\",\n  \"metadata\": {\n    \"title\": \"Actual page title (required)\",\n    \"description\": \"Actual description (required)\",\n    \"url\": \"Complete URL if available\",\n    \"extracted_at\": \"ISO timestamp\",\n    \"language\": \"en|fr|es|...\",\n    \"author\": \"Author name if found\",\n    \"date_published\": \"ISO date if found\",\n    \"tags\": [\"relevant\", \"tags\"]\n  },\n  \"content\": {\n    \"main_text\": \"Clean, readable main content (required)\",\n    \"summary\": \"2-3 sentence summary (required)\",\n    \"key_points\": [\"Important point 1\", \"Important point 2\"],\n    \"sections\": [\n      {\n        \"title\": \"Section title\",\n        \"content\": \"Section content\"\n      }\n    ]\n  },\n  \"structured_data\": {\n    \"price\": \"Product price if applicable\",\n    \"currency\": \"Currency code if applicable\", \n    \"availability\": \"Stock status if applicable\",\n    \"rating\": \"Rating if applicable\",\n    \"category\": \"Content category\",\n    \"reading_time\": \"Estimated reading time\",\n    \"images\": [\"Image URLs\"],\n    \"links\": [{\"text\": \"Link text\", \"url\": \"Link URL\"}]\n  },\n  \"entities\": {\n    \"people\": [\"Person names\"],\n    \"organizations\": [\"Company names\"],\n    \"locations\": [\"Place names\"],\n    \"technologies\": [\"Tech terms\"],\n    \"topics\": [\"Main topics\"]\n  },\n  \"validation_info\": {\n    \"original_valid\": true/false,\n    \"corrections_made\": [\"List of fixes applied\"],\n    \"confidence_score\": 0.0-1.0,\n    \"quality_issues\": [\"Any remaining issues\"]\n  }\n}\n\\`\\`\\`\n\n**IMPORTANT:**\n- Return ONLY the corrected JSON, no explanations\n- Ensure ALL required fields have meaningful values\n- Fix any syntax errors in the original\n- If original is completely invalid, create a reasonable structure from available data`;\n\nconst enhancementPayload = {\n  model: \"claude-3-7-sonnet-20250219\",\n  max_tokens: 4096,\n  messages: [\n    {\n      role: \"user\",\n      content: validationPrompt\n    }\n  ]\n};\n\ntry {\n  const options = {\n    method: 'POST',\n    url: 'https://api.anthropic.com/v1/messages',\n    headers: {\n      'x-api-key': '[CONFIGURE_YOUR_API_KEY]',\n      'content-type': 'application/json'\n    },\n    body: enhancementPayload,\n    json: true\n  };\n\n  console.log('\ud83d\udd0d AI Agent validating and enhancing extraction...');\n  \n  const aiResponse = await this.helpers.request(options);\n  \n  if (aiResponse.content && Array.isArray(aiResponse.content)) {\n    const enhancedText = aiResponse.content[0].text;\n    \n    const jsonMatch = enhancedText.match(/```json\\n([\\s\\S]*?)\\n```/) || enhancedText.match(/\\{[\\s\\S]*\\}/);\n    \n    if (jsonMatch) {\n      try {\n        const enhancedData = JSON.parse(jsonMatch[1] || jsonMatch[0]);\n        \n        enhancedData.ai_processing = {\n          processed_by: \"claude-ai-agent\",\n          processing_timestamp: new Date().toISOString(),\n          original_extraction_valid: !claudeResponse.error,\n          enhancements_applied: true\n        };\n        \n        console.log('\u2705 AI Agent enhancement successful:', {\n          page_type: enhancedData.page_type,\n          title: enhancedData.metadata?.title?.substring(0, 50) + '...',\n          confidence: enhancedData.validation_info?.confidence_score || 'unknown',\n          corrections: enhancedData.validation_info?.corrections_made?.length || 0\n        });\n        \n        return [{\n          json: {\n            content: [\n              {\n                text: JSON.stringify(enhancedData, null, 2)\n              }\n            ],\n            model: \"claude-3-7-sonnet-ai-agent\",\n            usage: aiResponse.usage || {}\n          }\n        }];\n        \n      } catch (parseError) {\n        console.error('\u274c Failed to parse AI Agent response:', parseError);\n        return [{ json: claudeResponse }];\n      }\n    } else {\n      console.warn('\u26a0\ufe0f No JSON found in AI Agent response');\n      return [{ json: claudeResponse }];\n    }\n  } else {\n    throw new Error('Invalid AI Agent response format');\n  }\n\n} catch (error) {\n  console.error('\u274c AI Agent error:', error);\n  \n  return [{\n    json: {\n      ...claudeResponse,\n      ai_agent_error: true,\n      ai_agent_error_message: error.message,\n      ai_agent_timestamp: new Date().toISOString()\n    }\n  }];\n}"
},
"typeVersion": 2
},
{
"id": "0cb93f10-3e59-4e38-bbc2-4bd7c809db27",
"name": "Webhook Status Notifier",
"type": "n8n-nodes-base.code",
"position": [
700,
600
],
"parameters": {
"jsCode": "// Webhook Notification - Data Stored Success/Error\n\n// Get data from Qdrant Vector Store\nconst qdrantResult = items[0].json;\n\nconsole.log('\ud83d\udcdd Qdrant result structure:', Object.keys(qdrantResult));\nconsole.log('\ud83d\udcdd Full Qdrant result for debugging:', JSON.stringify(qdrantResult, null, 2).substring(0, 1000) + '...');\n\n// Configuration for webhooks - Add your URLs here\nconst webhooks = {\n  discord: \"\",\n  slack: \"\", \n  teams: \"\",\n  telegram: \"\",\n  custom: \"\"\n};\n\nlet isSuccess = false;\nlet errorDetails = {};\n\nif (qdrantResult.success === true) {\n  isSuccess = true;\n} else if (qdrantResult.qdrant_response && \n           qdrantResult.qdrant_response.status && \n           qdrantResult.qdrant_response.status.status === \"ok\") {\n  isSuccess = true;\n} else if (qdrantResult.status && qdrantResult.status.status === \"ok\") {\n  isSuccess = true;\n} else if (qdrantResult.qdrant_response && qdrantResult.qdrant_response.result) {\n  isSuccess = true;\n}\n\nif (!isSuccess) {\n  errorDetails = {\n    error_message: qdrantResult.message || qdrantResult.error_message || \"Unknown error\",\n    error_details: qdrantResult.error_details || {},\n    status_code: qdrantResult.status_code || qdrantResult.qdrant_response?.status_code,\n    raw_error: qdrantResult.error || qdrantResult.qdrant_response?.error || \"No specific error found\"\n  };\n  \n  console.log('\u274c Detected error in Qdrant result:', errorDetails);\n}\n\nconst pointId = qdrantResult.point_info?.id || \n               (qdrantResult.qdrant_response?.result?.ids && qdrantResult.qdrant_response.result.ids[0]) || \n               qdrantResult.id ||\n               (isSuccess ? \"stored-but-no-id\" : \"not-stored\");\n\nconst itemTitle = qdrantResult.point_info?.title || \n                 qdrantResult.original_data?.title || \n                 qdrantResult.original_data?.metadata?.title ||\n                 qdrantResult.payload?.title ||\n                 qdrantResult.points?.[0]?.payload?.title ||\n                 (qdrantResult.points?.[0] ? \"Data without title\" : \"Untitled\");\n\nconst itemUrl = qdrantResult.original_data?.metadata?.url ||\n               qdrantResult.payload?.url ||\n               qdrantResult.points?.[0]?.payload?.url ||\n               qdrantResult.url ||\n               \"No URL available\";\n\nconst vectorDimensions = qdrantResult.point_info?.vector_dimensions || \n                        qdrantResult.vector?.length ||\n                        qdrantResult.points?.[0]?.vector?.length ||\n                        (qdrantResult.qdrant_response?.result?.vector_size) || \n                        \"unknown\";\n\nconst collectionName = qdrantResult.collection || \n                      (qdrantResult.qdrant_response?.collection_name) || \n                      \"hacker-news\";\n\nconst timestamp = new Date().toISOString();\nconst notificationData = {\n  status: isSuccess ? \"success\" : \"error\",\n  message: isSuccess \n    ? \"\u2705 Data successfully scraped and stored in vector database\" \n    : \"\u274c Error storing data in vector database\",\n  details: {\n    id: pointId,\n    title: itemTitle?.substring(0, 100) + (itemTitle?.length > 100 ? \"...\" : \"\") || \"No title\",\n    url: itemUrl,\n    vector_size: vectorDimensions,\n    timestamp: timestamp,\n    collection: collectionName\n  },\n  error: !isSuccess ? errorDetails : undefined\n};\n\nfunction createMessageForPlatform(platform, data) {\n  switch (platform) {\n    case 'discord':\n      const fields = [\n        {\n          name: \"Item ID\",\n          value: data.details.id,\n          inline: true\n        },\n        {\n          name: \"Title\",\n          value: data.details.title || \"No title\",\n          inline: true\n        },\n        {\n          name: \"Collection\",\n          value: data.details.collection,\n          inline: true\n        },\n        {\n          name: \"Vector Size\",\n          value: `${data.details.vector_size} dimensions`,\n          inline: true\n        }\n      ];\n      \n      if (data.details.url && data.details.url !== \"No URL available\") {\n        fields.push({\n          name: \"URL\",\n          value: data.details.url,\n          inline: false\n        });\n      }\n      \n      if (data.error) {\n        fields.push({\n          name: \"Error Message\",\n          value: data.error.error_message || \"Unknown error\",\n          inline: false\n        });\n        \n        const errorDetailsStr = JSON.stringify(data.error.error_details, null, 2);\n        if (errorDetailsStr && errorDetailsStr !== \"{}\" && errorDetailsStr.length < 1000) {\n          fields.push({\n            name: \"Error Details\",\n            value: \"```json\\n\" + errorDetailsStr + \"\\n```\",\n            inline: false\n          });\n        }\n      }\n      \n      return {\n        embeds: [{\n          title: data.status === \"success\" ? \"\u2705 Vector Storage Success\" : \"\u274c Vector Storage Error\",\n          description: data.message,\n          color: data.status === \"success\" ? 0x00ff00 : 0xff0000,\n          fields: fields,\n          timestamp: data.details.timestamp,\n          footer: {\n            text: \"n8n Workflow - Vector DB\"\n          }\n        }]\n      };\n      \n    case 'slack':\n      const blocks = [\n        {\n          type: \"section\",\n          text: {\n            type: \"mrkdwn\",\n            text: `*${data.status === \"success\" ? \"\u2705 Vector Storage Success\" : \"\u274c Vector Storage Error\"}*\\n${data.message}`\n          }\n        },\n        {\n          type: \"section\",\n          fields: [\n            {\n              type: \"mrkdwn\",\n              text: `*ID:*\\n${data.details.id}`\n            },\n            {\n              type: \"mrkdwn\",\n              text: `*Title:*\\n${data.details.title}`\n            },\n            {\n              type: \"mrkdwn\",\n              text: `*Collection:*\\n${data.details.collection}`\n            },\n            {\n              type: \"mrkdwn\",\n              text: `*Vector:*\\n${data.details.vector_size} dimensions`\n            }\n          ]\n        }\n      ];\n      \n      if (data.details.url && data.details.url !== \"No URL available\") {\n        blocks.push({\n          type: \"section\",\n          text: {\n            type: \"mrkdwn\",\n            text: `*URL:*\\n${data.details.url}`\n          }\n        });\n      }\n      \n      if (data.error) {\n        blocks.push({\n          type: \"section\",\n          text: {\n            type: \"mrkdwn\",\n            text: `*Error:*\\n${data.error.error_message}`\n          }\n        });\n      }\n      \n      blocks.push({\n        type: \"context\",\n        elements: [\n          {\n            type: \"mrkdwn\",\n            text: `\u23f0 ${data.details.timestamp}`\n          }\n        ]\n      });\n      \n      return { blocks };\n      \n    case 'teams':\n      const facts = [\n        {\n          name: \"ID\",\n          value: data.details.id\n        },\n        {\n          name: \"Title\",\n          value: data.details.title\n        },\n        {\n          name: \"Collection\",\n          value: data.details.collection\n        },\n        {\n          name: \"Vector Size\",\n          value: `${data.details.vector_size} dimensions`\n        },\n        {\n          name: \"Timestamp\",\n          value: data.details.timestamp\n        }\n      ];\n      \n      if (data.details.url && data.details.url !== \"No URL available\") {\n        facts.push({\n          name: \"URL\",\n          value: data.details.url\n        });\n      }\n      \n      if (data.error) {\n        facts.push({\n          name: \"Error\",\n          value: data.error.error_message\n        });\n      }\n      \n      return {\n        \"@type\": \"MessageCard\",\n        \"@context\": \"http://schema.org/extensions\",\n        \"themeColor\": data.status === \"success\" ? \"00FF00\" : \"FF0000\",\n        \"summary\": data.message,\n        \"sections\": [{\n          \"activityTitle\": data.status === \"success\" ? \"\u2705 Vector Storage Success\" : \"\u274c Vector Storage Error\",\n          \"activitySubtitle\": data.message,\n          \"facts\": facts\n        }]\n      };\n      \n    default:\n      return {\n        status: data.status,\n        message: data.message,\n        details: data.details,\n        error: data.error,\n        timestamp: data.details.timestamp\n      };\n  }\n}\n\nasync function sendToWebhook(platform, webhookUrl, data) {\n  if (!webhookUrl || webhookUrl.trim() === \"\") {\n    console.log(`\u26a0\ufe0f No webhook URL for ${platform} - skipping`);\n    return { skipped: true, platform };\n  }\n  \n  try {\n    const message = createMessageForPlatform(platform, data);\n    \n    const options = {\n      method: 'POST',\n      url: webhookUrl,\n      headers: {\n        'Content-Type': 'application/json'\n      },\n      body: message,\n      json: true\n    };\n    \n    const response = await this.helpers.request(options);\n    console.log(`\u2705 Sent notification to ${platform}`);\n    \n    return {\n      success: true,\n      platform,\n      response: response\n    };\n  } catch (error) {\n    console.error(`\u274c Error sending to ${platform}:`, error);\n    \n    return {\n      error: true,\n      platform,\n      message: error.message\n    };\n  }\n}\n\nasync function sendAllNotifications() {\n  const results = [];\n  \n  for (const [platform, webhookUrl] of Object.entries(webhooks)) {\n    const result = await sendToWebhook(platform, webhookUrl, notificationData);\n    results.push(result);\n  }\n  \n  return results;\n}\n\ntry {\n  const notificationResults = await sendAllNotifications();\n  \n  console.log('\u2705 Notification summary:', {\n    total: notificationResults.length,\n    success: notificationResults.filter(r => r.success).length,\n    skipped: notificationResults.filter(r => r.skipped).length,\n    errors: notificationResults.filter(r => r.error).length\n  });\n  \n  return [{\n    json: {\n      original_qdrant_result: qdrantResult,\n      notification_results: notificationResults,\n      notification_data: notificationData,\n      is_success: isSuccess,\n      timestamp: new Date().toISOString()\n    }\n  }];\n  \n} catch (error) {\n  console.error('\u274c Error in webhook notifications:', error);\n  \n  try {\n    const errorData = {\n      status: \"error\",\n      message: \"\u274c Critical error in webhook notification\",\n      details: {\n        id: \"webhook-error\",\n        title: error.message,\n        url: \"N/A\",\n        vector_size: \"N/A\",\n        timestamp: new Date().toISOString(),\n        collection: \"N/A\"\n      },\n      error: {\n        error_message: error.message,\n        error_stack: error.stack\n      }\n    };\n    \n    if (webhooks.discord) {\n      const message = createMessageForPlatform('discord', errorData);\n      await this.helpers.request({\n        method: 'POST',\n        url: webhooks.discord,\n        headers: { 'Content-Type': 'application/json' },\n        body: message,\n        json: true\n      });\n    }\n  } catch (webhookError) {\n    console.error('\ud83d\udca5 Critical error in error handler:', webhookError);\n  }\n  \n  return [{\n    json: {\n      error: true,\n      message: error.message,\n      original_data: qdrantResult\n    }\n  }];\n}"
},
"typeVersion": 2
},
{
"id": "257f6f96-d02a-4fba-bd26-baf5aa3c3d89",
"name": "Export Data Webhook",
"type": "n8n-nodes-base.code",
"position": [
135,
170
],
"parameters": {
"jsCode": "const inputData = items[0].json;\n\nconst webhooks = {\n  discord: \"\",\n  slack: \"\",\n  linear: \"\",\n  teams: \"\",\n  telegram: \"\"\n};\n\nlet formattedData = {};\ntry {\n  if (inputData.content && Array.isArray(inputData.content)) {\n    const claudeText = inputData.content[0].text;\n    const jsonMatch = claudeText.match(/\\{[\\s\\S]*\\}/);\n    if (jsonMatch) {\n      formattedData = JSON.parse(jsonMatch[0]);\n    } else {\n      formattedData = { content: claudeText };\n    }\n  } else {\n    formattedData = inputData;\n  }\n} catch (parseError) {\n  console.error('Error parsing Claude response:', parseError);\n  formattedData = { \n    error: \"Parse error\", \n    raw_content: inputData \n  };\n}\n\nconst timestamp = new Date().toISOString().replace(/[:.]/g, '-');\nconst filename = `extracted-data-${timestamp}.txt`;\n\nconst fileContent = `\ud83e\udd16 EXTRACTED AND FORMATTED DATA\n=======================================\nTimestamp: ${new Date().toISOString()}\nSource: n8n Workflow (Scrapeless + Claude)\n=======================================\n\n\ud83d\udcca STRUCTURED DATA:\n${JSON.stringify(formattedData, null, 2)}\n\n=======================================\n\ud83d\udd0d RAW DATA (Debug):\n${JSON.stringify(inputData, null, 2)}\n=======================================`;\n\nasync function sendFileToWebhook(platform, webhookUrl, fileContent, filename) {\n  if (!webhookUrl || webhookUrl.trim() === \"\") {\n    console.log(`\u26a0\ufe0f ${platform} webhook URL empty - skipping`);\n    return { skipped: true, platform };\n  }\n  \n  try {\n    let formData;\n    let contentType;\n    \n    switch (platform) {\n      case 'discord':\n        formData = {\n          content: `\ud83e\udd16 **Extracted Data** - ${timestamp}`,\n          file: {\n            value: Buffer.from(fileContent, 'utf8'),\n            options: {\n              filename: filename,\n              contentType: 'text/plain'\n            }\n          }\n        };\n        contentType = 'multipart/form-data';\n        break;\n        \n      case 'slack':\n        const slackMessage = {\n          text: `\ud83e\udd16 Extracted Data - ${timestamp}`,\n          blocks: [\n            {\n              type: \"section\",\n              text: {\n                type: \"mrkdwn\",\n                text: \"*\ud83d\udcca Extracted and Formatted Data*\"\n              }\n            },\n            {\n              type: \"section\",\n              text: {\n                type: \"mrkdwn\",\n                text: `\\`\\`\\`${fileContent.substring(0, 2800)}\\`\\`\\``\n              }\n            }\n          ]\n        };\n        \n        const response = await this.helpers.request({\n          method: 'POST',\n          url: webhookUrl,\n          headers: { 'Content-Type': 'application/json' },\n          body: slackMessage,\n          json: true\n        });\n        \n        return { success: true, platform, response, method: 'json_message' };\n        \n      case 'telegram':\n        formData = {\n          document: {\n            value: Buffer.from(fileContent, 'utf8'),\n            options: {\n              filename: filename,\n              contentType: 'text/plain'\n            }\n          },\n          caption: `\ud83e\udd16 Extracted Data - ${timestamp}`\n        };\n        contentType = 'multipart/form-data';\n        break;\n        \n      default:\n        const jsonMessage = {\n          text: `\ud83e\udd16 Extracted Data - ${timestamp}`,\n          attachment: {\n            filename: filename,\n            content: fileContent\n          },\n          metadata: {\n            timestamp: timestamp,\n            platform: platform\n          }\n        };\n        \n        const jsonResponse = await this.helpers.request({\n          method: 'POST',\n          url: webhookUrl,\n          headers: { 'Content-Type': 'application/json' },\n          body: jsonMessage,\n          json: true\n        });\n        \n        return { success: true, platform, response: jsonResponse, method: 'json_fallback' };\n    }\n    \n    if (formData && contentType === 'multipart/form-data') {\n      const response = await this.helpers.request({\n        method: 'POST',\n        url: webhookUrl,\n        formData: formData,\n        headers: {}\n      });\n      \n      console.log(`\u2705 ${platform} file sent successfully`);\n      return { \n        success: true, \n        platform, \n        response: response,\n        method: 'file_upload',\n        filename: filename\n      };\n    }\n    \n  } catch (error) {\n    console.error(`\u274c Error ${platform} webhook:`, error);\n    return { \n      error: true, \n      platform, \n      message: error.message || 'Unknown error'\n    };\n  }\n}\n\nconst results = [];\n\nfor (const [platform, webhookUrl] of Object.entries(webhooks)) {\n  const result = await sendFileToWebhook(platform, webhookUrl, fileContent, filename);\n  results.push(result);\n}\n\nreturn [{\n  json: {\n    webhook_results: results,\n    file_info: {\n      filename: filename,\n      size_bytes: Buffer.byteLength(fileContent, 'utf8'),\n      content_preview: fileContent.substring(0, 200) + '...'\n    },\n    formatted_data: formattedData,\n    timestamp: new Date().toISOString(),\n    summary: {\n      total_platforms: Object.keys(webhooks).length,\n      sent_successfully: results.filter(r => r.success).length,\n      skipped: results.filter(r => r.skipped).length,\n      errors: results.filter(r => r.error).length,\n      file_uploads: results.filter(r => r.method === 'file_upload').length,\n      json_messages: results.filter(r => r.method === 'json_message' || r.method === 'json_fallback').length\n    }\n  }\n}];"
},
"typeVersion": 2
},
{
"id": "f704e1d8-2177-45f3-a34a-5e53b5fbe248",
"name": "AI Content Validator",
"type": "n8n-nodes-base.code",
"position": [
395,
145
],
"parameters": {
"jsCode": "const inputData = items[0].json;\n\nlet htmlContent = '';\nif (inputData.data && inputData.data.html) {\n  htmlContent = inputData.data.html;\n} else if (inputData.data && inputData.data.content) {\n  htmlContent = inputData.data.content;\n} else if (inputData.content) {\n  htmlContent = inputData.content;\n} else if (inputData.data) {\n  htmlContent = JSON.stringify(inputData.data);\n} else {\n  htmlContent = JSON.stringify(inputData);\n}\n\nconst claudePayload = {\n  model: \"claude-3-7-sonnet-20250219\",\n  max_tokens: 4096,\n  messages: [\n    {\n      role: \"user\",\n      content: `Extract and format this HTML content into structured JSON. Focus on main articles, titles, and content. Return the data in this format:\n{\n  \"search_result\": {\n    \"title\": \"Page title or main heading\",\n    \"articles\": [\n      {\n        \"title\": \"Article title\",\n        \"content\": \"Article content/summary\",\n        \"url\": \"Article URL if available\"\n      }\n    ],\n    \"extracted_at\": \"${new Date().toISOString()}\"\n  }\n}\n\n\n\nHTML Content:\n${htmlContent}`\n    }\n  ]\n};\n\ntry {\n  const options = {\n    method: 'POST',\n    url: 'https://api.anthropic.com/v1/messages',\n    headers: {\n      'x-api-key': '[CONFIGURE_YOUR_API_KEY]',\n      'content-type': 'application/json'\n    },\n    body: claudePayload,\n    json: true\n  };\n\n  const claudeResponse = await this.helpers.request(options);\n  \n  console.log('Claude Response:', JSON.stringify(claudeResponse, null, 2));\n  \n  return [{ json: claudeResponse }];\n  \n} catch (error) {\n  console.error('Error calling Claude API:', error);\n  \n  return [{\n    json: {\n      error: true,\n      message: error.message,\n      input_data: inputData\n    }\n  }];\n}"
},
"typeVersion": 2
}
],
"active": false,
"pinData": [],
"settings": {
"executionOrder": "v1"
},
"versionId": "",
"connections": {
"AI Content Validator": {
"main": [
[
{
"node": "Export Data Webhook",
"type": "main",
"index": 0
}
]
]
},
"AI Enhancement Agent": {
"main": [
[
{
"node": "Claude Output Formatter",
"type": "main",
"index": 0
}
]
]
},
"Ollama Vectorizer": {
"main": [
[
{
"node": "Qdrant Vector Saver",
"type": "main",
"index": 0
}
]
]
},
"Qdrant Vector Saver": {
"main": [
[
{
"node": "Webhook Status Notifier",
"type": "main",
"index": 0
}
]
]
},
"Claude Output Formatter": {
"main": [
[
{
"node": "Ollama Vectorizer",
"type": "main",
"index": 0
}
]
]
},
"Claude Extraction Script": {
"main": [
[
{
"node": "AI Enhancement Agent",
"type": "main",
"index": 0
}
]
]
},
"External Scrape Request": {
"main": [
[
{
"node": "AI Content Validator",
"type": "main",
"index": 0
},
{
"node": "Claude Extraction Script",
"type": "main",
"index": 0
}
]
]
},
"Verify Collection Presence": {
"main": [
[
{
"node": "Collection Presence Gate",
"type": "main",
"index": 0
}
]
]
},
"Collection Presence Gate": {
"main": [
[
{
"node": "Assign URL & Webhook",
"type": "main",
"index": 0
}
],
[
{
"node": "Provision Qdrant Collection",
"type": "main",
"index": 0
}
]
]
},
"Provision Qdrant Collection": {
"main": [
[
{
"node": "Assign URL & Webhook",
"type": "main",
"index": 0
}
]
]
},
"Manual Launch Trigger": {
"main": [
[
{
"node": "Verify Collection Presence",
"type": "main",
"index": 0
}
]
]
},
"Assign URL & Webhook": {
"main": [
[
{
"node": "External Scrape Request",
"type": "main",
"index": 0
}
]
]
}
}
}