当前位置：首页 > news >正文

【RAG】【ingestion01】高级摄取管道示例

news 2026/7/3 7:04:38

1. 案例目标

本案例演示如何使用LlamaIndex构建一个高级摄取管道(Ingestion Pipeline)，该管道具有以下特性：

Redis缓存功能，避免重复处理相同内容
自动向量数据库插入功能
自定义文本转换功能
文档处理流程优化

通过这个案例，用户可以了解如何构建一个高效、可扩展的文档处理管道，适用于大规模文档处理和检索场景。

2. 技术栈与核心依赖

LlamaIndex

Redis

Weaviate

HuggingFace

Python

Jupyter Notebook

核心依赖包：

llama-index-vector-stores-weaviate llama-index-embeddings-huggingface llama-index weaviate-client

这些依赖提供了向量存储、嵌入模型、文档处理和缓存功能的支持。

3. 环境配置

步骤1: 安装必要的依赖

%pip install llama-index-vector-stores-weaviate %pip install llama-index-embeddings-huggingface !pip install llama-index !pip install weaviate-client

步骤2: 配置Redis缓存

from llama_index.core.ingestion.cache import RedisCache from llama_index.core.ingestion import IngestionCache ingest_cache = IngestionCache( cache=RedisCache.from_host_and_port(host="127.0.0.1", port=6379), collection="my_test_cache", )

注意：确保Redis服务已启动并运行在127.0.0.1:6379上。

步骤3: 配置Weaviate向量数据库

import weaviate auth_config = weaviate.AuthApiKey(api_key="...") client = weaviate.Client(url="https://...", auth_client_secret=auth_config) from llama_index.vector_stores.weaviate import WeaviateVectorStore vector_store = WeaviateVectorStore( weaviate_client=client, index_name="CachingTest" )

注意：需要替换实际的API密钥和URL以连接到您的Weaviate实例。

步骤4: 配置文本分割器和嵌入模型

from llama_index.core.node_parser import TokenTextSplitter from llama_index.embeddings.huggingface import HuggingFaceEmbedding text_splitter = TokenTextSplitter(chunk_size=512) embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

4. 案例实现

4.1 自定义文本转换器

首先创建一个自定义的文本转换器，用于清理文档中的特殊字符：

import re from llama_index.core.schema import TransformComponent class TextCleaner(TransformComponent): def __call__(self, nodes, **kwargs): for node in nodes: node.text = re.sub(r"[^0-9A-Za-z ]", "", node.text) return nodes

4.2 构建摄取管道

创建一个包含多个转换步骤的摄取管道：

from llama_index.core.ingestion import IngestionPipeline pipeline = IngestionPipeline( transformations=[ TextCleaner(), text_splitter, embed_model, TitleExtractor(), ], vector_store=vector_store, cache=ingest_cache, )

4.3 加载文档并运行管道

from llama_index.core import SimpleDirectoryReader documents = SimpleDirectoryReader("../data/paul_graham/").load_data() nodes = pipeline.run(documents=documents)

4.4 使用向量存储创建查询引擎

import os os.environ["OPENAI_API_KEY"] = "sk-..." from llama_index.core import VectorStoreIndex index = VectorStoreIndex.from_vector_store( vector_store=vector_store, embed_model=embed_model, ) query_engine = index.as_query_engine() print(query_engine.query("What did the author do growing up?"))

4.5 测试缓存功能

重新运行管道以测试缓存功能：

pipeline = IngestionPipeline( transformations=[TextCleaner(), text_splitter, embed_model], cache=ingest_cache, ) nodes = pipeline.run(documents=documents)