江蘇省建設(shè)安全協(xié)會(huì)網(wǎng)站,手機(jī)網(wǎng)站制作建設(shè),關(guān)聯(lián)詞有哪些三年級(jí),南寧有名的seo費(fèi)用一、核心設(shè)計(jì)原則整頁為單 Chunk#xff1a;將單頁保險(xiǎn)文檔作為 1 個(gè)檢索單元#xff08;Chunk#xff09;#xff0c;保留內(nèi)容邏輯關(guān)聯(lián)性#xff1b; 元數(shù)據(jù)對齊#xff1a;文檔入庫的元數(shù)據(jù)字段與提問提取的元數(shù)據(jù)字段完全一致#xff0c;確保過濾檢索精準(zhǔn)#xff…一、核心設(shè)計(jì)原則整頁為單 Chunk將單頁保險(xiǎn)文檔作為 1 個(gè)檢索單元Chunk保留內(nèi)容邏輯關(guān)聯(lián)性元數(shù)據(jù)對齊文檔入庫的元數(shù)據(jù)字段與提問提取的元數(shù)據(jù)字段完全一致確保過濾檢索精準(zhǔn)混合檢索元數(shù)據(jù)過濾精準(zhǔn)定位 Chunk 向量 / 關(guān)鍵詞檢索匹配 Chunk 內(nèi)內(nèi)容兼顧精度與效率。二、流程總覽原始保險(xiǎn)文檔OCR文本→ 提取文檔元數(shù)據(jù) → 整頁Chunk元數(shù)據(jù)上傳至Dify知識(shí)庫 ↑ ↓ 用戶提問 → 提取提問元數(shù)據(jù) → 元數(shù)據(jù)過濾檢索Dify知識(shí)庫 → 獲取匹配Chunk → LLM生成回答三、第一步文檔元數(shù)據(jù)提取整頁 Chunk 入庫Dify1. 定義通用元數(shù)據(jù)字段保險(xiǎn)類文檔適配元數(shù)據(jù)字段字段類型說明通用化doc_type字符串文檔類型如 “保險(xiǎn)產(chǎn)品介紹”“保險(xiǎn)條款”issuer字符串發(fā)行機(jī)構(gòu)如 “保險(xiǎn)公司名稱”update_time字符串文檔更新時(shí)間如 “YYYY 年 MM 月”applicable_area數(shù)組適用地區(qū)如 [“香港”,“澳門”]supported_currencies數(shù)組支持的保單貨幣類型如 [“美元”,“港元”]core_tags數(shù)組核心檢索標(biāo)簽如 [“長期 IRR”,“回本周期”,“紅利權(quán)益”,“退保規(guī)則”]data_modules數(shù)組文檔包含的邏輯模塊如 [“產(chǎn)品基礎(chǔ)”,“收益案例”,“權(quán)益規(guī)則”,“條款約束”]key_numbers數(shù)組核心數(shù)值帶單位如 [“5 年繳費(fèi)”,“7% IRR”,“50 萬美元保費(fèi)”]2. 文檔元數(shù)據(jù)提取函數(shù)LLM 驅(qū)動(dòng)import os import json import requests from openai import OpenAI from dotenv import load_dotenv # 環(huán)境變量加載通用配置 load_dotenv() LLM_API_KEY os.getenv(LLM_API_KEY) DIFY_API_KEY os.getenv(DIFY_API_KEY) DIFY_BASE_URL os.getenv(DIFY_BASE_URL, https://api.dify.ai/v1) DIFY_KNOWLEDGE_BASE_ID os.getenv(DIFY_KNOWLEDGE_BASE_ID) # 初始化LLM客戶端通用適配OpenAI/國產(chǎn)模型 llm_client OpenAI(api_keyLLM_API_KEY) def extract_document_metadata(ocr_text): 通用函數(shù)從保險(xiǎn)文檔OCR文本中提取結(jié)構(gòu)化元數(shù)據(jù) :param ocr_text: 單頁保險(xiǎn)文檔的OCR文本動(dòng)態(tài)輸入 :return: 通用化元數(shù)據(jù)字典 prompt f # 任務(wù)提取保險(xiǎn)類文檔的RAG檢索專用元數(shù)據(jù)整頁為1個(gè)Chunk # 輸入文本 {ocr_text} # 提取規(guī)則 1. 嚴(yán)格基于文本內(nèi)容未提及的字段填空字符串/空數(shù)組不編造任何信息 2. doc_type提取文檔類型如保險(xiǎn)產(chǎn)品介紹、保險(xiǎn)條款 3. core_tags提取所有可用于檢索的核心關(guān)鍵詞如收益、權(quán)益、規(guī)則、繳費(fèi)方式 4. data_modules提取文檔包含的邏輯模塊從[產(chǎn)品基礎(chǔ),收益案例,權(quán)益規(guī)則,條款約束,提取規(guī)則,退保規(guī)則]中選擇 5. key_numbers提取所有帶單位的核心數(shù)值如年限、金額、百分比 6. 輸出僅保留標(biāo)準(zhǔn)JSON無解釋性文字、無換行。 # 輸出JSON格式 {{ doc_type: , issuer: , update_time: , applicable_area: [], supported_currencies: [], core_tags: [], data_modules: [], key_numbers: [] }} try: response llm_client.chat.completions.create( modelgpt-3.5-turbo, # 可替換為國產(chǎn)模型如通義千問、文心一言 messages[ {role: system, content: 你是保險(xiǎn)文檔元數(shù)據(jù)提取專家輸出僅符合格式的JSON}, {role: user, content: prompt} ], temperature0.0, # 無幻覺嚴(yán)格基于文本提取 response_format{type: json_object}, timeout10 ) metadata json.loads(response.choices[0].message.content) # 空值清洗確保格式統(tǒng)一 for key in metadata: if isinstance(metadata[key], list) and len(metadata[key]) 0: metadata[key] [] elif isinstance(metadata[key], str) and metadata[key].strip() : metadata[key] return metadata except Exception as e: print(f文檔元數(shù)據(jù)提取失敗{e}) # 返回空元數(shù)據(jù)兜底 return { doc_type: , issuer: , update_time: , applicable_area: [], supported_currencies: [], core_tags: [], data_modules: [], key_numbers: [] }3. 整頁 Chunk 上傳至 Dify 知識(shí)庫def upload_full_page_to_dify(full_page_text, metadata, doc_unique_id): 通用函數(shù)將整頁文檔作為1個(gè)Chunk上傳至Dify知識(shí)庫 :param full_page_text: 整頁OCR文本 :param metadata: 提取的文檔元數(shù)據(jù) :param doc_unique_id: 文檔唯一標(biāo)識(shí)如insurance_doc_001 url f{DIFY_BASE_URL}/knowledge_bases/{DIFY_KNOWLEDGE_BASE_ID}/documents/batch headers { Authorization: fBearer {DIFY_API_KEY}, Content-Type: application/json } # 構(gòu)造Dify上傳請求體單Chunk documents [ { content: full_page_text, # 整頁文本作為1個(gè)Chunk metadata: metadata, # 綁定通用元數(shù)據(jù) document_id: doc_unique_id, # 自定義唯一ID便于管理 name: f{metadata[doc_type]}_{doc_unique_id} # 文檔名稱 } ] payload { documents: documents, mode: overwrite # 可選append追加/overwrite覆蓋 } try: response requests.post(url, headersheaders, jsonpayload, timeout30) response.raise_for_status() print(f整頁Chunk上傳成功ID{doc_unique_id}) except Exception as e: print(fChunk上傳失敗{e}) if hasattr(e, response): print(f錯(cuò)誤詳情{e.response.text}) # 文檔入庫主函數(shù) def document_ingestion_pipeline(ocr_text, doc_unique_id): 文檔入庫流程提取元數(shù)據(jù) → 上傳Chunk :param ocr_text: 整頁OCR文本 :param doc_unique_id: 文檔唯一ID # 步驟1提取文檔元數(shù)據(jù) doc_metadata extract_document_metadata(ocr_text) # 步驟2上傳整頁Chunk元數(shù)據(jù) upload_full_page_to_dify(ocr_text, doc_metadata, doc_unique_id)四、第二步用戶提問元數(shù)據(jù)提取對齊文檔元數(shù)據(jù)1. 提問元數(shù)據(jù)提取函數(shù)字段與文檔元數(shù)據(jù)完全對齊def extract_query_metadata(user_query): 通用函數(shù)從用戶提問中提取Dify檢索用的元數(shù)據(jù)字段與文檔元數(shù)據(jù)對齊 :param user_query: 用戶原始提問口語化/精準(zhǔn)化均可 :return: 結(jié)構(gòu)化提問元數(shù)據(jù)用于Dify過濾檢索 prompt f # 任務(wù)從用戶提問中提取保險(xiǎn)類文檔RAG檢索的過濾元數(shù)據(jù) # 核心規(guī)則 1. 嚴(yán)格基于用戶提問內(nèi)容提取未提及的字段填空字符串/空數(shù)組不推測、不編造 2. 字段值需與保險(xiǎn)文檔元數(shù)據(jù)格式對齊如貨幣名稱、模塊名稱統(tǒng)一 3. doc_type提取用戶提問指向的文檔類型如保險(xiǎn)產(chǎn)品介紹 4. core_tags提取提問中的核心檢索關(guān)鍵詞如繳費(fèi)方式、權(quán)益、金額、年限 5. data_modules提取提問指向的邏輯模塊從[產(chǎn)品基礎(chǔ),收益案例,權(quán)益規(guī)則,條款約束]中選擇 6. key_numbers提取提問中的核心數(shù)值帶單位 7. 輸出僅保留標(biāo)準(zhǔn)JSON無其他內(nèi)容。 # 用戶提問 {user_query} # 輸出JSON格式 {{ doc_type: , issuer: , update_time: , applicable_area: [], supported_currencies: [], core_tags: [], data_modules: [], key_numbers: [] }} try: response llm_client.chat.completions.create( modelgpt-3.5-turbo, messages[ {role: system, content: 你是保險(xiǎn)提問元數(shù)據(jù)提取助手輸出僅符合格式的JSON}, {role: user, content: prompt} ], temperature0.0, response_format{type: json_object}, timeout10 ) query_metadata json.loads(response.choices[0].message.content) # 空值清洗 for key in query_metadata: if isinstance(query_metadata[key], list) and len(query_metadata[key]) 0: query_metadata[key] [] elif isinstance(query_metadata[key], str) and query_metadata[key].strip() : query_metadata[key] return query_metadata except Exception as e: print(f提問元數(shù)據(jù)提取失敗{e}) # 返回空元數(shù)據(jù)兜底 return { doc_type: , issuer: , update_time: , applicable_area: [], supported_currencies: [], core_tags: [], data_modules: [], key_numbers: [] }五、第三步元數(shù)據(jù)過濾檢索 LLM 生成回答1. Dify 知識(shí)庫檢索元數(shù)據(jù)過濾混合檢索def retrieve_from_dify(query_metadata, user_query): 通用函數(shù)調(diào)用Dify檢索API基于提問元數(shù)據(jù)過濾Chunk :param query_metadata: 提問元數(shù)據(jù) :param user_query: 用戶原始提問用于向量/關(guān)鍵詞檢索 :return: Dify檢索結(jié)果匹配的Chunk列表 # 構(gòu)造過濾條件僅保留非空字段減少無效過濾 filter_conditions {} for key, value in query_metadata.items(): if value ! and value ! []: filter_conditions[key] value # Dify檢索API參數(shù) url f{DIFY_BASE_URL}/knowledge_bases/{DIFY_KNOWLEDGE_BASE_ID}/retrieve headers { Authorization: fBearer {DIFY_API_KEY}, Content-Type: application/json } payload { query: user_query, # 用戶提問向量/關(guān)鍵詞檢索 top_k: 3, # 返回Top3匹配的Chunk filter: filter_conditions, # 元數(shù)據(jù)過濾條件對齊文檔元數(shù)據(jù) retrieval_mode: hybrid, # 混合檢索關(guān)鍵詞向量兼顧精度 score_threshold: 0.3 # 相似度閾值過濾低匹配結(jié)果 } try: response requests.post(url, headersheaders, jsonpayload, timeout20) response.raise_for_status() return response.json() except Exception as e: print(fDify檢索失敗{e}) if hasattr(e, response): print(f錯(cuò)誤詳情{e.response.text}) return None2. LLM 生成回答基于檢索到的 Chunkdef generate_answer(retrieve_result, user_query): 通用函數(shù)基于檢索到的Chunk生成精準(zhǔn)回答 :param retrieve_result: Dify檢索結(jié)果 :param user_query: 用戶原始提問 :return: 結(jié)構(gòu)化回答 # 無匹配結(jié)果兜底 if not retrieve_result or len(retrieve_result[documents]) 0: return 未檢索到與您的問題匹配的保險(xiǎn)文檔信息請調(diào)整提問關(guān)鍵詞。 # 提取檢索到的Chunk內(nèi)容整頁文本 retrieved_content .join([doc[content] for doc in retrieve_result[documents]]) # 生成回答的Prompt通用化無具體產(chǎn)品 answer_prompt f # 任務(wù)基于保險(xiǎn)文檔信息回答用戶問題 # 文檔信息 {retrieved_content} # 回答規(guī)則 1. 僅使用提供的文檔信息回答不編造任何內(nèi)容 2. 回答簡潔準(zhǔn)確聚焦用戶問題核心忽略無關(guān)信息 3. 若文檔中無明確答案明確說明“文檔中未提及相關(guān)信息” 4. 涉及數(shù)值/規(guī)則的需標(biāo)注“非保證”等文檔中的約束條件如有。 # 用戶問題 {user_query} try: response llm_client.chat.completions.create( modelgpt-3.5-turbo, messages[ {role: system, content: 你是專業(yè)的保險(xiǎn)文檔解答助手回答嚴(yán)格基于提供的信息}, {role: user, content: answer_prompt} ], temperature0.1 # 低隨機(jī)性確?；卮鹁珳?zhǔn) ) return response.choices[0].message.content except Exception as e: print(f回答生成失敗{e}) return 回答生成失敗請重試。六、第四步全流程串聯(lián)通用 RAG 問答管道def insurance_rag_qa_pipeline(user_query, ocr_textNone, doc_unique_idNone): 保險(xiǎn)類文檔RAG全流程 1. 若傳入OCR文本文檔ID先執(zhí)行文檔入庫 2. 提取提問元數(shù)據(jù) → 檢索 → 生成回答 # 可選文檔入庫首次上傳時(shí)執(zhí)行 if ocr_text and doc_unique_id: document_ingestion_pipeline(ocr_text, doc_unique_id) # 核心流程提問處理 → 檢索 → 回答 # 步驟1提取提問元數(shù)據(jù) query_metadata extract_query_metadata(user_query) # 步驟2Dify元數(shù)據(jù)過濾檢索 retrieve_result retrieve_from_dify(query_metadata, user_query) # 步驟3生成回答 final_answer generate_answer(retrieve_result, user_query) return final_answer # 全流程測試示例 if __name__ __main__: # 示例1文檔入庫首次上傳 sample_ocr_text 【保險(xiǎn)產(chǎn)品介紹】發(fā)行機(jī)構(gòu)XX保險(xiǎn)公司更新時(shí)間2024年8月支持貨幣美元、港元、歐元核心收益長期總內(nèi)部回報(bào)率預(yù)期超7%回本周期短至8年權(quán)益規(guī)則支持貨幣轉(zhuǎn)換、紅利鎖/解鎖、受保人變更條款約束實(shí)際收益非保證提取金額需符合保單規(guī)則 # 執(zhí)行入庫僅首次執(zhí)行 insurance_rag_qa_pipeline( user_query, # 提問為空僅執(zhí)行入庫 ocr_textsample_ocr_text, doc_unique_idinsurance_doc_001 ) # 示例2用戶提問檢索回答 user_query 美元保單的長期IRR是多少是否有保證 answer insurance_rag_qa_pipeline(user_queryuser_query) print( 最終回答 ) print(answer)七、通用化優(yōu)化建議適配所有保險(xiǎn)類文檔1. 元數(shù)據(jù)擴(kuò)展可根據(jù)實(shí)際需求新增通用字段如payment_methods繳費(fèi)方式數(shù)組如 [“5 年繳”,“10 年繳”,“整付”]right_types權(quán)益類型數(shù)組如 [“貨幣轉(zhuǎn)換”,“紅利解鎖”,“受保人變更”]constraint_tags約束標(biāo)簽數(shù)組如 [“非保證收益”,“退保條件限制”]。2. 適配國產(chǎn) LLM若不用 OpenAI替換llm_client為國產(chǎn)模型調(diào)用邏輯如通義千問、文心一言Prompt 模板完全通用# 通義千問適配示例 import dashscope dashscope.api_key os.getenv(DASHSCOPE_API_KEY) def extract_document_metadata(ocr_text): prompt ... # 保留原Prompt response dashscope.Generation.call( modelqwen-plus, messages[{role: user, content: prompt}], result_formatjson, temperature0.0 ) metadata json.loads(response.output.choices[0].message.content) return metadata3. 批量處理優(yōu)化def batch_ingestion(folder_path): 批量入庫文件夾中的保險(xiǎn)文檔 import glob for idx, file_path in enumerate(glob.glob(f{folder_path}/*.txt)): with open(file_path, r, encodingutf-8) as f: ocr_text f.read() doc_unique_id finsurance_doc_{idx:03d} document_ingestion_pipeline(ocr_text, doc_unique_id)4. Dify 檢索配置檢索模式選擇「混合檢索」兼顧元數(shù)據(jù)關(guān)鍵詞和向量相似度向量模型選擇支持長文本的模型如text-embedding-3-large、m3e-large過濾邏輯Dify 支持「數(shù)組包含匹配」如supported_currencies包含 “美元” 即匹配無需完全一致?？偨Y(jié)該方案實(shí)現(xiàn)了完全通用化的保險(xiǎn)類文檔 RAG 全流程文檔側(cè)整頁為 Chunk 提取通用元數(shù)據(jù)無需拆分適配任意保險(xiǎn)文檔提問側(cè)提取與文檔元數(shù)據(jù)對齊的檢索標(biāo)簽精準(zhǔn)過濾 Chunk檢索側(cè)元數(shù)據(jù)過濾混合檢索兼顧精度與效率回答側(cè)基于檢索結(jié)果生成精準(zhǔn)回答無編造、無冗余。全流程無具體產(chǎn)品名稱依賴可直接復(fù)用至各類保險(xiǎn)產(chǎn)品文檔的 RAG 系統(tǒng)開發(fā)。

97色伦色在线综合视频,无玛专区,18videosex性欧美黑色,日韩黄色电影免费在线观看,国产精品伦理一区二区三区,在线视频欧美日韩,亚洲欧美在线中文字幕不卡

江蘇省建設(shè)安全協(xié)會(huì)網(wǎng)站手機(jī)網(wǎng)站制作建設(shè)

三維宣傳片制作公司關(guān)鍵詞優(yōu)化網(wǎng)站排名

重慶網(wǎng)站建設(shè)seo公司哪家好外貿(mào)營銷信

鐵路網(wǎng)站建設(shè)論文房產(chǎn)網(wǎng)簽流程圖

免費(fèi)網(wǎng)站推廣app一般網(wǎng)站有哪幾部分構(gòu)成

五合一網(wǎng)站制作視頻教程?學(xué)校網(wǎng)站源碼

網(wǎng)站快速收錄技術(shù)seo基礎(chǔ)入門視頻教程

97色伦色在线综合视频,无玛专区,18videosex性欧美黑色,日韩黄色电影免费在线观看,国产精品伦理一区二区三区,在线视频欧美日韩,亚洲欧美在线中文字幕不卡

江蘇省建設(shè)安全協(xié)會(huì)網(wǎng)站手機(jī)網(wǎng)站制作建設(shè)

三維宣傳片制作公司關(guān)鍵詞優(yōu)化網(wǎng)站排名

重慶網(wǎng)站建設(shè)seo公司哪家好外貿(mào)營銷信

鐵路網(wǎng)站建設(shè)論文房產(chǎn)網(wǎng)簽流程圖

免費(fèi)網(wǎng)站推廣app一般網(wǎng)站有哪幾部分構(gòu)成

五合一網(wǎng)站制作視頻教程?學(xué)校 網(wǎng)站源碼

網(wǎng)站快速收錄技術(shù)seo基礎(chǔ)入門視頻教程

五合一網(wǎng)站制作視頻教程?學(xué)校網(wǎng)站源碼