TypeError: expected string or buffer - Langchain, OpenAI Embeddings

题意:类型错误:期望字符串或缓冲区 - Langchain,OpenAI Embeddings

问题背景:

I am trying to create RAG using the product manuals in pdf which are splitted, indexed and stored in Chroma persisted on a disk. When I try the function that classifies the reviews using the documents context, below is the error I get:

我正在尝试使用 PDF 格式的产品手册创建 RAG,这些手册被拆分、索引并存储在硬盘上的 Chroma 中。当我尝试使用文档上下文对评论进行分类的函数时,出现了以下错误:

from langchain import PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain.embeddings import AzureOpenAIEmbeddings
from langchain.chat_models import AzureChatOpenAI
from langchain.vectorstores import Chromallm = AzureChatOpenAI(azure_deployment="ChatGPT-16K",openai_api_version="2023-05-15",azure_endpoint=endpoint,api_key=result["access_token"],temperature=0,seed = 100)embedding_model = AzureOpenAIEmbeddings(api_version="2023-05-15",azure_endpoint=endpoint,api_key=result["access_token"],azure_deployment="ada002",
)vectordb = Chroma(persist_directory=vector_db_path,embedding_function=embedding_model,collection_name="product_manuals",
)def format_docs(docs):return "\n\n".join(doc.page_content for doc in docs)def classify (review_title, review_text, product_num):template = """You are a customer service AI Assistant that handles responses to negative product reviews. Use the context below and categorize {review_title} and {review_text} into defect, misuse or poor quality categories based only on provided context. If you don't know, say that you do not know, don't try to make up an answer. Respond back with an answer in the following format:poor qualitymisusedefect{context}Category: """rag_prompt = PromptTemplate.from_template(template)retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={'filter': {'product_num': product_num}})retrieval_chain = ({"context": retriever | format_docs, "review_title: RunnablePassthrough(), "review_text": RunnablePassthrough()}| rag_prompt| llm| StrOutputParser())return retrieval_chain.invoke({"review_title": review_title, "review_text": review_text})classify(review_title="Terrible", review_text ="This baking sheet is terrible. It stains so easily and i've tried everything to get it clean", product_num ="8888999")

Error stack:        错误信息:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File <command-3066972537097411>, line 1
----> 1 issue_recommendation(2     review_title="Terrible",3     review_text="This baking sheet is terrible. It stains so easily and i've tried everything to get it clean. I've maybe used it 5 times and it looks like it's 20 years old. The side of the pan also hold water, so when you pick it up off the drying rack, water runs out. I would never purchase these again.",4     product_num="8888999"5    6 )File <command-3066972537097410>, line 44, in issue_recommendation(review_title, review_text, product_num)36 retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={'filter': {'product_num': product_num}})38 retrieval_chain = (39         {"context": retriever | format_docs, "review_text": RunnablePassthrough()}40         | rag_prompt41         | llm42         | StrOutputParser()43 )
---> 44 return retrieval_chain.invoke({"review_title":review_title, "review_text": review_text})File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/runnables/base.py:1762, in RunnableSequence.invoke(self, input, config)1760 try:1761     for i, step in enumerate(self.steps):
-> 1762         input = step.invoke(1763             input,1764             # mark each step as a child run1765             patch_config(1766                 config, callbacks=run_manager.get_child(f"seq:step:{i+1}")1767             ),1768         )1769 # finish the root run1770 except BaseException as e:File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/runnables/base.py:2327, in RunnableParallel.invoke(self, input, config)2314     with get_executor_for_config(config) as executor:2315         futures = [2316             executor.submit(2317                 step.invoke,(...)2325             for key, step in steps.items()2326         ]
-> 2327         output = {key: future.result() for key, future in zip(steps, futures)}2328 # finish the root run2329 except BaseException as e:File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/runnables/base.py:2327, in <dictcomp>(.0)2314     with get_executor_for_config(config) as executor:2315         futures = [2316             executor.submit(2317                 step.invoke,(...)2325             for key, step in steps.items()2326         ]
-> 2327         output = {key: future.result() for key, future in zip(steps, futures)}2328 # finish the root run2329 except BaseException as e:File /usr/lib/python3.10/concurrent/futures/_base.py:451, in Future.result(self, timeout)449     raise CancelledError()450 elif self._state == FINISHED:
--> 451     return self.__get_result()453 self._condition.wait(timeout)455 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:File /usr/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self)401 if self._exception:402     try:
--> 403         raise self._exception404     finally:405         # Break a reference cycle with the exception in self._exception406         self = NoneFile /usr/lib/python3.10/concurrent/futures/thread.py:58, in _WorkItem.run(self)55     return57 try:
---> 58     result = self.fn(*self.args, **self.kwargs)59 except BaseException as exc:60     self.future.set_exception(exc)File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/runnables/base.py:1762, in RunnableSequence.invoke(self, input, config)1760 try:1761     for i, step in enumerate(self.steps):
-> 1762         input = step.invoke(1763             input,1764             # mark each step as a child run1765             patch_config(1766                 config, callbacks=run_manager.get_child(f"seq:step:{i+1}")1767             ),1768         )1769 # finish the root run1770 except BaseException as e:File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/retrievers.py:121, in BaseRetriever.invoke(self, input, config)117 def invoke(118     self, input: str, config: Optional[RunnableConfig] = None119 ) -> List[Document]:120     config = ensure_config(config)
--> 121     return self.get_relevant_documents(122         input,123         callbacks=config.get("callbacks"),124         tags=config.get("tags"),125         metadata=config.get("metadata"),126         run_name=config.get("run_name"),127     )File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/retrievers.py:223, in BaseRetriever.get_relevant_documents(self, query, callbacks, tags, metadata, run_name, **kwargs)221 except Exception as e:222     run_manager.on_retriever_error(e)
--> 223     raise e224 else:225     run_manager.on_retriever_end(226         result,227         **kwargs,228     )File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/retrievers.py:216, in BaseRetriever.get_relevant_documents(self, query, callbacks, tags, metadata, run_name, **kwargs)214 _kwargs = kwargs if self._expects_other_args else {}215 if self._new_arg_supported:
--> 216     result = self._get_relevant_documents(217         query, run_manager=run_manager, **_kwargs218     )219 else:220     result = self._get_relevant_documents(query, **_kwargs)File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_core/vectorstores.py:654, in VectorStoreRetriever._get_relevant_documents(self, query, run_manager)650 def _get_relevant_documents(651     self, query: str, *, run_manager: CallbackManagerForRetrieverRun652 ) -> List[Document]:653     if self.search_type == "similarity":
--> 654         docs = self.vectorstore.similarity_search(query, **self.search_kwargs)655     elif self.search_type == "similarity_score_threshold":656         docs_and_similarities = (657             self.vectorstore.similarity_search_with_relevance_scores(658                 query, **self.search_kwargs659             )660         )File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_community/vectorstores/chroma.py:348, in Chroma.similarity_search(self, query, k, filter, **kwargs)331 def similarity_search(332     self,333     query: str,(...)336     **kwargs: Any,337 ) -> List[Document]:338     """Run similarity search with Chroma.339 340     Args:(...)346         List[Document]: List of documents most similar to the query text.347     """
--> 348     docs_and_scores = self.similarity_search_with_score(349         query, k, filter=filter, **kwargs350     )351     return [doc for doc, _ in docs_and_scores]File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_community/vectorstores/chroma.py:437, in Chroma.similarity_search_with_score(self, query, k, filter, where_document, **kwargs)429     results = self.__query_collection(430         query_texts=[query],431         n_results=k,(...)434         **kwargs,435     )436 else:
--> 437     query_embedding = self._embedding_function.embed_query(query)438     results = self.__query_collection(439         query_embeddings=[query_embedding],440         n_results=k,(...)443         **kwargs,444     )446 return _results_to_docs_and_scores(results)File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_community/embeddings/openai.py:691, in OpenAIEmbeddings.embed_query(self, text)682 def embed_query(self, text: str) -> List[float]:683     """Call out to OpenAI's embedding endpoint for embedding query text.684 685     Args:(...)689         Embedding for the text.690     """
--> 691     return self.embed_documents([text])[0]File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_community/embeddings/openai.py:662, in OpenAIEmbeddings.embed_documents(self, texts, chunk_size)659 # NOTE: to keep things simple, we assume the list may contain texts longer660 #       than the maximum context and use length-safe embedding function.661 engine = cast(str, self.deployment)
--> 662 return self._get_len_safe_embeddings(texts, engine=engine)File /local_disk0/.ephemeral_nfs/envs/pythonEnv-65a09d8c-062d-4f4f-9c52-1bf534f6511e/lib/python3.10/site-packages/langchain_community/embeddings/openai.py:465, in OpenAIEmbeddings._get_len_safe_embeddings(self, texts, engine, chunk_size)459 if self.model.endswith("001"):460     # See: https://github.com/openai/openai-python/461     #      issues/418#issuecomment-1525939500462     # replace newlines, which can negatively affect performance.463     text = text.replace("\n", " ")
--> 465 token = encoding.encode(466     text=text,467     allowed_special=self.allowed_special,468     disallowed_special=self.disallowed_special,469 )471 # Split tokens into chunks respecting the embedding_ctx_length472 for j in range(0, len(token), self.embedding_ctx_length):File /databricks/python/lib/python3.10/site-packages/tiktoken/core.py:116, in Encoding.encode(self, text, allowed_special, disallowed_special)114     if not isinstance(disallowed_special, frozenset):115         disallowed_special = frozenset(disallowed_special)
--> 116     if match := _special_token_regex(disallowed_special).search(text):117         raise_disallowed_special_token(match.group())119 try:TypeError: expected string or buffer

Embeddings seems to work fine when I test. It also works fine when I remove the context and retriever from the chain. It seems to be related to embeddings. Examples on Langchain website instantiates retriver from Chroma.from_documents() whereas I load Chroma vector store from a persisted path. I also tried invoking with review_text only (instead of review title and review text) but the error persists. Not sure why this is happening. These are the package versions I work:

当我测试时,Embeddings 似乎工作正常。当我从链中移除上下文和检索器时,它也能正常工作。问题似乎与 Embeddings 有关。Langchain 网站上的示例是通过 Chroma.from_documents() 实例化检索器,而我是从已保存的路径加载 Chroma 向量存储。我也尝试仅使用 review_text(而不是 review titlereview text),但错误仍然存在。不确定为什么会这样。这是我使用的包版本:

Name: openai Version: 1.6.1

Name: langchain Version: 0.0.354

问题解决:

I've come across the same issue, and turned out that langchain pass a key-value pair as an input to the encoding.code() while it requires str type. A work around is by using itemgetter() to get the direct string input. It might be something like this

我也遇到了同样的问题,发现是由于 langchain 将一个键值对作为输入传递给 encoding.code(),而它需要的是 str 类型。一个解决方法是使用 itemgetter() 来获取直接的字符串输入。可能是这样的:

        retrieval_chain = ({"document": itemgetter("question") | self.retriever,"question": itemgetter("question"),}| prompt| model| StrOutputParser())

You can find the reference here

你可以在这里找到参考资料。

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.xdnf.cn/news/145290.html

如若内容造成侵权/违法违规/事实不符,请联系一条长河网进行投诉反馈,一经查实,立即删除!

相关文章

完美解决 Array 方法 (map/filter/reduce) 不按预期工作 的正确解决方法,亲测有效!!!

完美解决 Array 方法 (map/filter/reduce) 不按预期工作 的正确解决方法&#xff0c;亲测有效&#xff01;&#xff01;&#xff01; 亲测有效 完美解决 Array 方法 (map/filter/reduce) 不按预期工作 的正确解决方法&#xff0c;亲测有效&#xff01;&#xff01;&#xff01;…

【FreeRTOS】中的portYIELD_FROM_ISR(xHigherPriorityTaskWoken)有啥用?

1、大家都知道&#xff0c;在中断里&#xff0c;freertos经常有下面的写法&#xff0c;会调用portYIELD_FROM_ISR BaseType_t xHigherPriorityTaskWoken pdFALSE; vTaskNotifyGiveFromISR(xTaskToNotify, &xHigherPriorityTaskWoken); //xHigherPriorityTaskWoken可为NUL…

【创意无限,尽在Houdini!】解锁数字特效的魔法工具箱 -- Houdini 产品交流会,诚邀您的参与!

尊敬的先生/女士, 我们是 Houdini 产品厂商在亚太地区的经销商--八方在线科技有限公司。 Houdini 产品厂商诚挚地邀请您参加即将举办的 Houdini 产品交流会。本次交流会将为您展示 Houdini 软件的最新功能和应用&#xff0c;帮助您更好地了解这款领先的3D动画和视觉特效软件。 …

1.4 MySql配置文件

既然我们开始学习数据库&#xff0c;就不能像大学里边讲数据库课程那样简单讲一下&#xff0c;增删改查&#xff0c;然后介绍一下怎么去创建索引&#xff0c;怎么提交和回滚事务。我们学习数据库要明白怎么用&#xff0c;怎么配置&#xff0c;学懂学透彻了。当然MySql的配置参数…

Python办公自动化案例(五):分析文本数据的词频并形成词云图

案例:分析文本数据的词频并形成词云图 在文本分析中,词频分析是一种基本且重要的技术,它可以帮助我们了解文本中词汇的使用频率。通过词频分析,我们可以识别出文本的关键词汇,从而对文本内容有更深入的理解。词云图是一种将词频视觉化的方法,它通过不同大小的字体展示词…

GRE隧道在实际部署中的优化、局限性与弊端

GRE的其他特性 上一篇光讲解配置就花了大量的篇幅&#xff0c;还一些特性没有讲解的&#xff0c;这里在来提及下。 1、动态路由协议 在上一篇中是使用的静态路由&#xff0c;那么在动态路由协议中应该怎么配置呢&#xff1f; undoip route-static 192.168.20.0 255.255.255.0 …

Android ImageView支持每个角的不同半径

Android ImageView支持每个角的不同半径 import android.annotation.SuppressLint; import android.content.Context; import android.content.res.ColorStateList; import android.content.res.Resources; import android.content.res.Resources.NotFoundException; import an…

身份证实名认证的应用场景-身份证识别api

引言 在互联网时代&#xff0c;虚拟身份和真实身份的界限逐渐模糊。为了保证线上平台的安全性和可信度&#xff0c;身份证实名认证逐渐成为必不可少的验证方式。它通过身份信息的核验&#xff0c;确保用户是真实的个人&#xff0c;防止虚假身份带来的各类风险。本文将探讨身份证…

卖家必看:利用亚马逊自养号测评精选热门产品,增强店铺权重

在亚马逊的商业版图中&#xff0c;选品始终占据着核心地位&#xff0c;是贯穿其经营策略的永恒旋律。一个商品能否脱颖而出&#xff0c;成为市场中的明星爆款&#xff0c;其关键在于卖家对产品的精挑细选&#xff0c;这一环节的重要性不言而喻&#xff0c;是决定胜负的关键所在…

【matlab】将程序打包为exe文件(matlab r2023a为例)

文章目录 一、安装运行时环境1.1 安装1.2 简介 二、打包三、打包文件为什么很大 一、安装运行时环境 使用 Application Compiler 来将程序打包为exe&#xff0c;相当于你使用C编译器把C语言编译成可执行程序。 在matlab菜单栏–App下面可以看到Application Compiler。 或者在…

智慧电网能源双碳实训平台

智慧产业实践基地提供能源双碳实训系统&#xff0c;系统集成了火力发电、风力发电、光伏发电、储能、变网、载荷、智能抄表等多种功能&#xff0c;将分布式发电机组、储能单元、逆变单元、可以远程控制的物联网负荷汇聚在一起&#xff0c;通过物联网、人工智能、嵌入式、大数据…

元素循环分析再添新成员:铜、钼、镍、钴、硒微量元素数据库注释

微量营养元素&#xff08;例如Fe、Cu、Mo、Ni等&#xff09;是光合作用、呼吸作用、生物大分子合成、氧化还原平衡、细胞生长和免疫系统功能等微生物驱动过程的重要调节因子。虽然生物体需要少量的微量营养元素&#xff0c;但缺乏微量营养元素会严重限制生物体的生长和生物过程…

新手教学系列——基于统一页面的管理后台设计(二)集成篇

在现代企业级应用中,后台管理系统不仅是业务运营的核心,还承担着数据管理、用户权限控制等重要功能。随着业务规模的不断扩大,系统架构逐渐向微服务转变,多个后端服务模块协同工作,如何高效地集成这些模块,确保系统的稳定性和可维护性,成为开发者亟需解决的问题。在《新…

TR CU 004/2011《低压设备安全技术法规》认证解读篇上

一、根据技术法规或国家标准进行的生产一致性认证为制造商和卖方提供了以下优势&#xff1a; 不需要为每个国家提供单独的文件&#xff1b; 大大减少了完成认证程序所需的时间&#xff1b; 加快国家间贸易周转&#xff0c;对其经济产生积极影响&#xff1b; 个别公司的销售额…

day-56 最长的字母序连续子字符串的长度

思路 双指针&#xff0c;用left和right表示以right为右边界的连续子字符串&#xff0c;left表示其左边界 解题过程 right从0到s.length()-1遍历&#xff1a;1.如果当前字符减去前一个字符的值为1,则当前字符与前面的字符是连续的&#xff0c;此时left不变&#xff0c;right;2.…

SpringBoot设置mysql的ssl连接

因工作需要&#xff0c;mysql连接需要开启ssl认证&#xff0c;本文主要讲述客户端如何配置ssl连接。 开发环境信息&#xff1a; SpringBoot&#xff1a; 2.0.5.RELEASE mysql-connector-java&#xff1a; 8.0.18 mysql version&#xff1a;8.0.18 一、检查服务端是否开启ssl认…

在 Windows 11/10/8/7 中恢复 Shift 删除的文件

在 Windows 中&#xff0c;临时删除文件就像按 Delete 键一样简单&#xff0c;永久删除文件就像按 Shift Delete 一样简单。但是&#xff0c;如果您想恢复这些丢失的文件&#xff0c;那又怎样&#xff1f;有没有办法恢复这些永久删除的文件&#xff1f; 幸运的是&#xff0c;…

【数据结构】排序算法---基数排序

文章目录 1. 定义2. 算法步骤2.1 MSD基数排序2.2 LSD基数排序 3. LSD 基数排序动图演示4. 性质5. 算法分析6. 代码实现C语言PythonJavaCGo 结语 ⚠本节要介绍的不是计数排序 1. 定义 基数排序&#xff08;英语&#xff1a;Radix sort&#xff09;是一种非比较型的排序算法&…

api接口详解大全

优秀的设计是产品变得卓越的原因。设计API意味着提供有效的接口&#xff0c;可以帮助API使用者更好地了解、使用和集成&#xff0c;同时帮助人们有效地维护它。每个产品都需要使用手册&#xff0c;API也不例外。在API领域&#xff0c;可以将设计视为服务器和客户端之间的协议进…

68个卫星电子地形大字体历史地图高清图源大全

数据是GIS的血液&#xff01; 在GIS数据中一般主要有栅格地图、高程DEM数据和矢量地图等&#xff0c;这里我们为你分享68个栅格瓦片地图的图源。 你如果需要这些图源&#xff0c;请在文末查看领取方式。 68个图源大全 现在为你分享的68个图源&#xff0c;主要包括35个卫星电…