update@发布Wikipedia知识库

2023-04-19 13:14:42 +08:00 · 2023-04-19 13:14:42 +08:00 · 82515cb1fc
parent e85f4a598a
commit 82515cb1fc
8 changed files with 390 additions and 8 deletions
--- a/README.md
+++ b/README.md
@ -11,8 +11,15 @@ https://github.com/yanqiangmiffy/Chinese-LangChain
 ![](https://github.com/yanqiangmiffy/Chinese-LangChain/blob/master/images/web_demo.png)
 ![](https://github.com/yanqiangmiffy/Chinese-LangChain/blob/master/images/web_demo_new.png)

+## 🚋 使用教程
+
+- 选择知识库询问相关领域的问题
+
+## 🏗️ 部署教程
+
 ## 🚀 特性

+- 📝 2023/04/19 发布45万Wikipedia的文本预处理语料以及FAISS索引向量
 - 🐯 2023/04/19 引入ChuanhuChatGPT皮肤
 - 📱 2023/04/19 增加web search功能，需要确保网络畅通！(感谢[@wanghao07456](https://github.com/wanghao07456),提供的idea)
 - 📚 2023/04/18 webui增加知识库选择功能
@ -24,6 +31,14 @@ https://github.com/yanqiangmiffy/Chinese-LangChain

 ## 🧰 知识库

+### 构建知识库
+
+- Wikipedia-zh
+
+> 详情见：corpus/zh_wikipedia/README.md
+
+### 知识库向量索引
+
 | 知识库数据  |FAISS向量|
 |--------|----|
 |💹 [大规模金融研报知识图谱](http://openkg.cn/dataset/fr2kg)|链接：https://pan.baidu.com/s/1FcIH5Fi3EfpS346DnDu51Q?pwd=ujjv 提取码：ujjv |
--- a/corpus/zh_wikipedia/README.md
+++ b/corpus/zh_wikipedia/README.md
@ -0,0 +1,114 @@
+## 知识库构建
+
+
+###  1 Wikipedia构建
+
+参考教程：https://blog.51cto.com/u_15127535/2697309
+
+
+一、维基百科
+
+维基百科（Wikipedia），是一个基于维基技术的多语言百科全书协作计划，也是一部用不同语言写成的网络百科全书。维基百科是由吉米·威尔士与拉里·桑格两人合作创建的，于2001年1月13日在互联网上推出网站服务，并在2001年1月15日正式展开网络百科全书的项目。
+
+
+
+二、维基百科处理
+
+1 环境配置（1）编程语言采用 python3（2）Gensim第三方库，Gensim是一个Python的工具包，其中有包含了中文维基百科数据处理的类，使用方便。
+Gensim : https://github.com/RaRe-Technologies/gensim
+
+使用 pip install gensim 安装gensim。
+
+（3）OpenCC第三方库，是中文字符转换，包括中文简体繁体相互转换等。
+
+OpenCC：https://github.com/BYVoid/OpenCC，OpenCC源码采用c++实现，如果会用c++的可以使用根据介绍，make编译源码。
+
+OpenCC也有python版本实现，可以通过pip安装（pip install opencc-python），速度要比c++版慢，但是使用方便，安装简单，推荐使用pip安装。
+
+
+
+2 数据下载
+
+中文维基百科数据按月进行更新备份，一般情况下，下载当前最新的数据，下载地址（https://dumps.wikimedia.org/zhwiki/latest/），我们下载的数据是：zhwiki-latest-pages-articles.xml.bz2。
+
+中文维基百科数据一般包含如下几个部分：
+
+
+
+训练词向量采用的数据是正文数据，下面我们将对正文数据进行处理。
+
+
+
+3 数据抽取
+
+下载下来的数据是压缩文件（bz2，gz），不需要解压，这里已经写好了一份利用gensim处理维基百科数据的脚本
+
+wikidata_processhttps://github.com/bamtercelboo/corpus_process_script/tree/master/wikidata_process
+
+使用：
+
+python wiki_process.py zhwiki-latest-pages-articles.xml.bz2 zhwiki-latest.txt
+
+这部分需要一些的时间，处理过后的得到一份中文维基百科正文数据（zhwiki-latest.txt）。
+
+输出文件类似于：
+
+歐幾里得 西元前三世紀的古希臘數學家 現在被認為是幾何之父 此畫為拉斐爾的作品 雅典學院 数学 是利用符号语言研究數量 结构 变化以及空间等概念的一門学科
+
+
+
+4 中文繁体转简体
+
+经过上述脚本得到的文件包含了大量的中文繁体字，我们需要将其转换成中文简体字。
+
+我们利用OpenCC进行繁体转简体的操作，这里已经写好了一份python版本的脚本来进行处理
+
+chinese_t2s
+
+https://github.com/bamtercelboo/corpus_process_script/tree/master/chinese_t2s
+
+使用：
+
+python chinese_t2s.py –input input_file –output output_file
+
+like:
+
+python chinese_t2s.py –input zhwiki-latest.txt –output zhwiki-latest-simplified.txt
+
+输出文件类似于
+
+欧几里得 西元前三世纪的古希腊数学家 现在被认为是几何之父 此画为拉斐尔的作品 雅典学院 数学 是利用符号语言研究数量 结构 变化以及空间等概念的一门学科
+
+      5.清洗语料
+
+上述处理已经得到了我们想要的数据，但是在其他的一些任务中，还需要对这份数据进行简单的处理，像词向量任务，在这得到的数据里，还包含很多的英文，日文，德语，中文标点，乱码等一些字符，我们要把这些字符清洗掉，只留下中文字符，仅仅留下中文字符只是一种处理方案，不同的任务需要不同的处理，这里已经写好了一份脚本
+
+clean
+
+https://github.com/bamtercelboo/corpus_process_script/tree/master/clean
+
+使用：
+
+python clean_corpus.py –input input_file –output output_file
+
+like：
+
+python clean_corpus.py –input zhwiki-latest-simplified.txt –output zhwiki-latest-simplified_cleaned.txt
+
+效果：
+
+input:
+
+哲学	哲学（英语：philosophy）是对普遍的和基本的问题的研究，这些问题通常和存在、知识、价值、理性、心灵、语言等有关。
+
+output:
+
+哲学哲学英语是对普遍的和基本的问题的研究这些问题通常和存在知识价值理性心灵语言等有关
+
+
+
+三、数据处理脚本
+
+近在github上新开了一个Repositorycorpus-process-scripthttps://github.com/bamtercelboo/corpus_process_script在这个repo，将存放中英文数据处理脚本，语言不限，会有详细的README，希望对大家能有一些帮助。
+References
+
--- a/corpus/zh_wikipedia/chinese_t2s.py
+++ b/corpus/zh_wikipedia/chinese_t2s.py
@ -0,0 +1,82 @@
+#!/usr/bin/env python
+# -*- coding:utf-8 _*-
+"""
+@author:quincy qiang
+@license: Apache Licence
+@file: chinese_t2s.py.py
+@time: 2023/04/19
+@contact: yanqiangmiffy@gamil.com
+@software: PyCharm
+@description: coding..
+"""
+import sys
+import os
+import opencc
+from optparse import OptionParser
+
+
+class T2S(object):
+    def __init__(self, infile, outfile):
+        self.infile = infile
+        self.outfile = outfile
+        self.cc = opencc.OpenCC('t2s')
+        self.t_corpus = []
+        self.s_corpus = []
+        self.read(self.infile)
+        self.t2s()
+        self.write(self.s_corpus, self.outfile)
+
+    def read(self, path):
+        print(path)
+        if os.path.isfile(path) is False:
+            print("path is not a file")
+            exit()
+        now_line = 0
+        with open(path, encoding="UTF-8") as f:
+            for line in f:
+                now_line += 1
+                line = line.replace("\n", "").replace("\t", "")
+                self.t_corpus.append(line)
+        print("read finished")
+
+    def t2s(self):
+        now_line = 0
+        all_line = len(self.t_corpus)
+        for line in self.t_corpus:
+            now_line += 1
+            if now_line % 1000 == 0:
+                sys.stdout.write("\rhandling with the {} line, all {} lines.".format(now_line, all_line))
+            self.s_corpus.append(self.cc.convert(line))
+        sys.stdout.write("\rhandling with the {} line, all {} lines.".format(now_line, all_line))
+        print("\nhandling finished")
+
+    def write(self, list, path):
+        print("writing now......")
+        if os.path.exists(path):
+            os.remove(path)
+        file = open(path, encoding="UTF-8", mode="w")
+        for line in list:
+            file.writelines(line + "\n")
+        file.close()
+        print("writing finished.")
+
+
+if __name__ == "__main__":
+    print("Traditional Chinese to Simplified Chinese")
+    # input = "./wiki_zh_10.txt"
+    # output = "wiki_zh_10_sim.txt"
+    # T2S(infile=input, outfile=output)
+
+    parser = OptionParser()
+    parser.add_option("--input", dest="input", default="", help="traditional file")
+    parser.add_option("--output", dest="output", default="", help="simplified file")
+    (options, args) = parser.parse_args()
+
+    input = options.input
+    output = options.output
+
+    try:
+        T2S(infile=input, outfile=output)
+        print("All Finished.")
+    except Exception as err:
+        print(err)
--- a/corpus/zh_wikipedia/clean_corpus.py
+++ b/corpus/zh_wikipedia/clean_corpus.py
@ -0,0 +1,88 @@
+#!/usr/bin/env python
+# -*- coding:utf-8 _*-
+"""
+@author:quincy qiang
+@license: Apache Licence
+@file: clean_corpus.py.py
+@time: 2023/04/19
+@contact: yanqiangmiffy@gamil.com
+@software: PyCharm
+@description: coding..
+"""
+"""
+    FILE :  clean_corpus.py
+    FUNCTION : None
+"""
+import sys
+import os
+from optparse import OptionParser
+
+
+class Clean(object):
+    def __init__(self, infile, outfile):
+        self.infile = infile
+        self.outfile = outfile
+        self.corpus = []
+        self.remove_corpus = []
+        self.read(self.infile)
+        self.remove(self.corpus)
+        self.write(self.remove_corpus, self.outfile)
+
+    def read(self, path):
+        print("reading now......")
+        if os.path.isfile(path) is False:
+            print("path is not a file")
+            exit()
+        now_line = 0
+        with open(path, encoding="UTF-8") as f:
+            for line in f:
+                now_line += 1
+                line = line.replace("\n", "").replace("\t", "")
+                self.corpus.append(line)
+        print("read finished.")
+
+    def remove(self, list):
+        print("removing now......")
+        for line in list:
+            re_list = []
+            for word in line:
+                if self.is_chinese(word) is False:
+                    continue
+                re_list.append(word)
+            self.remove_corpus.append("".join(re_list))
+        print("remove finished.")
+
+    def write(self, list, path):
+        print("writing now......")
+        if os.path.exists(path):
+            os.remove(path)
+        file = open(path, encoding="UTF-8", mode="w")
+        for line in list:
+            file.writelines(line + "\n")
+        file.close()
+        print("writing finished")
+
+    def is_chinese(self, uchar):
+        """判断一个unicode是否是汉字"""
+        if (uchar >= u'\u4e00') and (uchar <= u'\u9fa5'):
+            return True
+        else:
+            return False
+
+
+if __name__ == "__main__":
+    print("clean corpus")
+
+    parser = OptionParser()
+    parser.add_option("--input", dest="input", default="", help="input file")
+    parser.add_option("--output", dest="output", default="", help="output file")
+    (options, args) = parser.parse_args()
+
+    input = options.input
+    output = options.output
+
+    try:
+        Clean(infile=input, outfile=output)
+        print("All Finished.")
+    except Exception as err:
+        print(err)
--- a/corpus/zh_wikipedia/wiki_process.py
+++ b/corpus/zh_wikipedia/wiki_process.py
@ -0,0 +1,46 @@
+#!/usr/bin/env python
+# -*- coding:utf-8 _*-
+"""
+@author:quincy qiang
+@license: Apache Licence
+@file: wiki_process.py
+@time: 2023/04/19
+@contact: yanqiangmiffy@gamil.com
+@software: PyCharm
+@description: https://blog.csdn.net/weixin_40871455/article/details/88822290
+"""
+import logging
+import sys
+from gensim.corpora import WikiCorpus
+
+logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s', level=logging.INFO)
+'''
+    extract data from wiki dumps(*articles.xml.bz2) by gensim.
+    @2019-3-26
+'''
+
+
+def help():
+    print("Usage: python wikipro.py zhwiki-20190320-pages-articles-multistream.xml.bz2 wiki.zh.txt")
+
+
+if __name__ == '__main__':
+    if len(sys.argv) < 3:
+        help()
+        sys.exit(1)
+    logging.info("running %s" % ' '.join(sys.argv))
+    inp, outp = sys.argv[1:3]
+    i = 0
+
+    output = open(outp, 'w', encoding='utf8')
+    wiki = WikiCorpus(inp, dictionary={})
+    for text in wiki.get_texts():
+        output.write(" ".join(text) + "\n")
+        i = i + 1
+        if (i % 10000 == 0):
+            logging.info("Save " + str(i) + " articles")
+    output.close()
+    logging.info("Finished saved " + str(i) + "articles")
+
+    # 命令行下运行
+    # python wikipro.py cache/zh_wikipedia/zhwiki-latest-pages-articles.xml.bz2 wiki.zh.txt
--- a/create_knowledge.py
+++ b/create_knowledge.py
@ -10,7 +10,8 @@
@description: - emoji：https://emojixd.com/pocket/science
 """
 import os
-
+import pandas as pd
+from langchain.schema import Document
 from langchain.document_loaders import UnstructuredFileLoader
 from langchain.embeddings.huggingface import HuggingFaceEmbeddings
 from langchain.vectorstores import FAISS
@ -20,6 +21,9 @@ embedding_model_name = '/root/pretrained_models/text2vec-large-chinese'
 docs_path = '/root/GoMall/Knowledge-ChatGLM/cache/financial_research_reports'
 embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)

+
+# Wikipedia数据处理
+
 # docs = []

 # with open('docs/zh_wikipedia/zhwiki.sim.utf8', 'r', encoding='utf-8') as f:
@ -30,13 +34,46 @@ embeddings = HuggingFaceEmbeddings(model_name=embedding_model_name)
 # vector_store = FAISS.from_documents(docs, embeddings)
 # vector_store.save_local('cache/zh_wikipedia/')

+
+
 docs = []

-for doc in tqdm(os.listdir(docs_path)):
-    if doc.endswith('.txt'):
-        # print(doc)
-        loader = UnstructuredFileLoader(f'{docs_path}/{doc}', mode="elements")
-        doc = loader.load()
-        docs.extend(doc)
+with open('cache/zh_wikipedia/wiki.zh-sim-cleaned.txt', 'r', encoding='utf-8') as f:
+    for idx, line in tqdm(enumerate(f.readlines())):
+        metadata = {"source": f'doc_id_{idx}'}
+        docs.append(Document(page_content=line.strip(), metadata=metadata))
+
 vector_store = FAISS.from_documents(docs, embeddings)
-vector_store.save_local('cache/financial_research_reports')
+vector_store.save_local('cache/zh_wikipedia/')
+
+
+# 金融研报数据处理
+# docs = []
+#
+# for doc in tqdm(os.listdir(docs_path)):
+#     if doc.endswith('.txt'):
+#         # print(doc)
+#         loader = UnstructuredFileLoader(f'{docs_path}/{doc}', mode="elements")
+#         doc = loader.load()
+#         docs.extend(doc)
+# vector_store = FAISS.from_documents(docs, embeddings)
+# vector_store.save_local('cache/financial_research_reports')
+
+
+# 英雄联盟
+
+docs = []
+
+lol_df = pd.read_csv('cache/lol/champions.csv')
+# lol_df.columns = ['id', '英雄简称', '英雄全称', '出生地', '人物属性', '英雄类别', '英雄故事']
+print(lol_df)
+
+for idx, row in lol_df.iterrows():
+    metadata = {"source": f'doc_id_{idx}'}
+    text = ' '.join(row.values)
+    # for col in ['英雄简称', '英雄全称', '出生地', '人物属性', '英雄类别', '英雄故事']:
+    #     text += row[col]
+    docs.append(Document(page_content=text, metadata=metadata))
+
+vector_store = FAISS.from_documents(docs, embeddings)
+vector_store.save_local('cache/lol/')
--- a/images/wiki_process.png
+++ b/images/wiki_process.png
--- a/resources/OpenCC-1.1.6-cp310-cp310-manylinux1_x86_64.whl
+++ b/resources/OpenCC-1.1.6-cp310-cp310-manylinux1_x86_64.whl