本文及资源最后更新时间 2021-10-10 by sky995
以qq库为例:
得到的数据源文件为txt无序数据
step1: 对源文件进行分割合并排序 key为qq字段(phone也行)
这里我自己写了一个脚本,需要配合emeditor
使用emeditor将源文件以行分割 7k5w行一个文件 分出来大概10个文件
对这是个文件进行归并排序,最后得到的需要是一个有序的源文件
step2: 使用emeditor将源文件以行分割 100w行一个文件 分割后大概720个文件
为后面建表作为数据源,每个文件对应一张表,也就是720张表
step3: 批量创建数据库导入数据
先建立一个用于查询数据库名的表,字段为 database_name,begin,end
begin,end对应每个排序表的开始和结尾
这里以linux系统为例: 批量创建database及压缩表(压缩表可以减少表的占用空间和提高查询效率)
注意下压缩表不能修改,这里贴下shell脚本,需要有一定的基础,进行修改
#!/bin/bashindex=1USER_NAME="root"PASSWD=""DB_NAME=""HOST_NAME="127.0.0.1"DB_PORT="3306"endIndex=720MYSQL_ETL="mysql -h${HOST_NAME} -P${DB_PORT} -u${USER_NAME} -p${PASSWD} ${DB_NAME} -s -e"for ((i=$index; i<=$endIndex; i++))dotable_name="qq_database_"$i""database_path="/var/lib/mysql-files/qq_database/MargedFileOutPut_"$i".txt"times=$(date "+%Y-%m-%d %H:%M:%S")echo "[${times}] Insert Data ${table_name}"create_table="CREATE TABLE ${table_name} ( qq bigint UNSIGNED NOT NULL,phone bigint UNSIGNED NOT NULL,PRIMARY KEY (qq), INDEX phone_index(phone) USING BTREE) ENGINE = MyISAM;"exec_create_table=$($MYSQL_ETL "${create_table}")load_data="LOAD DATA INFILE '${database_path}' REPLACE INTO TABLE "${table_name}" FIELDS TERMINATED BY ',' enclosed by '' lines terminated by '\n' (qq,phone);"exec_load_data=$($MYSQL_ETL "${load_data}")query_begin="select * from ${table_name} limit 1;"query_end="select * from ${table_name} order by qq desc limit 1;"query_begin_done=$($MYSQL_ETL "${query_begin}")query_end_done=$($MYSQL_ETL "${query_end}")array=(${query_begin_done// / })begin=${array[0]}array=(${query_end_done// / })end=${array[0]}insert_index="INSERT INTO qq_database_index (database_name, begin, end) VALUES ('${table_name}',${begin},${end});"insert_index_done=$($MYSQL_ETL "${insert_index}")#packmyisampack /var/lib/mysql/bind_search_service/${table_name}myisamchk -rq /var/lib/mysql/bind_search_service/${table_name}#update#remove file> /boot/bigfilerm ${database_path}times=$(date "+%Y-%m-%d %H:%M:%S")echo "[${times}] Insert Data ${table_name} Done!"done
step4:
脚本运行完后需要刷新下表,flush tables;
step5: 查询
先查询数据库索引表 通过 begin<= keys <= end进行查询,得到的数据取出数据库名
再进行一次查询 SELECT * FROM database_name WHELE qq = keys;
通过分表和添加索引,查询效率非常高且占用空间少 通过主键查询大概0.05s以内,当然如果通过索引phone查询就得需要查询所有分表
写个循环构造表名,处理好逻辑 查询时间大概也在0.5s以内,
最后贴下归并排序文件的python源码
#!/usr/bin/env python3# -*- coding: utf-8 -*-# [url=home.php?mod=space&uid=2260]@Time[/url] : 2021/3/9 10:12# @Author : Smida# @FileName: sortDatabase.py# @Software: PyCharmimport osimport timeimport numpyclass SortDatabaseManager():dataPath = "E:\\ariDownload\\裤子\\q绑\\qqSearch_split_6\\OutPut" #文件目录dataFiles = [i for i in os.listdir(dataPath) if i[-3::] == 'txt'] #目录下所有txt文件名theQQMaxMap = {}theSplitFlag = ','theDataPosition = 0timeScale = 0@staticmethoddef printLog(msg):print(f"[{time.strftime('%Y-%m-%d %H:%M:%S',time.localtime())}] -> {msg}")@staticmethoddef caculateTimeSpan(fileSize,timeScale):return fileSize/timeScale if timeScale else "Null"@staticmethoddef getFileSize(filePath):return round(os.path.getsize(filePath) / float(1024 * 1024),2)def sortFile(self,path, chunk):self.printLog(f"开始分割文件 {path} \n 缓存大小为{chunk}")baseDir, baseFile = os.path.split(path)fileIndex = 1files = []with open(path, 'r') as f:while True:lines = f.readlines(chunk)lines.sort(key=lambda x: int(x.split(",")[0]))if lines:newFileName = os.path.join(baseDir, f"{baseFile[1:-4]}_{fileIndex}.txt")with open(newFileName, 'a') as sf:sf.write(''.join(lines))files.append(newFileName)fileIndex += 1else:breakreturn filesdef mergeFiles(self,fileList: list,filePath: str) -> str:""":param fileList: a list of file absolute path:return: a string of merged file absolute path"""self.printLog(f"开始归并文件覆盖输出到 {filePath}")fs = [open(file_, 'r') for file_ in fileList]tempDict = {}mergedFile = open(filePath, 'w+')for f in fs:initLine = f.readline()if initLine:tempDict[f] = initLinewhile tempDict:min_item = min(tempDict.items(), key=lambda x: int(x[1].split(",")[0]))mergedFile.write(min_item[1])nextLine = min_item[0].readline()if nextLine:tempDict[min_item[0]] = nextLineelse:del tempDict[min_item[0]]min_item[0].close()mergedFile.close()for file_ in fileList:self.printLog(f"清除缓存文件 {file_}")os.remove(file_)return os.path.join(filePath)def getFilePaths(self):pathList = []for fileName in self.dataFiles:pathList.append(f"{self.dataPath}\\{fileName}")return pathListdef setTimeScale(self,fileSize,timeSpan):self.timeScale = fileSize // timeSpan# 遍历文件,找出每个文件的最大值或最小值def startSortFile(self):allStartTime = time.time()filePathList = []for fileName in self.dataFiles:filePath = f"{self.dataPath}\\{fileName}"if fileName == "qqSearch_1.txt":continuefileSize = self.getFileSize(filePath)startTime = time.time()self.printLog(f"开始处理文件:{fileName} 预计耗时:{self.caculateTimeSpan(fileSize, self.timeScale)}s")self.mergeFiles(self.sortFile(filePath,1024 * 1024 * 500),filePath)endTime = time.time()self.setTimeScale(fileSize,endTime - startTime)self.printLog("开始最后归并...")for i in self.dataFiles:filePathList.append(f"{self.dataPath}\\{i}")self.mergeFiles(filePathList, "MargedFileOutPut.txt")allEndTime = time.time()self.printLog(f"Done! 耗时{allEndTime-allStartTime}")oj = SortDatabaseManager()path = oj.startSortFile()
同qq库一样凡是可以使用bigint存储的都可以使用类似方法,在服务器配置不好的情况下,可以尝试。