本文及资源最后更新时间 2021-10-10 by sky995
以qq库为例:
得到的数据源文件为txt无序数据
step1: 对源文件进行分割合并排序 key为qq字段(phone也行)
这里我自己写了一个脚本,需要配合emeditor
使用emeditor将源文件以行分割 7k5w行一个文件 分出来大概10个文件
对这是个文件进行归并排序,最后得到的需要是一个有序的源文件
step2: 使用emeditor将源文件以行分割 100w行一个文件 分割后大概720个文件
为后面建表作为数据源,每个文件对应一张表,也就是720张表
step3: 批量创建数据库导入数据
先建立一个用于查询数据库名的表,字段为 database_name,begin,end
begin,end对应每个排序表的开始和结尾
这里以linux系统为例: 批量创建database及压缩表(压缩表可以减少表的占用空间和提高查询效率)
注意下压缩表不能修改,这里贴下shell脚本,需要有一定的基础,进行修改
#!/bin/bash
index=1
USER_NAME="root"
PASSWD=""
DB_NAME=""
HOST_NAME="127.0.0.1"
DB_PORT="3306"
endIndex=720
MYSQL_ETL="mysql -h${HOST_NAME} -P${DB_PORT} -u${USER_NAME} -p${PASSWD} ${DB_NAME} -s -e"
for ((i=$index; i<=$endIndex; i++))
do
table_name="qq_database_"$i""
database_path="/var/lib/mysql-files/qq_database/MargedFileOutPut_"$i".txt"
times=$(date "+%Y-%m-%d %H:%M:%S")
echo "[${times}] Insert Data ${table_name}"
create_table="CREATE TABLE ${table_name} ( qq bigint UNSIGNED NOT NULL,phone bigint UNSIGNED NOT NULL,PRIMARY KEY (qq), INDEX phone_index(phone) USING BTREE) ENGINE = MyISAM;"
exec_create_table=$($MYSQL_ETL "${create_table}")
load_data="LOAD DATA INFILE '${database_path}' REPLACE INTO TABLE "${table_name}" FIELDS TERMINATED BY ',' enclosed by '' lines terminated by '\n' (qq,phone);"
exec_load_data=$($MYSQL_ETL "${load_data}")
query_begin="select * from ${table_name} limit 1;"
query_end="select * from ${table_name} order by qq desc limit 1;"
query_begin_done=$($MYSQL_ETL "${query_begin}")
query_end_done=$($MYSQL_ETL "${query_end}")
array=(${query_begin_done// / })
begin=${array[0]}
array=(${query_end_done// / })
end=${array[0]}
insert_index="INSERT INTO qq_database_index (database_name, begin, end) VALUES ('${table_name}',${begin},${end});"
insert_index_done=$($MYSQL_ETL "${insert_index}")
#pack
myisampack /var/lib/mysql/bind_search_service/${table_name}
myisamchk -rq /var/lib/mysql/bind_search_service/${table_name}
#update
#remove file
> /boot/bigfile
rm ${database_path}
times=$(date "+%Y-%m-%d %H:%M:%S")
echo "[${times}] Insert Data ${table_name} Done!"
done
step4:
脚本运行完后需要刷新下表,flush tables;
step5: 查询
先查询数据库索引表 通过 begin<= keys <= end进行查询,得到的数据取出数据库名
再进行一次查询 SELECT * FROM database_name WHELE qq = keys;
通过分表和添加索引,查询效率非常高且占用空间少 通过主键查询大概0.05s以内,当然如果通过索引phone查询就得需要查询所有分表
写个循环构造表名,处理好逻辑 查询时间大概也在0.5s以内,
最后贴下归并排序文件的python源码
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
# [url=home.php?mod=space&uid=2260]@Time[/url] : 2021/3/9 10:12
# @Author : Smida
# @FileName: sortDatabase.py
# @Software: PyCharm
import os
import time
import numpy
class SortDatabaseManager():
dataPath = "E:\\ariDownload\\裤子\\q绑\\qqSearch_split_6\\OutPut" #文件目录
dataFiles = [i for i in os.listdir(dataPath) if i[-3::] == 'txt'] #目录下所有txt文件名
theQQMaxMap = {}
theSplitFlag = ','
theDataPosition = 0
timeScale = 0
@staticmethod
def printLog(msg):
print(f"[{time.strftime('%Y-%m-%d %H:%M:%S',time.localtime())}] -> {msg}")
@staticmethod
def caculateTimeSpan(fileSize,timeScale):
return fileSize/timeScale if timeScale else "Null"
@staticmethod
def getFileSize(filePath):
return round(os.path.getsize(filePath) / float(1024 * 1024),2)
def sortFile(self,path, chunk):
self.printLog(f"开始分割文件 {path} \n 缓存大小为{chunk}")
baseDir, baseFile = os.path.split(path)
fileIndex = 1
files = []
with open(path, 'r') as f:
while True:
lines = f.readlines(chunk)
lines.sort(key=lambda x: int(x.split(",")[0]))
if lines:
newFileName = os.path.join(baseDir, f"{baseFile[1:-4]}_{fileIndex}.txt")
with open(newFileName, 'a') as sf:
sf.write(''.join(lines))
files.append(newFileName)
fileIndex += 1
else:
break
return files
def mergeFiles(self,fileList: list,filePath: str) -> str:
"""
:param fileList: a list of file absolute path
:return: a string of merged file absolute path
"""
self.printLog(f"开始归并文件覆盖输出到 {filePath}")
fs = [open(file_, 'r') for file_ in fileList]
tempDict = {}
mergedFile = open(filePath, 'w+')
for f in fs:
initLine = f.readline()
if initLine:
tempDict[f] = initLine
while tempDict:
min_item = min(tempDict.items(), key=lambda x: int(x[1].split(",")[0]))
mergedFile.write(min_item[1])
nextLine = min_item[0].readline()
if nextLine:
tempDict[min_item[0]] = nextLine
else:
del tempDict[min_item[0]]
min_item[0].close()
mergedFile.close()
for file_ in fileList:
self.printLog(f"清除缓存文件 {file_}")
os.remove(file_)
return os.path.join(filePath)
def getFilePaths(self):
pathList = []
for fileName in self.dataFiles:
pathList.append(f"{self.dataPath}\\{fileName}")
return pathList
def setTimeScale(self,fileSize,timeSpan):
self.timeScale = fileSize // timeSpan
# 遍历文件,找出每个文件的最大值或最小值
def startSortFile(self):
allStartTime = time.time()
filePathList = []
for fileName in self.dataFiles:
filePath = f"{self.dataPath}\\{fileName}"
if fileName == "qqSearch_1.txt":
continue
fileSize = self.getFileSize(filePath)
startTime = time.time()
self.printLog(f"开始处理文件:{fileName} 预计耗时:{self.caculateTimeSpan(fileSize, self.timeScale)}s")
self.mergeFiles(self.sortFile(filePath,1024 * 1024 * 500),filePath)
endTime = time.time()
self.setTimeScale(fileSize,endTime - startTime)
self.printLog("开始最后归并...")
for i in self.dataFiles:
filePathList.append(f"{self.dataPath}\\{i}")
self.mergeFiles(filePathList, "MargedFileOutPut.txt")
allEndTime = time.time()
self.printLog(f"Done! 耗时{allEndTime-allStartTime}")
oj = SortDatabaseManager()
path = oj.startSortFile()
同qq库一样凡是可以使用bigint存储的都可以使用类似方法,在服务器配置不好的情况下,可以尝试。