🤖 Ai回答: 在使用 MySQL 进行关键词挖掘时,通常指的是从数据库中的文本数据(如文章、评论、日志、产品描述等)中提取出高频词、重要关键词或潜在的搜索关键词。虽然 MySQL 本身不是专门的“关键词挖掘工具”,但通过其强大的字符串处理能力、聚合函数和正则表达式功能,可以实现基础到中高级的关键词挖掘。
以下是利用 MySQL 实现关键词挖掘的常见方法与技巧:
🔍 一、关键词挖掘的核心目标
1、提取高频词(出现次数多)
2、计算词频(TF: Term Frequency)
3、去除停用词(如“的”、“是”、“了”等无意义词)
4、识别关键词组合(短语、n-gram)
5、结合外部数据(如用户搜索日志、商品标题)
🧰 二、常用技术手段(在 MySQL 中实现)
✅ 1、分词处理(手动分词 + 字符串函数)
由于 MySQL 不支持原生中文分词,需借助外部工具(如 Python、Java)预处理,或使用存储过程模拟分词。
示例:按空格/标点切分关键词(英文场景)
sql
假设表名:articles,字段:content(文本内容)
SELECT
word,
COUNT(*) AS frequency
FROM (
SELECT TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(content, ' ', n.n), ' ', -1)) AS word
FROM articles
INNER JOIN (
SELECT 1 AS n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
可扩展更多数字以覆盖长文本
) n ON CHAR_LENGTH(content) CHAR_LENGTH(REPLACE(content, ' ', '')) = n.n 1
) t
WHERE word != ''
GROUP BY word
ORDER BY frequency DESC
LIMIT 20;
⚠️ 注意:此方法仅适用于英文或已用空格分隔的文本。
✅ 2、使用正则表达式(REGEXP)进行关键词提取
sql
提取所有小写字母组成的单词(忽略标点)
SELECT
word,
COUNT(*) AS count
FROM (
SELECT
SUBSTRING_INDEX(SUBSTRING_INDEX(content, ' ', n.n), ' ', -1) AS word
FROM articles
INNER JOIN (
SELECT 1 n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
生成序列
) n ON CHAR_LENGTH(content) CHAR_LENGTH(REPLACE(content, ' ', '')) = n.n 1
WHERE SUBSTRING_INDEX(SUBSTRING_INDEX(content, ' ', n.n), ' ', -1) REGEXP '^[a-zA-Z]+$'
) t
GROUP BY word
ORDER BY count DESC;
然后结合查询过滤:
sql
SELECT
word,
COUNT(*) AS frequency
FROM (
SELECT
TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(content, ' ', n.n), ' ', -1)) AS word
FROM articles
INNER JOIN (SELECT 1 n UNION ALL SELECT 2 UNION ALL SELECT 3) n
ON CHAR_LENGTH(content) CHAR_LENGTH(REPLACE(content, ' ', '')) = n.n 1
) t
WHERE word NOT IN (SELECT word FROM stopwords)
AND word != ''
GROUP BY word
ORDER BY frequency DESC;
2、在数据库中模拟分词(粗略):
sql
按字符拆分(适合单字词分析)
SELECT
SUBSTR(content, n.n, 1) AS char_word,
COUNT(*) AS freq
FROM articles
INNER JOIN (
SELECT 1 n UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4
依实际长度扩展
) n ON n.n <= CHAR_LENGTH(content)
GROUP BY char_word
ORDER BY freq DESC;
适用于单字词分析,如“人工智能”可拆为“人”“工”“智”“能”。
✅ 5、计算 TF-IDF(词频-逆文档频率)
用于判断关键词重要性。
步骤:
1、统计每个词在文档中的出现频率(TF)
2、统计词在整个语料库中出现的文档数(DF)
3、计算:`TF-IDF = TF × log(N/DF)`,其中 N 为总文档数
sql
假设:表 `word_doc_freq` 存储每篇文档的词频
先统计每个词的全局文档数
WITH doc_count AS (
SELECT COUNT(DISTINCT doc_id) AS total_docs FROM word_doc_freq
),
word_df AS (
SELECT word, COUNT(DISTINCT doc_id) AS df
FROM word_doc_freq
GROUP BY word
)
SELECT
w.word,
w.tf * LOG(d.total_docs / w.df) AS tfidf
FROM word_doc_freq w
CROSS JOIN doc_count d
JOIN word_df w2 ON w.word = w2.word
ORDER BY tfidf DESC;