关注不迷路! TF-IDF(term frequency–inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。TF是词频(Term Frequency),IDF是逆文本频率指数(Inverse Document Frequency)。 TFIDF的主要思想是:**如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。**TFIDF实际上是:TF * IDF,TF词频(Term Frequency),IDF逆向文件频率(Inverse Document Frequency)。TF表示词条在文档d中出现的频率。IDF的主要思想是:如果包含词条t的文档越少,也就是n越小,IDF越大,则说明词条t具有很好的类别区分能力。如果某一类文档C中包含词条t的文档数为m,而其它类包含t的文档总数为k,显然所有包含t的文档数n=m+k,当m大的时候,n也大,按照IDF公式得到的IDF的值会小,就说明该词条t类别区分能力不强。但是实际上,如果一个词条在一个类的文档中频繁出现,则说明该词条能够很好代表这个类的文本的特征,这样的词条应该给它们赋予较高的权重,并选来作为该类文本的特征词以区别与其它类文档。这就是IDF的不足之处. 某一特定文件内的高词语频率,以及该词语在整个文件集合中的低文件频率,可以产生出高权重的TF-IDF。因此,TF-IDF倾向于过滤掉常见的词语,保留重要的词语。 TF-IDF是NLP中一种常用的统计方法,用以评估一个字词对于一个文件集或一个语料库中的其中一份文件的重要程度,通常用于提取文本的特征,即关键词。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。 TF-IDF加权的各种形式常被搜索引擎应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,因特网上的搜索引擎还会使用基于链接分析的评级方法,以确定文件在搜寻结果中出现的顺序。 其中,tf是词频(Term Frequency),idf为逆向文件频率(Inverse Document Frequency)。 当然,不同地方的idf值计算公式会有稍微的不同。比如有些地方会在分母的k上加1,防止分母为0,还有些地方会让分子,分母都加上1,这是smoothing技巧。在本文中,还是采用最原始的idf值计算公式,因为这与gensim里面的计算公式一致。 以上就是TF-IDF的计算方法。 我们将采用以下三个示例文本: 这三篇文章分别是关于足球,篮球,排球的介绍,它们组成一篇文档。 接着,去掉文章中的通用词(stopwords),然后统计每个单词的出现次数,完整的Python代码如下,输入的参数为文章text: 以text3为例,生成的count字典如下: Counter({‘ball’: 4, ‘net’: 4, ‘teammate’: 3, ‘returned’: 2, ‘bat’: 2, ‘court’: 2, ‘team’: 2, ‘across’: 2, ‘touches’: 2, ‘back’: 2, ‘players’: 2, ‘touch’: 1, ‘must’: 1, ‘usually’: 1, ‘side’: 1, ‘player’: 1, ‘area’: 1, ‘Volleyball’: 1, ‘hands’: 1, ‘may’: 1, ‘toward’: 1, ‘A’: 1, ‘third’: 1, ‘two’: 1, ‘six’: 1, ‘opposing’: 1, ‘within’: 1, ‘prevent’: 1, ‘allowed’: 1, ‘’’: 1, ‘playing’: 1, ‘played’: 1, ‘volley’: 1, ‘surface—that’: 1, ‘volleys’: 1, ‘opponents’: 1, ‘use’: 1, ‘high’: 1, ‘teams’: 1, ‘bats’: 1, ‘To’: 1, ‘game’: 1, ‘make’: 1, ‘forth’: 1, ‘three’: 1, ‘trying’: 1}) 对文本进行预处理后,对于以上三个示例文本,我们都会得到一个count字典,里面是每个文本中单词的出现次数。下面,我们将用gensim中的已实现的TF-IDF模型,来输出每篇文章中TF-IDF排名前三的单词及它们的tfidf值,完整的代码如下: 输出的结果如下: 输出的结果还是比较符合我们的预期的,比如关于足球的文章中提取了football, rugby关键词,关于篮球的文章中提取了plat, cm关键词,关于排球的文章中提取了net, teammate关键词。 有了以上我们对TF-IDF模型的理解,其实我们自己也可以动手实践一把,这是学习算法的最佳方式! 输出结果如下: 可以看到,笔者自己动手实践的TF-IDF模型提取的关键词与gensim一致,至于篮球中为什么后两个单词不一致,是因为这些单词的tfidf一样,随机选择的结果不同而已。但是有一个问题,那就是计算得到的tfidf值不一样,这是什么原因呢? 也就是说,gensim对得到的tf-idf向量做了规范化(normalize),将其转化为单位向量。因此,我们需要在刚才的代码中加入规范化这一步,代码如下: 输出结果如下: 现在的输出结果与gensim得到的结果一致! Gensim是Python做NLP时鼎鼎大名的模块,有空还是多读读源码吧!以后,我们还会继续介绍TF-IDF在其它方面的应用,欢迎大家交流~ 注意:本人现已开通微信公众号: Python爬虫与算法(微信号为:easy_web_scrape), 欢迎大家关注哦~~ 本文的完整代码如下: 参考自山阴少年博客,属于自己的学习笔记,知识点记录一下。关注不迷路
TF-IDF介绍
在NLP中,TF-IDF的计算公式如下:
tf为词频,即一个词语在文档中的出现频率,假设一个词语在整个文档中出现了i次,而整个文档有N个词语,则tf的值为i/N.
idf为逆向文件频率,假设整个文档有n篇文章,而一个词语在k篇文章中出现,则idf值为
假设整个文档有D篇文章,则单词i在第j篇文章中的tfidf值为文本介绍及预处理
text1 =""" Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. Sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football. These different variations of football are known as football codes. """ text2 = """ Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period of play (overtime) is mandated. """ text3 = """ Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across the net. A team is allowed only three touches of the ball before it must be returned over the net. """
接下来是文本的预处理部分。
首先是对文本去掉换行符,然后是分句,分词,再去掉其中的标点,完整的Python代码如下,输入的参数为文章text:import nltk import string # 文本预处理 # 函数:text文件分句,分词,并去掉标点 def get_tokens(text): text = text.replace('n', '') sents = nltk.sent_tokenize(text) # 分句 tokens = [] for sent in sents: for word in nltk.word_tokenize(sent): # 分词 if word not in string.punctuation: # 去掉标点 tokens.append(word) return tokens
from nltk.corpus import stopwords #停用词 # 对原始的text文件去掉停用词 # 生成count字典,即每个单词的出现次数 def make_count(text): tokens = get_tokens(text) filtered = [w for w in tokens if not w in stopwords.words('english')] #去掉停用词 count = Counter(filtered) return count
Gensim中的TF-IDF
from nltk.corpus import stopwords #停用词 from gensim import corpora, models, matutils #training by gensim's Ifidf Model def get_words(text): tokens = get_tokens(text) filtered = [w for w in tokens if not w in stopwords.words('english')] return filtered # get text count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3) countlist = [count1, count2, count3] # training by TfidfModel in gensim dictionary = corpora.Dictionary(countlist) new_dict = {v:k for k,v in dictionary.token2id.items()} corpus2 = [dictionary.doc2bow(count) for count in countlist] tfidf2 = models.TfidfModel(corpus2) corpus_tfidf = tfidf2[corpus2] # output print("nTraining by gensim Tfidf Model.......n") for i, doc in enumerate(corpus_tfidf): print("Top words in document %d"%(i + 1)) sorted_words = sorted(doc, key=lambda x: x[1], reverse=True) #type=list for num, score in sorted_words[:3]: print(" Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))
Training by gensim Tfidf Model....... Top words in document 1 Word: football, TF-IDF: 0.84766 Word: rugby, TF-IDF: 0.21192 Word: known, TF-IDF: 0.14128 Top words in document 2 Word: play, TF-IDF: 0.29872 Word: cm, TF-IDF: 0.19915 Word: diameter, TF-IDF: 0.19915 Top words in document 3 Word: net, TF-IDF: 0.45775 Word: teammate, TF-IDF: 0.34331 Word: across, TF-IDF: 0.22888
自己动手实践TF-IDF模型
以下是笔者实践TF-IDF的代码(接文本预处理代码):import math # 计算tf def tf(word, count): return count[word] / sum(count.values()) # 计算count_list有多少个文件包含word def n_containing(word, count_list): return sum(1 for count in count_list if word in count) # 计算idf def idf(word, count_list): return math.log2(len(count_list) / (n_containing(word, count_list))) #对数以2为底 # 计算tf-idf def tfidf(word, count, count_list): return tf(word, count) * idf(word, count_list) # TF-IDF测试 count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3) countlist = [count1, count2, count3] print("Training by original algorithm......n") for i, count in enumerate(countlist): print("Top words in document %d"%(i + 1)) scores = {word: tfidf(word, count, countlist) for word in count} sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) #type=list # sorted_words = matutils.unitvec(sorted_words) for word, score in sorted_words[:3]: print(" Word: %s, TF-IDF: %s"%(word, round(score, 5)))
Training by original algorithm...... Top words in document 1 Word: football, TF-IDF: 0.30677 Word: rugby, TF-IDF: 0.07669 Word: known, TF-IDF: 0.05113 Top words in document 2 Word: play, TF-IDF: 0.05283 Word: inches, TF-IDF: 0.03522 Word: worth, TF-IDF: 0.03522 Top words in document 3 Word: net, TF-IDF: 0.10226 Word: teammate, TF-IDF: 0.07669 Word: across, TF-IDF: 0.05113
查阅gensim中计算tf-idf值的源代码(https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/tfidfmodel.py):import numpy as np # 对向量做规范化, normalize def unitvec(sorted_words): lst = [item[1] for item in sorted_words] L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst))) unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words] return unit_vector # TF-IDF测试 count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3) countlist = [count1, count2, count3] print("Training by original algorithm......n") for i, count in enumerate(countlist): print("Top words in document %d"%(i + 1)) scores = {word: tfidf(word, count, countlist) for word in count} sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) #type=list sorted_words = unitvec(sorted_words) # normalize for word, score in sorted_words[:3]: print(" Word: %s, TF-IDF: %s"%(word, round(score, 5)))
Training by original algorithm...... Top words in document 1 Word: football, TF-IDF: 0.84766 Word: rugby, TF-IDF: 0.21192 Word: known, TF-IDF: 0.14128 Top words in document 2 Word: play, TF-IDF: 0.29872 Word: shooting, TF-IDF: 0.19915 Word: diameter, TF-IDF: 0.19915 Top words in document 3 Word: net, TF-IDF: 0.45775 Word: teammate, TF-IDF: 0.34331 Word: back, TF-IDF: 0.22888
总结
import nltk import math import string from nltk.corpus import stopwords #停用词 from collections import Counter #计数 from gensim import corpora, models, matutils text1 =""" Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. Unqualified, the word football is understood to refer to whichever form of football is the most popular in the regional context in which the word appears. Sports commonly called football in certain places include association football (known as soccer in some countries); gridiron football (specifically American football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); and Gaelic football. These different variations of football are known as football codes. """ text2 = """ Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) through the defender's hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period of play (overtime) is mandated. """ text3 = """ Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across the net. A team is allowed only three touches of the ball before it must be returned over the net. """ # 文本预处理 # 函数:text文件分句,分词,并去掉标点 def get_tokens(text): text = text.replace('n', '') sents = nltk.sent_tokenize(text) # 分句 tokens = [] for sent in sents: for word in nltk.word_tokenize(sent): # 分词 if word not in string.punctuation: # 去掉标点 tokens.append(word) return tokens # 对原始的text文件去掉停用词 # 生成count字典,即每个单词的出现次数 def make_count(text): tokens = get_tokens(text) filtered = [w for w in tokens if not w in stopwords.words('english')] #去掉停用词 count = Counter(filtered) return count # 计算tf def tf(word, count): return count[word] / sum(count.values()) # 计算count_list有多少个文件包含word def n_containing(word, count_list): return sum(1 for count in count_list if word in count) # 计算idf def idf(word, count_list): return math.log2(len(count_list) / (n_containing(word, count_list))) #对数以2为底 # 计算tf-idf def tfidf(word, count, count_list): return tf(word, count) * idf(word, count_list) import numpy as np # 对向量做规范化, normalize def unitvec(sorted_words): lst = [item[1] for item in sorted_words] L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst))) unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words] return unit_vector # TF-IDF测试 count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3) countlist = [count1, count2, count3] print("Training by original algorithm......n") for i, count in enumerate(countlist): print("Top words in document %d"%(i + 1)) scores = {word: tfidf(word, count, countlist) for word in count} sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True) #type=list sorted_words = unitvec(sorted_words) # normalize for word, score in sorted_words[:3]: print(" Word: %s, TF-IDF: %s"%(word, round(score, 5))) #training by gensim's Ifidf Model def get_words(text): tokens = get_tokens(text) filtered = [w for w in tokens if not w in stopwords.words('english')] return filtered # get text count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3) countlist = [count1, count2, count3] # training by TfidfModel in gensim dictionary = corpora.Dictionary(countlist) new_dict = {v:k for k,v in dictionary.token2id.items()} corpus2 = [dictionary.doc2bow(count) for count in countlist] tfidf2 = models.TfidfModel(corpus2) corpus_tfidf = tfidf2[corpus2] # output print("nTraining by gensim Tfidf Model.......n") for i, doc in enumerate(corpus_tfidf): print("Top words in document %d"%(i + 1)) sorted_words = sorted(doc, key=lambda x: x[1], reverse=True) #type=list for num, score in sorted_words[:3]: print(" Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5))) """ 输出结果: Training by original algorithm...... Top words in document 1 Word: football, TF-IDF: 0.84766 Word: rugby, TF-IDF: 0.21192 Word: word, TF-IDF: 0.14128 Top words in document 2 Word: play, TF-IDF: 0.29872 Word: inches, TF-IDF: 0.19915 Word: points, TF-IDF: 0.19915 Top words in document 3 Word: net, TF-IDF: 0.45775 Word: teammate, TF-IDF: 0.34331 Word: bat, TF-IDF: 0.22888 Training by gensim Tfidf Model....... Top words in document 1 Word: football, TF-IDF: 0.84766 Word: rugby, TF-IDF: 0.21192 Word: known, TF-IDF: 0.14128 Top words in document 2 Word: play, TF-IDF: 0.29872 Word: cm, TF-IDF: 0.19915 Word: diameter, TF-IDF: 0.19915 Top words in document 3 Word: net, TF-IDF: 0.45775 Word: teammate, TF-IDF: 0.34331 Word: across, TF-IDF: 0.22888 """
本网页所有视频内容由 imoviebox边看边下-网页视频下载, iurlBox网页地址收藏管理器 下载并得到。
ImovieBox网页视频下载器 下载地址: ImovieBox网页视频下载器-最新版本下载
本文章由: imapbox邮箱云存储,邮箱网盘,ImageBox 图片批量下载器,网页图片批量下载专家,网页图片批量下载器,获取到文章图片,imoviebox网页视频批量下载器,下载视频内容,为您提供.
阅读和此文章类似的: 全球云计算