weightSMART(tm)
weightSMART()所属R语言包:tm
SMART Weightings
SMART比重
译者:生物统计家园网 机器人LoveR
描述----------Description----------
Weight a term-document matrix according to a combination of weights specified in SMART notation.
重量一个术语文档矩阵根据指定在SMART符号的组合的权重。
用法----------Usage----------
weightSMART(m, spec = "nnn", control = list())
参数----------Arguments----------
参数:m
A TermDocumentMatrix in term frequency format.
ATermDocumentMatrix在术语频率格式。
参数:spec
a character string consisting of three characters. The first letter specifies a term frequency schema, the second a document frequency schema, and the third a normalization schema. See Details for available built-in schemata.
一个由三个字符组成的字符串。的第一个字母指定任期频率的模式,第二个文档频率架构,以及第三的归一化模式。详情请参阅可用内置的图式。
参数:control
a list of control parameters. See Details.
的控制参数的列表。查看详细信息。
Details
详细信息----------Details----------
Formally this function is of class WeightingFunction with the additional attributes Name and Acronym.
正式这个函数是类WeightingFunction的附加属性的Name和Acronym。
The first letter of spec specifies a weighting schema for term frequencies of m:
spec指定的第一个字母的术语频率m加权模式为:
"n" (natural) \mathit{tf}_{i,j} counts the number of occurrences n_{i,j} of a term t_i in a document d_j. The input term-document matrix m is assumed to be in this
“N”(自然)“\mathit{tf}_{i,j}出现的次数进行计数n_{i,j}的一个术语t_i在一个文档中d_j。输入项文档矩阵m被假定为在此
"l" (logarithm) is defined as 1 + \log(\mathit{tf}_{i,j}).
“l”的(对数)被定义为1 + \log(\mathit{tf}_{i,j})。
"a" (augmented) is defined as <i>0.5 +
“a”的(增强)定义为<i> 0.5,+
"b" (boolean) is defined as 1 if \mathit{tf}_{i,j} > 0 and 0 otherwise.
“B”(布尔)被定义为1,如果\mathit{tf}_{i,j} > 0,否则为0。
"L" (log average) is defined as <i>\frac{1 +
“L”(-log平均值)的被定义为<i> \压裂{1 +
The second letter of spec specifies a weighting schema of document frequencies for m:
第二个字母spec指定文件的频率为m加权模式:
"n" (no) is defined as 1.
“n”(否)被定义为1。
"t" (idf) is defined as \log \frac{N}{\mathit{df}_t} where \mathit{df}_t denotes how often term t occurs in all
“T”(IDF)被定义为\log \frac{N}{\mathit{df}_t}\mathit{df}_t表示术语t如何往往发生在所有
"p" (prob idf) is defined as \max(0, \log(\frac{N - \mathit{df}_t}{\mathit{df}_t})).
“P”(概率IDF)被定义为\max(0, \log(\frac{N - \mathit{df}_t}{\mathit{df}_t}))。
The third letter of spec specifies a schema for normalization of m:
spec的第三个字母为m标准化指定的模式:
"n" (none) is defined as 1.
被定义为1的“n”(无)。
"c" (cosine) is defined as √{\mathrm{col\_sums}(m ^ 2)}.
“C”(余弦)被定义为√{\mathrm{col\_sums}(m ^ 2)}。
"u" (pivoted unique) is defined as \mathit{slope} * √{\mathrm{col\_sums}(m ^ 2)} + (1 - \mathit{slope}) * \mathit{pivot} where both slope and pivot must be set
\mathit{slope} * √{\mathrm{col\_sums}(m ^ 2)} + (1 - \mathit{slope}) * \mathit{pivot}都slope和pivot必须设置被定义为“U”(旋转唯一的)
"b" (byte size) is defined as \frac{1}{\mathit{CharLength}^α}. The parameter α must be set via the named tag alpha
“b”的(字节大小)被定义为\frac{1}{\mathit{CharLength}^α}。参数α必须通过指定的标记alpha
The final result is defined by multiplication of the chosen term frequency component with the chosen document frequency component with the chosen normalization component.
最终的结果是由所选择的项的频率分量,与所选择的文件的频率分量,与所选择的归一分量乘法定义。
值----------Value----------
The weighted matrix.
加权矩阵。
(作者)----------Author(s)----------
Ingo Feinerer
参考文献----------References----------
Introduction to Information Retrieval. Cambridge University Press, ISBN 0521865719.
实例----------Examples----------
data("crude")
TermDocumentMatrix(crude, control = list(removePunctuation = TRUE, stopwords = TRUE, weighting = function(x) weightSMART(x, spec = "ntc")))
转载请注明:出自 生物统计家园网(http://www.biostatistic.net)。
注:
注1:为了方便大家学习,本文档为生物统计家园网机器人LoveR翻译而成,仅供个人R语言学习参考使用,生物统计家园保留版权。
注2:由于是机器人自动翻译,难免有不准确之处,使用时仔细对照中、英文内容进行反复理解,可以帮助R语言的学习。
注3:如遇到不准确之处,请在本贴的后面进行回帖,我们会逐渐进行修订。
|