R语言 tm.plugin.factiva包 FactivaSource()函数中文帮助文档(中英文对照)

loveR · 发表于 2012-10-1 10:44:47

FactivaSource(tm.plugin.factiva)
FactivaSource()所属R语言包：tm.plugin.factiva

                                    Factiva Source
                                       Factiva的来源

                                       译者：生物统计家园网机器人LoveR

描述----------Description----------

Construct a source for an input containing a set of articles exported from Factiva in the <acronym>XML</acronym> or <acronym>HTML</acronym> formats.
构建一个Factiva的在XML </首字母缩写或<acronym>HTML </首字母缩写格式<acronym>中出口的文章包含了一组输入源。

用法----------Usage----------

  FactivaSource(x, encoding = "UTF-8",
            format = c("auto", "XML", "HTML"))

参数----------Arguments----------

参数：x
Either a character identifying the file or a connection.
一个字符识别的文件，或一个连接。

参数：encoding
A character giving the encoding of x. It will be ignored unless the <acronym>XML</acronym> or <acronym>HTML</acronym> input does not include this information, which should normally not happen with files exported from Factiva.
一个字符的编码x。它会被忽略，除非<acronym>XML </首字母缩写或<acronym> HTML </首字母缩写输入不包含此信息，通常不应发生在导出的文件从Factiva的。

参数：format
The format of the file or connection identified by x (see “Details”).
格式的文件或标识的连接x（见“详细信息”）。

Details

详细信息----------Details----------

This function can be used to import both <acronym>XML</acronym> and <acronym>HTML</acronym> files. If format is set to “auto” (the default), the file extension is used to guess the format: if the file name ends with “.xml” or “.XML”,  <acronym>XML</acronym> is assumed; else, the file is assumed to be in the <acronym>HTML</acronym> format.
此功能可用于，同时导入<acronym>XML</首字母缩写和<acronym> HTML</首字母缩写>文件。如果format设置为“自动”（默认设置），文件扩展名是用来猜测的格式：如果文件名以“。xml”或“XML”，<acronym>假设XML> </首字母缩写，否则，该文件被认为是在<acronym>HTML </首字母缩写>格式。

This function imports the body of the articles, but also sets several meta-data variables on individual documents:
此功能导入人体的文章，但也设置了几个对单个文档的元数据变量：

DateTimeStamp: The publication date.
DateTimeStamp：出版日期。

Heading: The title of the article.
Heading：文章的标题。

Origin: The newspaper the article comes from.
Origin：报纸上的文章。

Edition: The (local) variant of the newspaper.
Edition：（本地）变种的报纸。

Section: The part of the newspaper containing the article.
Section：报纸上的文章的部分。

Subject: One or several keywords defining the subject.
Subject：一个或多个关键字定义的主题。

Coverage: One or several keywords identifying the covered regions.
Coverage：确定覆盖区域的一个或几个关键字。

WordCount: The number of words in the article.
WordCount：在文章的字数。

Publisher: The publisher of the newspaper.
Publisher：报纸出版商。

Rights: The copyright information associated with the article.
Rights：版权信息的文章。

Language: This information is set automatically if readerControl = list(language = NA) is passed (see the example below). Else, the language specified manually is set for all articles. If omitted, the default, "en", is used.
Language：该信息被设置自动readerControl = list(language = NA)如果传递（见下面的例子）。否则，所有文章的语言设置为手动指定。如果省略，则默认情况下，“连接”，使用。

It is advised to export articles from Factiva in the <acronym>XML</acronym> format rather than in <acronym>HTML</acronym> when possible, since the latter does not provide completely clean information. In particular, dates are not guaranteed to be parsed correctly if the machine from which the <acronym>HTML</acronym> file was exported uses a locale different from that of the machine where it is read.
建议出口从Factiva的文章的<acronym>XML </首字母缩写>格式而不是在<acronym>的HTML </首字母缩写>在可能的情况下，因为后者不提供完全干净的信息。特别是，日期不能保证正确解析，如果机器从<acronym> HTML </首字母缩写>文件导出使用不同的语言环境的机器，它是只读的。

值----------Value----------

An object of class XMLSource which extends the class Source representing set of articles from Factiva.
一个对象类XMLSource类Source代表组文章Factiva的延伸。

注意----------Note----------

It has been found that some Factiva articles contain unescaped characters that are not authorized in <acronym>XML</acronym> files. If such articles are included in the input you are trying to import, the <acronym>XML</acronym> parser will fail printing a few error messages, and the corpus will not be created at all.
已经发现，一些Factiva的文章包含了未授权的的未转义字符，在<acronym> XML </首字母缩写文件。如果这样的文章都包含在输入你试图导入，<acronym> XML</首字母缩写>解析器将无法打印一些错误消息，将不会被创建在所有的语料。

If you experience this bug, please report this to the Factiva Customer Service, which will fix the incriminated article; feel free to ask the maintainer of the present package if needed. In the meantime, you can exclude the problematic article from the <acronym>XML</acronym> file: to identify it, proceed by exporting only one half of the original corpus at a time, as many times as needed, and see when it fails; you will eventually find the culprit. (If you know <acronym>XML</acronym>, you can use an <acronym>XML</acronym> validator to find the relevant part of the file, and fix it by hand.)
如果您遇到此错误，请报告Factiva的客户服务，这将解决递增后的文章，如果需要的话，随时要求本包的维护者。在此期间，您可以排除问题的文章的<acronym> XML </首字母缩写文件，以确定它，继续由出口只有一半的原始语料一次，根据需要，多次和看到它失败时，你最终会找到的罪魁祸首。（如果你知道<acronym>，XML </首字母缩写>，你可以使用一个<acronym> XML</首字母缩写验证器来找到相关的部分文件，并修复它的手。“）

（作者）----------Author(s)----------

Milan Bouchet-Valat

参见----------See Also----------

readFactivaXML and readFactivaHTML for the functions actually parsing individual articles.
readFactivaXML和readFactivaHTML的功能，实际上是在分析个人物品。

getSources to list available sources.
getSources列出可用的来源。

实例----------Examples----------

## Not run: [＃不运行：]
## Load an XML file[＃加载一个XML文件]
library(tm)
file <- system.file("factiva_test.xml", package = "tm.plugin.factiva")
source <- FactivaSource(file)
corpus <- Corpus(source, readerControl = list(language = NA))

# See the contents of the documents[查看的文件内容]
inspect(corpus)

# See meta-data associated with first article[请参阅相关的元数据的第一篇文章]
meta(corpus[[1]])

## End(Not run)[＃（不执行）]

## For an HTML file[＃对于HTML文件]
library(tm)
file <- system.file("factiva_test.html", package = "tm.plugin.factiva")
source <- FactivaSource(file)
corpus <- Corpus(source, readerControl = list(language = NA))

# See the contents of the documents[查看的文件内容]
inspect(corpus)

# See meta-data associated with first article[请参阅相关的元数据的第一篇文章]
meta(corpus[[1]])

转载请注明:出自生物统计家园网(http://www.biostatistic.net)。

注：
注1：为了方便大家学习，本文档为生物统计家园网机器人LoveR翻译而成，仅供个人R语言学习参考使用，生物统计家园保留版权。
注2：由于是机器人自动翻译，难免有不准确之处，使用时仔细对照中、英文内容进行反复理解，可以帮助R语言的学习。
注3：如遇到不准确之处，请在本贴的后面进行回帖，我们会逐渐进行修订。

账号		自动登录	找回密码
密码			注册