博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
POS Tagging 标签类型查询表(Penn Treebank Project)
阅读量:5154 次
发布时间:2019-06-13

本文共 4303 字,大约阅读时间需要 14 分钟。

在分析英文文本时,我们可能会关心文本当中每个词语的词性和在句中起到的作用。识别文本中各个单词词性的过程,可以称为词性标注。

英语主要的八种词性分别为:

1、名词(noun)

2、代词(pronoun)

3、动词(verb)

4、形容词(adjective)

5、副词(adverb)

6、介词(preposition)

7、连词(conjunction)

8、感叹词(interjection)

其他还包括数词(numeral)和冠词(article)等。

在使用第三方工具(如NLTK)进行词性标注时,返回的结果信息量可能比上述八种词性要丰富一些。比如NLTK,其所标注的词性可以参考Penn Treebank Project给出的pos tagset,如下图:

 

 举例来说,我们使用NLTK对一段英文进行词性标注:

这段英文摘自19年3月13日华盛顿邮报有关加拿大停飞波音737客机相关报道,段落的原文为:

After the Lion Air crash, questions were raised, so Boeing sent further instructions that it said pilots should know,” he said, according to the Associated Press. “Those relate to the specific behavior of this specific type of aircraft. As a result, training was given by Boeing, and our pilots have taken it and put it into our manuals.

 我们对该段落进行断句,然后对每句话进行分词,再对每个词语进行词性标注,然后循环打印每句话中每个词的词性标注结果,具体代码如下:

1 import nltk2 passage = """After the Lion Air crash, questions were raised, so Boeing sent further instructions that it said pilots should know,” he said, according to the Associated Press. “Those relate to the specific behavior of this specific type of aircraft. As a result, training was given by Boeing, and our pilots have taken it and put it into our manuals."""3 sentences = nltk.sent_tokenize( passage )4 for sent in sentences:5     tokens = nltk.word_tokenize( sent )6     posTags = nltk.pos_tag( tokens )7     print( posTags )

代码的print()函数打印的内容如下:

[('After', 'IN'), ('the', 'DT'), ('Lion', 'NNP'), ('Air', 'NNP'), ('crash', 'NN'), (',', ','), ('questions', 'NNS'), ('were', 'VBD'), ('raised', 'VBN'), (',', ','), ('so', 'IN'), ('Boeing', 'NNP'), ('sent', 'VBD'), ('further', 'JJ'), ('instructions', 'NNS'), ('that', 'IN'), ('it', 'PRP'), ('said', 'VBD'), ('pilots', 'NNS'), ('should', 'MD'), ('know', 'VB'), (',', ','), ('”', 'FW'), ('he', 'PRP'), ('said', 'VBD'), (',', ','), ('according', 'VBG'), ('to', 'TO'), ('the', 'DT'), ('Associated', 'NNP'), ('Press', 'NNP'), ('.', '.')][('“Those', 'JJ'), ('relate', 'NN'), ('to', 'TO'), ('the', 'DT'), ('specific', 'JJ'), ('behavior', 'NN'), ('of', 'IN'), ('this', 'DT'), ('specific', 'JJ'), ('type', 'NN'), ('of', 'IN'), ('aircraft', 'NN'), ('.', '.')][('As', 'IN'), ('a', 'DT'), ('result', 'NN'), (',', ','), ('training', 'NN'), ('was', 'VBD'), ('given', 'VBN'), ('by', 'IN'), ('Boeing', 'NNP'), (',', ','), ('and', 'CC'), ('our', 'PRP$'), ('pilots', 'NNS'), ('have', 'VBP'), ('taken', 'VBN'), ('it', 'PRP'), ('and', 'CC'), ('put', 'VB'), ('it', 'PRP'), ('into', 'IN'), ('our', 'PRP$'), ('manuals', 'NNS'), ('.', '.')]

如何看懂上面的输出结果:段落中的每句话为一个list,每句话中的每个词及其词性表示为一个tuple,左边为单词本身,右边为词性缩写,这些缩写的具体含义可以查找Penn Treebank Pos Tags表格。

我们对代码稍微修改一下,以便使结果呈现更清楚一些,而不至于看的太费力,如下:

1 import nltk2 passage = """After the Lion Air crash, questions were raised, so Boeing sent further instructions that it said pilots should know,” he said, according to the Associated Press. “Those relate to the specific behavior of this specific type of aircraft. As a result, training was given by Boeing, and our pilots have taken it and put it into our manuals."""3 sentences = nltk.sent_tokenize( passage )4 for sent in sentences:5     tokens = nltk.word_tokenize( sent )6     posTags = nltk.pos_tag( tokens )7     for tag in posTags:8         print( "{}({}) ".format( tag[0], tag[1] ), end = "" )

输出结果如下(标注的词性以括号形式紧跟在每个单词右侧):

After(IN) the(DT) Lion(NNP) Air(NNP) crash(NN) ,(,) questions(NNS) were(VBD) raised(VBN) ,(,) so(IN) Boeing(NNP) sent(VBD) further(JJ) instructions(NNS) that(IN) it(PRP) said(VBD) pilots(NNS) should(MD) know(VB) ,(,) ”(FW) he(PRP) said(VBD) ,(,) according(VBG) to(TO) the(DT) Associated(NNP) Press(NNP) .(.) “Those(JJ) relate(NN) to(TO) the(DT) specific(JJ) behavior(NN) of(IN) this(DT) specific(JJ) type(NN) of(IN) aircraft(NN) .(.) As(IN) a(DT) result(NN) ,(,) training(NN) was(VBD) given(VBN) by(IN) Boeing(NNP) ,(,) and(CC) our(PRP$) pilots(NNS) have(VBP) taken(VBN) it(PRP) and(CC) put(VB) it(PRP) into(IN) our(PRP$) manuals(NNS) .(.)

 

参考文献:

1、https://en.wikipedia.org/wiki/Part_of_speech

2、https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

3、https://www.washingtonpost.com/local/trafficandcommuting/canada-grounds-boeing-737-max-8-leaving-us-as-last-major-user-of-plane/2019/03/13/25ac2414-459d-11e9-90f0-0ccfeec87a61_story.html?utm_term=.f359a714d4d8

 

转载于:https://www.cnblogs.com/creatures-of-habit/p/10520079.html

你可能感兴趣的文章
List_统计输入数值的各种值
查看>>
学习笔记-KMP算法
查看>>
Timer-triggered memory-to-memory DMA transfer demonstrator
查看>>
跨域问题整理
查看>>
[Linux]文件浏览
查看>>
64位主机64位oracle下装32位客户端ODAC(NFPACS版)
查看>>
获取国内随机IP的函数
查看>>
今天第一次写博客
查看>>
江城子·己亥年戊辰月丁丑日话凄凉
查看>>
IP V4 和 IP V6 初识
查看>>
Spring Mvc模式下Jquery Ajax 与后台交互操作
查看>>
(转)matlab练习程序(HOG方向梯度直方图)
查看>>
『Raid 平面最近点对』
查看>>
【ADO.NET基础-数据加密】第一篇(加密解密篇)
查看>>
C语言基础小结(一)
查看>>
STL中的优先级队列priority_queue
查看>>
UE4 使用UGM制作血条
查看>>
浏览器对属性兼容性支持力度查询网址
查看>>
OO学习总结与体会
查看>>
虚拟机长时间不关造成的问题
查看>>