NLTK,全称Natural Language Toolkit,自然语言处理工具包,是NLP研究领域常用的一个Python库,由宾夕法尼亚大学的Steven Bird和Edward Loper在Python的基础上开发的一个模块,至今已有超过十万行的代码。这是一个开源项目,包含数据集、Python模块、教程等;NLTK是最常用的英文自然语言处理python基础库之一。
1.英文Tokenization(标记化/分词)
文本是不能成段送入模型中进行分析的,我们通常会把文本切成有独立含义的字、词或者短语,这个过程叫做tokenization,这通常是大家解决自然语言处理问题的第一步。在NLTK中提供了2种不同方式的tokenization,sentence tokenization 和 word tokenization,前者把文本进行“断句”,后者对文本进行“分词”。
1 2 3 4
import nltk from nltk import word_tokenize, sent_tokenize import matplotlib matplotlib.use('Agg')
1 2 3 4
# 读入数据 corpus = open('./data/text.txt','r').read() # 查看类型 print("corpus的数据类型为:",type(corpus))
corpus的数据类型为: <class 'str'>
1 2 3
# 断句 sentences = sent_tokenize(corpus) sentences
["A ``knowledge engineer'' interviews experts in a certain domain and tries to embody their knowledge in a computer program for carrying out some task.",
'How well this works depends on whether the intellectual mechanisms required for the task are within the present state of AI.',
'When this turned out not to be so, there were many disappointing results.',
'One of the first expert systems was MYCIN in 1974, which diagnosed bacterial infections of the blood and suggested treatments.',
'It did better than medical students or practicing doctors, provided its limitations were observed.',
'Namely, its ontology included bacteria, symptoms, and treatments and did not include patients, doctors, hospitals, death, recovery, and events occurring in time.',
'Its interactions depended on a single patient being considered.',
'Since the experts consulted by the knowledge engineers knew about patients, doctors, death, recovery, etc., it is clear that the knowledge engineers forced what the experts told them into a predetermined framework.',
'In the present state of AI, this has to be true.',
'The usefulness of current expert systems depends on their users having common sense.']
# 一段文本 text = """ he National Wrestling Association was an early professional wrestling sanctioning body created in 1930 by the National Boxing Association (NBA) (now the World Boxing Association, WBA) as an attempt to create a governing body for professional wrestling in the United States. The group created a number of "World" level championships as an attempt to clear up the professional wrestling rankings which at the time saw a number of different championships promoted as the "true world championship". The National Wrestling Association's NWA World Heavyweight Championship was later considered part of the historical lineage of the National Wrestling Alliance's NWA World Heavyweight Championship when then National Wrestling Association champion Lou Thesz won the National Wrestling Alliance championship, folding the original championship into one title in 1949."""
1 2 3 4 5 6 7 8
# 分句 tokenized_sentence = nltk.sent_tokenize(text) # 分词 tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence] # 词性标注 tagged_words = [nltk.pos_tag(word) for word in tokenized_words] # 识别NP组块 word_tree = [chunker.parse(word) for word in tagged_words]
'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'
1 2 3
# 造句 dog = wn.synset('dog.n.01') dog.examples()[0]