博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
Python3自然语言(NLTK)——语言大数据
阅读量:4316 次
发布时间:2019-06-06

本文共 29265 字,大约阅读时间需要 97 分钟。

NLTK

这是一个处理文本的python库,我们知道文字性的知识可是拥有非常庞大的数据量,故而这属于大数据系列。

本文只是浅尝辄止,目前本人并未涉及这块知识,只是偶尔好奇,才写本文。

从NLTK中的book模块中,载入所有条目

  • book 模块包含所有数据
from nltk.book import *
*** Introductory Examples for the NLTK Book ***Loading text1, ..., text9 and sent1, ..., sent9Type the name of the text or sentence to view it.Type: 'texts()' or 'sents()' to list the materials.text1: Moby Dick by Herman Melville 1851text2: Sense and Sensibility by Jane Austen 1811text3: The Book of Genesistext4: Inaugural Address Corpustext5: Chat Corpustext6: Monty Python and the Holy Grailtext7: Wall Street Journaltext8: Personals Corpustext9: The Man Who Was Thursday by G . K . Chesterton 1908
text1
text2

搜索文本或主题

  1. concordance允许在课文中查找单词,并打印出来
  2. similar 用来识别文章中和搜索词相似的词语,可以用在搜索引擎中的相关度识别功能中。
  3. common_contexts 用来识别2个关键词相似的词语。
  4. dispersion_plot 绘制单词的离散图
text1.concordance('monstrous') # 在text1中查阅词汇'monstrous'# concordance # 英 [kən'kɔːd(ə)ns]  美 [kən'kɔrdns]# n. 调和,一致;用语索引;著作或作家全集的重要用字索引
Displaying 11 of 11 matches:ong the former , one was of a most monstrous size . ... This came towards us , ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have rll over with a heathenish array of monstrous clubs and spears . Some were thickd as you gazed , and wondered what monstrous cannibal and savage could ever havthat has survived the flood ; most monstrous and most mountainous ! That Himmalthey might scout at Moby Dick as a monstrous fable , or still worse and more deth of Radney .'" CHAPTER 55 Of the Monstrous Pictures of Whales . I shall ere ling Scenes . In connexion with the monstrous pictures of whales , I am stronglyere to enter upon those still more monstrous stories of them which are to be foght have been rummaged out of this monstrous cabinet there is no telling . But of Whale - Bones ; for Whales of a monstrous size are oftentimes cast up dead u
text2.concordance('affection')
Displaying 25 of 79 matches:, however , and , as a mark of his affection for the three girls , he left themt . It was very well known that no affection was ever supposed to exist betweenderation of politeness or maternal affection on the side of the former , the twd the suspicion -- the hope of his affection for me may warrant , without impruhich forbade the indulgence of his affection . She knew that his mother neitherrd she gave one with still greater affection . Though her late conversation wit can never hope to feel or inspire affection again , and if her home be uncomfom of the sense , elegance , mutual affection , and domestic comfort of the fami, and which recommended him to her affection beyond every thing else . His sociween the parties might forward the affection of Mr . Willoughby , an equally st the most pointed assurance of her affection . Elinor could not be surprised athe natural consequence of a strong affection in a young and ardent mind . This  opinion . But by an appeal to her affection for her mother , by representing t every alteration of a place which affection had established as perfect with hie will always have one claim of my affection , which no other can possibly sharf the evening declared at once his affection and happiness . " Shall we see youause he took leave of us with less affection than his usual behaviour has shewnness ." " I want no proof of their affection ," said Elinor ; " but of their enonths , without telling her of his affection ;-- that they should part without ould be the natural result of your affection for her . She used to be all unresdistinguished Elinor by no mark of affection . Marianne saw and listened with ith no inclination for expense , no affection for strangers , no profession , antill distinguished her by the same affection which once she had felt no doubt oal of her confidence in Edward ' s affection , to the remembrance of every mark was made ? Had he never owned his affection to yourself ?" " Oh , no ; but if
text1.similar('monstrous')
true contemptible christian abundant few part mean careful puzzledmystifying passing curious loving wise doleful gamesome singulardelightfully perilous fearless
text2.similar('monstrous')
very so exceedingly heartily a as good great extremely remarkablysweet vast amazingly
text2.common_contexts(['monstrous','very'])
a_pretty am_glad a_lucky is_pretty be_glad
# 从文本中检查一个单词的位置,从该单词出现开始出现了多少次。# Each stripe represents an instance of a word, # and each row represents the entire text.text4.dispersion_plot(['citizens','democracy','freedon','duties','America','liberty'])# dispersion # 英 [dɪ'spɜːʃ(ə)n]  美 [dɪ'spɝʒn]# n. 散布;[统计][数] 离差;驱散

1372901-20180430164527862-1261273372.png

print(text3.generate('monstrous'))
None

统计词汇

len(text3)
44764
sorted(set(text3))
['!', "'", '(', ')', ',', ',)', '.', '.)', ':', ';', ';)', '?', '?)', 'A', 'Abel', 'Abelmizraim', 'Abidah', 'Abide', 'Abimael', 'Abimelech', 'Abr', 'Abrah', 'Abraham', 'Abram', 'Accad', 'Achbor', 'Adah', 'Adam', 'Adbeel', 'Admah', 'Adullamite', 'After', 'Aholibamah', 'Ahuzzath', 'Ajah', 'Akan', 'All', 'Allonbachuth', 'Almighty', 'Almodad', 'Also', 'Alvah', 'Alvan', 'Am', 'Amal', 'Amalek', 'Amalekites', 'Ammon', 'Amorite', 'Amorites', 'Amraphel', 'An', 'Anah', 'Anamim', 'And', 'Aner', 'Angel', 'Appoint', 'Aram', 'Aran', 'Ararat', 'Arbah', 'Ard', 'Are', 'Areli', 'Arioch', 'Arise', 'Arkite', 'Arodi', 'Arphaxad', 'Art', 'Arvadite', 'As', 'Asenath', 'Ashbel', 'Asher', 'Ashkenaz', 'Ashteroth', 'Ask', 'Asshur', 'Asshurim', 'Assyr', 'Assyria', 'At', 'Atad', 'Avith', 'Baalhanan', 'Babel', 'Bashemath', 'Be', 'Because', 'Becher', 'Bedad', 'Beeri', 'Beerlahairoi', 'Beersheba', 'Behold', 'Bela', 'Belah', 'Benam', 'Benjamin', 'Beno', 'Beor', 'Bera', 'Bered', 'Beriah', 'Bethel', 'Bethlehem', 'Bethuel', 'Beware', 'Bilhah', 'Bilhan', 'Binding', 'Birsha', 'Bless', 'Blessed', 'Both', 'Bow', 'Bozrah', 'Bring', 'But', 'Buz', 'By', 'Cain', 'Cainan', 'Calah', 'Calneh', 'Can', 'Cana', 'Canaan', 'Canaanite', 'Canaanites', 'Canaanitish', 'Caphtorim', 'Carmi', 'Casluhim', 'Cast', 'Cause', 'Chaldees', 'Chedorlaomer', 'Cheran', 'Cherubims', 'Chesed', 'Chezib', 'Come', 'Cursed', 'Cush', 'Damascus', 'Dan', 'Day', 'Deborah', 'Dedan', 'Deliver', 'Diklah', 'Din', 'Dinah', 'Dinhabah', 'Discern', 'Dishan', 'Dishon', 'Do', 'Dodanim', 'Dothan', 'Drink', 'Duke', 'Dumah', 'Earth', 'Ebal', 'Eber', 'Edar', 'Eden', 'Edom', 'Edomites', 'Egy', 'Egypt', 'Egyptia', 'Egyptian', 'Egyptians', 'Ehi', 'Elah', 'Elam', 'Elbethel', 'Eldaah', 'EleloheIsrael', 'Eliezer', 'Eliphaz', 'Elishah', 'Ellasar', 'Elon', 'Elparan', 'Emins', 'En', 'Enmishpat', 'Eno', 'Enoch', 'Enos', 'Ephah', 'Epher', 'Ephra', 'Ephraim', 'Ephrath', 'Ephron', 'Er', 'Erech', 'Eri', 'Es', 'Esau', 'Escape', 'Esek', 'Eshban', 'Eshcol', 'Ethiopia', 'Euphrat', 'Euphrates', 'Eve', 'Even', 'Every', 'Except', 'Ezbon', 'Ezer', 'Fear', 'Feed', 'Fifteen', 'Fill', 'For', 'Forasmuch', 'Forgive', 'From', 'Fulfil', 'G', 'Gad', 'Gaham', 'Galeed', 'Gatam', 'Gather', 'Gaza', 'Gentiles', 'Gera', 'Gerar', 'Gershon', 'Get', 'Gether', 'Gihon', 'Gilead', 'Girgashites', 'Girgasite', 'Give', 'Go', 'God', 'Gomer', 'Gomorrah', 'Goshen', 'Guni', 'Hadad', 'Hadar', 'Hadoram', 'Hagar', 'Haggi', 'Hai', 'Ham', 'Hamathite', 'Hamor', 'Hamul', 'Hanoch', 'Happy', 'Haran', 'Hast', 'Haste', 'Have', 'Havilah', 'Hazarmaveth', 'Hazezontamar', 'Hazo', 'He', 'Hear', 'Heaven', 'Heber', 'Hebrew', 'Hebrews', 'Hebron', 'Hemam', 'Hemdan', 'Here', 'Hereby', 'Heth', 'Hezron', 'Hiddekel', 'Hinder', 'Hirah', 'His', 'Hitti', 'Hittite', 'Hittites', 'Hivite', 'Hobah', 'Hori', 'Horite', 'Horites', 'How', 'Hul', 'Huppim', 'Husham', 'Hushim', 'Huz', 'I', 'If', 'In', 'Irad', 'Iram', 'Is', 'Isa', 'Isaac', 'Iscah', 'Ishbak', 'Ishmael', 'Ishmeelites', 'Ishuah', 'Isra', 'Israel', 'Issachar', 'Isui', 'It', 'Ithran', 'Jaalam', 'Jabal', 'Jabbok', 'Jac', 'Jachin', 'Jacob', 'Jahleel', 'Jahzeel', 'Jamin', 'Japhe', 'Japheth', 'Jared', 'Javan', 'Jebusite', 'Jebusites', 'Jegarsahadutha', 'Jehovahjireh', 'Jemuel', 'Jerah', 'Jetheth', 'Jetur', 'Jeush', 'Jezer', 'Jidlaph', 'Jimnah', 'Job', 'Jobab', 'Jokshan', 'Joktan', 'Jordan', 'Joseph', 'Jubal', 'Judah', 'Judge', 'Judith', 'Kadesh', 'Kadmonites', 'Karnaim', 'Kedar', 'Kedemah', 'Kemuel', 'Kenaz', 'Kenites', 'Kenizzites', 'Keturah', 'Kiriathaim', 'Kirjatharba', 'Kittim', 'Know', 'Kohath', 'Kor', 'Korah', 'LO', 'LORD', 'Laban', 'Lahairoi', 'Lamech', 'Lasha', 'Lay', 'Leah', 'Lehabim', 'Lest', 'Let', 'Letushim', 'Leummim', 'Levi', 'Lie', 'Lift', 'Lo', 'Look', 'Lot', 'Lotan', 'Lud', 'Ludim', 'Luz', 'Maachah', 'Machir', 'Machpelah', 'Madai', 'Magdiel', 'Magog', 'Mahalaleel', 'Mahalath', 'Mahanaim', 'Make', 'Malchiel', 'Male', 'Mam', 'Mamre', 'Man', 'Manahath', 'Manass', 'Manasseh', 'Mash', 'Masrekah', 'Massa', 'Matred', 'Me', 'Medan', 'Mehetabel', 'Mehujael', 'Melchizedek', 'Merari', 'Mesha', 'Meshech', 'Mesopotamia', 'Methusa', 'Methusael', 'Methuselah', 'Mezahab', 'Mibsam', 'Mibzar', 'Midian', 'Midianites', 'Milcah', 'Mishma', 'Mizpah', 'Mizraim', 'Mizz', 'Moab', 'Moabites', 'Moreh', 'Moreover', 'Moriah', 'Muppim', 'My', 'Naamah', 'Naaman', 'Nahath', 'Nahor', 'Naphish', 'Naphtali', 'Naphtuhim', 'Nay', 'Nebajoth', 'Neither', 'Night', 'Nimrod', 'Nineveh', 'Noah', 'Nod', 'Not', 'Now', 'O', 'Obal', 'Of', 'Oh', 'Ohad', 'Omar', 'On', 'Onam', 'Onan', 'Only', 'Ophir', 'Our', 'Out', 'Padan', 'Padanaram', 'Paran', 'Pass', 'Pathrusim', 'Pau', 'Peace', 'Peleg', 'Peniel', 'Penuel', 'Peradventure', 'Perizzit', 'Perizzite', 'Perizzites', 'Phallu', 'Phara', 'Pharaoh', 'Pharez', 'Phichol', 'Philistim', 'Philistines', 'Phut', 'Phuvah', 'Pildash', 'Pinon', 'Pison', 'Potiphar', 'Potipherah', 'Put', 'Raamah', 'Rachel', 'Rameses', 'Rebek', 'Rebekah', 'Rehoboth', 'Remain', 'Rephaims', 'Resen', 'Return', 'Reu', 'Reub', 'Reuben', 'Reuel', 'Reumah', 'Riphath', 'Rosh', 'Sabtah', 'Sabtech', 'Said', 'Salah', 'Salem', 'Samlah', 'Sarah', 'Sarai', 'Saul', 'Save', 'Say', 'Se', 'Seba', 'See', 'Seeing', 'Seir', 'Sell', 'Send', 'Sephar', 'Serah', 'Sered', 'Serug', 'Set', 'Seth', 'Shalem', 'Shall', 'Shalt', 'Shammah', 'Shaul', 'Shaveh', 'She', 'Sheba', 'Shebah', 'Shechem', 'Shed', 'Shel', 'Shelah', 'Sheleph', 'Shem', 'Shemeber', 'Shepho', 'Shillem', 'Shiloh', 'Shimron', 'Shinab', 'Shinar', 'Shobal', 'Should', 'Shuah', 'Shuni', 'Shur', 'Sichem', 'Siddim', 'Sidon', 'Simeon', 'Sinite', 'Sitnah', 'Slay', 'So', 'Sod', 'Sodom', 'Sojourn', 'Some', 'Spake', 'Speak', 'Spirit', 'Stand', 'Succoth', 'Surely', 'Swear', 'Syrian', 'Take', 'Tamar', 'Tarshish', 'Tebah', 'Tell', 'Tema', 'Teman', 'Temani', 'Terah', 'Thahash', 'That', 'The', 'Then', 'There', 'Therefore', 'These', 'They', 'Thirty', 'This', 'Thorns', 'Thou', 'Thus', 'Thy', 'Tidal', 'Timna', 'Timnah', 'Timnath', 'Tiras', 'To', 'Togarmah', 'Tola', 'Tubal', 'Tubalcain', 'Twelve', 'Two', 'Unstable', 'Until', 'Unto', 'Up', 'Upon', 'Ur', 'Uz', 'Uzal', 'We', 'What', 'When', 'Whence', 'Where', 'Whereas', 'Wherefore', 'Which', 'While', 'Who', 'Whose', 'Whoso', 'Why', 'Wilt', 'With', 'Woman', 'Ye', 'Yea', 'Yet', 'Zaavan', 'Zaphnathpaaneah', 'Zar', 'Zarah', 'Zeboiim', 'Zeboim', 'Zebul', 'Zebulun', 'Zemarite', 'Zepho', 'Zerah', 'Zibeon', 'Zidon', 'Zillah', 'Zilpah', 'Zimran', 'Ziphion', 'Zo', 'Zoar', 'Zohar', 'Zuzims', 'a', 'abated', 'abide', 'able', 'abode', 'abomination', 'about', 'above', 'abroad', 'absent', 'abundantly', 'accept', 'accepted', 'according', 'acknowledged', 'activity', 'add', 'adder', 'afar', 'afflict', 'affliction', 'afraid', 'after', 'afterward', 'afterwards', 'aga', 'again', 'against', 'age', 'aileth', 'air', 'al', 'alive', 'all', 'almon', 'alo', 'alone', 'aloud', 'also', 'altar', 'altogether', 'always', 'am', 'among', 'amongst', 'an', 'and', 'angel', 'angels', 'anger', 'angry', 'anguish', 'anointedst', 'anoth', 'another', 'answer', 'answered', 'any', 'anything', 'appe', 'appear', 'appeared', 'appease', 'appoint', 'appointed', 'aprons', 'archer', 'archers', 'are', 'arise', 'ark', 'armed', 'arms', 'army', 'arose', 'arrayed', 'art', 'artificer', 'as', 'ascending', 'ash', 'ashamed', 'ask', 'asked', 'asketh', 'ass', 'assembly', 'asses', 'assigned', 'asswaged', 'at', 'attained', 'audience', 'avenged', 'aw', 'awaked', 'away', 'awoke', 'back', 'backward', 'bad', 'bade', 'badest', 'badne', 'bak', 'bake', 'bakemeats', 'baker', 'bakers', 'balm', 'bands', 'bank', 'bare', 'barr', 'barren', 'basket', 'baskets', 'battle', 'bdellium', 'be', 'bear', 'beari', 'bearing', 'beast', 'beasts', 'beautiful', 'became', 'because', 'become', 'bed', 'been', 'befall', 'befell', 'before', 'began', 'begat', 'beget', 'begettest', 'begin', 'beginning', 'begotten', 'beguiled', 'beheld', 'behind', 'behold', 'being', 'believed', 'belly', 'belong', 'beneath', 'bereaved', 'beside', 'besides', 'besought', 'best', 'betimes', 'better', 'between', 'betwixt', 'beyond', 'binding', 'bird', 'birds', 'birthday', 'birthright', 'biteth', 'bitter', 'blame', 'blameless', 'blasted', 'bless', 'blessed', 'blesseth', 'blessi', 'blessing', 'blessings', 'blindness', 'blood', 'blossoms', 'bodies', 'boldly', 'bondman', 'bondmen', 'bondwoman', 'bone', 'bones', 'book', 'booths', 'border', 'borders', 'born', 'bosom', 'both', 'bottle', 'bou', 'boug', 'bough', 'bought', 'bound', 'bow', 'bowed', 'bowels', 'bowing', 'boys', 'bracelets', 'branches', 'brass', 'bre', 'breach', 'bread', 'breadth', 'break', 'breaketh', 'breaking', 'breasts', 'breath', 'breathed', 'breed', 'brethren', 'brick', 'brimstone', 'bring', 'brink', 'broken', 'brook', 'broth', 'brother', 'brought', 'brown', 'bruise', 'budded', 'build', 'builded', 'built', 'bulls', 'bundle', 'bundles', 'burdens', 'buried', 'burn', 'burning', 'burnt', 'bury', 'buryingplace', 'business', 'but', 'butler', 'butlers', 'butlership', 'butter', 'buy', 'by', 'cakes', 'calf', 'call', 'called', 'came', 'camel', 'camels', 'camest', 'can', 'cannot', 'canst', 'captain', 'captive', 'captives', 'carcases', 'carried', 'carry', 'cast', 'castles', 'catt', 'cattle', 'caught', 'cause', 'caused', 'cave', 'cease', 'ceased', 'certain', 'certainly', 'chain', 'chamber', 'change', 'changed', 'changes', 'charge', 'charged', 'chariot', 'chariots', 'chesnut', 'chi', 'chief', 'child', 'childless', 'childr', 'children', 'chode', 'choice', 'chose', 'circumcis', 'circumcise', 'circumcised', 'citi', 'cities', 'city', 'clave', 'clean', 'clear', 'cleave', 'clo', 'closed', 'clothed', 'clothes', 'cloud', 'clusters', 'co', 'coat', 'coats', 'coffin', 'cold', ...]
len(set(text3))
2789
len(text3)/len(set(text3))
16.050197203298673
text3.count('smote')
5
100*text4.count('a')/len(text4)
1.4643016433938312
def lexical_diversity(text):    # lexical英['leksɪk(ə)l] 美 ['lɛksɪkl]    # adj.词汇的;[语] 词典的;词典编纂的    # diversity英[daɪ'vɜːsɪtɪ; dɪ-]美 [dɪˈvəsɪti]    # n.多样性;差异    return len(text)/len(set(text))def percentage(count, total):    return 100*count/totalprint('text3中词汇多样性指标:{}'.format(lexical_diversity(text3)))print('text4中单词a占全文的百分比:{}'.format(percentage(text4.count('a'),len(text4))))
text3中词汇多样性指标:16.050197203298673text4中单词a占全文的百分比:1.4643016433938312

列表 = Lists

sent1 = ['Call', 'me','Ishmael','.']print('打印sent1中的内容:{}'.format(sent1))print('打印sent1中内容的长度:{}'.format(len(sent1)))print('sent1中词汇多样性指标:{}'.format(lexical_diversity(sent1)))
打印sent1中的内容:['Call', 'me', 'Ishmael', '.']打印sent1中内容的长度:4sent1中词汇多样性指标:1.0
sent1,sent2,sent3,sent4 # 这是内部定义好的列表
(['Call', 'me', 'Ishmael', '.'], ['The',  'family',  'of',  'Dashwood',  'had',  'long',  'been',  'settled',  'in',  'Sussex',  '.'], ['In',  'the',  'beginning',  'God',  'created',  'the',  'heaven',  'and',  'the',  'earth',  '.'], ['Fellow',  '-',  'Citizens',  'of',  'the',  'Senate',  'and',  'of',  'the',  'House',  'of',  'Representatives',  ':'])
sent4+sent1
['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the', 'House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.']
sent1.append('Some')
['Call', 'me', 'Ishmael', '.', 'Some', 'Some', 'Some', 'Some']

列表索引

type(text4)
nltk.text.Text
text4[173]
'awaken'
text4.index('awaken')
173
text5[16715:16735]
['U86', 'thats', 'why', 'something', 'like', 'gamefly', 'is', 'so', 'good', 'because', 'you', 'can', 'actually', 'play', 'a', 'full', 'game', 'without', 'buying', 'it']
text6[1600:1625]
['We', "'", 're', 'an', 'anarcho', '-', 'syndicalist', 'commune', '.', 'We', 'take', 'it', 'in', 'turns', 'to', 'act', 'as', 'a', 'sort', 'of', 'executive', 'officer', 'for', 'the', 'week']

变量

sent1 = ['Call','me','Ishmael','.']my_sent = ['Bravely','bold','Sir','Robin',',','rode','forth','from','Camelot','.']noun_phrase = my_sent[1:4]print('打印切片后的列表:noun_phrase-》{}'.format(noun_phrase))wOrDs = sorted(noun_phrase)print('打印排序后的列表:wOrDs-》{}'.format(wOrDs))
打印切片后的列表:noun_phrase-》['bold', 'Sir', 'Robin']打印排序后的列表:wOrDs-》['Robin', 'Sir', 'bold']

字符串

name = 'bright'print('打印name中的第一个字母:{}'.format(name[0]))print(name[:4])print(name*2)print(name + '!')
打印name中的第一个字母:bbrigbrightbrightbright!
' '.join(['Monty', 'Python'])
'Monty Python'
'Monty Python'.split()
['Monty', 'Python']
saying = ['After','all','is','said','and','done','more','is','said','than','done']tokens = set(saying)tokens = sorted(tokens)tokens[-2:]
['said', 'than']
fdist1 = FreqDist(text1)vocabulary1 = fdist1.keys()type(vocabulary1)
dict_keys
fdist1.plot(50, cumulative=True)#Cumulative frequency plot for the 50 most frequently used words in Moby Dick, which#account for nearly half of the tokens.

1372901-20180430164507705-43639303.png

fdist1.hapaxes() #the words that occur once only
['Herman', 'Melville', ']', 'ETYMOLOGY', 'Late', 'Consumptive', 'School', 'threadbare', 'lexicons', 'mockingly', 'flags', 'mortality', 'signification', 'HACKLUYT', 'Sw', 'HVAL', 'roundness', 'Dut', 'Ger', 'WALLEN', 'WALW', 'IAN', 'RICHARDSON', 'KETOS', 'GREEK', 'CETUS', 'LATIN', 'WHOEL', 'ANGLO', 'SAXON', 'WAL', 'HWAL', 'SWEDISH', 'ICELANDIC', 'BALEINE', 'BALLENA', 'FEGEE', 'ERROMANGOAN', 'Librarian', 'painstaking', 'burrower', 'grub', 'Vaticans', 'stalls', 'higgledy', 'piggledy', 'gospel', 'promiscuously', 'commentator', 'belongest', 'sallow', 'Pale', 'Sherry', 'loves', 'bluntly', 'Subs', 'thankless', 'Hampton', 'Court', 'hie', 'refugees', 'pampered', 'Michael', 'Raphael', 'unsplinterable', 'GENESIS', 'JOB', 'JONAH', 'punish', 'ISAIAH', 'soever', 'cometh', 'incontinently', 'perisheth', 'PLUTARCH', 'MORALS', 'breedeth', 'Whirlpooles', 'Balaene', 'arpens', 'PLINY', 'Scarcely', 'TOOKE', 'LUCIAN', 'TRUE', 'catched', 'OCTHER', 'VERBAL', 'TAKEN', 'MOUTH', 'ALFRED', '890', 'gudgeon', 'retires', 'MONTAIGNE', 'APOLOGY', 'RAIMOND', 'SEBOND', 'Nick', 'RABELAIS', 'cartloads', 'STOWE', 'ANNALS', 'LORD', 'BACON', 'Touching', 'ork', 'DEATH', 'sovereignest', 'bruise', 'HAMLET', 'leach', 'Mote', 'availle', 'returne', 'againe', 'worker', 'Dinting', 'paine', 'thro', 'maine', 'FAERIE', 'Immense', 'til', 'DAVENANT', 'PREFACE', 'GONDIBERT', 'spermacetti', 'Hosmannus', 'Nescio', 'VIDE', 'Spencer', 'Talus', 'flail', 'threatens', 'jav', 'lins', 'WALLER', 'SUMMER', 'ISLANDS', 'Commonwealth', 'Civitas', 'OPENING', 'SENTENCE', 'HOBBES', 'LEVIATHAN', 'Silly', 'Mansoul', 'chewing', 'sprat', 'PILGRIM', 'PROGRESS', 'Created', 'PARADISE', 'LOST', '---"', 'Hugest', 'Stretched', 'Draws', 'FULLLER', 'PROFANE', 'HOLY', 'STATE', 'DRYDEN', 'ANNUS', 'MIRABILIS', 'aground', 'EDGE', 'TEN', 'SPITZBERGEN', 'PURCHAS', 'wantonness', 'fuzzing', 'vents', 'HERBERT', 'INTO', 'ASIA', 'AFRICA', 'SCHOUTEN', 'SIXTH', 'CIRCUMNAVIGATION', 'Elbe', 'ducat', 'herrings', 'GREENLAND', 'Several', 'Fife', 'Anno', '1652', 'Pitferren', 'SIBBALD', 'FIFE', 'KINROSS', 'Myself', 'Sperma', 'ceti', 'fierceness', 'RICHARD', 'STRAFFORD', 'LETTER', 'BERMUDAS', 'PHIL', 'TRANS', '1668', 'PRIMER', 'COWLEY', '1729', '"...', 'frequendy', 'insupportable', 'disorder', 'ULLOA', 'SOUTH', 'AMERICA', 'sylphs', 'petticoat', 'Oft', 'Tho', 'RAPE', 'LOCK', 'NAT', 'wales', 'JOHNSON', 'COOK', 'dung', 'lime', 'juniper', 'UNO', 'VON', 'TROIL', 'LETTERS', 'BANKS', 'SOLANDER', '1772', 'Nantuckois', 'JEFFERSON', 'MEMORIAL', 'MINISTER', 'REFERENCE', 'PARLIAMENT', 'SOMEWHERE', 'guarding', 'protecting', 'robbers', 'BLACKSTONE', 'Rodmond', 'suspends', 'attends', 'FALCONER', 'Bright', 'roofs', 'domes', 'rockets', 'Around', 'unwieldy', 'COWPER', 'VISIT', 'LONDON', 'HUNTER', 'DISSECTION', 'SMALL', 'SIZED', 'aorta', 'gushing', 'PALEY', 'THEOLOGY', 'mammiferous', 'hind', 'BARON', 'CUVIER', 'COLNETT', 'PURPOSE', 'EXTENDING', 'SPERMACETI', 'Floundered', 'chace', 'peopling', 'Gather', 'Led', 'instincts', 'trackless', 'Assaulted', 'voracious', 'spiral', 'MONTGOMERY', 'WORLD', 'FLOOD', 'Paean', 'fatter', 'Flounders', 'CHARLES', 'LAMB', 'TRIUMPH', '1690', 'OBED', 'Susan', 'HAWTHORNE', 'TWICE', 'bespeak', 'raal', 'COOPER', 'PILOT', 'Berlin', 'Gazette', 'ECKERMANN', 'CONVERSATIONS', 'GOETHE', 'ESSEX', 'WAS', 'ATTACKED', 'FINALLY', 'DESTROYED', 'OWEN', 'CHACE', 'FIRST', 'SAID', 'VESSEL', 'YORK', '1821', 'piping', 'dimmed', 'phospher', 'ELIZABETH', 'OAKES', 'SMITH', 'amounted', '440', 'SCORESBY', 'Mad', 'agonies', 'endures', 'infuriated', 'rears', 'snaps', 'propelled', 'observers', 'opportunities', 'habitudes', 'BEALE', 'offensively', 'artful', 'mischievous', 'FREDERICK', 'DEBELL', '1840', 'October', 'Raise', 'ay', 'THAR', 'bowes', 'os', 'ROSS', 'ETCHINGS', 'CRUIZE', '1846', 'Globe', 'transactions', 'relate', 'HUSSEY', 'SURVIVORS', 'parried', 'MISSIONARY', 'JOURNAL', 'TYERMAN', 'boldest', 'persevering', 'REPORT', 'DANIEL', 'SPEECH', 'SENATE', 'APPLICATION', 'ERECTION', 'BREAKWATER', 'CAPTORS', 'WHALEMAN', 'ADVENTURES', 'BIOGRAPHY', 'GATHERED', 'HOMEWARD', 'COMMODORE', 'PREBLE', 'REV', 'CHEEVER', 'MUTINEER', 'BROTHER', 'ANOTHER', 'MCCULLOCH', 'COMMERCIAL', 'reciprocal', 'clews', 'SOMETHING', 'UNPUBLISHED', 'CURRENTS', 'Pedestrians', 'recollect', 'gateways', 'VOYAGER', 'ARCTIC', 'NEWSPAPER', 'TAKING', 'RETAKING', 'HOBOMACK', 'MIRIAM', 'FISHERMAN', 'appliance', 'RIBS', 'TRUCKS', 'Terra', 'Del', 'Fuego', 'DARWIN', 'NATURALIST', ";--'", '!\'"', 'WHARTON', 'Loomings', 'spleen', 'regulating', 'circulation', 'Whenever', 'drizzly', 'hypos', 'philosophical', 'Cato', 'Manhattoes', 'reefs', 'downtown', 'gazers', 'Circumambulate', 'Corlears', 'Coenties', 'Slip', 'Whitehall', 'Posted', 'sentinels', 'spiles', 'pier', 'lath', 'counters', 'desks', 'loitering', 'shady', 'Inlanders', 'lanes', 'alleys', 'attract', 'dale', 'dreamiest', 'shadiest', 'quietest', 'enchanting', 'Saco', 'crucifix', 'Deep', 'mazy', 'Tiger', 'Tennessee', 'Rockaway', 'Persians', 'deity', 'Narcissus', 'ungraspable', 'hazy', 'quarrelsome', 'offices', 'abominate', 'toils', 'trials', 'barques', 'schooners', 'broiling', 'buttered', 'judgmatically', 'peppered', 'reverentially', 'idolatrous', 'dotings', 'ibis', 'roasted', 'bake', 'plumb', 'Van', 'Rensselaers', 'Randolphs', 'Hardicanutes', 'lording', 'tallest', 'decoction', 'Seneca', 'Stoics', 'Testament', 'promptly', 'rub', 'infliction', 'BEING', 'PAID', 'urbane', 'ills', 'monied', 'consign', 'prevalent', 'violate', 'Pythagorean', 'commonalty', 'police', 'surveillance', 'programme', 'solo', 'CONTESTED', 'ELECTION', 'PRESIDENCY', 'UNITED', 'STATES', 'ISHMAEL', 'BLOODY', 'AFFGHANISTAN', 'managers', 'genteel', 'comedies', 'farces', 'cunningly', 'disguises', 'cajoling', 'unbiased', 'freewill', 'discriminating', 'overwhelming', 'undeliverable', 'itch', 'forbidden', 'ignoring', 'lodges', 'Carpet', 'Bag', 'Manhatto', 'candidates', 'penalties', 'Tyre', 'Carthage', 'imported', 'cobblestones', 'bitingly', 'shouldering', 'price', 'fervent', 'asphaltic', 'pavement', 'flinty', 'projections', 'soles', 'Too', 'cheapest', 'cheeriest', 'invitingly', 'particles', 'peer', 'Angel', 'Doom', 'wailing', 'gnashing', 'Wretched', 'entertainment', 'Moving', 'emigrant', 'poverty', 'creak', 'lodgings', 'zephyr', 'hob', 'toasting', 'observest', 'sashless', 'glazier', 'reasonest', 'chinks', 'crannies', 'lint', 'chattering', 'shiverings', 'cob', 'redder', 'Orion', 'glitters', 'conservatories', 'president', 'temperance', 'blubbering', 'straggling', 'wainscots', 'reminding', 'oilpainting', 'besmoked', 'defaced', 'unequal', 'crosslights', 'hags', 'delineate', 'bewitched', 'ponderings', 'boggy', 'soggy', 'squitchy', 'froze', 'heath', 'icebound', 'represents', 'Horner', 'foundered', 'clubs', 'harvesting', 'hacking', 'horrifying', 'Mixed', 'Nathan', 'Swain', 'corkscrew', 'Blanco', 'sojourning', 'fireplaces', 'duskier', 'cockpits', 'rarities', 'Projecting', 'Within', 'shelves', 'flasks', 'bustles', 'deliriums', 'Abominable', 'tumblers', 'cylinders', 'goggling', 'deceitfully', 'tapered', 'Parallel', 'pecked', 'footpads', 'Fill', 'shilling', 'examining', 'SKRIMSHANDER', 'accommodated', 'unoccupied', 'haint', 'pose', 'whalin', 'decidedly', 'objectionable', 'wander', 'Battery', 'ruminating', 'adorning', 'potatoes', 'sartainty', 'diabolically', 'steaks', 'undress', 'looker', 'rioting', 'Grampus', 'seed', 'Feegees', 'tramping', 'Enveloped', 'bedarned', 'eruption', 'officiating', 'brimmers', 'complained', 'potion', 'colds', 'catarrhs', 'liquor', 'arrantest', 'topers', 'obstreperously', 'aloof', 'desirous', 'hilarity', 'coffer', 'Southerner', 'mountaineers', 'Alleghanian', 'missed', 'supernaturally', 'congratulate', 'multiply', 'bachelor', 'abominated', 'tidiest', 'bedwards', 'shan', 'tablecloth', 'Skrimshander', 'bump', 'spraining', 'eider', 'yoking', 'rickety', 'whirlwinds', 'knockings', 'dismissed', 'popped', 'cherishing', 'chuckled', 'chuckle', 'mightily', 'catches', 'bamboozingly', 'overstocked', 'toothpick', 'rayther', 'BROWN', 'slanderin', 'farrago', 'BROKE', 'Sartain', 'Mt', 'Hecla', 'persist', 'mystifying', 'unsay', 'criminal', 'Wall', 'purty', 'sarmon', 'rips', 'tellin', 'bought', 'balmed', 'curios', 'sellin', 'inions', 'fooling', 'idolators', 'Depend', 'reg', 'lar', 'spliced', 'Johnny', 'sprawling', 'Arter', 'glim', 'jiffy', 'irresolute', 'vum', 'WON', 'Folding', 'scrutiny', 'porcupine', 'moccasin', 'ponchos', 'parade', 'rainy', 'remembering', 'commended', 'cobs', 'Nod', 'footfall', 'unlacing', 'blackish', 'plasters', 'inkling', 'Placing', 'crammed', 'scalp', 'mildewed', 'Ignorance', 'parent', 'nonplussed', 'undressing', 'checkered', 'Thirty', 'frogs', 'quaked', 'wrapall', 'dreadnaught', 'fumbled', 'Remembering', 'manikin', 'tenpin', 'andirons', 'jambs', 'bricks', 'appropriate', 'applying', 'hastier', 'withdrawals', 'antics', 'devotee', 'extinguishing', 'unceremoniously', 'bagged', 'sportsman', 'woodcock', 'uncomfortableness', 'deliberating', 'puffed', 'sang', 'Stammering', 'conjured', 'responses', 'debel', 'flourishing', 'Angels', 'flourishings', 'peddlin', 'sleepe', 'grunted', 'gettee', 'motioning', 'comely', 'insured', 'Counterpane', 'parti', 'triangles', 'interminable', 'caper', 'supperless', '21st', 'hemisphere', 'sigh', 'Sixteen', 'ached', 'coaches', 'stockinged', 'slippering', 'misbehaviour', 'unendurable', 'stepmothers', 'misfortunes', 'steeped', 'shudderingly', 'confounding', 'soberly', 'recurred', 'predicament', 'unlock', 'bridegroom', 'clasp', 'hugged', 'rouse', 'snore', 'scratch', 'Throwing', 'expostulations', 'unbecomingness', 'matrimonial', 'dawning', 'overture', 'innate', 'compliment', 'civility', 'rudeness', 'toilette', 'dressing', 'donning', 'gaspings', 'booting', 'caterpillar', 'outlandishness', 'manners', 'education', 'undergraduate', 'dreamt', 'cowhide', 'pinched', 'curtains', 'indecorous', 'contented', 'restricting', 'donned', 'lathering', 'unsheathes', 'whets', 'Rogers', 'cutlery', 'Afterwards', 'baton', 'Breakfast', 'pleasantly', 'bountifully', 'laughable', 'bosky', 'unshorn', 'gowns', 'toasted', 'lingers', 'tarried', 'barred', 'Grub', 'Park', 'assurance', 'polish', 'occasioned', 'embarrassed', 'bashfulness', 'duelled', 'winking', 'tastes', 'sheepishly', 'bashful', 'icicle', 'admirer', 'cordially', 'grappling', 'genteelly', 'eschewed', 'undivided', '6', 'circulating', 'nondescripts', 'Chestnut', 'jostle', 'Regent', 'Lascars', 'Bombay', 'Apollo', 'Feegeeans', 'Tongatobooarrs', 'Erromanggoans', 'Pannangians', 'Brighggians', 'weekly', 'Vermonters', 'stalwart', 'frames', 'felled', 'strutting', 'wester', 'bombazine', 'cloak', 'mow', 'gloves', 'joins', 'outfit', 'waistcoats', 'Hay', 'Seed', 'tract', 'dearest', 'pave', 'eggs', 'patrician', 'parks', 'scraggy', 'scoria', 'Herr', 'dowers', 'nieces', 'reservoirs', 'maples', 'bountiful', 'proffer', 'passer', 'cones', 'blossoms', 'superinduced', 'carnation', 'Salem', 'sweethearts', 'Puritanic', 'Whaleman', 'Wrapping', 'Each', 'quote', 'TALBOT', 'Near', 'Desolation', '1st', 'SISTER', 'ROBERT', 'WILLIS', 'ELLERY', 'NATHAN', 'COLEMAN', 'WALTER', 'CANNY', 'SETH', 'GLEIG', 'Forming', 'ELIZA', '31st', 'MARBLE', 'SHIPMATES', 'EZEKIEL', 'HARDY', 'AUGUST', '3d', '1833', 'WIDOW', 'Shaking', 'glazed', 'Affected', 'relatives', 'unhealing', 'sympathetically', 'wounds', 'bleed', 'blanks', ...]

单词的精细选择

  1. the set of all w such that w is an element of V (the vocabulary) and w has property P
    {w|w \(\in\) V and P(w)}
  2. The corresponding Python expression is given:
    [w for w in V if p(w)]
V = set(text1)long_words = [w for w in V if len(w)>15]sorted(long_words)
['CIRCUMNAVIGATION', 'Physiognomically', 'apprehensiveness', 'cannibalistically', 'characteristically', 'circumnavigating', 'circumnavigation', 'circumnavigations', 'comprehensiveness', 'hermaphroditical', 'indiscriminately', 'indispensableness', 'irresistibleness', 'physiognomically', 'preternaturalness', 'responsibilities', 'simultaneousness', 'subterraneousness', 'supernaturalness', 'superstitiousness', 'uncomfortableness', 'uncompromisedness', 'undiscriminating', 'uninterpenetratingly']
 

本文选自《Natural Language Processing with Python》

转载于:https://www.cnblogs.com/brightyuxl/p/8973951.html

你可能感兴趣的文章
JS高级用法
查看>>
public static final 的用法
查看>>
使用TortoiseGit同步代码到github远程仓库
查看>>
Django中HtttpRequest请求
查看>>
K-Means聚类和EM算法复习总结
查看>>
彻底卸载MySql
查看>>
[转]Bat脚本处理ftp超强案例解说
查看>>
P3901 数列找不同
查看>>
poj2516
查看>>
输出的文本实现对齐
查看>>
C#WPF实现回溯算法解决八皇后问题
查看>>
EXT.NET Toolbar GridPanel主动宽度和高度的解决规划,引入Viewport
查看>>
the security settings could not be applied to the database(mysql安装error)【简记】
查看>>
搭建无线局域网:因地制宜
查看>>
利用无线网络数据包分析无线网络安全
查看>>
MEMBER REPORT
查看>>
[HAOI2006]受欢迎的牛
查看>>
使用jquery去掉时光轴头尾部的线条
查看>>
算法(转)
查看>>
IT职场人生系列之十五:语言与技术II
查看>>