英文ストップワード方式による用語抽出

モジュールをimport

In [1]:
import termextract.english_plaintext
import termextract.core

from pprint import pprint # このサンプルでの処理結果の整形表示のため

英文のプレインテキストを読み込み

テキストはWikipediaの「A人工知能」記事( https://en.wikipedia.org/wiki/A.I._Artificial_Intelligence )から抜粋

In [2]:
f = open("eng_sample_s.txt", "r", encoding="utf-8")
text = f.read()
f.close
print(text)
Artificial intelligence (AI) is the intelligence exhibited by machines. In computer science, an ideal "intelligent" machine is a flexible rational agent that perceives its environment and takes actions that maximize its chance of success at an arbitrary goal.[1] Colloquially, the term "artificial intelligence" is likely to be applied when a machine uses cutting-edge techniques to competently perform or mimic "cognitive" functions that we intuitively associate with human minds, such as "learning" and "problem solving".[2] The colloquial connotation, especially among the public, associates artificial intelligence with machines that are "cutting-edge" (or even "mysterious"). This subjective borderline around what constitutes "artificial intelligence" tends to shrink over time; for example, optical character recognition is no longer perceived as an exemplar of "artificial intelligence" as it is nowadays a mundane routine technology.[3] Modern examples of AI include computers that can beat professional players at Chess and Go, and self-driving cars that navigate crowded city streets.

AI research is highly technical and specialized, and is deeply divided into subfields that often fail to communicate with each other.[4] Some of the division is due to social and cultural factors: subfields have grown up around particular institutions and the work of individual researchers. AI research is also divided by several technical issues. Some subfields focus on the solution of specific problems. Others focus on one of several possible approaches or on the use of a particular tool or towards the accomplishment of particular applications.

The central problems (or goals) of AI research include reasoning, knowledge, planning, learning, natural language processing (communication), perception and the ability to move and manipulate objects.[5] General intelligence is still among the field's long-term goals.[6] Currently popular approaches include statistical methods, computational intelligence and traditional symbolic AI. There are a large number of tools used in AI, including versions of search and mathematical optimization, logic, methods based on probability and economics, and many others. The AI field is interdisciplinary, in which a number of sciences and professions converge, including computer science, mathematics, psychology, linguistics, philosophy and neuroscience, as well as other specialized fields such as artificial psychology.

The field was founded on the claim that a central property of humans, human intelligence—the sapience of Homo sapiens sapiens—"can be so precisely described that a machine can be made to simulate it."[7] This raises philosophical arguments about the nature of the mind and the ethics of creating artificial beings endowed with human-like intelligence, issues which have been explored by myth, fiction and philosophy since antiquity.[8] Artificial intelligence has been the subject of tremendous optimism[9] but has also suffered stunning setbacks.[10] Today AI techniques have become an essential part of the technology industry, providing the heavy lifting for many of the most challenging problems in computer science.[11]


複合語抽出処理(ディクショナリとリストの両方可)

In [3]:
frequency = termextract.english_plaintext.cmp_noun_dict(text)
pprint(frequency)

#term_list = termextract.english_plaintext.cmp_noun_list(text)
#pprint(term_list)
{'AI': 1,
 'AI field': 1,
 'AI research': 2,
 'AI techniques': 1,
 'Artificial intelligence': 1,
 'Artificial intelligence AI': 1,
 'Chess': 1,
 'Colloquially': 1,
 'Currently popular approaches include statistical methods': 1,
 'Modern examples of AI include computers': 1,
 'ability': 1,
 'accomplishment of particular applications': 1,
 'among': 1,
 'applied': 1,
 'approaches': 1,
 'arbitrary': 1,
 'artificial psychology': 1,
 'associates artificial intelligence': 1,
 'beat professional players': 1,
 'cars': 1,
 'central': 1,
 'central property of humans': 1,
 'challenging': 1,
 'chance of success': 1,
 'claim': 1,
 'colloquial connotation': 1,
 'communicate': 1,
 'competently perform': 1,
 'computational intelligence': 1,
 'computer': 1,
 'computer science': 1,
 'constitutes artificial intelligence tends': 1,
 'cultural factors': 1,
 'deeply divided': 1,
 'divided': 1,
 'division': 1,
 'due': 1,
 'economics': 1,
 'environment': 1,
 'especially among': 1,
 'essential': 1,
 'ethics of creating artificial beings endowed': 1,
 'even mysterious': 1,
 'example': 1,
 'exemplar of artificial intelligence': 1,
 'explored': 1,
 'fail': 1,
 'fiction': 1,
 'field': 1,
 "field's": 1,
 'flexible rational agent': 1,
 'focus': 1,
 'founded': 1,
 'goals of AI research include reasoning': 1,
 'grown': 1,
 'heavy lifting': 1,
 'highly technical': 1,
 'human intelligence—the sapience of Homo sapiens sapiens—can': 1,
 'human minds': 1,
 'ideal intelligent machine': 1,
 'including computer science': 1,
 'including versions of search': 1,
 'individual researchers': 1,
 'intelligence': 2,
 'intelligence exhibited': 1,
 'interdisciplinary': 1,
 'intuitively associate': 1,
 'issues': 1,
 'knowledge': 1,
 'learning': 2,
 'linguistics': 1,
 'logic': 1,
 'machine': 2,
 'machines': 2,
 'manipulate': 1,
 'mathematical optimization': 1,
 'mathematics': 1,
 'maximize': 1,
 'methods based': 1,
 'mimic cognitive functions': 1,
 'mind': 1,
 'move': 1,
 'mundane routine': 1,
 'myth': 1,
 'natural language processing communication': 1,
 'nature': 1,
 'navigate crowded city streets': 1,
 'neuroscience': 1,
 'nowadays': 1,
 'optical character recognition': 1,
 'particular institutions': 1,
 'particular tool': 1,
 'perceived': 1,
 'perceives': 1,
 'perception': 1,
 'philosophy': 2,
 'planning': 1,
 'precisely described': 1,
 'probability': 1,
 'professions converge': 1,
 'providing': 1,
 'psychology': 1,
 'public': 1,
 'raises philosophical arguments': 1,
 'sciences': 1,
 'shrink': 1,
 'simulate': 1,
 'social': 1,
 'solution of specific': 1,
 'specialized': 1,
 'specialized fields': 1,
 'subfields': 2,
 'subfields focus': 1,
 'subject of tremendous optimism9': 1,
 'subjective borderline': 1,
 'suffered stunning': 1,
 'takes actions': 1,
 'technical issues': 1,
 'techniques': 1,
 'technology industry': 1,
 'term artificial intelligence': 1,
 'time': 1,
 'tools': 1,
 'towards': 1,
 'traditional symbolic AI': 1}

FrequencyからLRを生成する

In [4]:
lr = termextract.core.score_lr(
    frequency,
    ignore_words=termextract.english_plaintext.IGNORE_WORDS,
    lr_mode=1, average_rate=1)
pprint(lr)
{'AI': 5.916079783099616,
 'AI field': 2.892507608519078,
 'AI research': 4.090623489235047,
 'AI techniques': 2.892507608519078,
 'Artificial intelligence': 3.1301691601465746,
 'Artificial intelligence AI': 3.8701091424450826,
 'Chess': 1.0,
 'Colloquially': 1.0,
 'Currently popular approaches include statistical methods': 2.1189261887185906,
 'Modern examples of AI include computers': 2.13480888082866,
 'ability': 1.0,
 'accomplishment of particular applications': 1.5422108254079407,
 'among': 1.4142135623730951,
 'applied': 1.0,
 'approaches': 2.0,
 'arbitrary': 1.0,
 'artificial psychology': 3.027400104035091,
 'associates artificial intelligence': 3.7288210710016374,
 'beat professional players': 1.5874010519681994,
 'cars': 1.0,
 'central': 1.4142135623730951,
 'central property of humans': 1.4142135623730951,
 'challenging': 1.0,
 'chance of success': 1.2599210498948732,
 'claim': 1.0,
 'colloquial connotation': 1.4142135623730951,
 'communicate': 1.0,
 'competently perform': 1.4142135623730951,
 'computational intelligence': 2.8284271247461903,
 'computer': 2.449489742783178,
 'computer science': 2.0597671439071177,
 'constitutes artificial intelligence tends': 2.9262229190053666,
 'cultural factors': 1.4142135623730951,
 'deeply divided': 1.4142135623730951,
 'divided': 1.4142135623730951,
 'division': 1.0,
 'due': 1.0,
 'economics': 1.0,
 'environment': 1.0,
 'especially among': 1.4142135623730951,
 'essential': 1.0,
 'ethics of creating artificial beings endowed': 1.9310155543137495,
 'even mysterious': 1.4142135623730951,
 'example': 1.0,
 'exemplar of artificial intelligence': 2.6833582480460967,
 'explored': 1.0,
 'fail': 1.0,
 'fiction': 1.0,
 'field': 1.4142135623730951,
 "field's": 1.0,
 'flexible rational agent': 1.5874010519681994,
 'focus': 1.4142135623730951,
 'founded': 1.0,
 'goals of AI research include reasoning': 2.261751222748436,
 'grown': 1.0,
 'heavy lifting': 1.4142135623730951,
 'highly technical': 1.681792830507429,
 'human intelligence—the sapience of Homo sapiens sapiens—can': 1.6888822547634152,
 'human minds': 1.5650845800732873,
 'ideal intelligent machine': 1.5874010519681994,
 'including computer science': 1.9441612972396656,
 'including versions of search': 1.4877378261644902,
 'individual researchers': 1.4142135623730951,
 'intelligence': 5.656854249492381,
 'intelligence exhibited': 2.8284271247461903,
 'interdisciplinary': 1.0,
 'intuitively associate': 1.4142135623730951,
 'issues': 1.4142135623730951,
 'knowledge': 1.0,
 'learning': 1.0,
 'linguistics': 1.0,
 'logic': 1.0,
 'machine': 1.4142135623730951,
 'machines': 1.0,
 'manipulate': 1.0,
 'mathematical optimization': 1.4142135623730951,
 'mathematics': 1.0,
 'maximize': 1.0,
 'methods based': 1.681792830507429,
 'mimic cognitive functions': 1.5874010519681994,
 'mind': 1.0,
 'move': 1.0,
 'mundane routine': 1.4142135623730951,
 'myth': 1.0,
 'natural language processing communication': 1.681792830507429,
 'nature': 1.0,
 'navigate crowded city streets': 1.681792830507429,
 'neuroscience': 1.0,
 'nowadays': 1.0,
 'optical character recognition': 1.5874010519681994,
 'particular institutions': 2.0,
 'particular tool': 2.0,
 'perceived': 1.0,
 'perceives': 1.0,
 'perception': 1.0,
 'philosophy': 1.0,
 'planning': 1.0,
 'precisely described': 1.4142135623730951,
 'probability': 1.0,
 'professions converge': 1.4142135623730951,
 'providing': 1.0,
 'psychology': 1.4142135623730951,
 'public': 1.0,
 'raises philosophical arguments': 1.5874010519681994,
 'sciences': 1.0,
 'shrink': 1.0,
 'simulate': 1.0,
 'social': 1.0,
 'solution of specific': 1.2599210498948732,
 'specialized': 1.4142135623730951,
 'specialized fields': 1.4142135623730951,
 'subfields': 1.4142135623730951,
 'subfields focus': 1.4142135623730951,
 'subject of tremendous optimism9': 1.4142135623730951,
 'subjective borderline': 1.4142135623730951,
 'suffered stunning': 1.4142135623730951,
 'takes actions': 1.4142135623730951,
 'technical issues': 1.681792830507429,
 'techniques': 1.4142135623730951,
 'technology industry': 1.4142135623730951,
 'term artificial intelligence': 3.7288210710016374,
 'time': 1.0,
 'tools': 1.0,
 'towards': 1.0,
 'traditional symbolic AI': 2.557759296802023}

FrequencyとLRを組み合わせFLRの重要度を出す

In [6]:
term_imp = termextract.core.term_importance(frequency, lr)
pprint(term_imp)
{'AI': 5.916079783099616,
 'AI field': 2.892507608519078,
 'AI research': 8.181246978470094,
 'AI techniques': 2.892507608519078,
 'Artificial intelligence': 3.1301691601465746,
 'Artificial intelligence AI': 3.8701091424450826,
 'Chess': 1.0,
 'Colloquially': 1.0,
 'Currently popular approaches include statistical methods': 2.1189261887185906,
 'Modern examples of AI include computers': 2.13480888082866,
 'ability': 1.0,
 'accomplishment of particular applications': 1.5422108254079407,
 'among': 1.4142135623730951,
 'applied': 1.0,
 'approaches': 2.0,
 'arbitrary': 1.0,
 'artificial psychology': 3.027400104035091,
 'associates artificial intelligence': 3.7288210710016374,
 'beat professional players': 1.5874010519681994,
 'cars': 1.0,
 'central': 1.4142135623730951,
 'central property of humans': 1.4142135623730951,
 'challenging': 1.0,
 'chance of success': 1.2599210498948732,
 'claim': 1.0,
 'colloquial connotation': 1.4142135623730951,
 'communicate': 1.0,
 'competently perform': 1.4142135623730951,
 'computational intelligence': 2.8284271247461903,
 'computer': 2.449489742783178,
 'computer science': 2.0597671439071177,
 'constitutes artificial intelligence tends': 2.9262229190053666,
 'cultural factors': 1.4142135623730951,
 'deeply divided': 1.4142135623730951,
 'divided': 1.4142135623730951,
 'division': 1.0,
 'due': 1.0,
 'economics': 1.0,
 'environment': 1.0,
 'especially among': 1.4142135623730951,
 'essential': 1.0,
 'ethics of creating artificial beings endowed': 1.9310155543137495,
 'even mysterious': 1.4142135623730951,
 'example': 1.0,
 'exemplar of artificial intelligence': 2.6833582480460967,
 'explored': 1.0,
 'fail': 1.0,
 'fiction': 1.0,
 'field': 1.4142135623730951,
 "field's": 1.0,
 'flexible rational agent': 1.5874010519681994,
 'focus': 1.4142135623730951,
 'founded': 1.0,
 'goals of AI research include reasoning': 2.261751222748436,
 'grown': 1.0,
 'heavy lifting': 1.4142135623730951,
 'highly technical': 1.681792830507429,
 'human intelligence—the sapience of Homo sapiens sapiens—can': 1.6888822547634152,
 'human minds': 1.5650845800732873,
 'ideal intelligent machine': 1.5874010519681994,
 'including computer science': 1.9441612972396656,
 'including versions of search': 1.4877378261644902,
 'individual researchers': 1.4142135623730951,
 'intelligence': 11.313708498984761,
 'intelligence exhibited': 2.8284271247461903,
 'interdisciplinary': 1.0,
 'intuitively associate': 1.4142135623730951,
 'issues': 1.4142135623730951,
 'knowledge': 1.0,
 'learning': 2.0,
 'linguistics': 1.0,
 'logic': 1.0,
 'machine': 2.8284271247461903,
 'machines': 2.0,
 'manipulate': 1.0,
 'mathematical optimization': 1.4142135623730951,
 'mathematics': 1.0,
 'maximize': 1.0,
 'methods based': 1.681792830507429,
 'mimic cognitive functions': 1.5874010519681994,
 'mind': 1.0,
 'move': 1.0,
 'mundane routine': 1.4142135623730951,
 'myth': 1.0,
 'natural language processing communication': 1.681792830507429,
 'nature': 1.0,
 'navigate crowded city streets': 1.681792830507429,
 'neuroscience': 1.0,
 'nowadays': 1.0,
 'optical character recognition': 1.5874010519681994,
 'particular institutions': 2.0,
 'particular tool': 2.0,
 'perceived': 1.0,
 'perceives': 1.0,
 'perception': 1.0,
 'philosophy': 2.0,
 'planning': 1.0,
 'precisely described': 1.4142135623730951,
 'probability': 1.0,
 'professions converge': 1.4142135623730951,
 'providing': 1.0,
 'psychology': 1.4142135623730951,
 'public': 1.0,
 'raises philosophical arguments': 1.5874010519681994,
 'sciences': 1.0,
 'shrink': 1.0,
 'simulate': 1.0,
 'social': 1.0,
 'solution of specific': 1.2599210498948732,
 'specialized': 1.4142135623730951,
 'specialized fields': 1.4142135623730951,
 'subfields': 2.8284271247461903,
 'subfields focus': 1.4142135623730951,
 'subject of tremendous optimism9': 1.4142135623730951,
 'subjective borderline': 1.4142135623730951,
 'suffered stunning': 1.4142135623730951,
 'takes actions': 1.4142135623730951,
 'technical issues': 1.681792830507429,
 'techniques': 1.4142135623730951,
 'technology industry': 1.4142135623730951,
 'term artificial intelligence': 3.7288210710016374,
 'time': 1.0,
 'tools': 1.0,
 'towards': 1.0,
 'traditional symbolic AI': 2.557759296802023}

collectionsを使って重要度が高い順に表示

In [7]:
import collections
data_collection = collections.Counter(term_imp)
for cmp_noun, value in data_collection.most_common():
    print(cmp_noun, value, sep="\t")
intelligence	11.313708498984761
AI research	8.181246978470094
AI	5.916079783099616
Artificial intelligence AI	3.8701091424450826
term artificial intelligence	3.7288210710016374
associates artificial intelligence	3.7288210710016374
Artificial intelligence	3.1301691601465746
artificial psychology	3.027400104035091
constitutes artificial intelligence tends	2.9262229190053666
AI techniques	2.892507608519078
AI field	2.892507608519078
subfields	2.8284271247461903
computational intelligence	2.8284271247461903
intelligence exhibited	2.8284271247461903
machine	2.8284271247461903
exemplar of artificial intelligence	2.6833582480460967
traditional symbolic AI	2.557759296802023
computer	2.449489742783178
goals of AI research include reasoning	2.261751222748436
Modern examples of AI include computers	2.13480888082866
Currently popular approaches include statistical methods	2.1189261887185906
computer science	2.0597671439071177
philosophy	2.0
particular institutions	2.0
learning	2.0
particular tool	2.0
machines	2.0
approaches	2.0
including computer science	1.9441612972396656
ethics of creating artificial beings endowed	1.9310155543137495
human intelligence—the sapience of Homo sapiens sapiens—can	1.6888822547634152
technical issues	1.681792830507429
methods based	1.681792830507429
navigate crowded city streets	1.681792830507429
highly technical	1.681792830507429
natural language processing communication	1.681792830507429
mimic cognitive functions	1.5874010519681994
flexible rational agent	1.5874010519681994
raises philosophical arguments	1.5874010519681994
ideal intelligent machine	1.5874010519681994
optical character recognition	1.5874010519681994
beat professional players	1.5874010519681994
human minds	1.5650845800732873
accomplishment of particular applications	1.5422108254079407
including versions of search	1.4877378261644902
psychology	1.4142135623730951
subfields focus	1.4142135623730951
takes actions	1.4142135623730951
suffered stunning	1.4142135623730951
subjective borderline	1.4142135623730951
deeply divided	1.4142135623730951
cultural factors	1.4142135623730951
field	1.4142135623730951
heavy lifting	1.4142135623730951
professions converge	1.4142135623730951
colloquial connotation	1.4142135623730951
mathematical optimization	1.4142135623730951
technology industry	1.4142135623730951
individual researchers	1.4142135623730951
focus	1.4142135623730951
precisely described	1.4142135623730951
intuitively associate	1.4142135623730951
among	1.4142135623730951
central	1.4142135623730951
issues	1.4142135623730951
specialized fields	1.4142135623730951
competently perform	1.4142135623730951
techniques	1.4142135623730951
subject of tremendous optimism9	1.4142135623730951
divided	1.4142135623730951
especially among	1.4142135623730951
specialized	1.4142135623730951
even mysterious	1.4142135623730951
central property of humans	1.4142135623730951
mundane routine	1.4142135623730951
solution of specific	1.2599210498948732
chance of success	1.2599210498948732
Chess	1.0
providing	1.0
field's	1.0
claim	1.0
sciences	1.0
nature	1.0
arbitrary	1.0
fiction	1.0
perceives	1.0
towards	1.0
challenging	1.0
example	1.0
applied	1.0
interdisciplinary	1.0
perception	1.0
communicate	1.0
explored	1.0
neuroscience	1.0
essential	1.0
environment	1.0
planning	1.0
founded	1.0
manipulate	1.0
logic	1.0
due	1.0
ability	1.0
mind	1.0
time	1.0
tools	1.0
division	1.0
myth	1.0
shrink	1.0
economics	1.0
knowledge	1.0
Colloquially	1.0
social	1.0
simulate	1.0
perceived	1.0
grown	1.0
maximize	1.0
fail	1.0
linguistics	1.0
nowadays	1.0
public	1.0
cars	1.0
mathematics	1.0
move	1.0
probability	1.0
In [ ]: