14  spaCy使用

14.1 安装

pip install spacy


python -m spacy download en_core_web_sm
python -m spacy download zh_core_web_sm
python -m spacy download en_core_web_lg
python -m spacy download zh_core_web_lg


spaCy model官方的下载地址: https://github.com/explosion/spacy-models/tags


pip install ./en_core_web_lg-3.5.0.tar.gz
pip install ./zh_core_web_lg-3.5.0.tar.gz

14.1.1 Run spaCy with GPU


pip install -U spacy[cuda113]
import spacy
nlp = spacy.load("en_core_web_sm")

14.2 Doc Container

图 14.1 spacy Container
import spacy
nlp = spacy.load("en_core_web_sm")
text = """The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia. With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York."""

doc = nlp(text)

14.2.1 token

for i, token in enumerate(doc[:15]):
    print(i, token, sep="\t")
0   The
1   United
2   States
3   of
4   America
5   (
6   U.S.A.
7   or
8   USA
9   )
10  ,
11  commonly
12  known
13  as
14  the
for i, token in enumerate(text[:15]):
    print(i, token, sep="\t")
0   T
1   h
2   e
4   U
5   n
6   i
7   t
8   e
9   d
11  S
12  t
13  a
14  t
for i, token in enumerate(text.split()[:15]):
    print(i, token, sep="\t")
0   The
1   United
2   States
3   of
4   America
5   (U.S.A.
6   or
7   USA),
8   commonly
9   known
10  as
11  the
12  United
13  States
14  (U.S.

14.2.2 sentences

for sent in doc.sents:
    print(sent, end="\n\n")
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]

At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]

The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.

With a population of more than 331 million people, it is the third most populous country in the world.

The national capital is Washington, D.C., and the most populous city is New York.

doc.sents attribute is a generator. In python, we can usually iterate over generators by converting them into a list.

sentence1 = list(doc.sents)[0]
print (sentence1)
The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

14.2.3 Token Attributes

The token object contains a lot of different attributes that are VITAL do performing NLP in spaCy. We will be working with a few of them, such as:

  • .text
  • .head
  • .left_edge
  • .right_edge
  • .ent_type_
  • .iob_
  • .lemma_
  • .morph
  • .pos_
  • .dep_
  • .lang_

token attributes in spaCy: https://spacy.io/api/attributes

token = sentence1[2]
print(" 1", token.text)
print(" 2", token.ent_type_)
print(" 3", token.lemma_)
print(" 4", token.is_alpha)
print(" 5", token.is_space)
print(" 6", token.is_stop)
print(" 7", token.is_punct)
print(" 8", token.is_currency)
print(" 9", token.is_upper)
print("10", token.pos_) # Part of Speech
 1 States
 2 GPE
 3 States
 4 True
 5 False
 6 False
 7 False
 8 False
 9 False

14.2.4 Part of Speech Tagging (POS)

In the field of computational linguistics, understanding parts-of-speech is essential. SpaCy offers an easy way to parse a text and identify its parts of speech. Below, we will iterate across each token (word or punctuation) in the text and identify its part of speech.

for token in sentence1:
    print (token.text, token.pos_, token.dep_, sep="\t")
The DET det
United  PROPN   compound
States  PROPN   nsubj
of  ADP prep
America PROPN   pobj
(   PUNCT   punct
U.S.A.  PROPN   appos
or  CCONJ   cc
USA PROPN   conj
)   PUNCT   punct
,   PUNCT   punct
commonly    ADV advmod
known   VERB    acl
as  ADP prep
the DET det
United  PROPN   compound
States  PROPN   pobj
(   PUNCT   punct
U.S.    PROPN   appos
or  CCONJ   cc
US  PROPN   conj
)   PUNCT   punct
or  CCONJ   cc
America PROPN   conj
,   PUNCT   punct
a   DET det
country NOUN    attr
primarily   ADV advmod
located VERB    acl
in  ADP prep
North   PROPN   compound
America PROPN   pobj
.   PUNCT   punct
from spacy import displacy
displacy.render(sentence1, style="dep")
The DET United PROPN States PROPN of ADP America ( PROPN U.S.A. PROPN or CCONJ USA), PROPN commonly ADV known VERB as ADP the DET United PROPN States ( PROPN U.S. PROPN or CCONJ US) PROPN or CCONJ America, PROPN is AUX a DET country NOUN primarily ADV located VERB in ADP North PROPN America. PROPN det compound nsubj prep pobj appos cc conj advmod acl prep det compound pobj appos cc conj cc conj det attr advmod acl prep compound pobj

14.2.5 Named Entity Recognition¶

Another essential task of NLP, is named entity recognition, or NER. I spoke about NER in the last notebook. Here, I’d like to demonstrate how to perform basic NER via spaCy. Again, we will iterate over the doc object as we did above, but instead of iterating over doc.sents, we will iterate over doc.ents. For our purposes right now, I simply want to print off each entity’s text (the string itself) and its corresponding label (note the _ after label). I will be explaining this process in much greater detail in the next two notebooks.

for ent in doc.ents:
    print (ent.text, ent.label_)
The United States of America GPE
the United States GPE
America GPE
North America LOC
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
fourth ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
Russia GPE
more than 331 million CARDINAL
Washington GPE
New York GPE
Type Description
PERSON People, including fictional.
NORP Nationalities or religious or political groups.
FAC Buildings, airports, highways, bridges, etc.
ORG Companies, agencies, institutions, etc.
GPE Countries, cities, states.
LOC Non-GPE locations, mountain ranges, bodies of water.
PRODUCT Objects, vehicles, foods, etc. (Not services.)
EVENT Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART Titles of books, songs, etc.
LAW Named documents made into laws.
LANGUAGE Any named language.
DATE Absolute or relative dates or periods.
TIME Times smaller than a day.
PERCENT Percentage, including “%”.
MONEY Monetary values, including unit.
QUANTITY Measurements, as of weight or distance.
ORDINAL “first”, “second”, etc.
CARDINAL Numerals that do not fall under another type.

We can use the spacy.explain() on all entities for one example.

for ent in doc.ents:
    print(f'Entity: {ent}, Label: {ent.label_}, {spacy.explain(ent.label_)}')
Entity: The United States of America, Label: GPE, Countries, cities, states
Entity: U.S.A., Label: GPE, Countries, cities, states
Entity: USA, Label: GPE, Countries, cities, states
Entity: the United States, Label: GPE, Countries, cities, states
Entity: U.S., Label: GPE, Countries, cities, states
Entity: US, Label: GPE, Countries, cities, states
Entity: America, Label: GPE, Countries, cities, states
Entity: North America, Label: LOC, Non-GPE locations, mountain ranges, bodies of water
Entity: 50, Label: CARDINAL, Numerals that do not fall under another type
Entity: five, Label: CARDINAL, Numerals that do not fall under another type
Entity: 326, Label: CARDINAL, Numerals that do not fall under another type
Entity: Indian, Label: NORP, Nationalities or religious or political groups
Entity: 3.8 million square miles, Label: QUANTITY, Measurements, as of weight or distance
Entity: 9.8 million square kilometers, Label: QUANTITY, Measurements, as of weight or distance
Entity: fourth, Label: ORDINAL, "first", "second", etc.
Entity: The United States, Label: GPE, Countries, cities, states
Entity: Canada, Label: GPE, Countries, cities, states
Entity: Mexico, Label: GPE, Countries, cities, states
Entity: Bahamas, Label: GPE, Countries, cities, states
Entity: Cuba, Label: GPE, Countries, cities, states
Entity: Russia, Label: GPE, Countries, cities, states
Entity: more than 331 million, Label: CARDINAL, Numerals that do not fall under another type
Entity: third, Label: ORDINAL, "first", "second", etc.
Entity: Washington, Label: GPE, Countries, cities, states
Entity: D.C., Label: GPE, Countries, cities, states
Entity: New York, Label: GPE, Countries, cities, states
displacy.render(doc, style="ent")
The United States of America GPE ( U.S.A. GPE or USA GPE ), commonly known as the United States GPE ( U.S. GPE or US GPE ) or America GPE , is a country primarily located in North America LOC . It consists of 50 CARDINAL states, a federal district, five CARDINAL major unincorporated territories, 326 CARDINAL Indian NORP reservations, and some minor possessions.[j] At 3.8 million square miles QUANTITY ( 9.8 million square kilometers QUANTITY ), it is the world's third- or fourth ORDINAL -largest country by total area.[d] The United States GPE shares significant land borders with Canada GPE to the north and Mexico GPE to the south, as well as limited maritime borders with the Bahamas GPE , Cuba GPE , and Russia GPE . With a population of more than 331 million CARDINAL people, it is the third ORDINAL most populous country in the world. The national capital is Washington GPE , D.C. GPE , and the most populous city is New York GPE .

14.2.6 noun phrases

for noun_chunk in doc.noun_chunks:
    # Print noun chunk
The United States
the United States
a country
North America
50 states
a federal district
five major unincorporated territories
326 Indian reservations
3.8 million square miles
9.8 million square kilometers
the world's third- or fourth-largest country
total area.[d
The United States
significant land borders
the north
the south
limited maritime borders
the Bahamas
a population
more than 331 million people
the third most populous country
the world
The national capital
the most populous city
New York

14.3 Standard Pipes from spaCy


图 14.2 The spaCy processing pipeline

14.3.1 add pipeline example

nlp = spacy.blank("en")
<spacy.pipeline.sentencizer.Sentencizer at 0x1dc1fa3ed00>
doc = nlp(text)
170 µs ± 49.2 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
nlp2 = spacy.load("en_core_web_sm")
doc = nlp2(text)
24.1 ms ± 409 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
[('sentencizer', <spacy.pipeline.sentencizer.Sentencizer at 0x1dc1fa3ed00>)]
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1dc1f8f3100>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1dc1f8f32e0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1dc236d5c80>),
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1dc1f9f6980>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1dc1fa0e7c0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1dc1b776cf0>)]

14.3.2 customize pipeline

# Load a small language model for English, but exclude named entity
# recognition ('ner') and syntactic dependency parsing ('parser').
nlp = spacy.load('en_core_web_sm', exclude=['ner', 'parser'])
# Examine the active components under the Language object 'nlp'
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1dc21cb4a00>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1dc238acee0>),
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1dc25aaec80>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1dc25ac5500>)]
# Analyse the pipeline and store the analysis under 'pipe_analysis'
pipe_analysis = nlp.analyze_pipes(pretty=True)

============================= Pipeline Overview =============================

#   Component         Assigns       Requires   Scores      Retokenizes
-   ---------------   -----------   --------   ---------   -----------
0   tok2vec           doc.tensor                           False      
1   tagger            token.tag                tag_acc     False      
2   attribute_ruler                                        False      
3   lemmatizer        token.lemma              lemma_acc   False      

✔ No problems found.

14.3.3 Merging noun phrases and named entities

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1dc22fad400>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1dc22fadfa0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1dc26880d60>),
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1dc280c4b00>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1dc280d21c0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1dc26880ba0>),
  <function spacy.pipeline.functions.merge_noun_chunks(doc: spacy.tokens.doc.Doc) -> spacy.tokens.doc.Doc>)]
for i, token in enumerate(doc[:10]):
    print(i, token, sep="\t")
0   The United States
1   of
2   America
3   (
4   U.S.A.
5   or
6   USA
7   )
8   ,
9   commonly
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1dc22fad400>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1dc22fadfa0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1dc26880d60>),
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1dc280c4b00>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1dc280d21c0>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1dc26880ba0>),
  <function spacy.pipeline.functions.merge_entities(doc: spacy.tokens.doc.Doc)>)]
doc = nlp(text)
for i, token in enumerate(doc[:10]):
    print(i, token, sep="\t")
0   The United States of America
1   (
2   U.S.A.
3   or
4   USA
5   )
6   ,
7   commonly
8   known
9   as

14.4 DTM construction with spaCy

# Create list
examples = ["Helsinki is the capital of Finland",
            "Tallinn is the capital of Estonia",
            "The two capitals are joined by a ferry connection",
            "Travelling between Helsinki and Tallinn takes about two hours",
            "Ferries depart from downtown Helsinki and Tallinn"]

docs = list(nlp.pipe(examples))

LEMMA is a spaCy object that refers to this particular linguistic feature, which we can pass to the count_by() method of a Doc object to instruct spaCy to count these linguistic features.

from spacy.attrs import LEMMA
lemma_counts = {i: doc.count_by(LEMMA) for i, doc in enumerate(docs)}
{0: {332692160570289739: 1,
  10382539506755952630: 1,
  7425985699627899538: 1,
  15481038060779608540: 1,
  886050111519832510: 1,
  4881666681900411319: 1},
 1: {7392857733388117912: 1,
  10382539506755952630: 1,
  7425985699627899538: 1,
  15481038060779608540: 1,
  886050111519832510: 1,
  15428882767191480669: 1},
 2: {7425985699627899538: 1,
  11711838292424000352: 1,
  15481038060779608540: 1,
  10382539506755952630: 1,
  16238441731120403936: 1,
  16764210730586636600: 1,
  11901859001352538922: 1,
  16008623592554433546: 1,
  14753437861310164020: 1},
 3: {9016120516514741834: 1,
  7508752285157982505: 1,
  332692160570289739: 1,
  2283656566040971221: 1,
  7392857733388117912: 1,
  6789454535283781228: 1,
  883782512640661246: 1},
 4: {16008623592554433546: 1,
  11568774473013387390: 1,
  7831658034963690409: 1,
  18137549281339502438: 1,
  332692160570289739: 1,
  2283656566040971221: 1,
  7392857733388117912: 1}}
lemma_counts = {i: {docs[i].vocab[k].text: v for k, v in counter.items()}
    for i, counter in lemma_counts.items()}
import pandas as pd
df = pd.DataFrame.from_dict(lemma_counts).sort_index(ascending=True)
df = df.fillna(0).T
Estonia Finland Helsinki Tallinn a about two hour and be between by ... depart downtown ferry from join of take the travel two
0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
1 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
2 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0 ... 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0
3 0.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
4 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 22 columns

14.4.1 manipulate data with pandas

word_df = df.stack().reset_index(). \
    rename(columns = {'level_0': "id", "level_1":"word", 0: "N"}). \
    sort_values(by=['id', "N"])
id word N
0 0 Estonia 0.0
3 0 Tallinn 0.0
4 0 a 0.0
5 0 about two hour 0.0
6 0 and 0.0
... ... ... ...
94 4 and 1.0
100 4 depart 1.0
101 4 downtown 1.0
102 4 ferry 1.0
103 4 from 1.0

110 rows × 3 columns

word_df.query('N > 0')
id word N
1 0 Finland 1.0
2 0 Helsinki 1.0
7 0 be 1.0
10 0 capital 1.0
17 0 of 1.0
19 0 the 1.0
22 1 Estonia 1.0
25 1 Tallinn 1.0
29 1 be 1.0
32 1 capital 1.0
39 1 of 1.0
41 1 the 1.0
48 2 a 1.0
51 2 be 1.0
53 2 by 1.0
54 2 capital 1.0
55 2 connection 1.0
58 2 ferry 1.0
60 2 join 1.0
63 2 the 1.0
65 2 two 1.0
68 3 Helsinki 1.0
69 3 Tallinn 1.0
71 3 about two hour 1.0
72 3 and 1.0
74 3 between 1.0
84 3 take 1.0
86 3 travel 1.0
90 4 Helsinki 1.0
91 4 Tallinn 1.0
94 4 and 1.0
100 4 depart 1.0
101 4 downtown 1.0
102 4 ferry 1.0
103 4 from 1.0

14.4.2 wordcloud

conda install -c conda-forge wordcloud
pip install stylecloud
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import random
worddic = {row.word: row.N + random.randint(5, 15)  for row in word_df.itertuples()}
from wordcloud import WordCloud
wordcloud = WordCloud(
        background_color="white", max_words=100,
        max_font_size=250, random_state=42, width=1000,
        height=800, margin=2).generate_from_frequencies(worddic)
plt.imshow(wordcloud, interpolation='bilinear')

import stylecloud
#def gen_stylecloud(text=None,
#    file_path=None,   # 输入文本/CSV 的文件路径
#    size=512,  # stylecloud 的大小(长度和宽度)
#    icon_name='fas fa-flag',  # stylecloud 形状的图标名称(如 fas fa-grin)
#    palette='cartocolors.qualitative.Bold_5',  # 调色板(通过 palettable 实现)
#    colors=None,
#    background_color="white",  # 背景颜色
#    max_font_size=200,  # stylecloud 中的最大字号
#    max_words=2000,  # stylecloud 可包含的最大单词数
#    stopwords=True,  # 布尔值,用于筛除常见禁用词
#    custom_stopwords=STOPWORDS,
#    icon_dir='.temp',
#    output_name='stylecloud.png',   # stylecloud 的输出文本名
#    gradient=None,  # 梯度方向
#    font_path=os.path.join(STATIC_PATH,'Staatliches-Regular.ttf'), # stylecloud 所用字体
#    random_state=None,  # 控制单词和颜色的随机状态
#    collocations=True,
#    invert_mask=False,
#    pro_icon_path=None,
#    pro_css_path=None)
stylecloud.gen_stylecloud(text = text,

Font-awesome for shape: https://fontawesome.com/v4.7.0/icons/ Google Font: https://fonts.google.com/

import spacy
nlp = spacy.load('zh_core_web_trf')

doc = nlp('暨南大学是中国第一所由政府创办的华侨学府。“暨南”二字出自《尚书·禹贡》:“东渐于海,西被于流沙,朔南暨,声教讫于四海。”意即面向南洋,将中华文化远播到五洲四海。学校目前是中央统战部、教育部、广东省共建的国家“双一流”建设高校,直属中央统战部管理。')

doc = " ".join([token.text for token in doc if not token.is_stop])
暨南 大学 中国 第一 政府 创办 华侨 学府 暨南 二字 出自 尚书·禹贡 东渐于海 西 流沙 朔南 暨 声 教讫 四海 意 即面 南洋 中华 文化 远播 五洲 四海 学校 中央 统战部 教育部 广东省 共建 国家 双一流 建设 高校 直属 中央 统战部 管理
    text = doc,
    font_path = "data/MaShanZheng-Regular.ttf",

    text = doc,
    font_path = "data/MaShanZheng-Regular.ttf",
    icon_name='fas fa-heart',

14.4.3 similarity matrix

from sklearn.metrics.pairwise import cosine_similarity
# Evaluate cosine similarity between vectors
sim = cosine_similarity(df.values)
array([[1.        , 0.66666667, 0.40824829, 0.15430335, 0.15430335],
       [0.66666667, 1.        , 0.40824829, 0.15430335, 0.15430335],
       [0.40824829, 0.40824829, 1.        , 0.        , 0.12598816],
       [0.15430335, 0.15430335, 0.        , 1.        , 0.42857143],
       [0.15430335, 0.15430335, 0.12598816, 0.42857143, 1.        ]])

14.5 token/Doc Similarity

a = docs[0][0]
b = docs[0][-1]
C:\Users\xinlu\AppData\Local\Temp\ipykernel_40628\3344546738.py:1: UserWarning:

[W007] The model you're using has no word vectors loaded, so the result of the Token.similarity method will be based on the tagger, parser and NER, which may not give useful similarity judgements. This may happen if you're using one of the small models, e.g. `en_core_web_sm`, which don't ship with word vectors and only use context-sensitive tensors. You can always add your own word vectors, or use one of the larger models instead if available.

In spaCy we can do this same thing at the document level. Through word vectors we can calculate the similarity between two documents. Let’s look at the example from spaCy’s documentation.

nlp = spacy.load('en_core_web_lg')  # make sure to use larger package!
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, doc2, doc1.similarity(doc2), sep='\n')
I like salty fries and hamburgers.
Fast food tastes very good.

14.6 transformer model in spaCy

nlp_trf = spacy.load('en_core_web_trf')
  <spacy_transformers.pipeline_component.Transformer at 0x1dc3cd93100>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1dc29cd5dc0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1dc2b700cf0>),
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1dc53115c80>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1dc53118380>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1dc417a37b0>)]
# Feed an example sentence to the model; store output under 'example_doc'
example_doc = nlp_trf("Helsinki is the capital of Finland.")

# Check the length of the Doc object

The first item in the tensors list under index 0 contains the output for individual Tokens.

# Check the shape of the first item in the list
(1, 11, 768)

The second item under index 1 holds the output for the entire Doc.

# Check the shape of the first item in the list
(1, 768)

In both cases, the Transformer output is stored in a tensor, which is a mathematical term for describing a “bundle” of numerical objects (e.g. vectors) and their shape.

In the case of Tokens, we have a batch of 1 that consists of 11 vectors with 768 dimensions each.

We can access the first ten dimensions of each vector using the expression [:10].

Note that we need the preceding [0] to enter the first “batch” of vectors in the tensor.

# Check the first ten dimensions of the tensor
array([[-0.14800896, -0.36132938, -0.52402526, ...,  0.33104226,
        -0.1899541 , -0.08459951],
       [-1.0568136 , -0.659217  ,  0.1645886 , ...,  1.0160642 ,
        -1.5546881 , -0.36001885],
       [-0.83849084, -0.5810041 , -0.07826854, ...,  0.55888575,
        -1.3520969 , -0.37117025],
       [-0.33014834, -0.22176161, -0.00310517, ...,  2.5474327 ,
         0.8233686 , -1.045297  ],
       [-0.7981953 , -1.5335865 , -0.16577926, ...,  1.4347023 ,
        -0.6925767 , -0.33236212],
       [-0.8362223 , -0.81882554, -1.3890662 , ...,  0.10648382,
         0.17524342, -0.29888782]], dtype=float32)

spaCy Doc object with 7 Token objects represented by 11 vectors

图 14.3 spacy token vs Transformer token
# Access the Transformer tokens under the key 'input_texts'

14.7 中文NLP例子

14.7.1 自定义词典

import spacy
nlp = spacy.load('zh_core_web_sm')
doc = nlp('调整给水,注意给水流量与蒸汽流量相匹配,注意过热度,保证主蒸汽温度不超限。')

token_list = [f"{i}\t{token.text}\t{token.is_stop}" for i, token in enumerate(doc)]
0   调整  False
1   给水  False
2   ,   True
3   注意  True
4   给   True
5   水流量 False
6   与   True
7   蒸汽  False
8   流量  False
9   相匹配 False
10  ,   True
11  注意  True
12  过   True
13  热度  False
14  ,   True
15  保证  False
16  主蒸  False
17  汽温度 False
18  不   True
19  超限  False
20  。   True
proper_nouns = ['给水流量','蒸汽流量','过热度','主蒸汽']
doc = nlp('调整给水,注意给水流量与蒸汽流量相匹配,注意过热度,保证主蒸汽温度不超限。')

token_list = [f"{i}\t{token.text}\t{token.is_stop}" for i, token in enumerate(doc)]
0   调整  False
1   给水  False
2   ,   True
3   注意  True
4   给水流量    False
5   与   True
6   蒸汽流量    False
7   相匹配 False
8   ,   True
9   注意  True
10  过热度 False
11  ,   True
12  保证  False
13  主蒸汽 False
14  温度  False
15  不   True
16  超限  False
17  。   True

14.7.2 自定义stopwords

from spacy.lang.zh.stop_words import STOP_WORDS

print(STOP_WORDS) # <- set of Spacy's default stop words

for word in ["保证", "超限"]:
    STOP_WORDS.add(word) # 增加stop words
    lexeme = nlp.vocab[word]
    lexeme.is_stop = True
doc = nlp('调整给水,注意给水流量与蒸汽流量相匹配,注意过热度,保证主蒸汽温度不超限。')

token_list = [f"{i}\t{token.text}\t{token.is_stop}" for i, token in enumerate(doc)]
0   调整  False
1   给水  False
2   ,   True
3   注意  True
4   给水流量    False
5   与   True
6   蒸汽流量    False
7   相匹配 False
8   ,   True
9   注意  True
10  过热度 False
11  ,   True
12  保证  True
13  主蒸汽 False
14  温度  False
15  不   True
16  超限  True
17  。   True
for word in ["保证", "超限"]:
    STOP_WORDS.remove(word) # 剔除stop words
    lexeme = nlp.vocab[word]
    lexeme.is_stop = False

doc = nlp('调整给水,注意给水流量与蒸汽流量相匹配,注意过热度,保证主蒸汽温度不超限。')

token_list = [f"{i}\t{token.text}\t{token.is_stop}" for i, token in enumerate(doc)]
0   调整  False
1   给水  False
2   ,   True
3   注意  True
4   给水流量    False
5   与   True
6   蒸汽流量    False
7   相匹配 False
8   ,   True
9   注意  True
10  过热度 False
11  ,   True
12  保证  False
13  主蒸汽 False
14  温度  False
15  不   True
16  超限  False
17  。   True

14.8 spacy pipe to speed up

处理多个文档时可以用nlp.pipe. It is specifically used to process text as a sequence of strings. This is much more efficient than processing text one by one. If you’re only processing a single text, simply remove the ‘.pipe’ extension. Source: https://spacy.io/usage/processing-pipelines

  1. save into file example.py
  2. run in terminal python example.py
import spacy

texts = [
    "Net income was $9.4 million compared to the prior year of $2.7 million.",
    "Revenue exceeded twelve billion dollars, with a loss of $1b.",

nlp = spacy.load("en_core_web_sm")
for doc in nlp.pipe(texts, n_process=4, batch_size=200):
    # Do something with the doc here
    print([(ent.text, ent.label_) for ent in doc.ents])

14.9 spacy教程 / Model