File size: 3,388 Bytes
cfd3735
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "dab86b60",
   "metadata": {},
   "source": [
    "# Spacy Text Splitter\n",
    "Another alternative to NLTK is to use Spacy.\n",
    "\n",
    "1. How the text is split: by Spacy\n",
    "2. How the chunk size is measured: by length function passed in (defaults to number of characters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "f1de7767",
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is a long document we can split up.\n",
    "with open('../../../state_of_the_union.txt') as f:\n",
    "    state_of_the_union = f.read()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "f4ec9b90",
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain.text_splitter import SpacyTextSplitter\n",
    "text_splitter = SpacyTextSplitter(chunk_size=1000)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "cef2b29e",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Madam Speaker, Madam Vice President, our First Lady and Second Gentleman.\n",
      "\n",
      "Members of Congress and the Cabinet.\n",
      "\n",
      "Justices of the Supreme Court.\n",
      "\n",
      "My fellow Americans.  \n",
      "\n",
      "\n",
      "\n",
      "Last year COVID-19 kept us apart.\n",
      "\n",
      "This year we are finally together again. \n",
      "\n",
      "\n",
      "\n",
      "Tonight, we meet as Democrats Republicans and Independents.\n",
      "\n",
      "But most importantly as Americans. \n",
      "\n",
      "\n",
      "\n",
      "With a duty to one another to the American people to the Constitution. \n",
      "\n",
      "\n",
      "\n",
      "And with an unwavering resolve that freedom will always triumph over tyranny. \n",
      "\n",
      "\n",
      "\n",
      "Six days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways.\n",
      "\n",
      "But he badly miscalculated. \n",
      "\n",
      "\n",
      "\n",
      "He thought he could roll into Ukraine and the world would roll over.\n",
      "\n",
      "Instead he met a wall of strength he never imagined. \n",
      "\n",
      "\n",
      "\n",
      "He met the Ukrainian people. \n",
      "\n",
      "\n",
      "\n",
      "From President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.\n"
     ]
    }
   ],
   "source": [
    "texts = text_splitter.split_text(state_of_the_union)\n",
    "print(texts[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ff3064a7",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.1"
  },
  "vscode": {
   "interpreter": {
    "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}