Spaces:
Configuration error
Configuration error
Overview [](https://travis-ci.org/lydell/js-tokens) | |
======== | |
A regex that tokenizes JavaScript. | |
```js | |
var jsTokens = require("js-tokens").default | |
var jsString = "var foo=opts.foo;\n..." | |
jsString.match(jsTokens) | |
// ["var", " ", "foo", "=", "opts", ".", "foo", ";", "\n", ...] | |
``` | |
Installation | |
============ | |
`npm install js-tokens` | |
```js | |
import jsTokens from "js-tokens" | |
// or: | |
var jsTokens = require("js-tokens").default | |
``` | |
Usage | |
===== | |
### `jsTokens` ### | |
A regex with the `g` flag that matches JavaScript tokens. | |
The regex _always_ matches, even invalid JavaScript and the empty string. | |
The next match is always directly after the previous. | |
### `var token = matchToToken(match)` ### | |
```js | |
import {matchToToken} from "js-tokens" | |
// or: | |
var matchToToken = require("js-tokens").matchToToken | |
``` | |
Takes a `match` returned by `jsTokens.exec(string)`, and returns a `{type: | |
String, value: String}` object. The following types are available: | |
- string | |
- comment | |
- regex | |
- number | |
- name | |
- punctuator | |
- whitespace | |
- invalid | |
Multi-line comments and strings also have a `closed` property indicating if the | |
token was closed or not (see below). | |
Comments and strings both come in several flavors. To distinguish them, check if | |
the token starts with `//`, `/*`, `'`, `"` or `` ` ``. | |
Names are ECMAScript IdentifierNames, that is, including both identifiers and | |
keywords. You may use [is-keyword-js] to tell them apart. | |
Whitespace includes both line terminators and other whitespace. | |
[is-keyword-js]: https://github.com/crissdev/is-keyword-js | |
ECMAScript support | |
================== | |
The intention is to always support the latest ECMAScript version whose feature | |
set has been finalized. | |
If adding support for a newer version requires changes, a new version with a | |
major verion bump will be released. | |
Currently, ECMAScript 2018 is supported. | |
Invalid code handling | |
===================== | |
Unterminated strings are still matched as strings. JavaScript strings cannot | |
contain (unescaped) newlines, so unterminated strings simply end at the end of | |
the line. Unterminated template strings can contain unescaped newlines, though, | |
so they go on to the end of input. | |
Unterminated multi-line comments are also still matched as comments. They | |
simply go on to the end of the input. | |
Unterminated regex literals are likely matched as division and whatever is | |
inside the regex. | |
Invalid ASCII characters have their own capturing group. | |
Invalid non-ASCII characters are treated as names, to simplify the matching of | |
names (except unicode spaces which are treated as whitespace). Note: See also | |
the [ES2018](#es2018) section. | |
Regex literals may contain invalid regex syntax. They are still matched as | |
regex literals. They may also contain repeated regex flags, to keep the regex | |
simple. | |
Strings may contain invalid escape sequences. | |
Limitations | |
=========== | |
Tokenizing JavaScript using regexes—in fact, _one single regex_—won’t be | |
perfect. But that’s not the point either. | |
You may compare jsTokens with [esprima] by using `esprima-compare.js`. | |
See `npm run esprima-compare`! | |
[esprima]: http://esprima.org/ | |
### Template string interpolation ### | |
Template strings are matched as single tokens, from the starting `` ` `` to the | |
ending `` ` ``, including interpolations (whose tokens are not matched | |
individually). | |
Matching template string interpolations requires recursive balancing of `{` and | |
`}`—something that JavaScript regexes cannot do. Only one level of nesting is | |
supported. | |
### Division and regex literals collision ### | |
Consider this example: | |
```js | |
var g = 9.82 | |
var number = bar / 2/g | |
var regex = / 2/g | |
``` | |
A human can easily understand that in the `number` line we’re dealing with | |
division, and in the `regex` line we’re dealing with a regex literal. How come? | |
Because humans can look at the whole code to put the `/` characters in context. | |
A JavaScript regex cannot. It only sees forwards. (Well, ES2018 regexes can also | |
look backwards. See the [ES2018](#es2018) section). | |
When the `jsTokens` regex scans throught the above, it will see the following | |
at the end of both the `number` and `regex` rows: | |
```js | |
/ 2/g | |
``` | |
It is then impossible to know if that is a regex literal, or part of an | |
expression dealing with division. | |
Here is a similar case: | |
```js | |
foo /= 2/g | |
foo(/= 2/g) | |
``` | |
The first line divides the `foo` variable with `2/g`. The second line calls the | |
`foo` function with the regex literal `/= 2/g`. Again, since `jsTokens` only | |
sees forwards, it cannot tell the two cases apart. | |
There are some cases where we _can_ tell division and regex literals apart, | |
though. | |
First off, we have the simple cases where there’s only one slash in the line: | |
```js | |
var foo = 2/g | |
foo /= 2 | |
``` | |
Regex literals cannot contain newlines, so the above cases are correctly | |
identified as division. Things are only problematic when there are more than | |
one non-comment slash in a single line. | |
Secondly, not every character is a valid regex flag. | |
```js | |
var number = bar / 2/e | |
``` | |
The above example is also correctly identified as division, because `e` is not a | |
valid regex flag. I initially wanted to future-proof by allowing `[a-zA-Z]*` | |
(any letter) as flags, but it is not worth it since it increases the amount of | |
ambigous cases. So only the standard `g`, `m`, `i`, `y` and `u` flags are | |
allowed. This means that the above example will be identified as division as | |
long as you don’t rename the `e` variable to some permutation of `gmiyus` 1 to 6 | |
characters long. | |
Lastly, we can look _forward_ for information. | |
- If the token following what looks like a regex literal is not valid after a | |
regex literal, but is valid in a division expression, then the regex literal | |
is treated as division instead. For example, a flagless regex cannot be | |
followed by a string, number or name, but all of those three can be the | |
denominator of a division. | |
- Generally, if what looks like a regex literal is followed by an operator, the | |
regex literal is treated as division instead. This is because regexes are | |
seldomly used with operators (such as `+`, `*`, `&&` and `==`), but division | |
could likely be part of such an expression. | |
Please consult the regex source and the test cases for precise information on | |
when regex or division is matched (should you need to know). In short, you | |
could sum it up as: | |
If the end of a statement looks like a regex literal (even if it isn’t), it | |
will be treated as one. Otherwise it should work as expected (if you write sane | |
code). | |
### ES2018 ### | |
ES2018 added some nice regex improvements to the language. | |
- [Unicode property escapes] should allow telling names and invalid non-ASCII | |
characters apart without blowing up the regex size. | |
- [Lookbehind assertions] should allow matching telling division and regex | |
literals apart in more cases. | |
- [Named capture groups] might simplify some things. | |
These things would be nice to do, but are not critical. They probably have to | |
wait until the oldest maintained Node.js LTS release supports those features. | |
[Unicode property escapes]: http://2ality.com/2017/07/regexp-unicode-property-escapes.html | |
[Lookbehind assertions]: http://2ality.com/2017/05/regexp-lookbehind-assertions.html | |
[Named capture groups]: http://2ality.com/2017/05/regexp-named-capture-groups.html | |
License | |
======= | |
[MIT](LICENSE). | |