Erlang Central

Matching Words

Revision as of 01:42, 4 September 2006 by Bfulgham (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)


You want to select words from a string.


Determine the defining features of a word for your specific application, then write a regular expression that models this idea.

Words_1 = "[^ ]+".        % as many non-whitespace bytes as possible
Words_2 = "[A-Za-z'-]+".  % as many letters, apostrophes, and hyphens

1> regexp:first_match("'alpha-beta gamma", Words_1).
2> string:substr("'alpha-beta gamma",1,11).
3> regexp:first_match("'alpha-beta&or gamma", Words_2).
4> string:substr("'alpha-beta&or gamma",1,11).   


Erlang does not have a built-in definition for words in strings. On the one hand, this is inconvenient since you have to define your own meaning of "word". On the other hand, this is the correct behavior since the concept of words varies significantly between applications, locales, encodings, and input source.

The meaning of "word" in a particular application's context can vary significantly. Languages usually support pluralization of singular nouns, attach posessive modifiers, allow hyphenated word combinations, and so forth. The regular expression used must reflect the expected range of words to be encountered.

Unfortunately, there is no existing Perl-compatible regular expression module for use in Erlang.