Erlang Central

Matching Words

From ErlangCentral Wiki


You want to select words from a string.


Determine the defining features of a word for your specific application, then write a regular expression that models this idea.

matches(H,{match,M}) -> matches(H,M,[]).
matches(_,[],Acc) -> Acc;
matches(H,[{I,L}|T],Acc) ->

words(String, Regexp) -> matches(String,regexp:matches(String, Regexp)).

Words_1 = "[^ ]+".        % as many non-whitespace bytes as possible
Words_2 = "[A-Za-z'-]+".  % as many letters, apostrophes, and hyphens

1> words("'alpha-beta gamma theta", Words_1).
2> words("'alpha-beta&or gamma theta", Words_2).
["'alpha-beta", "or", "gamma", "theta"]


Erlang does not have a built-in definition for words in strings. On the one hand, this is inconvenient since you have to define your own meaning of "word". On the other hand, this is the correct behavior since the concept of words varies significantly between applications, locales, encodings, and input source.

The meaning of "word" in a particular application's context can vary significantly. Languages usually support pluralization of singular nouns, attach posessive modifiers, allow hyphenated word combinations, and so forth. The regular expression used must reflect the expected range of words to be encountered.