Exercise: auto-indexing books
Write a script to produce an index of words in text
- everything mapped to lower case
- words are defined as
/[a-zA-Z']+/
- possessives are dropped (e.g. John's → John)
- all '-suffixes are dropped (e.g. they're → they)
- "stop words" are removed (see file stop.words)
- remove single-letter "words" ("a" and "i" are stop words)
- output shows word line1,line2,... for all words
Implement as a shell, PHP and Perl script.
|