[prev] [index] [next]

Exercise: auto-indexing books

Write a script to produce an index of words in text
  • everything mapped to lower case
  • words are defined as /[a-zA-Z']+/
  • possessives are dropped (e.g. John's John)
  • all '-suffixes are dropped (e.g. they're they)
  • "stop words" are removed (see file stop.words)
  • remove single-letter "words" ("a" and "i" are stop words)
  • output shows word line1,line2,... for all words
Implement as a shell, PHP and Perl script.