Predictions!

Now that you know that the average length differs, build a simple classifier based on the number of words in a particular message.

  1. Write a block that takes in a message and, based on how many words it contains, returns either "ham" or "spam" as its classification.
    classify a message that evaluates to HAM
    classify a message that evaluates to SPAM
    Implement in snap, and use it to classify the messages in our data. You can use a regular loop to call your classify on the second item in each row of the data, or you might use this faster method using the keep block:
    using keep to apply a custom classify method to the data
  2. Play around with different threshold values—that is, the number of words above which you decide that a message is spam.
  3. What was the best threshold you found?
  4. How many messages did it classify incorrectly?