CrowdFlower Competition Scripts: Approaching NLP

The CrowdFlower Search Results Relevance competition was a great opportunity for Kagglers to approach a tricky Natural Language Processing problem. With 1,326 teams, there was plenty of room for fierce competition and helpful collaboration. We pulled some of our favorite scripts that you'll want to review before approaching your next NLP project or competition. Keep reading for more on:

The instability of a quadratic weighted kappa metric
How to use a stemmer and a lemmatizer
Machine Learning Classification using Google Charts
Set-based similarities (with a seaborn visualization)

Kappa Intuition

Created by: Triskelion
Language: Python

What motivated you to create the script?

I initially created this script, because I was not familiar with the "quadratic weighted kappa" evaluation metric yet. I think a good way to familiarize yourself with a metric is to try it out on a small toy example and to tinker with it a bit. That's basically what this script was doing: Seeing how it would react to slightly different predictions in comparison to more familiar metrics.

What did you learn from the code / output?

See the full code on Scripts

What most struck me from the output was that a single big mistaken prediction can damage your score more than 50% small mistaken predictions.

The winner of the competition, Chenglong Chen, showed on the blog that he had an understanding of the instability of the metric too. He used this knowledge to increase the number of bagged ensemble models and actively tried to combat this instability. An understanding of the metric (and his smart decoding method) gave him an edge.

Porter Stemmer

Created by: Taposh Dutta-Roy (aka OverfitterScientist)
Language: Python

What motivated you to create the script?

For any Natural Language Processing problem a stemmer and a lemmatizer is needed. When I saw the solutions I did not see any one using it. I wanted to provide sample code for folks to use it.

See the full code on Scripts

What can other data scientists learn from your script?

Novice data scientists can learn parts needed for text analytics such as stemmer and lemmatizer. They can understand how sentiment is determined from text.

I had previously blogged about how to do simple sentiment analysis using google's word2vec. This script combined with this blog would help them to do any sentiment analysis.

Pure Python - No Blackbox Test

Created by: the1owl
Language: Python

What motivated you to create the script?

This was an exploratory script and my first attempt at Machine Learning Classification. I used it as my own benchmark to the established machine learning classifiers I was able to test. Ultimately it performed equal to some of the weaker classifiers but provided the flexibility for further exploration.

See the full code on Scripts

I really enjoyed this competition and the fact that its dataset was not obfuscated made it the more interesting. Hope you enjoy and find some of the easy to implement Google Charts applicable to your future scripts.

Visualization Using Seaborn

Created by: saihttam
Language: Python

What motivated you to create the script?

On the forums there was no code that showed how to use set-based similarities. So I thought I would add this visualization script to add a different set of features. As for the visualization, I typically try to use existing libraries, since they do a lot of things right out of the box and target to answer the relevant questions. Lately, I have often used seaborn.

What did you learn from the code / output?

I played around with corrections of the query/description as was shown on the forums, e.g. fixing words (harleydavidson) and using synonyms (children versus kids). The plots indicated that these fixes improved the similarity globally, also for the queries, which were not highly rated.

See the full code on Scripts

What can other data scientists learn from your script?

I generally use python and python offers excellent libraries. So my approach has been to look at these libraries. If you look at seaborn for example, it has a great gallery where you can explore some of the plots that may be useful for your problem.

How did the output of this script help you in the competition?

I added two sets of features: similarities on fixed queries/descriptions and similarities on unfixed queries/descriptions, which improved my submissions slightly.

Read other posts on the CrowdFlower Search Results Relevance competition by clicking the tag below.