Buy Google Voice Accounts

Do you use Google Voice to stay in touch with your customers? The first impression with new customers is important, and a message on an answering system is sometimes the only impression they’ll get…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




NLP Classification of Yelp Reviews

This week I had been studying up on Natural Language Processing, so I decided to make my weekly blog an NLP classification project. I had previously done some work with the Yelp Fusion API on a non text-modeling project, so I knew it was one of the best free APIs available that would have review text data and the associated business class for each review. After trying out API calls on a few different types of businesses in NYC, I settled on Gyms and Barbershops as two classes of reviews with sufficient sample size. I intentionally chose business classes that, while loosely related in terms of beautifying the body, were different enough to likely produce classifiable text reviews.

In order to access Yelp review text, I first had to download lists of Gym and Barbershop names from the Yelp Fusion API to subsequently request reviews for. I downloaded 500 business names for each class, the maximum number in NYC I could access while maintaining balanced class weights. I then downloaded the maximum 3 reviews for each of the 1,000 businesses in JSON form and read them into Pandas after some reformatting. This resulted in 3,000 total datapoints, 1,500 datapoints for each class, with each datapoint containing a 140 character snippet of a business review and the class for that review.

Initial frequency distribution of the top 25 tokens by class

I used nltk to tokenize the reviews, remove stopwords and stem the reviews for subequent EDA. After taking an initial look at the frequency distribution of the top 25 words in both classes, I noticed a few issues that I wanted to correct with custom additions to the standard nltk stopwords. Firstly, the tokens included a few different punctuation marks that were showing up high in the frequency distribution, so I added these to the stopwords. The most frequent tokens also included alternative forms of existing stopwords like ‘s (is) and n’t (not), which I added to the stopwords for consistency. Finally, I noticed that barber and gym were the most frequent remaining tokens, which I suspected could have been part of the reason Yelp returned this subset of reviews for my API request. Not wanting to overfit my model to these terms, I added them to my list of stopwords.

Wordcloud for top 25 tokens by class after removing custom stopwords

I next produced wordclouds of the top 25 most frequent tokens in each class after the removing the enlarged list of stopwords. Not surprisingly many of the top tokens for both business classes were directly related to what they did, such as ‘fit’ /‘workout’ for Gyms and ‘haircut’ /‘hair’ for Barbers. Both wordclouds also included words more indirectly associated with the activities of both classes, like ‘month’ referring to the fact that gym memberships tend to be monthly and ‘guy’ referring to the fact that barbers often specialize in a single gender’s hair. There was also some general review terminology common to both classes, including ‘great’, ‘friend’ and ‘time’. I could have added a list of review stopwords to my stopwords but I thought developing this list would consume a lot of time, so I decided to dive into modeling first to see how it would perform and if needed I could use TFIDF to overcome these common words across all documents or add stopwords later.

After vectorizing my data using TFIDF and performing train test split, I first built a Naive Bayes model using GridSearchCV to find the optimal alpha of .3. I used f1 score as my performance metric in order to ensure that my model performed equally well across both classes. Naive Bayes outperformed the Dummy Classifier with a test set f1 score of .895 compared to .497. The Confusion Matrix revealed that Naive Bayes was slightly more likely to classify a review as Barber (53.6% of the time) and misclassified more reviews as Barber than Gym (6.9% vs. 3.2%).

Confusion Matrix for Random Forest Classification of Yelp reviews

I next built a Random Forest Classification model to compare to Naive Bayes. Random Forest had slightly lower performance with a test set f1 score of .892. The results of the Confusion Matrix were the opposite of Naive Bayes with Random Forest more likely to classify reviews as Gym (54.6% of the time) and misclassified more reviews as Gym than Barber (7.9% vs. 3.4%).

Confusion Matrix for Gradient Boost Classification of Yelp reviews

I built a Gradient Boost model to see if it could improve on the performance of Random Forest but it actually underperformed the other models with a test set f1 score of .877. The Confusion Matrix revealed that Gradient Boost was skewed towards classifying datapoints as Gym (61.2% of the time), which led it to misclassify a large number of reviews as Gym (12.4%), despite rarely misclassifying datapoints as Barber (1.2% of the time).

Confusion Matrix for Voting Classification of Yelp reviews

I finally wanted to see if combining the relative strengths of my 3 models into a Soft Voting Classifier could improve performance. The Voting Classifier did improve the test set f1 score to .914. The influence of Gradient Boost on the model showed as it was more likely to classify reviews as Gym (56.5% of the time) and misclassify reviews as Gym (7.8%) vs. Barber (1.4%). However, this Voting Classifier still outperformed a 2 model Voting Classifier excluding Gradient Boost.

My top performing Soft Voting Classifier Model performed strongly with a test set f1 score of .914, which greatly outperformed the Dummy Classifier test set f1 score of .497. The Voting Classifier skewed slightly towards the Gym Class, both correctly and incorrectly classifying points as belonging to this class more often. With more time, I would like to continue testing new models in hopes of reducing this skew and further increasing f1 score.

Add a comment

Related posts:

Dragonfly Mysteries

what answers unfold in your haphazard flight. “Dragonfly Mysteries” is published by Joe Merkle in iPoetry.

The Dog Days of Summer at BBD

Summer is coming to an end (seriously, tomorrow is September 1st — when did that happen?) and with summer’s end comes the end of our “dog days of summer”. What is that, you ask? Well, summer is meant…

I write this for you.

Mahmoud Darwish said, “Memory, your personal museum, takes you into the realms of what is lost. A sesame field, a plot of lettuce, mint, a round sun that falls into the sea. What is lost grows in you…