Class Lecture & Project

EMSE 6575 - Applied Machine Learning for Analytics

Course Description (3 credits)

Methods and techniques for discovering patterns and relationships in aggregated data, with practical focus on engineering problems. Tools, techniques, and methods explored in the context of their application.

Topic 1 - Linear Regression

Topic Description

Linear Regression: Classical OLS Regression; Underfitting & Overfitting; Cross-Validation; Model Goodness-of-fit; Higher-Order Regression. SKLearn Regression Tutorial, SKLearn Underfitting/Overfitting Tutorial; SKLearn User Guide Ordinary Least Squares; Wikipedia Linear Regression The dataset to use is the diamond dataset (https://www.kaggle.com/shivam2503/diamonds).

Lecture Code

Reinforcement Learning

Recording - n!4GDHgW

Topic 2 - Logistic Regression

Topic Description

Logistic Regression: Discriminative Classification; Contingency Tables; Accuracy Scores; the logistic function; multinomial logistic regression: SKLearn Logistic Regression Tutorial; SKLearn User Guide Logistic Regression; Wikipedia Logistic Regression

Lecture Code

Assignment

Recording - s9Q5&N#M

Topic 3 - Naïve Bayes

Topic Description

Naïve Bayes Classification: Generative Classification; Bayes’ Rule; Conditional Independence; Discrete & Continuous Naïve Bayesian Classification: SKLearn Naïve Bayes; NLTK Book, Chapter 06

Lecture Code

Assignment

Recording - @8QxdjR0

Topic 4 - Support Vector Machine

Topic Description

Support Vector Machines: Support Vector Classification; the Kernel Trick; Loss Functions; SKLearn User Guide Support Vector Machines; SKLearn Supervised Learning Tutorial

Lecture Code

Topic 5 - ROC Curve

Topic Description

Model Evaluation Metrics: Comparing across classifiers; Precision, Recall, and the ROC Curve; Grid Search; SKLearn Precision Recall Tutorial; SKLearn ROC Tutorial; SKLearn ROC Example; SKLearn User Guide Grid Search

Lecture Note

Lecture Code

Recording - 756v&*?e

Topic 6 - Neural Network

Topic Description

Neural Networks, Part 0 – Perceptrons, feed-forward nets, backpropagation

Lecture Code

Assignment

Recording

Topic 7 - Convolutional Neural Networks

Topic Description

Intro to deep learning. Convolutional Neural Networks; using convolutional neural networks for feature extraction and scoring.

Lecture Note

Recording

Topic 8 - Recurrent Neural Network

Topic Description

PCA, SVD and LSA: Overview of Unsupervised learning; Eigenvalue decomposition; K-Means Clustering: Clustering, Model selection for clustering; Screeplots; Comparison to Dimensionality Reduction; SKLearn User Guide Clustering

Intro

GPU

Recording - $=k20&xG

Topic 9 - Object Detection

Topic Description

Getting into text mining Data wrangling and mechanics Textual datasets Data Cleaning & Text Data Structures: “Bag of Words” & N-Grams TF-IDF weighting; Assigned readings: pandas tutorial Assigned reading: NLTK Book, Chapter 03 Sklearn Feature Extraction

Single Rectangle

Multiple Rectangles

Classification

Recording - ?@4Whfz0

Topic 10 - Generative Adversarial Networks

Topic Description

Adversarial Neural Networks

Lecture Note

Lecture Code

Recording - 9br^.pVF

Topic 11 - Autonomous Car

Topic Description

Extra topic on building an autonomous car in machine learning.

Lecture Code

Final Project - PCA of Facebook and Twitter LDA Evaluation Metrics

Project Description

Topic models are easy to train, but do they generate useful topics? In this post, we discuss several diagnostic metrics that Mallet uses to assess topic quality and conduct a principal component analysis (PCA) to determine which underlying features are most important. Since many of the evaluation metrics are highly correlated, PCA is an appropriate analytical approach. PCA is a statistical technique used to re-express highly correlated multivariate data in uncorrelated components that capture independent pieces of information represented in the larger data.

To accomplish this, we use Mallet to generate fifty topics for a corpus of over 264K posts found on publicly available Facebook pages related to COVID-19 and fifty topics for a corpus of ~11 million Twitter posts related to COVID-19. We used hashtag pooling to generate topics for the Twitter corpus. We use Python to calculate diagnostic measures from Mallet topic-term frequency output files.

Based on our interpretation of the PCA results, we believe LDA topics are distinguished by two primary factors: 1) term frequency, and 2) term specificity. Furthermore, on average, we found topics with common, specific terms score significantly better on coherence scores than topics with uncommon, unspecfic terms. However, we also found several cases of poor topics that scored relatively high on coherence scores. In other words, our results suggest topics that use the common, specific terms should be easier to interpret, but interpretability doesn’t imply a topic is comprised of terms that are specific or central to a corpus.