EMSE 6575 - Applied Machine Learning for Analytics



Course Description (3 credits)

Methods and techniques for discovering patterns and relationships in aggregated data, with practical focus on engineering problems. Tools, techniques, and methods explored in the context of their application.


Topic 1 - Linear Regression

Topic Description

Linear Regression: Classical OLS Regression; Underfitting & Overfitting; Cross-Validation; Model Goodness-of-fit; Higher-Order Regression. SKLearn Regression Tutorial, SKLearn Underfitting/Overfitting Tutorial; SKLearn User Guide Ordinary Least Squares; Wikipedia Linear Regression The dataset to use is the diamond dataset (https://www.kaggle.com/shivam2503/diamonds).

Lecture Code

nbviewer

Reinforcement Learning

nbviewer

Recording - n!4GDHgW

zoom


Topic 2 - Logistic Regression

Topic Description

Logistic Regression: Discriminative Classification; Contingency Tables; Accuracy Scores; the logistic function; multinomial logistic regression: SKLearn Logistic Regression Tutorial; SKLearn User Guide Logistic Regression; Wikipedia Logistic Regression

Lecture Code

nbviewer

Assignment

nbviewer

Recording - s9Q5&N#M

zoom


Topic 3 - Naïve Bayes

Topic Description

Naïve Bayes Classification: Generative Classification; Bayes’ Rule; Conditional Independence; Discrete & Continuous Naïve Bayesian Classification: SKLearn Naïve Bayes; NLTK Book, Chapter 06

Lecture Code

nbviewer

Assignment

nbviewer

Recording - @8QxdjR0

zoom


Topic 4 - Support Vector Machine

Topic Description

Support Vector Machines: Support Vector Classification; the Kernel Trick; Loss Functions; SKLearn User Guide Support Vector Machines; SKLearn Supervised Learning Tutorial

Lecture Code

nbviewer


Topic 5 - ROC Curve

Topic Description

Model Evaluation Metrics: Comparing across classifiers; Precision, Recall, and the ROC Curve; Grid Search; SKLearn Precision Recall Tutorial; SKLearn ROC Tutorial; SKLearn ROC Example; SKLearn User Guide Grid Search

Lecture Note

Slides PDF

Lecture Code

nbviewer

Recording - 756v&*?e

zoom


Topic 6 - Neural Network

Topic Description

Neural Networks, Part 0 – Perceptrons, feed-forward nets, backpropagation

Lecture Code

nbviewer

Assignment

nbviewer

Recording

zoom


Topic 7 - Convolutional Neural Networks

Topic Description

Intro to deep learning. Convolutional Neural Networks; using convolutional neural networks for feature extraction and scoring.

Lecture Note

Slides PDF

Recording

zoom


Topic 8 - Recurrent Neural Network

Topic Description

PCA, SVD and LSA: Overview of Unsupervised learning; Eigenvalue decomposition; K-Means Clustering: Clustering, Model selection for clustering; Screeplots; Comparison to Dimensionality Reduction; SKLearn User Guide Clustering

Intro

nbviewer

GPU

nbviewer

Recording - $=k20&xG

zoom


Topic 9 - Object Detection

Topic Description

Getting into text mining Data wrangling and mechanics Textual datasets Data Cleaning & Text Data Structures: “Bag of Words” & N-Grams TF-IDF weighting; Assigned readings: pandas tutorial Assigned reading: NLTK Book, Chapter 03 Sklearn Feature Extraction

Single Rectangle

nbviewer

Multiple Rectangles

nbviewer

Classification

nbviewer

Recording - ?@4Whfz0

zoom


Topic 10 - Generative Adversarial Networks

Topic Description

Adversarial Neural Networks

Lecture Note

Slides PDF

Lecture Code

nbviewer

Recording - 9br^.pVF

zoom


Topic 11 - Autonomous Car

Topic Description

Extra topic on building an autonomous car in machine learning.

Lecture Code

nbviewer


Final Project - PCA of Facebook and Twitter LDA Evaluation Metrics

Project Description

Topic models are easy to train, but do they generate useful topics? In this post, we discuss several diagnostic metrics that Mallet uses to assess topic quality and conduct a principal component analysis (PCA) to determine which underlying features are most important. Since many of the evaluation metrics are highly correlated, PCA is an appropriate analytical approach. PCA is a statistical technique used to re-express highly correlated multivariate data in uncorrelated components that capture independent pieces of information represented in the larger data.

To accomplish this, we use Mallet to generate fifty topics for a corpus of over 264K posts found on publicly available Facebook pages related to COVID-19 and fifty topics for a corpus of ~11 million Twitter posts related to COVID-19. We used hashtag pooling to generate topics for the Twitter corpus. We use Python to calculate diagnostic measures from Mallet topic-term frequency output files.

Based on our interpretation of the PCA results, we believe LDA topics are distinguished by two primary factors: 1) term frequency, and 2) term specificity. Furthermore, on average, we found topics with common, specific terms score significantly better on coherence scores than topics with uncommon, unspecfic terms. However, we also found several cases of poor topics that scored relatively high on coherence scores. In other words, our results suggest topics that use the common, specific terms should be easier to interpret, but interpretability doesn’t imply a topic is comprised of terms that are specific or central to a corpus.

Project Post
Medium Post
Project Workflow

nbviewer

Coherence Calculation

nbviewer

Model Evaluation

nbviewer



Back to Top