Class Lecture & Project
EMSE 6575 - Applied Machine Learning for Analytics
Course Description (3 credits)
Methods and techniques for discovering patterns and relationships in aggregated data, with practical focus on engineering problems. Tools, techniques, and methods explored in the context of their application.
Topic 1 - Linear Regression
Topic Description
Linear Regression: Classical OLS Regression; Underfitting & Overfitting; Cross-Validation; Model Goodness-of-fit; Higher-Order Regression. SKLearn Regression Tutorial, SKLearn Underfitting/Overfitting Tutorial; SKLearn User Guide Ordinary Least Squares; Wikipedia Linear Regression The dataset to use is the diamond dataset (https://www.kaggle.com/shivam2503/diamonds).
Lecture Code
Reinforcement Learning
Recording - n!4GDHgW
Related Links
Topic 2 - Logistic Regression
Topic Description
Logistic Regression: Discriminative Classification; Contingency Tables; Accuracy Scores; the logistic function; multinomial logistic regression: SKLearn Logistic Regression Tutorial; SKLearn User Guide Logistic Regression; Wikipedia Logistic Regression
Lecture Code
Assignment
Recording - s9Q5&N#M
Related Links
Topic 3 - Naïve Bayes
Topic Description
Naïve Bayes Classification: Generative Classification; Bayes’ Rule; Conditional Independence; Discrete & Continuous Naïve Bayesian Classification: SKLearn Naïve Bayes; NLTK Book, Chapter 06
Lecture Code
Assignment
Recording - @8QxdjR0
Related Links
Topic 4 - Support Vector Machine
Topic Description
Support Vector Machines: Support Vector Classification; the Kernel Trick; Loss Functions; SKLearn User Guide Support Vector Machines; SKLearn Supervised Learning Tutorial
Lecture Code
Related Links
Topic 5 - ROC Curve
Topic Description
Model Evaluation Metrics: Comparing across classifiers; Precision, Recall, and the ROC Curve; Grid Search; SKLearn Precision Recall Tutorial; SKLearn ROC Tutorial; SKLearn ROC Example; SKLearn User Guide Grid Search
Lecture Note
Lecture Code
Recording - 756v&*?e
Related Links
Topic 6 - Neural Network
Topic Description
Neural Networks, Part 0 – Perceptrons, feed-forward nets, backpropagation
Lecture Code
Assignment
Recording
Topic 7 - Convolutional Neural Networks
Topic Description
Intro to deep learning. Convolutional Neural Networks; using convolutional neural networks for feature extraction and scoring.
Lecture Note
Recording
Related Links
Topic 8 - Recurrent Neural Network
Topic Description
PCA, SVD and LSA: Overview of Unsupervised learning; Eigenvalue decomposition; K-Means Clustering: Clustering, Model selection for clustering; Screeplots; Comparison to Dimensionality Reduction; SKLearn User Guide Clustering
Intro
GPU
Recording - $=k20&xG
Topic 9 - Object Detection
Topic Description
Getting into text mining Data wrangling and mechanics Textual datasets Data Cleaning & Text Data Structures: “Bag of Words” & N-Grams TF-IDF weighting; Assigned readings: pandas tutorial Assigned reading: NLTK Book, Chapter 03 Sklearn Feature Extraction
Single Rectangle
Multiple Rectangles
Classification
Recording - ?@4Whfz0
Topic 10 - Generative Adversarial Networks
Topic Description
Adversarial Neural Networks
Lecture Note
Lecture Code
Recording - 9br^.pVF
Topic 11 - Autonomous Car
Topic Description
Extra topic on building an autonomous car in machine learning.
Lecture Code
Final Project - PCA of Facebook and Twitter LDA Evaluation Metrics
Project Description
Topic models are easy to train, but do they generate useful topics? In this post, we discuss several diagnostic metrics that Mallet uses to assess topic quality and conduct a principal component analysis (PCA) to determine which underlying features are most important. Since many of the evaluation metrics are highly correlated, PCA is an appropriate analytical approach. PCA is a statistical technique used to re-express highly correlated multivariate data in uncorrelated components that capture independent pieces of information represented in the larger data.
To accomplish this, we use Mallet to generate fifty topics for a corpus of over 264K posts found on publicly available Facebook pages related to COVID-19 and fifty topics for a corpus of ~11 million Twitter posts related to COVID-19. We used hashtag pooling to generate topics for the Twitter corpus. We use Python to calculate diagnostic measures from Mallet topic-term frequency output files.
Based on our interpretation of the PCA results, we believe LDA topics are distinguished by two primary factors: 1) term frequency, and 2) term specificity. Furthermore, on average, we found topics with common, specific terms score significantly better on coherence scores than topics with uncommon, unspecfic terms. However, we also found several cases of poor topics that scored relatively high on coherence scores. In other words, our results suggest topics that use the common, specific terms should be easier to interpret, but interpretability doesn’t imply a topic is comprised of terms that are specific or central to a corpus.