Class Lectures & Assignments
SEAS 6401 - Data Analytics Foundations & Practicum
Course Description (3 credits)
Introduction to concepts and techniques in data analytics. Basic techniques of data science; algorithms for data mining; basics of statistical modeling and their “Big Data” applications. Concepts, abstractions, and practical techniques. Restricted to students in the MS in data analytics program. (Fall, Every Year).
Lecture 1 - DB-100
Lecture Description
This is the introduction to the courseware, Apache Spark & Databricks. We want to give a high-level explanation of a cluster, jobs & stages. Just enough that we can ignore it until we discuss the architecture. This will enable us to focus more narrowly on the API for now.
- Reading Data - CSV
- Reading Data - Parquet
- Reading Data - Json
- Reading Data - Text
- Reading Data - JDBC
- Reading Data - Summary
- Writing Data - Summary
Lab Assignment
Lecture 2 - DB-105
Lecture Description
Go through a set of notebooks for Databricks courses DB-100, DB-105, and JEFS (“Just Enough for Spark”). The links below are to the DB-105 course lectures and labs.
Lab Assignment
Extras
- Catalyst Optimizer
- Machine Learning Pipeline Demo
- Databricks Environment
- Transformations And Actions Lab
- Review Questions
Lecture 3 - Machine Learning Deployment
Lecture Description
This course teaches data scientists and data engineers best practices for deploying machine learning models into production. First, it explores common production issues faced when deploying machine learning solutions. Second, it implements various deployment options including batch, continuous with Spark Streaming, and on demand with RESTful and containerized services. This includes integrations with databases, data streams, and hosted endpoints. Finally, it covers monitoring machine learning models once they have been deployed into production.
- Introduction
- Production Issues
- Batch Deployment
- Streaming Deployment
- RealTime Deployment SageMaker
- Drift Monitoring
- Alerting
Capstone
Lab Assignment
Lecture 4 - Distributed Natural Language Processing
Lecture Description
In this course data scientists will learn how to process large amounts of text in a distributed manner using both single-node and distributed libraries. By the end of this course, you will have the tools necessary to train machine learning models using features generated from your text corpus, such as TF-IDF scores and word embeddings.
Lab Assignment
Lecture 5 - Machine Learning with Apache Spark
Lecture Description
Go through a set of notebooks for the Databricks course “Machine Learning with Apache Spark”. The labs and capstone notebook are below.
- Data Cleansing
- Linear Regression I
- Linear Regression II
- MLflow Tracking
- MLflow Model Registry
- Decision Trees
- Hyperparameter Tuning
- Hyperopt
- MLlib Deployment Options
- XGBoost
- Inference with Pandas UDFs
- Training with Pandas UDFs
- Koalas
Capstone
Lab Assignment
- Data Exploration Lab
- Linear Regression I Lab
- Linear Regression II Lab
- MLflow Lab
- Hyperparameter Tuning Lab
- Hyperopt Lab
- Pandas UDF Lab
Extras
Lecture 6 - Just Enough Scala for Spark
Lecture Description
Define several variables and intialize them with values of differnet data types. Discover the data type of the variables and understand how Python auto-infers data types. Try assigning a value of different data type to a variable and check what happens. Create an expression using variables.
Capstone
Lab Assignment
- Values, Variables, Data Types Lab
- JConditional and Control Statements Lab
- Methods, Functions, Packages Lab
- Collections Lab
- Functional Programming Lab
- Classes, Tuples and More Lab
- String and Utility Functions Lab
- Exceptions Lab
Lecture 7 - Deep Learning with Keras
Lecture Description
Hands on Deep Learning with Keras, TensorFlow, and Apache Spark™
Lab Assignment
- Keras Lab
- Advanced Keras Lab
- MLflow Lab
- Hyperopt Lab
- Horovod Lab
- Lime for CNNs Lab
- DL 09L - Transfer Learning Lab
- Generative Adversarial Networks
- Best Practices
Lecture 8 - Introduction to Reinforcement Learning
Lecture Description
Types of Machine Learning problems, Reinforcement Learning problem, Agent, Environment, RL vocabulary, RL shortcomings.
Lab Assignment
- OpenAI gym Lab
- MDP Linear Equation Lab
- MDP Lab
- Policy Evaluation Lab
- Policy Iteration Lab
- Value Iteration Lab
- Asynchronous Lab
- First-visit MC Prediction Lab
- Every-visit MC Prediction Lab
- Prediction Lab
- First-visit MC Prediction Lab - Gridworld Problem
- Every-visit MC Prediction Lab - Gridworld Problem
- Gridworld Problem
- Gridworld Problem
Final Project
Project Description
YouTube is an influential and popular online video-sharing tool that is rated as one of the largest search engines owned by Google. Because of its convenient feature of uploading and sharing videos, it has reached 1.9 billion users worldwide by the end of 2019. Each day, more than 1 million videos are being viewed in the U.S., and almost 5 million videos are being viewed globally.
Hence, our goals for this project are: To identify key features that predict trending videos are being liked the most in the U.S. Use Machine Learning to train model(s) on prediction and then evaluate and improve model performance. We targeted a broad range of audience, basically anyone who is interested in the topic. Ideally, we wanted to show the process of using big data and build the model to explain and predict the question of interest.