SEAS 6401 - Data Analytics Foundations & Practicum


Course Description (3 credits)

Introduction to concepts and techniques in data analytics. Basic techniques of data science; algorithms for data mining; basics of statistical modeling and their “Big Data” applications. Concepts, abstractions, and practical techniques. Restricted to students in the MS in data analytics program. (Fall, Every Year).


Lecture 1 - DB-100

Lecture Description

This is the introduction to the courseware, Apache Spark & Databricks. We want to give a high-level explanation of a cluster, jobs & stages. Just enough that we can ignore it until we discuss the architecture. This will enable us to focus more narrowly on the API for now.

Lab Assignment

Lecture 2 - DB-105

Lecture Description

Go through a set of notebooks for Databricks courses DB-100, DB-105, and JEFS (“Just Enough for Spark”). The links below are to the DB-105 course lectures and labs.

Lab Assignment
Extras

Lecture 3 - Machine Learning Deployment

Lecture Description

This course teaches data scientists and data engineers best practices for deploying machine learning models into production. First, it explores common production issues faced when deploying machine learning solutions. Second, it implements various deployment options including batch, continuous with Spark Streaming, and on demand with RESTful and containerized services. This includes integrations with databases, data streams, and hosted endpoints. Finally, it covers monitoring machine learning models once they have been deployed into production.

Capstone
Lab Assignment


Back to Top


Lecture 4 - Distributed Natural Language Processing

Lecture Description

In this course data scientists will learn how to process large amounts of text in a distributed manner using both single-node and distributed libraries. By the end of this course, you will have the tools necessary to train machine learning models using features generated from your text corpus, such as TF-IDF scores and word embeddings.

Lab Assignment

Lecture 5 - Machine Learning with Apache Spark

Lecture Description

Go through a set of notebooks for the Databricks course “Machine Learning with Apache Spark”. The labs and capstone notebook are below.

Capstone
Lab Assignment
Extras


Back to Top


Lecture 6 - Just Enough Scala for Spark

Lecture Description

Define several variables and intialize them with values of differnet data types. Discover the data type of the variables and understand how Python auto-infers data types. Try assigning a value of different data type to a variable and check what happens. Create an expression using variables.

Capstone
Lab Assignment

Lecture 7 - Deep Learning with Keras

Lecture Description

Hands on Deep Learning with Keras, TensorFlow, and Apache Spark™

Lab Assignment


Back to Top


Lecture 8 - Introduction to Reinforcement Learning

Lecture Description

Types of Machine Learning problems, Reinforcement Learning problem, Agent, Environment, RL vocabulary, RL shortcomings.

Lab Assignment

Final Project

Project Description

YouTube is an influential and popular online video-sharing tool that is rated as one of the largest search engines owned by Google. Because of its convenient feature of uploading and sharing videos, it has reached 1.9 billion users worldwide by the end of 2019. Each day, more than 1 million videos are being viewed in the U.S., and almost 5 million videos are being viewed globally.

Hence, our goals for this project are: To identify key features that predict trending videos are being liked the most in the U.S. Use Machine Learning to train model(s) on prediction and then evaluate and improve model performance. We targeted a broad range of audience, basically anyone who is interested in the topic. Ideally, we wanted to show the process of using big data and build the model to explain and predict the question of interest.

Report PDF

Slides PDF

nbviewer



Back to Top