Projects & Publications
CONTACT
PROJECT LIST
Quantitative Research Assistant in RDC Funded Research Project
Project Title: The Impact of Housing Assistance on Residential Environmental Exposures
Funded by: US Department of Housing and Urban Development (HUD) Healthy Homes Program Homes Technical Research (DCHHU0054-19)
Organization: Environmental Occupational Health at The George Washington University, The Milken Institute School of Public Health
Supervisors: Dr MyDzung T. Chu and Dr Ami Zota
Project Summary
Racial/ethnic and socioeconomic disparities in housing conditions play a critical role in stark health disparities observed in the U.S. Housing assistance may impact health by improving housing quality and the residential environment. The objective of our project is to characterize the relationship between housing assistance and residential environmental exposures. We hypothesize that residential exposures will be significantly different between residents living in federally-assisted housing stock and residents living in other types of low-income housing due to differences in: compliance with federal regulations and policies specific to public housing, management and maintenance practices, and physical attributes of the housing stock. In this work, we will use the linkage of National Health and Nutrition Examination Survey (NHANES) data to the U.S. Department of Housing and Urban Development (HUD) administrative records covering the period 1999-2016.
Project Slides
Smoke Free Project in RPubs
Smoke Free Project Version 2 in RPubs
Classification on Secondhand Smoke Exposure Using Machine Learning Models
on Dec 2021
Project Summary
This project will investigate the abilities of different logistic models on classifying whether people are under SHS exposure or not, using demographic predictors. SHS is measured by serum cotinine (by blood examination) in this project. It also evaluates the models and then chooses the best model with lowest RMSE and highest AUROC score. The project has the folloing steps:
- Implemented statistical and ML regression models to classify people are in minor SHS exposure or not in R and R-Markdown.
- Performed EDA, data visualization, train-test split, and model iterations.
- Built different statistical and machine learning models, including GLM, Random Forest, and XGBoost with hyperparameter tuning.
- Interpreted the model results and selected the best model with the lowest RMSE and highest AUCROC score.
Publication in RPubs
Project Poster
Evaluation on Public Policy Intervention Using Interrupted with Time Series Regression
on Dec 2021
Project Summary
In this project, I introduced the use of interrupt with time series regression method on a real-world example of public intervention. I first highlighted the causal inference of the topic in the project, including the proposed cause and the proposed effect.
I then discuss the sampling strategies concerning the external validity of our pilot data. The measurement from the construct validity of the study is clearly defined before the quasi-experimental design is implemented. Finally, I described the main method which is the interrupted with time series analysis, and interpret the meaning from the results.
Publication in RPubs
Research Paper
Text Analysis on Chinese Digital Collections Using Twitter Tweets
Oct 2020 - Dec 2021
Project Summary
First, we performed NLP in Python to search keywords and to gather digital collection links from Twitter tweets. The twitter tweets were gathered from Social Feed Manager from the George Washington University's library. The following codes are conducted in Python to perform text analysis in NLP.
Python Code - getting most frequently used hashtags
Python Code - text analysis on Twitter tweets
Then, we created interactive maps and useful browsing functions in Tableau and ArcGIS with detailed information of collections. Both of these products were published on George Washington University's website and they are widely used by researchers.
Tableau
ArcGIS
Finally, we developed a web-based application to allow users to interact with the database of collections, along with map visualization of collections’ locations. We also designed and implemented multiple advanced searching functions in Python using the Streamlit platform.
Web-App in Streamlit
- Version One
- Version Two - with an updated in using function ‘Session State’
Regression Analysis Final Report
on May 2021
Project Summary
This project demonstrates very detailed steps in conducting stepwise function in a regreesion model analysis.
First, I presented the importance and the motivation the use of Log(Y) as the dependent variable as opposed to Y. I found the estimated linear regression of Log(Y) on an appropriate set of explanatory variables using the properties and interpret the results. Then I performed and detailed a diagnostic analysis in this report on the regression analysis of the final selected model. Finally, I forcasted the median and average of price of a real estate property with some existing explanatory variables and provided a 95% prediction interval for Y and an approximate 95% confidence interval for E[Y].
Two-Way ANOVA Final Report
on May 2021
Project Summary
The goal for this project is to perform an Analysis of Variance on a given problem. I then performed a diagnostic analysis of the Analysis of Variance for the project and finally I interrept the results. The following links is the project report of the Two-Way ANOVA analysis
Built Database in MySQL
on May 2021
Project Summary
The goal of this project is to restructure a flattened dataset, and load into MySQL database, demonstrate the convenience to have this dataset stored in a database, and provide an efficient and easier way for end-users to search for specific information. The dataset was collected from the Zomato API in the form of .json files (raw data) and sotred in the Comma Separated Value file Zomato.csv. We explored this dataset by visualizing the information that has been fetched, and have a better understanding of the dataset.
Data Analysis in Python
Database Creation in Python
Project Slides
The Impact of Big Data on Risk Management and Methods to Reduce Risk
on Jul 2021
Paper Summary
This is a literature survey paper that summarizes some of risks in different kinds of IT systems. It also introduces the methods or approaches that can be conducted to mitigate those risks by applying big data techniques.
Prodiction on YouTube Video Likes in PySpark
on Dec 2020
Project Summary
The goals for this project are to identify key features that predict trending videos are being liked the most in the U.S. and to use Machine Learning to train model(s) on prediction and then evaluate and improve model performance. The project has the folloing steps:
- Implemented statistical and ML regression models to predict trending YouTube video likes in Spark Databricks.
- Performed data cleaning in the steps of missing value imputation, outlier removal, and autocorrelation detection.
- Model preparation includes data extraction, distribution & correlation analysis, NLP, and One-Hot Encoding.
- Built model pipeline with data transformation and k-fold cross-validation.
- Performed hyperparameter tuning in Decision Tree & RF models.
PySpark code in Databricks
Project report
Project slides
Calories Prediction Using Recipe Ingredients
on Dec 2020
Project Summary
For this class project, we are trying to create two models to predict the type of cuisine and the amount of calories from a list of ingredients. The project goals are the followings:
- Predicted cuisine’s calories by conducting text analysis on recipe ingredients.
- Performed EDA, text preprocessing (removing punctuation, removing stop words, and lemmatizing), train-test split, and model iterations in Python (scikit-learn).
- Built different machine learning models, including Random Forest, TF-IDF, Linear Regression, and Passive-Aggressive Regressor.
- Selected the lowest RMSE model with hyperparameter tuning.
With these two models, we then create a web application so people can play around with it. The app was created using Streamlit and hosted with Streamlit Sharing. To see the app, click the Streamlit badge below.