Statistical/Data Analysis

1. Exploratory Analysis of Primate Hand and Foot Digit Proportions:

Read full paper Here.

This semester, I completed an independent study with the objective of analyzing an unused autopod digit length dataset given to me by the Boyer Lab at Duke University. The aim of the project was to enhance our understanding of the evolution of hand proportions in primates and exploring potential covariation present among digits.

For this project:

  1. I spent a fairly significant amount of time in the data collection & preprocessing stage. The dataset I was given was incomplete and contained many gaps and underrepresented taxa groups. To supplement this, I needed to collect CT scans for each missing taxa and measure each associated manual bone (the metacarpals, proximal phalanges, and intermediate phalanges) using the biological modeling software, Avizo.

  2. Normalized digit lengths by calculating the geometric means of the metacarpal/phalange lengths for each individual

  3. Performed a principal component analysis (PCA) on the normalized digit length data to identify major axes of variation and covariation among the different bones

  4. Wrote final report summarizing significance + methods + results of study

2. Chance in Games: A Look into Luck vs Skill

Read full paper w/code Here.

I recently participated in the 2024-2025 TriComm Math Modeling Competition in which we were tasked to build a model predicting the 2024 March Madness winner in 48 hours.

Highlights:

  1. Gathered and preprocessed historical game log data for the 2022 and 2023 season of our 68 teams. This was surprisingly the bulk of this project and took us over 15 hours to complete.

  2. Created a skill metric based on skill + consistency

  3. Modeled team wins as a function of skill under informative priors — Utilized Metropolis-Hastings Algorithm

  4. Determined to what extent wins could be explained by a team’s skill and how much was left to chance

  5. In reality, UConn won March Madness 2024. Our model predicted UConn was second most likely to win (out of 68 teams)—not too shabby, right?

  6. Our team received special recognition by judges for sound mathematical modeling and innovative analysis

Some Thoughts: The vast majority of our time for this project was spent on data collection & preprocessing. Allocating our time and focus to collecting historical data both hurt and helped us. On one end, it helped us develop an incredibly accurate model. On the other end, it was a 48 hour competition and by the time everything was said and done, we didn’t have enough time to write the actual report! As a result, our submission was quite unpolished. It was an interesting feeling to think we could potentially have a winning idea, but the cost of getting it was the inability to represent it as such. Overall, a very valuable lesson in opportunity cost was learned.

3. Predicting Profit Amongst Fortune 1000 Companies

Read full paper Here.

Regression Analysis Paper done using R.

Highlights:

  1. Fit several linear models to predict profit on dataset with over 1000 observations and checked for multicollinearity of predictors by calculating variance inflation factor (VIF values)

  2. Ran initial model comparisons using AIC/BIC and implemented cross-validation by calculating Root Mean Squared Errors (RMSEs) and r-squared values of models

  3. Wrote methodology and discussion section of report detailing results and presented findings to professors and students in statistics department

4. GoodReads and Gender: An NLP Analysis of Gender Perceptions in Book Reviews

Read full paper Here.

Highlights:

  1. Developed and trained five Word2Vec models on genre-specific datasets to uncover linguistic associations between gender and literary stereotypes in a database of over 15 million Goodreads book reviews, ultimately presenting our findings to professors and students in statistics department

  2. Executed a comprehensive data cleaning process using the langdetect library and NLTK toolkit, followed by deep learning techniques to vectorize and analyze text, revealing significant gender bias patterns across genres

  3. Utilized cosine similarity metrics and data visualization libraries to measure and compare the orientation of word vectors, identifying strong linguistic associations between gendered words and stereotypes

  4. Designed a classification algorithm using tokenization to identify male-centered versus female-centered reviews through weighted counts of gendered language, and analyzed rating distributions to determine the impact of gender focus on review ratings

  5. In the report, I wrote the “Results and Methods” section