2023 UD Mathematical Data Science Symposium

The 2023 University of Delaware's Mathematical Data Science Symposium will take place on May 19, 2023 at Willard Hall (Room 109), University of Delaware . The symposium consists of 15-minute talks on various practical topics in data science, from predicting ice cream flavors to quantifying coorporate credit risk. The symposium will be open to the whole data science community at UD. This year, up to two projects will be awarded with a Best Presentation Award. Let the best project win!

Please register by Wednesday 05/17 so we can supply sufficient coffee and cookies!

Program

8:45am

Welcome and Introduction

9:00am	Ice Cream Flavors: Marketing Strategies that Help Determine Popular Products
	Rob Salati, Nikki Pilla, Chara Angelidou, Angela Kuczykowski
	Our plan is to research the most relevant marketing strategies that help determine popular products, in this case the popular product being ice cream flavors. From the dataset, we plan on determining the favorite, i.e. the highest reviewed ice cream flavor. From this, we will investigate which ingredients these favorite flavors contain. These ingredients will theoretically create the most popular flavor when combined together.

9:20am	Are You a Robot?
	Calvin Adkins, Justin Jacobowitz, Yamini Pravallika Medapati, Juneeth Kumar Padarti, Cameron Spiess and Hareesh Kumar Tadapaneni
	Have you ever been prompted to perform a task to prove that you are not a robot? The task is typically given by a CAPTHCA system whose goal is to identify if the user is human. We set out to build a computer driven identifier to attempt to perform at one of these tasks, specifically, identifying a target number in a panel of many numbers. We used the UCI ML hand-written dataset to construct panels of the numbers for our classifier to identify. We simulated similar conditions to the CAPTCHA system by rotating and randomizing the location of the target number. Our classifier proved to be useful in the assistance of helping us identify and locate the target number in the panel.

9:40am	Methods and Accuracy Metric for Predicting Corporate Credit Ratings
	Anthony Angone, Julianna Dorsch, Alex Mulrooney, Brian Orak, Murugesan Somasundaram
	In this presentation, we will investigate various classifiers' performance in predicting assignment credit ratings for companies given their financial data. As this is tabular data, we compare multiple tree-based methods. Then, we develop multiple accuracy metrics to most accurately gauge the performance of each model, and attempt to maximize the model's performance on these metrics.

10:00am	Exploring Interpolation Methods: A Comparison Study of Kriging and Inverse Distance Weighting (IDW)
	Nana Abena Konadu Osei Tutu
	Interpolation is a common technique used in various fields, such as geostatistics, environmental sciences, and geography, to estimate values at unsampled locations based on observations at sampled locations. Kriging and IDW are two widely used interpolation methods based on different assumptions and models. This project aims to compare the performance of kriging and IDW for spatial interpolation using field data in the form of saturated hydraulic conductivity, soil moisture contents and soil compaction.

10:20am	Coffee & Tea Break

10:40am	Comparison of Nearest Neighbor and Support Vector Machine Techniques for Mitigating Class Imbalance to Predict Customer Purchasing Behavior
	Cole Plum, Julia Rothstein, Margaux Deputy
	In recent years, online shopping has been rapidly increasing in volume. Before the development of e-commerce, shopping solely took place with in-person interactions. However, with the ever increasing advancement of internet technology, more and more consumers choose online shopping. This transition to online shopping is oftentimes also preferred by consumers due to its convenience and the ability to do it from the comfort of their homes. Online shopping also allows the website to gather information on the users such as the duration they spent on a certain product or related products, bounce rates, exit rates, and page values for specific web pages. This study proposes a review on classification models for customers's purchasing intention prediction using the information provided by a users session. The main issue we will explore is the imbalance nature of the dataset. We perform a hands-on comparison between two techniques, nearest neighbors and support vector machine, to predict whether a customer will buy a product while employing various resampling techniques and adapted classification algorithms in an effort to mitigate any unwanted effects of the imbalanced data.

11:00am	A Comparative Study Using the ModelNet Dataset for Point Cloud Classification Using Different Machine Learning Models and Deep Learning Models
	Ravi Teja Chigurupati, Dhana Lakshmi Kankanala, Kishore Kumar Reddy Madithati, Ashish Reddy Mulaka, Lalith Teja Nagidi, Harshitha Paladugu Chengalraya
	Point cloud is a set of data points in 3D space that represents the shape or surface of a physical object, and we took modelnet dataset which contains CAD models from the 40 different categories which are most commonly used objects like bed, chair, desk, dresser, sofa.etc. and we used different Machine learning/ Deep Learning techniques like Convolutional neural networks, MLP classifier, Random Forest classifier and Pointnet to analyze how these models are performing on this kind of dataset and we also perform this by varying number of samples in order to understand how well the models are predicting by reducing or increasing the number of samples.

11:20am	Tesla Stock Price Prediction
	Chandralekha Eluri, Greshma Vachepalli, Lakshmi Sudini
	We have chosen "Tesla stock data from 2010 to 2020" dataset for stock price prediction. We will predict future stock prices by looking at Tesla's stock prices in the past. Regression analysis can be carried out using a multiple linear regression model. We are going to use the modifications like Lasso or Ridge regression to understand the impact of regularization on model performance. We are also planning to predict the stock price using Long short-term memory (LSTM).

11:40am	Patient Length of Stay
	Bhavya Arora, Sabrina Casas, Komali Challa, Gaurav Kumar, Elisha Shrestha, Olushola Olufemi Soyoye
	Recent Covid-19 Pandemic has raised alarms over one of the most overlooked area to focus: Healthcare Management. While healthcare management has various use cases for using data science, patient length of stay is one critical parameter to observe and predict if one wants to improve the efficiency of the healthcare management in a hospital. This helps hospitals to identify patients of high LOS risk (patients who will stay longer) at the time of admission. Once identified, patients with high LOS risk can have their treatment plan optimized to miminize LOS and lower the chance of staff/visitor infection. Also, prior knowledge of LOS can aid in logistics such as room and bed allocation planning.

Noon	Lunch Break

1:30pm	City Walkability
	Lane McLaughlin, Galen Sweet, Cameron Ibrahim, Ryan Wolynetz, Seemy Hodge, Zolfa Saadat
	Walkability refers to the ease at which a person could use walking as a means of travel around a place to get to the location they're trying to get to. The National Walkability Index, developed by the EPA, is a tool used to measure the walkability of communities all over the United States. The guiding principle of our analysis is to help a city planner identify how to make their city more walkable. We will approach our analysis for this project from several perspectives. The first thing to do will be to analyze what features most impact walkability. From there, we intend to make an interactive tool for predicting the walkability of a city given these variables. Finally, we would like to analyze the distribution of walkable cities, to see which areas need walkability improvements.

1:50pm	Diabetes Prediction & Diabetic Retinopathy Detection using AI
	Sri Vishnuvardhan Reddy Akepati, Venkata Sai Vardhan Kataru, Nagendra Sai Nandimandalam, Charitha Nagamalla and Sujith Yeluru
	Diabetes is a global epidemic, affecting millions and leading to serious complications. In our machine learning project, we aim to tackle two critical objectives: predicting diabetes (binary) and detecting diabetic retinopathy (multi-class). Through cutting-edge algorithms and deep learning techniques, we achieve high accuracy in both classifications, analyzing medical records and retinal images to inform clinical decision-making and improve patient outcomes. Our results pave the way for more accurate and effective tools for early detection and treatment, addressing two of the most pressing challenges in diabetes management and screening. Keywords: Deep Learning, Diabetes, Binary classification, multi-class classification.

2:10pm	Practical methodology of select text classification techniques
	Vitali Kay, Ethan Kempista, Jae Kim, Zhifei Yuliu
	Text classification is one of the main use cases of Natural Language processing. Multiple methods exist to solve this problem, ranging in accuracy, sophistication and processing speed. The purpose of this presentation is to research several of these techniques in application to a large open-source body of classified data. The dataset chosen for the analysis is the famous Enron spam/ham email collection of 30K+ email samples. Once the data is obtained, the group vaguely follows the Keras NLP guide to format and present various methods of analysis. There is no intention to identify “the best” method, because the practical results will vary with the structure of source data and the requirements for explainability. Rather, the goal is to run multiple independent analyses in parallel, in anticipation of them acting in concert with each other.

2:30pm	Airline Sentiment Analysis: Uncovering Customer Opinions on Twitter
	Abdul Rahuman Aslam Moopan Abdulwahab, Bala Subrahmanyam Boyina, Melvin Oswald Sahaya Anbarasu, Akash Parmar, Raghu Krishna Soudri
	The Airline Industry is a very competitive market that has increased in the past 2 Decades. In the airline industry, Customer satisfaction is crucial to winning the competition. Airline companies resort to traditional customer feedback forms, which in turn are very tedious and time-consuming; with the tremendous increase in the use of social media, consumers are sharing their thoughts, feelings, and experiences on various social media platforms. Twitter is one of the valuable sources to analyze customer satisfaction and can provide valuable insights into how customers perceive their services and help them to improve their offerings. In this report, we propose a Sentiment Analysis tool for Twitter Data on Airlines, Categorizing the tweets into positive, negative, and neutral based on the tweets shared by the users. We trained four machine learning models: Random Forest classifier, Support vector classifier, Multinomial naïve Bayes classifier, and Bagging classifier, and achieved an accuracy of 92%. Also, we evaluated the models on other evaluation metrics like precision, recall, and f1-scores.

2:50pm	The Netflix prize
	Hamad El Kahza, Mahruna Kader, Arnab Roy

3:10pm	Coffee & Tea Break

4:00pm	Data Analytics in Football/Soccer
	Casey Carr
	I will create my own versions of some of the most important models being used in the field of football/soccer analytics.

4:20pm	Instance Segmentation for Accessibility: Submission to the Accessibility, Vision, and Autonomy 2023 Challenge
	Amani Arman Kiruga, Nii Otu Tackie-Otoo, Osman Mohamed, Qize Zhang, Emmanuel Adebayo
	In this project, our team is participating in the CVPRW 2023 AVA Segmentation Challenge, which is focused on developing vision-based accessibility systems. The challenge includes a synthetic instance segmentation benchmark that simulates scenarios where autonomous systems interact with pedestrians with disabilities and includes categories such as wheelchairs and walking canes. Our proposed method compares two state-of-the-art techniques for image segmentation: ViT (Visual Transformer) and SAM (Segment-Anything Model). ViT is a transformer model shown in previous work to outperform convolution-based methods on instance segmentation and image recognition. On the other hand, SAM is a large model trained on millions of images for segmentation using visual prompts akin to text prompts in NLP. The prompts can consist of bounding box points which we generate using an object detection model. Since SAM does not predict labels for segmented objects, we instead use a trained object detection model to predict labels. We show that both approaches show promising results for instance segmentation in virtual worlds.

4:40pm	Loan Risk Analysis
	Logan Borys, Marydol Soto Santarriaga, Emma Weber
	Our goal was to investigate risk analysis in three types of loans; medical, credit card, and home improvement. We used different techniques from MATH637 to predict whether or not a person will default on a loan and identify which loans have a higher risk of customers defaulting. We evaluated the performance of each method we implemented to understand which methods are recommended. Moreover, we identified which variables are the most relevant for the prediction. Using all of this information, we developed an early detection model. The early detection model is for the highest-risk loan type with a current length of six months. That way banks can have an idea of how likely someone is to default.

5:00pm	Structural Adhesives
	Paul Samuel, Ravva Pavan Uttej, Ravi Teja, Sai Mahesh, Ananth Durbha
	The use of machine learning has tremendously increased in the recent past to uncover material behavior, supplanting conventional methods used to study them. On similar lines, we are interested in using machine learning methods to determine the constitutive law that governs the material behavior of structural adhesives in this project. Structural adhesives are widely used as weight saving materials, and have relevance in the aerospace and automotive industry. We first begin with a data driven approach, where a synthetic dataset is generated using numeral analysis (finite element simulations). This dataset comprised of stress-strain data pairs is trained and a surrogate ML model will be developed. The next step would be to couple the physics of the problem with data, and have a neural network minimize a physics based cost function ultimately yielding the constitutive law of the adhesive.

5:20pm	Performance Evaluation of Efficient Det model on microscopic images and DOTA dataset (RGB images).
	High-Praise Akomolede, Sasank Gogineni, Sai Akhil Reddy Gunnam, Anirudh Padullaparthi, Yashwanth Tekumudi
	The need to identify objects with great accuracy, both at the Microscopic and Macroscopic level has recently been of great interest. Object detection at the Macroscopic level has been deemed an easy feat, with the numerous algorithms which are made easily accessible but, the detection of cells and particles in microscopic images is a common and challenging task. The use of microscopic images presents difficulties due to small and densely packed objects, poor signal quality in relation to background noise, and the complex shapes and appearances of the objects. Current methods still face challenges in addressing these issues. This project implements the use of Efficient-Det network to analyse both the DOTA (RGB images) dataset, and microscopic images (greyscale images) containing Cells and Microrobots. The Efficient-Net network was designed to be computationally efficient by scaling the depth, width, and resolution of the network based on a compound scaling method. These microscopic images would be used to create a dataset and test the efficiency of the Efficient-DET model on grey-scale microscopic images in comparison to the RGB images of DOTA dataset. The accuracy and performance would be computed on both datasets to evaluate the Efficient-DET model on different grades of images.

5:40pm

Closing remarks

Organizers

The symposium is organized by Professor Dominique Guillot and Professor Vu Dinh from the Department of Mathematical Sciences. The organizers are grateful for support and sponsor from the Department of Mathematical Sciences.